Cosine Similarity using BERT

조회 수: 7 (최근 30일)
Nicholas Ang
Nicholas Ang 2021년 6월 30일
댓글: Nicholas Ang 2021년 6월 30일
I am using BERT to calculate similarities in Question Answering. I have encoded my Question data using
data.Tokens = encode(mdl.Tokenizer,data.Questions) which returns me a cell array.
Next, I proceeded to encode new text to test the similiarity with the already encoded Questions in the database: testTokens = encode(mdl.Tokenizer,text)
However, I am imable to use the cosineSimilarity(data.Tokens,testTokens) and I receive an error that says:
Input must be a matrix, a tokenizedDocument array, a bagOfWords model, a bagOfNgrams model, a string array of words, or a cell array of character vectors.
Do I need padding here or reshape of my cell vectors?

채택된 답변

Divyam Gupta
Divyam Gupta 2021년 6월 30일
Hi Nicholas, I notice that you're facing an issue while computing the cosine similarity using a text encoder. As per the documentation mentioned at https://www.mathworks.com/help/textanalytics/ref/cosinesimilarity.html#d123e8335 the cosineSimilarity function takes a matrix to compute the similarity between two documents.
Since the encoded vector sizes for each of the questions is different, constructing a matrix might be difficult. You can do a pairwise comparision between the data.Tokens and the testTokens to compute the similarities. This can be achieved by running a nested loop while simultaneously storing the similarity scores.
Hope this helps.
  댓글 수: 1
Nicholas Ang
Nicholas Ang 2021년 6월 30일
Thank you! This worked!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Modeling and Prediction에 대해 자세히 알아보기

제품


릴리스

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by