measuring term frequency of words

Question

John 2017년 12월 3일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/370725-measuring-term-frequency-of-words

댓글: D. Frank 2020년 10월 16일

I have been able to obtain a bag of words from a document. Please, how can I interact with the bag of words array, so I may make calculations on the frequency of terms within each document?

str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = [];                % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)

I have this result:

For clarification, I believe the columns are the terms, while the rows are the documents. Am I right?

I loaded 2 txt files (1 document set, 1 query set) I want to evaluate similarity between each document and each query by Cosine similarity, tf-idf or whatsoever means.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Christopher Creutzig 2020년 4월 24일

MATLAB Online에서 열기

If I understand your question correctly, you can simply divide the counts, aka term frequency, by the document length. You may need to adapt the orientation of the vectors a bit, and also transpose everything if you want to, as I did here, display them in a table:

>> str = ["This is a short document.",...
          "This is a longer document. With more tokens. Maybe that is about enough?"];
>> td = tokenizedDocument(str)
td = 
  1×2 tokenizedDocument:
     6 tokens: This is a short document .
    16 tokens: This is a longer document . With more tokens . Maybe that is about enough ?
>> bow = bagOfWords(td);
>> relFreq = bow.Counts ./ doclength(td).';
>> table(bow.Vocabulary.', relFreq.', 'VariableNames',["Word","relative Frequency"])
ans =
  15×2 table
       Word       relative Frequency
    __________    __________________
    "This"        0.16667     0.0625
    "is"          0.16667      0.125
    "a"           0.16667     0.0625
    "short"       0.16667          0
    "document"    0.16667     0.0625
    "."           0.16667      0.125
    "longer"            0     0.0625
    "With"              0     0.0625
    "more"              0     0.0625
    "tokens"            0     0.0625
    "Maybe"             0     0.0625
    "that"              0     0.0625
    "about"             0     0.0625
    "enough"            0     0.0625
    "?"                 0     0.0625

D. Frank 2020년 10월 16일

Can i ask, is there any way to find the frequency and the number of repeated letters,pair of letters, space in a note, word or pdf file??

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Christopher Creutzig 2017년 12월 4일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/370725-measuring-term-frequency-of-words#answer_294520

See the bagOfWords documentation. E.g., you can use the tfidf function, you can extract bag.Counts and use pdist(bag.Counts,'cosine'), you can use fitlsa for what is essentially a principal component analysis for dimensionality reduction, or fitlda to train/fit a topic model.

댓글 수: 2
없음 표시없음 숨기기

John 2017년 12월 7일

편집: John 2017년 12월 7일

MATLAB Online에서 열기

I need to compute the similarity between each query loaded in QueTF and each document in DocTF.
How may I do that? QueTF and DocTF are both bag of words.
What is the significance of pdist2?
I am having problems applying this to the bag of words.
Cosss = pdist2(QueTF,DocTF,'cosine');

Christopher Creutzig 2018년 10월 15일

편집: Christopher Creutzig 2018년 10월 15일

MATLAB Online에서 열기

John, you need to encode both sets of documents with the same bag-of-words model. (That model not only contains counts, it also has a specific mapping which word to put into which position, and if you use tfidf, you need to use the same idf factors for consistency within your analysis.) Something like this:

corpus = tokenizedDocument(corpusData);
bow = bagOfWords(corpus);
query = tokenizedDocument(queryData);
queryVectors = encode(bow,query);
dists = pdist2(queryVectors,bow.Counts,'cosine');

댓글을 달려면 로그인하십시오.

measuring term frequency of words

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

measuring term frequency of words

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 2
없음 표시없음 숨기기