measuring term frequency of words

조회 수: 7 (최근 30일)
John
John 2017년 12월 3일
댓글: D. Frank 2020년 10월 16일
I have been able to obtain a bag of words from a document. Please, how can I interact with the bag of words array, so I may make calculations on the frequency of terms within each document?
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
I have this result:
For clarification, I believe the columns are the terms, while the rows are the documents. Am I right?
I loaded 2 txt files (1 document set, 1 query set) I want to evaluate similarity between each document and each query by Cosine similarity, tf-idf or whatsoever means.
  댓글 수: 3
Christopher Creutzig
Christopher Creutzig 2020년 4월 24일
If I understand your question correctly, you can simply divide the counts, aka term frequency, by the document length. You may need to adapt the orientation of the vectors a bit, and also transpose everything if you want to, as I did here, display them in a table:
>> str = ["This is a short document.",...
"This is a longer document. With more tokens. Maybe that is about enough?"];
>> td = tokenizedDocument(str)
td =
1×2 tokenizedDocument:
6 tokens: This is a short document .
16 tokens: This is a longer document . With more tokens . Maybe that is about enough ?
>> bow = bagOfWords(td);
>> relFreq = bow.Counts ./ doclength(td).';
>> table(bow.Vocabulary.', relFreq.', 'VariableNames',["Word","relative Frequency"])
ans =
15×2 table
Word relative Frequency
__________ __________________
"This" 0.16667 0.0625
"is" 0.16667 0.125
"a" 0.16667 0.0625
"short" 0.16667 0
"document" 0.16667 0.0625
"." 0.16667 0.125
"longer" 0 0.0625
"With" 0 0.0625
"more" 0 0.0625
"tokens" 0 0.0625
"Maybe" 0 0.0625
"that" 0 0.0625
"about" 0 0.0625
"enough" 0 0.0625
"?" 0 0.0625
D. Frank
D. Frank 2020년 10월 16일
Can i ask, is there any way to find the frequency and the number of repeated letters,pair of letters, space in a note, word or pdf file??

댓글을 달려면 로그인하십시오.

채택된 답변

Christopher Creutzig
Christopher Creutzig 2017년 12월 4일
See the bagOfWords documentation. E.g., you can use the tfidf function, you can extract bag.Counts and use pdist(bag.Counts,'cosine'), you can use fitlsa for what is essentially a principal component analysis for dimensionality reduction, or fitlda to train/fit a topic model.
  댓글 수: 2
John
John 2017년 12월 7일
편집: John 2017년 12월 7일
I need to compute the similarity between each query loaded in QueTF and each document in DocTF.
How may I do that? QueTF and DocTF are both bag of words.
What is the significance of pdist2?
I am having problems applying this to the bag of words.
Cosss = pdist2(QueTF,DocTF,'cosine');
Christopher Creutzig
Christopher Creutzig 2018년 10월 15일
편집: Christopher Creutzig 2018년 10월 15일
John, you need to encode both sets of documents with the same bag-of-words model. (That model not only contains counts, it also has a specific mapping which word to put into which position, and if you use tfidf, you need to use the same idf factors for consistency within your analysis.) Something like this:
corpus = tokenizedDocument(corpusData);
bow = bagOfWords(corpus);
query = tokenizedDocument(queryData);
queryVectors = encode(bow,query);
dists = pdist2(queryVectors,bow.Counts,'cosine');

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 MATLAB에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by