MATLAB Answers

measuring term frequency of words

조회 수: 25(최근 30일)
John
John 3 Dec 2017
댓글: D. Frank 16 Oct 2020
I have been able to obtain a bag of words from a document. Please, how can I interact with the bag of words array, so I may make calculations on the frequency of terms within each document?
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
I have this result:
For clarification, I believe the columns are the terms, while the rows are the documents. Am I right?
I loaded 2 txt files (1 document set, 1 query set) I want to evaluate similarity between each document and each query by Cosine similarity, tf-idf or whatsoever means.

  댓글 수: 3

Abirami M
Abirami M 21 Apr 2020
friend how to find term frequency only in MATLAB for after bag of words model.Please tell me.for using the formula number of time term t appear in document /total number of terms in document.
Christopher Creutzig
Christopher Creutzig 24 Apr 2020
If I understand your question correctly, you can simply divide the counts, aka term frequency, by the document length. You may need to adapt the orientation of the vectors a bit, and also transpose everything if you want to, as I did here, display them in a table:
>> str = ["This is a short document.",...
"This is a longer document. With more tokens. Maybe that is about enough?"];
>> td = tokenizedDocument(str)
td =
1×2 tokenizedDocument:
6 tokens: This is a short document .
16 tokens: This is a longer document . With more tokens . Maybe that is about enough ?
>> bow = bagOfWords(td);
>> relFreq = bow.Counts ./ doclength(td).';
>> table(bow.Vocabulary.', relFreq.', 'VariableNames',["Word","relative Frequency"])
ans =
15×2 table
Word relative Frequency
__________ __________________
"This" 0.16667 0.0625
"is" 0.16667 0.125
"a" 0.16667 0.0625
"short" 0.16667 0
"document" 0.16667 0.0625
"." 0.16667 0.125
"longer" 0 0.0625
"With" 0 0.0625
"more" 0 0.0625
"tokens" 0 0.0625
"Maybe" 0 0.0625
"that" 0 0.0625
"about" 0 0.0625
"enough" 0 0.0625
"?" 0 0.0625
D. Frank
D. Frank 16 Oct 2020
Can i ask, is there any way to find the frequency and the number of repeated letters,pair of letters, space in a note, word or pdf file??

댓글을 달려면 로그인하십시오.

채택된 답변

Christopher Creutzig
See the bagOfWords documentation. E.g., you can use the tfidf function, you can extract bag.Counts and use pdist(bag.Counts,'cosine'), you can use fitlsa for what is essentially a principal component analysis for dimensionality reduction, or fitlda to train/fit a topic model.

  댓글 수: 2

John
John 7 Dec 2017
I need to compute the similarity between each query loaded in QueTF and each document in DocTF.
How may I do that? QueTF and DocTF are both bag of words.
What is the significance of pdist2?
I am having problems applying this to the bag of words.
Cosss = pdist2(QueTF,DocTF,'cosine');
Christopher Creutzig
Christopher Creutzig 15 Oct 2018
John, you need to encode both sets of documents with the same bag-of-words model. (That model not only contains counts, it also has a specific mapping which word to put into which position, and if you use tfidf, you need to use the same idf factors for consistency within your analysis.) Something like this:
corpus = tokenizedDocument(corpusData);
bow = bagOfWords(corpus);
query = tokenizedDocument(queryData);
queryVectors = encode(bow,query);
dists = pdist2(queryVectors,bow.Counts,'cosine');

댓글을 달려면 로그인하십시오.

추가 답변(0개)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by