How to find the most used word in a text?

Question

Armina Petrean 2023년 4월 3일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1940214-how-to-find-the-most-used-word-in-a-text

편집: DGM 2023년 4월 3일

i have a notepad file with a literary text and i need to find the most repeated word/words . How many times they appear in that text.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

the cyclist 2023년 4월 3일

편집: the cyclist 2023년 4월 3일

FYI, this question was closed by another editor as a duplicate, but I don't think it was. This question is asking about repeated words, and the other was asking about repeated letters.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

the cyclist 2023년 4월 3일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1940214-how-to-find-the-most-used-word-in-a-text#answer_1207534

편집: the cyclist 2023년 4월 3일

MATLAB Online에서 열기

I'm putting this answer here as possibly the "canonical" MATLAB answer, but I expect you do not have the Text Analytics Toolbox.

myTextFile = "sonnets.txt"; % Put your file name here
str = extractFileText(myTextFile);
T = wordCloudCounts(str);

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

DGM 2023년 4월 3일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1940214-how-to-find-the-most-used-word-in-a-text#answer_1207689

편집: DGM 2023년 4월 3일

MATLAB Online에서 열기

wordpile.txt

Define "word". Once you have defined "word" and have implemented a means to split a block of text into said words, then the rest is basic.

I'm sure this can be improved a lot, but I was in a hurry.

bunchofwords = fileread('wordpile.txt')
bunchofwords = 
    'This is a text file.
     This file contains many words.
     
     It also contains a list:
     1: Entry one (first)
     2: Entry two (second)
     3: This is the third entry in the list.
     
     Sometimes words need to be hyphen-
     ated in order to make them fit. 
     
     I'm sure any reasonably-observant person
     would notice that not all hyphenation
     should be treated the same.
     
     I'm sure they'd also notice the problems 
     with quotes and 'apostrophes'.
     '
% i assume the capitalization doesn't matter
bunchofwords = lower(bunchofwords);
% try to fix words that are hyphenated on linebreaks
% but not all hyphenation is done with U+002D
bunchofwords = regexprep(bunchofwords,'(?<=\w+)-(\r\n|\r|\n)+(?=\w+)','');
% split the file into blobs separated by whitespace
% this causes lots of problems
%words = regexp(bunchofwords,'\S+','match');
% instead, split the file into blobs of "word" type characters
% this still has problems, but it's a bit better
words = regexp(bunchofwords,'\w+','match');
% find unique words
[uwords,~,uwidx] = unique(words);
% get histogram counts and sort them
hc = histcounts(uwidx,'binmethod','integers');
[hc hcidx] = sort(hc,'descend');
% sort unique word list by frequency
uwordssorted = uwords(hcidx);
% display the results as a table as a cursory effort toward readability
table(uwordssorted.',hc.')
ans = 54×2 table
Var1Var2________________

    {'the'     }     4  
    {'entry'   }     3  
    {'this'    }     3  
    {'a'       }     2  
    {'also'    }     2  
    {'be'      }     2  
    {'contains'}     2  
    {'file'    }     2  
    {'i'       }     2  
    {'in'      }     2  
    {'is'      }     2  
    {'list'    }     2  
    {'m'       }     2  
    {'notice'  }     2  
    {'sure'    }     2  
    {'to'      }     2  

Note that this still has plenty of problems with contractions.

댓글 수: 2
없음 표시없음 숨기기

Image Analyst 2023년 4월 3일

MATLAB Online에서 열기

Or simpler than

words = regexp(bunchofwords,'\w+','match');

is to use strsplit

words = strsplit(bunchofwords);

DGM 2023년 4월 3일

편집: DGM 2023년 4월 3일

MATLAB Online에서 열기

wordpile.txt

No, that would be similar to the first example, naively splitting on whitespace. This causes problems with any punctuation. Note the cases of 'file', 'list', and 'words'.

bunchofwords = fileread('wordpile.txt');
bunchofwords = lower(bunchofwords);
uwords = unique(strsplit(bunchofwords))
uwords = 1×34 cell array
  Columns 1 through 17

    {0×0 char}    {'1:'}    {'2:'}    {'3:'}    {'a'}    {'also'}    {'ated'}    {'be'}    {'contains'}    {'entry'}    {'file'}    {'file.'}    {'fit.'}    {'hyphen-'}    {'in'}    {'is'}    {'it'}

  Columns 18 through 33

    {'list.'}    {'list:'}    {'make'}    {'many'}    {'need'}    {'one'}    {'order'}    {'sometimes'}    {'text'}    {'the'}    {'them'}    {'third'}    {'this'}    {'to'}    {'two'}    {'words'}

  Column 34

    {'words.'}
uwords = unique(regexp(bunchofwords,'\S+','match'))
uwords = 1×33 cell array
  Columns 1 through 17

    {'1:'}    {'2:'}    {'3:'}    {'a'}    {'also'}    {'ated'}    {'be'}    {'contains'}    {'entry'}    {'file'}    {'file.'}    {'fit.'}    {'hyphen-'}    {'in'}    {'is'}    {'it'}    {'list.'}

  Columns 18 through 33

    {'list:'}    {'make'}    {'many'}    {'need'}    {'one'}    {'order'}    {'sometimes'}    {'text'}    {'the'}    {'them'}    {'third'}    {'this'}    {'to'}    {'two'}    {'words'}    {'words.'}
uwords = unique(regexp(bunchofwords,'\w+','match'))
uwords = 1×30 cell array
  Columns 1 through 17

    {'1'}    {'2'}    {'3'}    {'a'}    {'also'}    {'ated'}    {'be'}    {'contains'}    {'entry'}    {'file'}    {'fit'}    {'hyphen'}    {'in'}    {'is'}    {'it'}    {'list'}    {'make'}

  Columns 18 through 30

    {'many'}    {'need'}    {'one'}    {'order'}    {'sometimes'}    {'text'}    {'the'}    {'them'}    {'third'}    {'this'}    {'to'}    {'two'}    {'words'}

I'm sure there are better ways to handle splitting into words, but using \w+ was simple enough.

댓글을 달려면 로그인하십시오.

Answer 3

Image Analyst 2023년 4월 3일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1940214-how-to-find-the-most-used-word-in-a-text#answer_1207614

MATLAB Online에서 열기

If you don't have the Text Analytics Toolbox (like @the cyclist solution requires) then you can get a histogram like this:

str = 'abcddrd,ee,fghd,**^^###$s t q j' % Whatever your character array is

str = 'abcddrd,ee,fghd,**^^###$s t q j'

% Convert characters to numbers.

strAscii = str - char(0);

% Compute histogram

edges = 0 : max(strAscii);

counts = histogram(strAscii, edges);

% Fancy up the plot.

grid on;

xlabel('ASCII value');

ylabel('Count');

title('Histogram of Characters')

댓글 수: 2
없음 표시없음 숨기기

the cyclist 2023년 4월 3일

Unless I misunderstand, this solution finds the count of characters. This question (and my solution) is about finding words.

Image Analyst 2023년 4월 3일

I think your solution is more like what the OP wants. But maybe I'll leave mine up in case someone in the future stumbles across it and wants a histogram of characters.

By the way, if he doesn't have that toolbox, is there a solution for a histogram of complete words?

댓글을 달려면 로그인하십시오.

How to find the most used word in a text?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

답변 (3개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

How to find the most used word in a text?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

답변 (3개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2 없음 표시없음 숨기기

댓글 수: 2 없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기