how to extract a list of unique words from a set of one row strings

조회 수: 55 (최근 30일)
Harrison
Harrison 2024년 11월 14일 0:58
댓글: Harrison 2024년 11월 15일 16:56
Basically I have a set of 11 strings of words, and each string has no repeating words, but I need a list of every unique word in all 11 strings.
I've found that this works for one string at a time, but I can't get a list for all 11 strings this way.
A{1} = updatedDocuments(1,1)
B{1} = strjoin(unique(strtrim(strsplit(A{1}, ',')))', '')
Is it possible to index A{1} as updatedDocuments(1:11,1) or do something similar?

채택된 답변

Madheswaran
Madheswaran 2024년 11월 14일 9:32
편집: Madheswaran 2024년 11월 15일 5:17
I am assuming the following:
  • 'updatedDocuments' is an array of 'tokenizedDocument'
  • Each document contains text that is comma seperated and doesn't end with a comma
To get the unique words from the entire set of strings, you can follow the below approach:
% remove comma from the documents if you don't want comma to be
% included in 'uniqeWords'
updatedDocuments = removeWords(updatedDocuments, ",");
uniqueWords = updatedDocuments.Vocabulary;
If the 'updatedDocuments' is an cell array of char vector, you can follow the below approach:
updatedDocuments = strcat(updatedDocuments, ','); % Add comma at end of each cell
allWords = strjoin(updatedDocuments(1:11,1), ' '); % Join all words into a single string
allWords = strtrim(strsplit(allWords, ',')); % Split with comma as delimiter and trim
uniqueWords = unique(allWords); % unique words (1 x n cell where n is the number of unique words)
For more information, refer to the following documentations:
  1. https://mathworks.com/help/textanalytics/ref/tokenizeddocument.html
  2. https://mathworks.com/help/matlab/ref/double.unique.html
Hope this helps!
  댓글 수: 3
Madheswaran
Madheswaran 2024년 11월 15일 5:18
That is because I assumed 'updatedDocument' to be a cell array of character vectors. If 'updatedDocument' were an array of 'tokenizedDocument', resolving this issue would be straightforward. I have updated the answer by including a solution for when 'updatedDocument' is a 'tokenizedDocument', in addition to the existing explanation.
Let me know if that helps!
Harrison
Harrison 2024년 11월 15일 16:56
Thats exactly right! Thank you!!

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Paul
Paul 2024년 11월 14일 1:09
If UpdatedDocuments is a 1D cell array of chars ...
UpdatedDocuments{1} = 'one,two,three,one';
UpdatedDocuments{2} = 'one,two,three,two';
UpdatedDocuments{3} = 'one,two,three,three';
result = cellfun(@(S) strjoin(unique(strtrim(strsplit(S, ','))),','),UpdatedDocuments,'Uni',false)
result = 1x3 cell array
{'one,three,two'} {'one,three,two'} {'one,three,two'}
  댓글 수: 1
Paul
Paul 2024년 11월 15일 1:06
The Vocabulary property of tokenizedDocument returns the uniqew words in the array
documents = tokenizedDocument([
"an example of a short sentence an example of a short sentence "
"a second short sentence a second short sentence"]);
documents
documents =
2x1 tokenizedDocument: 12 tokens: an example of a short sentence an example of a short sentence 8 tokens: a second short sentence a second short sentence
documents.Vocabulary
ans = 1x7 string array
"an" "example" "of" "a" "short" "sentence" "second"

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Characters and Strings에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by