Fastest way to find text keywords out of large amount of textual news sentences?
조회 수: 1 (최근 30일)
이전 댓글 표시
Hello, I have a database containing over 900,000 line of news. And I want to scan these lines of texts for certain keyword. I tried
tic; strfind(newsDb.SingleNewline, kws{1}); toc
tic; contains(newsDb.SingleNewline, kws{1}); toc
both takes over 0.003 sec for search in one keyword in one news line.
If I want to create a new database with over 20,000 keywords, then it would take
900000 * 20000 * 0.003 / 60 / 60 / 24
over 600 days to do this. :(
Anyone has perhaps an idea how to to this within perhaps one-two day?
Thank you very much
댓글 수: 6
Walter Roberson
2021년 1월 28일
What do you want to do about substrings, and plurals, and upper/lowercase and the other factors I asked about? For example if the headline were "Elon visits Oak Hammock Marsh" then is it acceptable that this would match "Mars" ? And "Elon eats musk-melon" ? And "Eucre trumps Bridge in recent poll" ?
채택된 답변
Walter Roberson
2021년 2월 8일
You can do the search phase efficiently:
S = [ "Elon Musk is the richest man on the planet"
"Elon Musk is the poorest man on Mars"
"Trump is the president of US"
"Elon eats musk-melon"
"Eucre Trumps Bridge in recent poll"
"Trump is the not president of US"]
Tags = ["Musk" "Trump" "Mars"]
numTags = length(Tags);
pattern = "\<(?<word>(" + strjoin(Tags, "|") + "))\>"
search_results = regexp(S, pattern, 'names')
However, the output is not really what you want: it is information about each tag that was matched for each cell, and needs to re-arranged to give information about where each tag was found.
tags_matched = cellfun(@(C) string({C.word}), search_results, 'uniform', 0).'
TagWasFoundAt = cell(numTags,1);
for K = 1 : numTags; TagWasFoundAt{K} = find(cellfun(@(C) ismember(Tags{K}, C), tags_matched)); end
[cellstr(Tags(:)), TagWasFoundAt]
%OR
match_bits = cell2mat(cellfun(@(C) ismember(Tags, string({C.word})), search_results, 'uniform', 0));
TagWasFoundAt = arrayfun(@(COL) find(match_bits(:,COL)).', (1:numTags).', 'uniform', 0);
[cellstr(Tags(:)), TagWasFoundAt]
It is likely that there are other ways to do the matching from tags to entries.
The first of those two is probably more efficient, but the match_bits array would be useful if you wanted a single data structure that you could easily query to find out which articles contain a particular tag, or which tags a particular article contains. The match_bits array is good for doing boolean searches, for example, such as trying to find articles that contain Musk Or Mars but not Trump
(match_bits(:,1) | match_bits(:,3)) & ~match_bits(:,2)
There might be better ways of doing the matching.
댓글 수: 0
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Characters and Strings에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!