Text Extraction and retrieval
조회 수: 5(최근 30일)
표시 이전 댓글
<P ID=1>
A LITTLE BLACK BIRD.
</P>
<P ID=2>
Story about a bird,
(1811)
</P>
<P ID=3>
Part 1.
</P>
As I am new to text extraction, I need help in;
댓글 수: 0
채택된 답변
Akira Agata
2017년 10월 25일
Just tried to make a script to do that. Here is the result (assuming the maximum ID = 10).
% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
% 1. Count the delimiters '</P>'
idx = strfind(C,'</P>');
n = nnz(cellfun(@(x) ~isempty(x), idx));
% 2. Remove all punctuation
C2 = regexprep(C,'[.,!?:;]','');
% 3. Break the text into individual documents at each delimiter
idx2 = find(strcmp(C,'</P>'));
for kk = 1:10
str = ['<P ID=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
fileName = ['document',num2str(kk),'.txt'];
fid = fopen(fileName,'w');
fprintf(fid,'%s\r\n',C(idx_s:idx_e));
fclose(fid);
end
end
추가 답변(2개)
Cedric Wannaz
2017년 10월 26일
Here is another approach based on pattern matching:
>> data = regexp(fileread('data.txt'), '(?<=<P[^>]+>\s*)[\w ]+', 'match' )
data =
1×3 cell array
{'A LITTLE BLACK BIRD'} {'Story about a bird'} {'Part 1'}
if you don't need the IDs (e.g. if in any case they will go from 1 to the number of P tags), you are done.
If you needed the IDs, you could get both IDs and content as follows:
>> data = regexp(fileread('data.txt'), '<P ID=(\d+)>\s*([\w ]+)', 'tokens' ) ;
data = vertcat( data{:} ) ;
ids = str2double( data(:,1) )
data = data(:,2)
ids =
1
2
3
data =
3×1 cell array
{'A LITTLE BLACK BIRD'}
{'Story about a bird' }
{'Part 1' }
댓글 수: 6
Cedric Wannaz
2017년 11월 9일
편집: Cedric Wannaz
2017년 11월 9일
If you have a count per document, finding the number of documents a keyword is in is easy:
counts = [7, 0 ,3] ;
hasKey = counts > 0 ; % [1,0,1]
nDocs = sum( hasKey ) ; % 2
Christopher Creutzig
2017년 11월 2일
편집: Christopher Creutzig
2017년 11월 2일
It's probably easiest to split the text and then check the number of splits created to count, using string functions:
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
Then, numel(paras) will give you the number of </P>.
If you do not have extractFileText, calling string(fileread('file.txt')) should work just fine, too.
In one of the comments, you indicated you also need to count the frequency of words in documents. That is what bagOfWords is for:
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
bag =
bagOfWords with 13 words and 3 documents:
a little black bird . …
1 1 1 1 1
1 0 0 1 0
…
댓글 수: 2
shilpa patil
2019년 9월 23일
편집: shilpa patil
2019년 9월 23일
how to rewrite the above code for a document image
instead of text file
참고 항목
범주
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!