Reading/fetching text from text/PDF file for pre-processing

I have text/pdf files which contains millions of words(text). If i use str = extractFileText(filename) then firstly matlab became very slow also some time hancked . Also variable is not able to hold such a large data.
I want to read file word by word so i can filter text and make a smaller array of filtered data. Or i want to make filtered data temp file for next processing of data(as t will be small).
i need help in this also if you have any other solution of my probelm do reply.

댓글 수: 2

Ive J
Ive J 2021년 3월 21일
편집: Ive J 2021년 3월 21일
Did you try using the function with name, value pair?
for i = 1:numel(pages)
str = extractFileText(filename, 'pages', pages(i)); % get only one page per time
% do whatever you want with str
end
kindly have a look on my solution below.

댓글을 달려면 로그인하십시오.

 채택된 답변

moin khan
moin khan 2021년 3월 21일

0 개 추천

I firstly tried extractFileText on my file(text file with 19million words) it was really slow ad didnt worked because it all was going in single variable. Now i fetched data line by line and saved in an array and now its ok just take some seconds but its fine with such large file.
code:
fid=fopen(filename);
inputData = cell(0,1);
while ~feof(fid)
tline = fgetl(fid);
if ~isempty(tline)
inputData{end+1,1} = tline;
end
end
fclose(fid);
clear('ans','fid','tline');
documents = tokenizedDocument(inputData);
clear('inputData');

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 Text Data Preparation에 대해 자세히 알아보기

질문:

2021년 3월 21일

댓글:

2021년 3월 21일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by