Quickly Search Strings inside PDF files

Question

0 개 추천

I have ~25,000 PDF files that I want to classify based on the presence of keywords in their text. I know there's a PDF Toolbox that provides MATLAB with an interface for reading PDF text, but the fact that it comes from Sourceforge makes it difficult to obtain (this is for work) and the reliance on java seems to me like it would make the process very slow -especially for searching so many files. Is there a simpler, faster way to parse these documents if all I want to do is basically strfind on the text to check for keywords?

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

Michael B 2016년 4월 18일

편집: Michael B 2016년 4월 18일

@jgg: That's a good explanation. If you'd made this an official answer I'd accept it. Sounds like I'll have to figure out a clever way to get PDFBox on my work machine without breaking the rules...

@Walter Roberson: Thank you sir, that is tremendously helpful. You are truly a font of knowledge and support. I can only hope that one day I might possess a fraction of your wit.

Walter Roberson 2023년 1월 4일

For batch extracting I see the commercial product https://www.qoppa.com/files/pdfstudio/guide/batch-extract-text-from-pdf.htm (which I have never used.)

I also see instructions at https://kenbenoit.net/how-to-batch-convert-pdf-files-to-text/ for a free convertor. As those instructions basically involve preparing a file of names and then running a shell script, then building the file name list inside MATLAB would not be difficult. Running the converter would be simple in MacOS or Linux; in Windows it would take more work.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Jan 2016년 4월 11일

0 개 추천

PDFs are designed to guarantee an equal output on different machines. You want to create a catalogue of the contained strings. These two jobs do not match.

What about converting the PDFs by one of the many pdf2text tools and work on the text files? E.g. http://www.foolabs.com/xpdf, http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-NET

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Michael B 2016년 4월 18일

Since I have to parse 25,000 files, using an external converter really isn't a viable option unless it has batch capability. Alright, I guess there really isn't anything simpler than the PDFBox tool for a MATLAB interface. Thanks.

댓글을 달려면 로그인하십시오.

Answer 2

Sarah Palfreyman 2018년 4월 30일

0 개 추천

Have you tried Text Analytics Toolbox ?

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Benjamin Ehrlich 2023년 1월 4일

Is there ANY way to effectively speed up textanalytics.internal.pdfparser.extractText?

A single page can take up to 20 seconds... I just want to extract a small section of text.

-Ben

댓글을 달려면 로그인하십시오.

Quickly Search Strings inside PDF files

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

제품

태그

Community Treasure Hunt

Quickly Search Strings inside PDF files

댓글 수: 7 이전 댓글 5개 표시 이전 댓글 5개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

제품

태그

참고 항목

Community Treasure Hunt

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기