Extracting text from PDF with extractFileText is not working for some PDF

조회 수: 17 (최근 30일)
mario
mario 2023년 10월 17일
답변: Christopher Creutzig 2023년 12월 11일
I am using the extractFileText function to extract text from PDF files but with some files the function returns an empty string.
Through the function pdfinfo, I realized that the PDF files from which extractFileText cannot extract the text have different producer tag than those for which it works. In particular, it seems that extractFileText fails to extract the text in the case where the producer tag is Producer: "iText 2.1.7 by 1T3XT".
No error message is generated; you simply get an empty string.
Can anyone help me? Thank you!
  댓글 수: 7
dpb
dpb 2023년 10월 18일
pdfinfo('cs 2023.01.03.pdf')
ans = struct with fields:
NumPages: 40 PageSize: [40×4 double] PDFVersion: "1.4" Title: "" Subject: "" Language: "" Keywords: "" Author: "" Creator: "" Producer: "iText 2.1.7 by 1T3XT" CreationDate: 03-Jan-2023 03:17:20 ModificationDate: 03-Jan-2023 03:17:20 Encrypted: 0 AllowsTextExtraction: 1 Filename: "/users/mss.system.asxxnt/cs 2023.01.03.pdf"
extractFileText('cs 2023.01.03.pdf')
ans = ""
probably confirms identical symptoms you get locally. I did comment to a TMW staff member who responded to another Q? on reading pdf files to make aware of this issue if comes back on the presumption might have a specific interest in MATLAB pdf file functions.

댓글을 달려면 로그인하십시오.

답변 (1개)

Christopher Creutzig
Christopher Creutzig 2023년 12월 11일
This is a known issue in Text Analytics Toolbox. Please watch https://www.mathworks.com/support/bugreports/3155425 for updates.

카테고리

Help CenterFile Exchange에서 Downloads에 대해 자세히 알아보기

제품


릴리스

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by