Reading mathematical formula's in pdf with matlab is inconsistent, how to generalize this?

조회 수: 4(최근 30일)
Milan Lips
Milan Lips 2021년 3월 29일
댓글: Walter Roberson 2021년 5월 21일
Hi all,
I'm trying to extract certain pieces of text (the 4.50% and the 22.50% in picture 1) from a pdf file with matlab. To do so I use the pdfRead function. To get the text as generic as possible I remove enters, double spaces, tabs and indents and make all text uppercase. In reading the file, I run into the following problem:
  • some text in the file seems to be in math mode (see picture 1 and pay special attention to the two cases of "Notional Amount") :
  • It turns out this math mode is not consistent when reading it with pdfRead (see picture 2 and pay special attention to the two cases of "Notional Amount" (For readability I chose to show the file before removing enters, double spaces etc. however the problem is the same)).
  • The spaces within the word "notional amount" here are in a different spot for every pdf file, this results in the fact that I cannot use 1 matlab code for multiple pdf files (I do need that).
  • Besides this when copy pasting the part into my command window it appears different than it appears in the text (see picture 3)
My question consists of multiple questions:
  • Why doesn't the text appear as text and how can I make it appear as text?
  • How can I make this part generic such that I can read multiple pdf files with the same code?
Solutions I tried:
  • Removing all spaces
  • Saving it as a txt file and try to change font (the formula part didn't change)
  • Use Python to try to adjust the file
Thanks in advance!

답변(1개)

Pranav Verma
Pranav Verma 2021년 5월 20일
편집: Pranav Verma 2021년 5월 21일
Hi Milan,
The function you mentioned that you are using : pdfRead, does not seem to be present with the official MATLAB software. However I see a similar function in one of the MATLAB File Exchange submissions: "Read text from a PDF document".
"Read text from a PDF document" is one of the several submissions in MATLAB File Exchange on MATLAB Central which is a forum for our product users to interact, exchange information and knowledge, without MathWorks' involvement. Feel free to contact the author of this submission directly for specific questions about the implementation.

태그

제품


릴리스

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by