How to extract from a PDF a table that contains empty cells, and then, how to rebuild it as a Matlab table (or cell array)?

Question

Sim 2023년 3월 12일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1927375-how-to-extract-from-a-pdf-a-table-that-contains-empty-cells-and-then-how-to-rebuild-it-as-a-matlab

댓글: Sim 2023년 3월 28일

pdftable.pdf

[EDITED] How to extract from a PDF a table that contains empty cells, and then, how to rebuild it as a Matlab table (or cell array)?

Just as example, I created a PDF document (here attached, "pdftable.pdf") with Latex (here below the code) that only contains a fictious table with a few empty cells:

\documentclass[12pt]{article}
\usepackage{graphicx}
\begin{document}
    \begin{center}
        \begin{tabular}{ |c|c|c| } 
         \hline
         cell1 & cell2 &       \\ \hline
         cell4 &       & cell6 \\ \hline
               & cell8 & cell9 \\ 
         \hline
        \end{tabular}
    \end{center}
\end{document}

Then, I tried to extract that table's content from the PDF document, by using a tool that looks like to be state-of-the-art of Matlab for extracting texts from PDFs, i.e.:

>> str = extractFileText('pdftable.pdf')
str = 
    "cell1 cell2
     cell4 cell6
     cell8 cell9
     
     1"

As we can see, extractFileText spits out every row of that table, but how can I know which cells of the table (contained inside the PDF document) were empty, just from the extractFileText output ?

OK, in this simple example I have a tiny 3x3 table with enumerated cells (i.e. "cell1", "cell2", "cell3", "cell4", "cell5", etc..) and, by eye, I can easily find which cells were empty in my PDF.

However, usually, I would have very long tables (inside my PDF document), and I could not check manually, i.e. by eye, which rows (extracted by extractFileText) contained empty cells.

Therefore, is there a way to re-build the table contained in my PDF, as a Matlab table or cell array, and by including the empty cells as follows?

% Desired Output (either as a table or a cell array)
str =
  3×3 cell array
    {'cell1'}    {'cell2'}    {'_'    }
    {'cell4'}    {'_'    }    {'cell6'}
    {'_'    }    {'cell8'}    {'cell9'}

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Sim 2023년 3월 21일

편집: Sim 2023년 3월 21일

Anyone who can help here? :-)

Or, maybe, someone from the @MathWorks Support Team...?

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Suraj 2023년 3월 28일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1927375-how-to-extract-from-a-pdf-a-table-that-contains-empty-cells-and-then-how-to-rebuild-it-as-a-matlab#answer_1202739

Hi Sim

From the example you’ve presented it is clear that you’re trying to read a table from a PDF but the empty cells are causing extractFileText to produce unfavourable results.

The extractFileText method primarily focuses on extraction of text data from documents, as it belongs to the Text Analytics Toolbox. Extraction of tabular data from a PDF is a highly requested feature that MathWorks plans to add in a future release.

For now, I suggest you use a workaround that takes a .docx or .xlsx file as input rather than a PDF. You may use any online service or other widely available tools to convert your PDF to these formats. You can then feed your input file to the readtable method which is already great at extracting tables from both Word and Excel files.

Hope this helps.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Sim 2023년 3월 28일

Thanks a lot! Ok, so, for this specific task, we still need to rely on tools external to Matlab, thanks.

Do you already know which future release (also a date of the release) will contain this highly requested feature?

댓글을 달려면 로그인하십시오.

How to extract from a PDF a table that contains empty cells, and then, how to rebuild it as a Matlab table (or cell array)?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

How to extract from a PDF a table that contains empty cells, and then, how to rebuild it as a Matlab table (or cell array)?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기