Data extraction from a research article (an electronic pdf with highly unstructured data)

Question

Aditi Mahajan 2022년 11월 2일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1842163-data-extraction-from-a-research-article-an-electronic-pdf-with-highly-unstructured-data

답변: Image Analyst 2022년 11월 3일

My research area is applied machine learning in material science. I am struggling for an algorithm which could retrieve the desired data values (categorical and numerical both) from the research article (an electronic pdf document which is highly unstructured). There are thousands of such documents (pdfs) from which the data needs to be extracted which is a time intensive task. Some pdf may carry data in graphs while others in table or text. Kindly guide me through the process using which I can efficiently extract the data.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

dpb 2022년 11월 2일

Essentially an impossible task -- the data/figures in a pdf file are not stored in a retrievable format other than by interpreting/rendering the pdf document itself.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Image Analyst 2022년 11월 3일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1842163-data-extraction-from-a-research-article-an-electronic-pdf-with-highly-unstructured-data#answer_1090183

Try this:

https://www.mathworks.com/matlabcentral/fileexchange/?term=tag%3A%22digitize%22

댓글 수: 2
없음 표시없음 숨기기

Aditi Mahajan 2022년 11월 3일

s12034-019-1813-5.pdf

I have attached a research article from which I need to extract the materials studied, processes deployed, material characterization type (all these given in text); sequence type (given in table); extracting the mechanical characterization (given in figures).

Is this possible in one go? Or do I need to do it in fragments using codes? Or some collection needs to be done manually?

dpb 2022년 11월 3일

@Image Analyst is pointing out that the "converters" actually use OCR to recognize and convert pdf content to text. Then, you'll have to have code to find the particular code words of ineterst.

That's only the text portion; tables and images in my experience weren't converted to raw data but simply embedded into the document as objects. That might get you at least part of the way, but it's not going to be anything simple to do for a generic collection of papers.

See <acrobat/online/convert-pdf.html> and Google is your friend to find alternates outside Adobe...altho your uni probably has a site license.

This really isn't much a MATLAB Q?

댓글을 달려면 로그인하십시오.

Answer 2

Image Analyst 2022년 11월 3일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1842163-data-extraction-from-a-research-article-an-electronic-pdf-with-highly-unstructured-data#answer_1090658

I doubt thousands of articles would all be in this format/style. You might be able to get the text and numbers out but it could be tough to automatically figure out which numbers are in a table and what they mean. For the plots, you might just need to convert them to images and then use one of the File Exchange submissions. But even then I imagine it's going to require a lot of manual processing.

Maybe you could just use Amazon Mechanical Turk to hire a bunch of cheap global workers to do it for you.

https://www.mturk.com/

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Data extraction from a research article (an electronic pdf with highly unstructured data)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

답변 (2개)

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Data extraction from a research article (an electronic pdf with highly unstructured data)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

답변 (2개)

댓글 수: 2 없음 표시없음 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기