Extract documents from a website's hyperlinks

Question

0 개 추천

Hello folks!

I have a few websites from which I am trying to pull the files from their embedded hyperlinks. Does Matlab have a way to do this? For example, if we look at the website:

https://en.wikipedia.org/wiki/Quantum_mechanics we notice several hyperlinks at the bottom as references. In this case disregard the earlier hyperlinks that lead to other articles or to these references.

Is there a way to extract these documents automatically via Matlab?

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Adrian 2023년 7월 25일

Yes, Matlab has the capability to extract files from websites, including those embedded in hyperlinks. To achieve this, you can use the "webread" function in Matlab. This function allows you to read the content of a webpage and then you can parse the HTML to extract the links you're interested in.

Check here general outline of the steps you can follow:

Use the "webread" function to retrieve the content of the webpage (e.g., https://en.wikipedia.org/wiki/Quantum_mechanics).
Parse the HTML content to identify and extract the hyperlinks you want. For this, you can use regular expressions or a HTML parsing library like "regexp" or "HTMLParser" available in Matlab.
Filter out the relevant links you need, based on specific criteria (e.g., filtering out links that don't lead to references).
Use the "webread" function again to download the files pointed to by the extracted hyperlinks.

It's worth noting that the process may vary depending on the structure of the webpage and how the hyperlinks are embedded in the HTML. Also, make sure to be respectful of website terms of service and check if there are any restrictions on web scraping or downloading files from the site.

Below is a basic example of how you can get started with the process using Matlab's "webread" function to retrieve the webpage's content:

matlab

% Step 1: Read the content of the webpage

url = 'https://en.wikipedia.org/wiki/Quantum_mechanics';

html_content = webread(url);

% Step 2: Parse the HTML to extract hyperlinks

% You'll need to implement this part based on the specific HTML structure

% and the criteria you want to use to identify the relevant links.

% Step 3: Filter out the relevant links

% Step 4: Download the files pointed to by the hyperlinks

% You can use "webread" again to download the files. Make sure to handle

% file names and saving appropriately.

% Additional steps:

% - Implement error handling for webread and file downloads.

% - Be mindful of website policies and restrictions to avoid any legal issues.

Please keep in mind that the actual implementation may be more involved, and you might need to tweak it based on the structure of the webpages you're dealing with. Additionally, web scraping is a complex topic, so it's essential to be mindful of the website's terms of service and to be respectful of their resources and bandwidth.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

활동을 팔로우하려면 로그인

Answer 1

Koundinya 2019년 5월 29일

MATLAB Online에서 열기

2 개 추천

That could be done using webread to retrieve data from the webpage and regexp to extract all the hyperlinks in the page by parsing through the retrieved text.

html_text = webread(https://en.wikipedia.org/wiki/Quantum_mechanics);
hyperlinks = regexp(html_text,'<a.*?/a>','match');

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Extract documents from a website's hyperlinks

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (0개)

카테고리

태그

Community Treasure Hunt

Extract documents from a website's hyperlinks

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (0개)

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기