Matlab extract url from html source

조회 수: 21 (최근 30일)
hinhthoi
hinhthoi 2012년 5월 2일
댓글: Gobert 2021년 6월 13일
Hi, I am trying to extract all urls from a HTML source code. I used strfind command to find "http" as the starting of url and ".html", ".php" , ".png" as the end of the url. After that i join the starting and the ending to form a complete URL
But this give very bad result because it usually mix up.
I want to ask if there is any easier way to do this?
I'm thinking about searching for a pattern, a single command to give all urls that start with http:// and end with .html , .php, or .png
In the html source code, there are some other url extension, but i want to ignore all of them.
Thank you very much for any help

답변 (3개)

Jason Ross
Jason Ross 2012년 5월 2일
I would do this using a series of regular expressions. Take a look at "Parsing Strings with Regular Expressions" on the following page for an example. It uses email addresses, but doing it for a URL is very similar since you know how it starts and ends, and you care about what's in between.

Walter Roberson
Walter Roberson 2012년 5월 2일
regexp(TheString, 'http://.*?\.(html|php|png)')
However, this cannot notice that (say) http://mathworks.com/scripts.htmlx/logo.png should extend to the .png instead of just to the .html . In order to be able to determine that you have reached the end of the URI, you need to know the list of characters which terminate URI in your context. Taking into account that sloppy pages often send URI with embedded blanks, which is syntactically invalid...

Abhisar Ekka
Abhisar Ekka 2021년 2월 13일
You can run this piece of code and it works.
html = webread("<----paste your url here ---->");
hyperlinks = regexp(html,'https?://[^"]+','match')'
Inside webread, paste your url. Webread does the work of reading & parsing the html code . And upon using regexp which matches regular expression we get all kinds of http and https links in the url.
  댓글 수: 1
Gobert
Gobert 2021년 6월 13일
How can one check each html code to find emails? For example, see below: How to make this code work?
html = webread("https://edition.cnn.com");
hyperlinks = regexp(html,'https?://[^"]+','match')';
rgx ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
emails = regexpi(hyperlinks,rgx,'match')';

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Adding custom doc에 대해 자세히 알아보기

태그

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by