Matlab extract url from html source
조회 수: 21 (최근 30일)
이전 댓글 표시
Hi, I am trying to extract all urls from a HTML source code. I used strfind command to find "http" as the starting of url and ".html", ".php" , ".png" as the end of the url. After that i join the starting and the ending to form a complete URL
But this give very bad result because it usually mix up.
I want to ask if there is any easier way to do this?
I'm thinking about searching for a pattern, a single command to give all urls that start with http:// and end with .html , .php, or .png
In the html source code, there are some other url extension, but i want to ignore all of them.
Thank you very much for any help
댓글 수: 0
답변 (3개)
Jason Ross
2012년 5월 2일
I would do this using a series of regular expressions. Take a look at "Parsing Strings with Regular Expressions" on the following page for an example. It uses email addresses, but doing it for a URL is very similar since you know how it starts and ends, and you care about what's in between.
댓글 수: 1
Walter Roberson
2015년 2월 28일
Walter Roberson
2012년 5월 2일
regexp(TheString, 'http://.*?\.(html|php|png)')
However, this cannot notice that (say) http://mathworks.com/scripts.htmlx/logo.png should extend to the .png instead of just to the .html . In order to be able to determine that you have reached the end of the URI, you need to know the list of characters which terminate URI in your context. Taking into account that sloppy pages often send URI with embedded blanks, which is syntactically invalid...
댓글 수: 0
Abhisar Ekka
2021년 2월 13일
You can run this piece of code and it works.
html = webread("<----paste your url here ---->");
hyperlinks = regexp(html,'https?://[^"]+','match')'
Inside webread, paste your url. Webread does the work of reading & parsing the html code . And upon using regexp which matches regular expression we get all kinds of http and https links in the url.
댓글 수: 1
Gobert
2021년 6월 13일
How can one check each html code to find emails? For example, see below: How to make this code work?
html = webread("https://edition.cnn.com");
hyperlinks = regexp(html,'https?://[^"]+','match')';
rgx ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
emails = regexpi(hyperlinks,rgx,'match')';
참고 항목
카테고리
Help Center 및 File Exchange에서 Adding custom doc에 대해 자세히 알아보기
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!