How do I read the text between href tags and return the results in a cell array?

조회 수: 6 (최근 30일)
Currently, I have an html webpage saved in a text format. Below is an example of the portion of the text I am interested in:
<a href='/some/1056-text-stuff'>
I want to search the text document for every case the "<a href='\some\ " pattern appears and extract the text between the tokens, i.e.
/some/1056-text-stuff
Matlab has regexp, match and tags but I am struggling to pick out the string cleanly. Ideally, I would like to search the document and return a cell array of strings which lists all of the matches. Here is my current code:
str= fileread('C:\Users\Me\Documents\MATLAB\trial.txt'); %read in text file
urls = regexp(str, 'href=(\S+)(\s*)$', 'tokens', 'lineAnchors'); %find urls

채택된 답변

Julian
Julian 2016년 6월 17일
You can try something like
>> RE='<a[\s]+href="(?<target>.*?)"[^>]*>(?<text>.*?)</a>';
>> list=regexp(html, RE, 'names')
I can recommend this tool https://www.regexbuddy.com/
  댓글 수: 2
StuartG
StuartG 2016년 6월 21일
Thank you very much, the regex command was giving me a lot of grief. I tailored your expression a little bit and it worked perfectly.
Ana Alonso
Ana Alonso 2019년 12월 17일
Hi there,
What do the (?<target>.*?) and (?<text>.*?) expressions correspond to?
I've never worked with html before and I'm just trying to scrape urls from the html code.
Thanks!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Data Import and Export에 대해 자세히 알아보기

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by