How do I read the text between href tags and return the results in a cell array?
조회 수: 6 (최근 30일)
이전 댓글 표시
Currently, I have an html webpage saved in a text format. Below is an example of the portion of the text I am interested in:
<a href='/some/1056-text-stuff'>
I want to search the text document for every case the "<a href='\some\ " pattern appears and extract the text between the tokens, i.e.
/some/1056-text-stuff
Matlab has regexp, match and tags but I am struggling to pick out the string cleanly. Ideally, I would like to search the document and return a cell array of strings which lists all of the matches. Here is my current code:
str= fileread('C:\Users\Me\Documents\MATLAB\trial.txt'); %read in text file
urls = regexp(str, 'href=(\S+)(\s*)$', 'tokens', 'lineAnchors'); %find urls
댓글 수: 0
채택된 답변
Julian
2016년 6월 17일
You can try something like
>> RE='<a[\s]+href="(?<target>.*?)"[^>]*>(?<text>.*?)</a>';
>> list=regexp(html, RE, 'names')
I can recommend this tool https://www.regexbuddy.com/
댓글 수: 2
Ana Alonso
2019년 12월 17일
Hi there,
What do the (?<target>.*?) and (?<text>.*?) expressions correspond to?
I've never worked with html before and I'm just trying to scrape urls from the html code.
Thanks!
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Data Import and Export에 대해 자세히 알아보기
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!