Web scraping with regular expression, getting rid of html tags.
조회 수: 1 (최근 30일)
이전 댓글 표시
Hi all,
I am doing some webscraping code and consequently, I am using regular expressions. I need to isolate the words from a string, of course html tags should not be included. Html tags are words included in < > (e.g. br). Unfortunately, my code does not work out and I am wondering why. Here an example:
regexp('qu <qa>','(?!<)\w*(?!>)','match')
My expected results is 'qu' but instead I get 'qu' and 'q'. The code works with this string 'qu q'. What may I do to solve this issue?
thanks
Regards,
Pietro
댓글 수: 0
채택된 답변
Guillaume
2017년 6월 3일
The first part of your expression is a look-ahead. You want a look behind instead. Add a < before the !:
regexp('qu <qa>', '(?<!<)\w*(?!>)', 'match')
댓글 수: 3
Guillaume
2017년 6월 3일
It's a lot more difficult to tell a regular expression not to match something than it is to tell it to match something. Therefore, I'd do it in two passes.
1. remove the tags:
notags = regexprep(yourstring, '<[^>]*>', '')
2. match whatever it is you want to match
matches = regexp(notags, '\w+', 'match')
추가 답변 (0개)
참고 항목
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!