Web scraping with regular expression, getting rid of html tags.

Question

pietro 2017년 6월 3일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/343153-web-scraping-with-regular-expression-getting-rid-of-html-tags

편집: pietro 2017년 6월 4일

Hi all,

I am doing some webscraping code and consequently, I am using regular expressions. I need to isolate the words from a string, of course html tags should not be included. Html tags are words included in < > (e.g. br). Unfortunately, my code does not work out and I am wondering why. Here an example:

regexp('qu <qa>','(?!<)\w*(?!>)','match')

My expected results is 'qu' but instead I get 'qu' and 'q'. The code works with this string 'qu q'. What may I do to solve this issue?

thanks

Regards,

Pietro

The following code works regexp('qu qa','(?!<)\w*(?!>)','match')

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Guillaume 2017년 6월 3일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/343153-web-scraping-with-regular-expression-getting-rid-of-html-tags#answer_269431

MATLAB Online에서 열기

The first part of your expression is a look-ahead. You want a look behind instead. Add a < before the !:

regexp('qu <qa>', '(?<!<)\w*(?!>)', 'match')

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Guillaume 2017년 6월 3일

MATLAB Online에서 열기

It's a lot more difficult to tell a regular expression not to match something than it is to tell it to match something. Therefore, I'd do it in two passes.

1. remove the tags:

notags = regexprep(yourstring, '<[^>]*>', '')

2. match whatever it is you want to match

matches = regexp(notags, '\w+', 'match')

pietro 2017년 6월 4일

편집: pietro 2017년 6월 4일

thanks for your reply. I haven't thought of using regexprep

댓글을 달려면 로그인하십시오.

Web scraping with regular expression, getting rid of html tags.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Web scraping with regular expression, getting rid of html tags.

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기