Help with REGEXP: extracting info from a fragment of URL inside the HTML code.

Question

Ajpaezm 2019년 4월 3일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/454226-help-with-regexp-extracting-info-from-a-fragment-of-url-inside-the-html-code

댓글: Ajpaezm 2019년 5월 4일

Hey guys, I have used webread/urlread to get info from this site, the outcome is huge but I'm only interested in these lines:

     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=-1'> < </a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=1'>1</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=2'>2</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=3'>3</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=4'>4</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=5'>5</a></li>
     <li class='disabled'><span>...</span></li>
     <li><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=22'>22</a></li>

If you notice, there's a 'segment' from the main url included in this part of the HTML code (this one: /en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=5). From this, I'd like to get the numbers at the very end of this fragment, or the numbers between the >< symbols (like 1, 2, 3, 4, 5 and 22).

I tried this foolishly thinking it was going to help but it didn't:

url='https://www.interactivebrokers.com/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=';
pattern='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=[1-9]';
[a1, a2]=regexp(url, pattern,'match'); 

But it didn't work. Do you have any suggestions for this one? I previously tried '<li[^>]*><a[^>]*>(.*?)</a></li>' and 'tokens' option and although it captures these values, it also captures a lot of stuff I don't want.

Thanks for your help!

댓글 수: 2
없음 표시없음 숨기기

Stephen23 2019년 4월 3일

편집: Stephen23 2019년 4월 3일

Keep in mind that regular expressions are not a robust or neat way to parse HTML:

https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not

https://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/

https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Just because you can do something does not mean that it is a good idea. You would likely be better off using a proper HTML parser.

Walter Roberson 2019년 4월 3일

The Text Analytics Toolbox might have suitable tools.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

per isakson 2019년 5월 3일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/454226-help-with-regexp-extracting-info-from-a-fragment-of-url-inside-the-html-code#answer_373522

편집: per isakson 2019년 5월 4일

MATLAB Online에서 열기

"Keep in mind that regular expressions are not a robust or neat way to parse HTML:" Anyhow, it can be used as an exercise on regular expressions.

>> cssm('h:\m\cssm\cssm.txt')
ans =
     1     2     3     4     5    22

where

function    num = cssm( ffs )
    str = fileread( ffs );
    xpr = '/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=';
    xpr = regexptranslate( 'escape', xpr );
    xpr = ['(?<=',xpr,'\d+''>)\d+(?=<)'];
    cac = regexp( str, xpr, 'match' );
    num = str2double( cac );
end

and where h:\m\cssm\cssm.txt contains the html-code of the question.

The length of the look-behind-text varies because of the expression, '\d+', which may hamper performance.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

per isakson 2019년 5월 4일

See https://se.mathworks.com/matlabcentral/answers/402062-help-using-regexp#answer_321487

Ajpaezm 2019년 5월 4일

Oh I remember that post, I went back there a couple of times for other things, but for this one I couldn't find a solution with those resources. Actually I gave up this route and tried something else until your reply on this post today.

I was trying to do it with regexp directly, and with that piece of string that always seems to repeat itself in that part of the HTML code. But failed :/

I'll analyze this approach you used for HTML parsing. :)

댓글을 달려면 로그인하십시오.

Help with REGEXP: extracting info from a fragment of URL inside the HTML code.

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Help with REGEXP: extracting info from a fragment of URL inside the HTML code.

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기