How to select specific urls in a webpage with regexp?

Question

pietro 2017년 6월 8일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/343912-how-to-select-specific-urls-in-a-webpage-with-regexp

댓글: pietro 2017년 7월 1일

Hi all,

I'm doing some webscraping from this website. I need to extract the tractor links which are recognized from many lines similar to the following one:

<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a></td><td>21 hp</td><td>2008 - 2011</td></tr>

so after the link there is the string '\d* hp'. Here the code I use to detected them:

url='http://www.tractordata.com/farm-tractors/tractor-brands/johndeere/johndeere-tractors.html';
html=urlread(url);
hyperlinks = regexp(html,'(?<=<tr><td.*>)<a.*?/a>(?=.*{8,50}\d* hp</td>)','match');

This code works rather fine, but I'm not able to get rid of the first wrong result that is:

<a href="http://www.tractordata.com/spacer.gif" height="1" width="1" alt=""></td></tr>
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a>

As you can see it starts above the link that has to be selected. How can I do to solve it? Thanks

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Michael Dombrowski 2017년 6월 29일

When I run your code I get no results in hyperlinks. But, have you thought of adding "farm-tractors" into your regex? It would resolve your issue, and as long as all the links also go to the farm-tractors directory it would work fine.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Guillaume 2017년 6월 29일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/343912-how-to-select-specific-urls-in-a-webpage-with-regexp#answer_272391

편집: Guillaume 2017년 6월 29일

MATLAB Online에서 열기

Note: avoid greedy .* particularly in complex expressions, it's bound to cause you problems. Negative classes often work better. For example, instead of <td.*>, use <td[^>]*>.

As per Michael comment, your posted regex does not work. But even with the simplified regex:

hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a.*?/a>', 'match')' %transposed for easy viewing in command window

you can see that there is a problem. Unfortunately for you, the problem is actually the webpage which is actually not valid html. Your whole problem comes from the fact that the spacer.gif <a hyperlink (on line 131 of the source html) is never closed. So of course, your regex captures everything up to the next a> which belongs to the next <tr><td>.

Unfortunately that makes your life rather difficult. Try:

 hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a[^>]*>[^<]*</a>(?=</td><td[^>]*>\d+ hp</td>)', 'match')' %transposed for easy viewing in command window

And if you can report to the website owner that their page is missing a closing tag.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

pietro 2017년 7월 1일

I got it. Infact with .* it gets a bit tricky, sometimes.

댓글을 달려면 로그인하십시오.

How to select specific urls in a webpage with regexp?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

How to select specific urls in a webpage with regexp?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기