HTML Page source info

Question

0 개 추천

Hello, many-a-times we come across a series of numbered webpages

basePage.html?page=2
basePage.html?page=3

and so forth, wherein there are several fields identified by their labels:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>

and so on.

How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,

basePage.html?page=1toInf

be taken (outputted/exported) into one text file, say, Parameter2.txt?

The "textOfInterest" is often alphanumeric with special characters !@#$% also.

Thanks.

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

b 2020년 12월 1일

Initially, I was hesitant to download this file because I thought it is religious or some such thing. But I am happy to have downloaded it. It is immensely useful and 'on the money' for this thread.

My interest occurs in the function button_Callback in BibleDownloader.m. The webpage is getting saved in the parameter called 'data'. And since finding <div class="pagination"> is right in the ballpark of my initially query, I was greatly excited to see the output and experiment with the case 'NB2014' inside this function. Unfortunately, the code doesn't seem to go here, since I was unable to retrieve either 'data', or the indices idx*. All of these indices idx*, viz idx, idx2 and idx3 will be useful for me. How can I access, and get to this part?

Also, perhaps you can suggest one regexp line to pull out 'textOfInterest' from

<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

and better still, if you already have something like the BibleDownloader m-file, with regexp used on extracting text between <div class> and </div> type of structure, that will be great.

Rik 2020년 12월 1일

편집: Rik 2020년 12월 1일

The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.

Did you try adapting any of the code? I'll post some code as an answer.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Rik 2020년 12월 1일

MATLAB Online에서 열기

0 개 추천

One possibility with strfind:

close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
    end_of_text=close_div(close_div>position(n));
    end_of_text=end_of_text(1)-1;
    texts{n}=d(position(n):end_of_text);
end

Or with a regexp:

d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
    ' : </label> <div class="category-related">',...
    '(',... % use parentheses to capture a token
    '[^<]*',... % this matches any number of characters other than <
    ')',...
    '</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)

You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*

댓글 수: 8
이전 댓글 6개 표시 이전 댓글 6개 숨기기

b 2020년 12월 1일

MATLAB Online에서 열기

Thank you.

But I have run into problem with the following part:

Trying to take the output of the two parameters simultaneously: Parameter1 and Parameter2. It so happens, that many times, Parameter1 is present, but the Parameter2 is missing. That is, the structure is like this:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

Same problem if try to take all the three parameters.

When all three parameters are to be extracted, the objective is to get ' ' (no value) at the place where it is missing, rather than skipping it completely, because skipping it completely would result in a mismatch (so that when it is exported to the output text file, the corresponding entry is simply blank).

In the first (strfind) code, I tried to replicate the 'for loop' three times for the three parameters, but quickly ran into problems.

b 2020년 12월 2일

MATLAB Online에서 열기

Thanks for the link.

Downloaded the readfile from github. The 'elements' seems promising, except for - what are those ->->-> arrows in front of all the fields of interest?! Anyways, glad that it has brought to this point.

But the same situation with all the three approaches : when the mail-field is missing, then how to write 'NULL' in the output-file and continue with the loop?

Name1    mail1
Name2    missing
Name3    mail3
Name4    mail4

The strfind and regexp approaches give

Name{1}='Name1'
Name{2}='Name2'
Name{3}='Name3'
Name{4}='Name4'

and

Parameter{1}='mail1'
Parameter{2}='mail3'
Parameter{3}='mail4'

How to bypass the 'for loop' and at the same time, print 'NULL' in the corresponding excel row-column entry? In this example, (row=2,col=2) will be 'NULL', and (row=3,col=2) will be Parameter{2}.

It is not the question of 'skipping if not found', because numel(position) has already been evaluated, =4 here for the Name field, and =3 for the Parameter. So it seems to be hardcoded.

Rik 2020년 12월 2일

Those arrows are probably newline characters. What release are you using?

I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

댓글을 달려면 로그인하십시오.

Answer 2

b 2020년 12월 3일

0 개 추천

That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.

What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

b 2020년 12월 3일

MATLAB Online에서 열기

I am overwhelmed by the way you have patiently worked with me on this thread. I think I will close this elaborate thread here only, but not before posting this limerick:

There was once a man named Rik, 
Who wrote matlab codes so quick, 
To the topic, they were relevant
The codes themselves so elegant, 
His m-files, sir, were completely sick!

Enjoy your freedom from this thread.

Rik 2020년 12월 3일

You're welcome (and thanks for the limerick XD).

If you have follow-up question, feel free to post a link to it here.

댓글을 달려면 로그인하십시오.

HTML Page source info

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

채택된 답변

댓글 수: 8
이전 댓글 6개 표시 이전 댓글 6개 숨기기

추가 답변 (1개)

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

카테고리

태그

Community Treasure Hunt

HTML Page source info

댓글 수: 6 이전 댓글 4개 표시 이전 댓글 4개 숨기기

채택된 답변

댓글 수: 8 이전 댓글 6개 표시 이전 댓글 6개 숨기기

추가 답변 (1개)

댓글 수: 3 이전 댓글 1개 표시 이전 댓글 1개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

댓글 수: 8
이전 댓글 6개 표시 이전 댓글 6개 숨기기

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기