HTML Page source info
조회 수: 2 (최근 30일)
이전 댓글 표시
Hello, many-a-times we come across a series of numbered webpages
basePage.html?page=2
basePage.html?page=3
and so forth, wherein there are several fields identified by their labels:
<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
and so on.
How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,
basePage.html?page=1toInf
be taken (outputted/exported) into one text file, say, Parameter2.txt?
The "textOfInterest" is often alphanumeric with special characters !@#$% also.
Thanks.
댓글 수: 6
Rik
2020년 12월 1일
편집: Rik
2020년 12월 1일
The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.
Did you try adapting any of the code? I'll post some code as an answer.
채택된 답변
Rik
2020년 12월 1일
One possibility with strfind:
close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
end_of_text=close_div(close_div>position(n));
end_of_text=end_of_text(1)-1;
texts{n}=d(position(n):end_of_text);
end
Or with a regexp:
d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
' : </label> <div class="category-related">',...
'(',... % use parentheses to capture a token
'[^<]*',... % this matches any number of characters other than <
')',...
'</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)
You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*
댓글 수: 8
Rik
2020년 12월 2일
Those arrows are probably newline characters. What release are you using?
I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!