retrieve data from a website with multiple pages

조회 수: 8 (최근 30일)
sani 2022년 2월 18일
댓글: Ive J 2022년 2월 22일
Hi all,
I want to pull the data from this website into a table.
it has 185 pages so I wrote a for loop so it will pass the entire table.
the problam is that I'm using webread, which is seems to read everything into char array.
what I want is that in each itteration of the for loop the data from this table will be read, how can it be done?
  댓글 수: 4
Rik 2022년 2월 19일
You can search the html for the text in the table and guess the structure from what you see.
sani 2022년 2월 19일
I think it is a <table> if I understand correctly, I tried to set weboptions.ContentType to 'table' but it is saying that there is only text.
I'm not sure that this is the way to approach it though

댓글을 달려면 로그인하십시오.

채택된 답변

Ive J
Ive J 2022년 2월 19일
편집: Ive J 2022년 2월 20일
My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more:, so that needs to be fixed first.
url = "";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits:
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
Challenge = extractBetween(top, "Challenge=", ";");
challenge_id = extractBetween(top, "ChallengeId=", ";");
arr = char(Challenge);
last_digit = str2double(arr(end));
arr = sort(arr);
min_digit = str2double(arr(1));
subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
x = double(Challenge) * 3 + subvar1;
y = cos(pi * subvar1);
answer = x * y;
answer = answer - power;
answer = answer + (min_digit - last_digit);
answer = string(floor(answer)) + subvar2;
hdrs = {'X-AA-Challenge' char(Challenge); ...
'X-AA-Challenge-ID' char(challenge_id); ...
'X-AA-Challenge-Result' char(answer)};
% now read the website contents ===========================================
htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
wildcardPattern + "</td><td>", ...
'</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
ans = 8×7 table
מספר רישיון שם יצרן כתובת ישוב מחוז פרטים_סוג מזון (מהות היצור): פרטים_קבוצת מזון: ___________ ________________________________ _____________________ _____________ _________ __________________________________________________________________________________________________________________________________________________ ___________________________________ 55678 "א. הקר 2009 גלאט למהדרין בע"מ" "מרכז ספיר 3 ירושלים" "ירושלים" "ירושלים" "ייצור מוצרי בשר קפואים בלבד: בשר בקר טחון, בשר בעלי כנף טחון ומוצריהם, קישקע ממולא, בשר בקר מעובד, בשר בעלי כנף מעובד, ניסור ואריזת בשר בקר קפוא" "הסעדה" 68795 "א. כ. התעשיינים בע"מ" "שד הסנהדרין 3 יבנה" "יבנה" "מרכז" "בשר ומוצריו, לרבות עופות וצייד" "הסעדה (קיטרינג)" 52319 "א.א בורקס ליאון" "איתן 24 ראשון לציון" "ראשון לציון" "מרכז" "אחסנה בקירור" "אחסון מזון בקירור" 69047 "א.א בליסימו בע"מ" "איתן 3 ראשון לציון" "ראשון לציון" "מרכז" "קרחונים אכילים, כולל שרבט וסורבט" "מחסן קרור/מחסן בטמ' מבוקרת" 67457 "א.א מטעמים הכי טעים בע"מ" "מודיעין 8 פתח תקווה" "פתח תקווה" "מרכז" "ייצור בצקים ממולאים, ייצור עוגיות יבשות" "לחם, לחמניות, עוגות שמרים ומאפים" 52312 "א.א. בליסימו בע"מ" "לזרוב 3 ראשון לציון" "ראשון לציון" "מרכז" "מוצרי מאפה, תערובות להכנתם ובצקים" "לחמים ולחמניות מאודים" 50780 "א.א. דרך האוכל (חיפה) בע"מ" "שנקר אריה 47 חיפה" "חיפה" "חיפה" "אחסנת בצקים קפואים" "יצור מוצרי בשר בקר וצאן טחון בלבד" 52587 "א.א. לרנר מוצרי מזון העמק בע"מ" "הפועלים 2 באר שבע" "באר שבע" "דרום" "מחסן קרור/מחסן בטמ' מבוקרת" "בשר ומוצריו, לרבות עופות וצייד"
  댓글 수: 10
sani 2022년 2월 22일
I was actually put your entire script in a for loop, and changed the URL as i increase. Than in each loop I was writing the answer from your script to another tanle using vertcat. If I understand correctly, the answer of size(unitab) = (36,7) is for pages 1-3? If so, this is the dimension I'm expecting to receive.
Ive J
Ive J 2022년 2월 22일
Yes, that's for 3 pages.
Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).
To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:
for i = 1:n
fprintf('reading page %d of %d\n', i, n)
tab = readEachPage(i);
save("" + i + ".mat", "tab") % e.g. contains table for page 10
unitab{i} = tab;

댓글을 달려면 로그인하십시오.

추가 답변 (0개)


Help CenterFile Exchange에서 String Parsing에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by