retrieve data from a website with multiple pages

My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.

url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
    Challenge = extractBetween(top, "Challenge=", ";");
    challenge_id = extractBetween(top, "ChallengeId=", ";");
    
    arr = char(Challenge);
    last_digit = str2double(arr(end));
    arr = sort(arr);
    min_digit = str2double(arr(1));
    subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
    subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
    power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
    x = double(Challenge) * 3 + subvar1;
    y = cos(pi * subvar1);
    answer = x * y;
    answer = answer - power;
    answer = answer + (min_digit - last_digit);
    answer = string(floor(answer)) + subvar2;
    
    hdrs = {'X-AA-Challenge' char(Challenge); ...
        'X-AA-Challenge-ID' char(challenge_id); ...
        'X-AA-Challenge-Result' char(answer)};
    
    % now read the website contents ===========================================
    htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
    wildcardPattern + "</td><td>", ...
    '</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table 
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
    'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
ans = 8×7 table
    מספר רישיון                שם יצרן                         כתובת                ישוב           מחוז                                                                  פרטים_סוג מזון (מהות היצור):                                                                        פרטים_קבוצת מזון:         
    ___________    ________________________________    _____________________    _____________    _________    __________________________________________________________________________________________________________________________________________________    ___________________________________

       55678       "א. הקר 2009 גלאט למהדרין בע"מ"     "מרכז ספיר 3 ירושלים"    "ירושלים"        "ירושלים"    "ייצור מוצרי בשר קפואים בלבד: בשר בקר טחון, בשר בעלי כנף טחון ומוצריהם, קישקע ממולא, בשר בקר מעובד, בשר בעלי כנף מעובד, ניסור ואריזת בשר בקר קפוא"    "הסעדה"                            
       68795       "א. כ. התעשיינים בע"מ"              "שד הסנהדרין 3 יבנה"     "יבנה"           "מרכז"       "בשר ומוצריו, לרבות עופות וצייד"                                                                                                                      "הסעדה (קיטרינג)"                  
       52319       "א.א בורקס ליאון"                   "איתן 24 ראשון לציון"    "ראשון לציון"    "מרכז"       "אחסנה בקירור"                                                                                                                                        "אחסון מזון בקירור"                
       69047       "א.א בליסימו בע"מ"                  "איתן 3 ראשון לציון"     "ראשון לציון"    "מרכז"       "קרחונים אכילים, כולל שרבט וסורבט"                                                                                                                    "מחסן קרור/מחסן בטמ' מבוקרת"       
       67457       "א.א מטעמים הכי טעים בע"מ"          "מודיעין 8 פתח תקווה"    "פתח תקווה"      "מרכז"       "ייצור בצקים ממולאים, ייצור עוגיות יבשות"                                                                                                             "לחם, לחמניות, עוגות שמרים ומאפים" 
       52312       "א.א. בליסימו בע"מ"                 "לזרוב 3 ראשון לציון"    "ראשון לציון"    "מרכז"       "מוצרי מאפה, תערובות להכנתם ובצקים"                                                                                                                   "לחמים ולחמניות מאודים"            
       50780       "א.א. דרך האוכל (חיפה) בע"מ"        "שנקר אריה 47 חיפה"      "חיפה"           "חיפה"       "אחסנת בצקים קפואים"                                                                                                                                  "יצור מוצרי בשר בקר וצאן טחון בלבד"
       52587       "א.א. לרנר מוצרי מזון העמק בע"מ"    "הפועלים 2 באר שבע"      "באר שבע"        "דרום"       "מחסן קרור/מחסן בטמ' מבוקרת"                                                                                                                          "בשר ומוצריו, לרבות עופות וצייד"   

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

Ive J 2022년 2월 21일

편집: Ive J 2022년 2월 21일

MATLAB Online에서 열기

I'm not sure if I get it right; do you mean you tried something like this?

function parseFoodAndNutrition(n)
if nargin < 1
    n = 3; % read only 3 pages
end
unitab = cell(n, 1);
for i = 1:n
    fprintf('reading page %d of %d\n', i, n)
    unitab{i} = readEachPage(i);
end
unitab = vertcat(unitab{:});
unitab = convertvars(unitab, 1:width(unitab), @string);
unitab.(1) = double(unitab.(1));
end % END
%% subfunctions ===========================================================
function tab = readEachPage(n)
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=" + n;
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
    Challenge = extractBetween(top, "Challenge=", ";");
    challenge_id = extractBetween(top, "ChallengeId=", ";");
    
    arr = char(Challenge);
    last_digit = str2double(arr(end));
    arr = sort(arr);
    min_digit = str2double(arr(1));
    subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
    subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
    power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
    x = double(Challenge) * 3 + subvar1;
    y = cos(pi * subvar1);
    answer = x * y;
    answer = answer - power;
    answer = answer + (min_digit - last_digit);
    answer = string(floor(answer)) + subvar2;
    
    hdrs = {'X-AA-Challenge' char(Challenge); ...
        'X-AA-Challenge-ID' char(challenge_id); ...
        'X-AA-Challenge-Result' char(answer)};
    
    % now read the website contents ===========================================
    htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
    wildcardPattern + "</td><td>", ...
    '</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table 
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
    'VariableNames', hdr);
% can be done at once in the end
% tab = convertvars(tab, 1:width(tab), @string);
% tab.(1) = double(tab.(1));
end

When I run the above function I get this:

size(unitab)
ans =
    36     7

sani 2022년 2월 22일

I was actually put your entire script in a for loop, and changed the URL as i increase. Than in each loop I was writing the answer from your script to another tanle using vertcat. If I understand correctly, the answer of size(unitab) = (36,7) is for pages 1-3? If so, this is the dimension I'm expecting to receive.

Ive J 2022년 2월 22일

MATLAB Online에서 열기

Yes, that's for 3 pages.

Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).

To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:

for i = 1:n
    fprintf('reading page %d of %d\n', i, n)
    tab = readEachPage(i);
    save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
    unitab{i} = tab;
end

댓글을 달려면 로그인하십시오.

retrieve data from a website with multiple pages

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

retrieve data from a website with multiple pages

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 10 이전 댓글 8개 표시이전 댓글 8개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기