How do I extract the contents of an HTML table on a web page into a MATLAB table?

조회 수: 29 (최근 30일)
I'd like to plot and analyze the TSA traveler data from this website: https://www.tsa.gov/coronavirus/passenger-throughput
The data is embedded on the page as an HTML table element.
How do I extract the table content into a MATLAB table?

채택된 답변

Pat Canny
Pat Canny 2020년 6월 23일
You can extract the <table> content, which is all stored in a set of <td> tags, as a string array and go from there.
You first need to use findElement and extractHTMLText on an htmlTree object.
You then can use reshape to arrange the data, then use array2table to convert to a table.
Here is one approach:
travel_data = webread('https://www.tsa.gov/coronavirus/passenger-throughput');
travel_data_tree = htmlTree(travel_data);
selector = "td";
subtrees = findElement(travel_data_tree,selector);
str = extractHTMLText(subtrees);
table_data = str(4:end); % first three elements are just the column names
reshape_ncols = 3;
reshape_nrows = length(table_data)/reshape_ncols;
table_data_reshaped = reshape(table_data,reshape_ncols,reshape_nrows)';
% Convert to table
traveler_data_table = array2table(table_data_reshaped,'VariableNames',["Date" "Travelers_Today" "Travelers_Last_Year"]); % I got lazy with VariableNames, I know.
% Convert data types from strings to appropriate types
traveler_data_table.Date = datetime(traveler_data_table.Date);
traveler_data_table.Travelers_Today = str2double(traveler_data_table.Travelers_Today);
traveler_data_table.Travelers_Last_Year = str2double(traveler_data_table.Travelers_Last_Year);
traveler_data_table.Traveler_Ratio = traveler_data_table.Travelers_Today ./ traveler_data_table.Travelers_Last_Year;
% Plot the results
figure
plot(traveler_data_table.Date,traveler_data_table.Traveler_Ratio)
title("TSA Traveler Ratio by Date (2020 vs. 2019)")
grid on
% Some more fun analysis
% When did it bottom out?
[min_ratio,idx] = min(traveler_data_table.Traveler_Ratio);
min_ratio_pct = 100*min_ratio;
min_date = traveler_data_table.Date(idx);
disp("The minimum traveler ratio of " + min_ratio_pct + "% occurred on " + string(min_date))
latest_pct = 100*traveler_data_table.Traveler_Ratio(1);
disp("The current ratio is " + latest_pct + "%")

추가 답변 (1개)

Christopher Creutzig
Christopher Creutzig 2022년 6월 7일
Starting in R2021b, you can directly use readtable for HTML tables:
readtable("https://www.tsa.gov/coronavirus/passenger-throughput",...
FileType="html",ReadVariableNames=true,ThousandsSeparator=",")
ans = 364×5 table
Date 2022 2021 2020 2019 __________ __________ __________ __________ __________ 06/05/2022 2.3872e+06 1.9847e+06 4.4126e+05 2.6699e+06 06/04/2022 1.9814e+06 1.6812e+06 3.5302e+05 2.226e+06 06/03/2022 2.3326e+06 1.8799e+06 4.1968e+05 2.6498e+06 06/02/2022 2.2132e+06 1.8159e+06 3.9188e+05 2.6239e+06 06/01/2022 1.9991e+06 1.5879e+06 3.0444e+05 2.3702e+06 05/31/2022 2.1081e+06 1.6828e+06 2.6774e+05 2.2474e+06 05/30/2022 2.3122e+06 1.9002e+06 3.5326e+05 2.499e+06 05/29/2022 2.0965e+06 1.6505e+06 3.5295e+05 2.5556e+06 05/28/2022 1.9942e+06 1.6058e+06 2.6887e+05 2.1172e+06 05/27/2022 2.3847e+06 1.9596e+06 3.2713e+05 2.5706e+06 05/26/2022 2.3799e+06 1.8545e+06 3.2178e+05 2.4858e+06 05/25/2022 2.1477e+06 1.6182e+06 2.6117e+05 2.269e+06 05/24/2022 2.0207e+06 1.4708e+06 2.6484e+05 2.4536e+06 05/23/2022 2.329e+06 1.7474e+06 3.4077e+05 2.5122e+06 05/22/2022 2.3509e+06 1.8637e+06 2.6745e+05 2.0707e+06 05/21/2022 1.9888e+06 1.55e+06 2.5319e+05 2.1248e+06
  댓글 수: 3
Simon
Simon 2025년 1월 27일
편집: Simon 2025년 1월 27일
Thanks for this excellent answer. Is there a way to read in only the latest Number in the top row when readtable() fetches the data, without doing the extraction in Matlab table?
I want to read in the latest data from a table, which is updated daily.
Christopher Creutzig
Christopher Creutzig 2025년 1월 27일
You can use DataRows=[1,1] to only read from the first row.
Note: The count here does not start where the import without the option starts. Depending on the data, you may have better luck with DataRows=[2,2] or something like that.
Why not just DataRows=2? Because by backward compatibility with other related options in readtable, DataRows=2 means “data rows start at n=2.”

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Tables에 대해 자세히 알아보기

태그

제품


릴리스

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by