Extracting Data field of a Series in HTML file

조회 수: 3 (최근 30일)
b
b 2020년 4월 7일
댓글: b 2020년 5월 10일
In an HTML file, there is a section like this :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [45,78,84,91,111,125,178,231,274,283,303,333] }],
How to extract the 'data' field into an array in a matlab code ?
There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.
  댓글 수: 4
Mohammad Sami
Mohammad Sami 2020년 4월 7일
are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.
you can easily change the pattern to name: \'Numbers\'
b
b 2020년 4월 7일
I am trying to do the following:
url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);
The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

댓글을 달려면 로그인하십시오.

채택된 답변

per isakson
per isakson 2020년 4월 7일
편집: per isakson 2020년 5월 9일
I misunderstood your question. This is a bit of overkill.
Assumptions
  • the string, series:, always indicates the start of a block of interest
I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).
This script reads all blocks
%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = matlab.lang.makeValidName( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
and extract "series which matches name='Numbers'. Not the other series'."
>> series(strcmp({series.name},'Numbers')).data
ans =
45 78 84 91 111 125 178 231 274 283 303 333
In response to comment below
Assumptions
  • the string, series:, always indicates the start of a block of interest
  • the string, }], indicates the end of a block of interest
  • all html-files of interest are named index.html
  • all files named index.html are of interest
  • all html-files of interest are in subfolders under a root-folder, ...\finCase
  • every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers
The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.
Try
>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
4×2 cell array
{'anderson' } {1×9 double}
{'kim-j-clijsters'} {1×10 double}
{'paul-judd' } {1×11 double}
{'simmi' } {1×12 double}
>>
where (in one m-file)
function client_data = read_client_data( root, file, name )
sad = dir( fullfile( root, '**', file ) );
len = length( sad );
client_data = cell( len, 2 );
for jj = 1 : len
cac = strsplit( sad(jj).folder, filesep );
client = cac{end};
series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
end
end
function series = read_one_file_( file )
chr = fileread( fullfile( file ) );
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = strtrim( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
end
TODO: add error handling and comments
  댓글 수: 10
per isakson
per isakson 2020년 5월 9일
A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.
Replace the statement
series(jj).data = str2num( txt ); %#ok<ST2NM>
by
out = textscan( txt , '%f' ...
, 'CollectOutput' , true ...
, 'Delimiter' , ',' ...
, 'EmptyValue' , 0 ...
, 'TreatAsEmpty' , 'null' ...
, 'Whitespace' , ' \t[]' );
series(jj).data = reshape( out{:}, 1,[] );
and read about textscan in the documentation.
b
b 2020년 5월 10일
LOL on the tragedy of being Null.
The code section works nicely with output as needed.
Indebted once again.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Data Type Conversion에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by