readtable(html file) producing extra empty columns

조회 수: 14 (최근 30일)
Simon
Simon 2023년 9월 10일
댓글: Simon 2023년 9월 17일
Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.
Update: name-value pair does have 'DataRows' option.
opt.ExtraColumnsRule = 'ignore' % readtable only the first column.
% either
opt = htmlImportOptions;
opt.DataRows = 4;
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore';
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4)
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error
I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.
Thanks for any solutions!
  댓글 수: 6
Simon
Simon 2023년 9월 11일
@dpb I will see if I can create a sample data.
dpb
dpb 2023년 9월 11일
It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.
IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...

댓글을 달려면 로그인하십시오.

답변 (1개)

dpb
dpb 2023년 9월 10일
Use 'SelectedVariableNames' with the variable(s) desired
I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then
opt = htmlImportOptions;
opt.DataRows = 4;
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1); % read only the first column
tData=readtable(htmlfile, opt);
  댓글 수: 5
dpb
dpb 2023년 9월 12일
As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.
While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.
If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.
Simon
Simon 2023년 9월 17일
@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Text Data Preparation에 대해 자세히 알아보기

제품


릴리스

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by