readtable(html file) producing extra empty columns
이전 댓글 표시
Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.
Update: name-value pair does have 'DataRows' option.
opt.ExtraColumnsRule = 'ignore' % readtable only the first column.
% either
opt = htmlImportOptions;
opt.DataRows = 4;
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore';
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4)
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error
I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.
Thanks for any solutions!
댓글 수: 6
Simon
2023년 9월 10일
Sulaymon Eshkabilov
2023년 9월 10일
이동: Voss
2023년 9월 10일
Can you share your sample data?
Simon
2023년 9월 11일
You should be able to create sample file(s) that reproduces the issues by confounding the actual data...it's essentially impossible to solve a problem remotely that can't see/touch and for others to try to duplicate your issue is too much to expect of busy volunteers...
Simon
2023년 9월 11일
dpb
2023년 9월 11일
It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.
IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...
답변 (1개)
dpb
2023년 9월 10일
Use 'SelectedVariableNames' with the variable(s) desired
I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then
opt = htmlImportOptions;
opt.DataRows = 4;
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1); % read only the first column
tData=readtable(htmlfile, opt);
댓글 수: 5
You can't expect a single solution to be able to handle different file formats.
If you can't standardize the file format, then you'll have to have a catalog of what are possible formats you expect to be able to handle and then build a tool which can discern which is which from some characteristic within the file or have a way to know heuristically which is which a priori.
But, if the header line or data content is such that the import options detection heuristics don't return the expected number of variables/columns, then you'll have to use some other technique to identify what are columns/variables. But again, if we can't see examples to work with, we're also shooting blindly at an ill-defined problem.
The mind-(file)reading toolbox has yet to be released...
ADDENDUM:
If, as the original Q? content seems to suggest, if the "dead ahead" readtable function returns the expected variables plus some that may occasionally be extra/empty, the most expedient solution may well be to simply read the file then delete the extraneous columns instead of going through a lot of gyrations of parsing the changeable input file to handle various cases explicitly -- again, if you can't standardize the input form and can't know a priori from some other source the specifics of the particular file a priori.
In a particular case with some more-or-less free format Excel files, I simply read the variable names row first and parse it for the specific variables of interest. In that case, the variables will always exist; just sometimes there are some others that user may have thrown in ad hoc and sometimes there are some empty columns to parse out, but detectImportOptions will return a correct number of variables. If, in the case of an .html file that can't be assured, then it's going to be a lot more trouble, but I don't have to mess with such and without specific example(s) from which to work, as noted before, we're pretty-much hamstrung.
Simon
2023년 9월 12일
dpb
2023년 9월 12일
As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.
While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.
If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.
Simon
2023년 9월 17일
카테고리
도움말 센터 및 File Exchange에서 Data Type Identification에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!