readtable(html file) producing extra empty columns

Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.
Update: name-value pair does have 'DataRows' option.
opt.ExtraColumnsRule = 'ignore' % readtable only the first column.
% either
opt = htmlImportOptions;
opt.DataRows = 4;
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore';
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4)
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error
I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.
Thanks for any solutions!

댓글 수: 6

I prefer htmlImportOptions to name-value pair. I hope opt.EmptyColumnRule will be added in Matlab's future version.
Sulaymon Eshkabilov
Sulaymon Eshkabilov 2023년 9월 10일
이동: Voss 2023년 9월 10일
Can you share your sample data?
It's not permissible for me to share my data, but I suspect the header rows, which I select to be used as variables, cause the problem. The headers of the first two columns are empty, and the other headers are, say X, Y, Z. So the extra variables are created by Matlab as Z_1, Z_2, ... etc.
dpb
dpb 2023년 9월 11일
편집: dpb 2023년 9월 11일
You should be able to create sample file(s) that reproduces the issues by confounding the actual data...it's essentially impossible to solve a problem remotely that can't see/touch and for others to try to duplicate your issue is too much to expect of busy volunteers...
@dpb I will see if I can create a sample data.
It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.
IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...

댓글을 달려면 로그인하십시오.

답변 (1개)

dpb
dpb 2023년 9월 10일
Use 'SelectedVariableNames' with the variable(s) desired
I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then
opt = htmlImportOptions;
opt.DataRows = 4;
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1); % read only the first column
tData=readtable(htmlfile, opt);

댓글 수: 5

Simon
Simon 2023년 9월 11일
편집: Simon 2023년 9월 11일
@dpb "Use 'SelectedVariableNames' with the variable(s) desired"
I think your answer is in the right direction. I suspect the header rows, which I select to be used as variables, cause the problem. The headers of the first two columns are empty, and the other headers are, say X, Y, Z. So the extra variables are created by Matlab as Z_1, Z_2, ... etc.
But the problem is that I have many tables, whose headers may more or less be the same, but not alway be the same. For example, one table may have headers (empty), (empty), X, Y, Z., another table (empty), (empty), X, Y, W, Z. Matlab treats the first empty header as Var1, and the rest as 'extra' columns.
I will give 'SelectedVariableNames' a shot. Thank you.
@dpb I have just now tried it. not working. Matlab insist treats the first column, whose header is empty, as the only variable. Even SelectedVariableNames = 1:2 will produce out-of-range error. It's critical for me to have the headers of the other columns as variables. But I don't know what they might be beforehand. I have to unaccept your answer. Sorry!
dpb
dpb 2023년 9월 11일
편집: dpb 2023년 9월 11일
You can't expect a single solution to be able to handle different file formats.
If you can't standardize the file format, then you'll have to have a catalog of what are possible formats you expect to be able to handle and then build a tool which can discern which is which from some characteristic within the file or have a way to know heuristically which is which a priori.
But, if the header line or data content is such that the import options detection heuristics don't return the expected number of variables/columns, then you'll have to use some other technique to identify what are columns/variables. But again, if we can't see examples to work with, we're also shooting blindly at an ill-defined problem.
The mind-(file)reading toolbox has yet to be released...
ADDENDUM:
If, as the original Q? content seems to suggest, if the "dead ahead" readtable function returns the expected variables plus some that may occasionally be extra/empty, the most expedient solution may well be to simply read the file then delete the extraneous columns instead of going through a lot of gyrations of parsing the changeable input file to handle various cases explicitly -- again, if you can't standardize the input form and can't know a priori from some other source the specifics of the particular file a priori.
In a particular case with some more-or-less free format Excel files, I simply read the variable names row first and parse it for the specific variables of interest. In that case, the variables will always exist; just sometimes there are some others that user may have thrown in ad hoc and sometimes there are some empty columns to parse out, but detectImportOptions will return a correct number of variables. If, in the case of an .html file that can't be assured, then it's going to be a lot more trouble, but I don't have to mess with such and without specific example(s) from which to work, as noted before, we're pretty-much hamstrung.
@dpb “You can't expect a single solution to be able to handle different file formats.” You are right. My Matlab codes are for data wrangling. In my experience, there is really no fixed set of solutions for that kind of task. While Matlab is a great tool for it, many problems can't be solved by built-in functions. Thanks for your excellent feedbacks!
As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.
While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.
If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.
@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 Data Type Identification에 대해 자세히 알아보기

제품

릴리스

R2023a

질문:

2023년 9월 10일

댓글:

2023년 9월 17일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by