- something in the file that's confusing textscan 's row count or
- there is an actual bug/limitation inside textscan
Why would csvread read all data into a single column
조회 수: 13 (최근 30일)
이전 댓글 표시
I am trying to read in a csv file to matlab. It has 1 million columns and 2 rows. When I use csvread it reads the file in as a 1 column by 2 million row matrix. Why would it do this?
댓글 수: 1
dpb
2017년 5월 12일
편집: dpb
2017년 5월 12일
Dunno...would seem there's either
csvread is just a wrapper to dlmread which in turn simply parses the inputs and calls textscan. For a .csv file, the call boils down to
delimiter = sprintf(delimiter);
whitespace = setdiff(sprintf(' \b\t'),delimiter);
result = textscan(fid,'',nrows, ...
'delimiter',delimiter,'whitespace',whitespace, ...
'headerlines',r,'headercolumns',c,...
'returnonerror',0,'emptyvalue',0,'CollectOutput', true);
where, of course, delimiter is ','.
The "magic" occurs inside textscan as you notice there is no explicit format string but an empty string placeholder--this is the cue used internally instructing it to return the array shape as the record structure appears externally without the user having to count fields and build a format string.
Since we can't see inside textscan, this is as far as we can go.
You could try building a test file and parsing it and seeing if you can replicate the problems at a specific record length or, on the way, perhaps determine that file works correctly and the fault is in this data file.
답변 (2개)
Matthew Eicholtz
2017년 5월 12일
I think dpb's comment addresses potential csvread issues well, so I'll just add an alternative option that may work for you: readtable.
댓글 수: 3
Matthew Eicholtz
2017년 5월 12일
Ah yes, after re-reading the question, I agree. I was thinking 1 million instance of 2 variables, not the other way around. Good catch.
dpb
2017년 5월 12일
Actually, for such data file sizes it would seem far better to use .mat files or stream or somesuch...there's certainly no looking at them usefully by hand it would seem.
dpb
2017년 5월 12일
편집: dpb
2017년 5월 12일
Expanding upon the above comments, I did a test that looked like--
N=1E6; % the long row length
csvwrite('test.csv',randi(127,2,N)) % write a 2-row file of same N (@)
d=csvread('test.csv');
while isvector(d)
N=N/2;
csvwrite('test.csv',randi(127,2,N)) % write a 2-row file of same N
d=csvread('test.csv');
end
disp(N)
Result was for N=62500 which seems to prove conclusively there's an internal limit in textscan; probably some sort of buffer limit one would guess when the format string isn't provided.
I didn't try to refine the result to between 62500 125000 where it breaks, but that definitely seems to be the cause of the issue.
I tried the venerable textread, it never completed the 1E6 case before I gave up, so that's not a workaround.
While it would be butt-ugly as a solution, tried the explicit format and
>> d=textscan(fid,fmt,'delimiter',',','collectoutput',1);
Out of memory. Type HELP MEMORY for your options.
>>
Looks like a support request to TMW to see they can find a workaround or put it onto the enhancement list to resolve. Certainly seems as though Matlab should be able to read any file in whatever form it is in on disk as long as it can actually fit in memory without gyrations by the user.
(@) Just to be sure, I did scan the long record file by reading as stream character file and confirmed csvwrite wrote the linefeeds where should have so that, in fact, the file was actually two records on disk.
ADDENDUM
I hate it when get fixated on something... :(
But, I did a couple of additional tests and confirmed there's a hard limit apparently buried inside the textscan code at 100000--
>> N=100000;
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
1
>> N=N-1
N =
99999
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
0
>>
Fails beginning at 100,000 elements in length; 99,999 is ok, you're just not supposed to have a file with records any longer than that, it seems.
댓글 수: 1
dpb
2017년 5월 12일
편집: dpb
2017년 5월 13일
Well, one way to make it work, albeit slowly
>> fid=fopen('test.csv','r');
>> dd=str2num(fread(fid,'*char').');
>> whos dd
Name Size Bytes Class Attributes
dd 2x1000000 16000000 double
>> fid=fclose(fid);
If you know the size a priori it would be better to just read and reshape. If the size isn't known, then two step solution is probably still significantly faster as str2num uses eval internally. But, it is interesting the interpreter can deal with that long of an internal input record while textscan can't handle that long of an external record.
fid=fopen('test.csv','r');
n=length(find(fread(fid,'*char')==10)); %
fid=fclose(fid);
d=reshape(csvread('test.csv'),n,[]);
참고 항목
카테고리
Help Center 및 File Exchange에서 Large Files and Big Data에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!