Why does csvread behave differently for large csv files?

Question

Peter 2015년 6월 5일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/222551-why-does-csvread-behave-differently-for-large-csv-files

편집: Jeremy Hughes 2015년 6월 8일

I have two csv files that I'm trying to read in. The first contains one row of integers, the second contains one row of floats.

They are both formatted in the same way (with a trailing comma):

int_val_1,int_val_2,...,int_val_n,
float_val_1,float_val_2,...,float_val_m,

As I understand it, csvread should produce a row matrix with an extra 0 at the end (due to the trailing comma). In my case, however, csvread produces a column matrix without an extra 0 for the first file, and a row matrix with an extra 0 for the second file. This only happens if the first file is large (e.g., 589824 integers). If there are a small number of integers, it behaves as expected.

What's going on?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Jeremy Hughes 2015년 6월 8일

편집: Jeremy Hughes 2015년 6월 8일

MATLAB Online에서 열기

Hi Peter,

You have run into an unfortunate limitation in the way csvread detects the number of columns in the file. Since your file is one long row, csvread assumes it's all one never-ending string of data. (at 100,000 columns, as Per discovered below, it stops counting and just returns a column.)

If you want to get consistent results on the output shape, you can call textscan in the following way.

fid = fopen(filename);
[data] = textscan(fid,'%f','Delimiter',',','EndOfLine','\r\n');
fclose(fid);

The variable "data" will be a cell array containing a column of numbers. If you need a row, just pull it out of the cell array and transpose;

data = (data{1})';

I hope this helps,

Jeremy

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

per isakson 2015년 6월 5일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/222551-why-does-csvread-behave-differently-for-large-csv-files#answer_181643

편집: per isakson 2015년 6월 5일

MATLAB Online에서 열기

I reproduced your result on R2013a, Win7

>> [CR,FS] = cssm(1e5); whos('CR','FS')
  Name           Size             Bytes  Class     Attributes
  CR        100000x1             800000  double              
  FS        100000x1             800000  double              
>> [CR,FS] = cssm(1e3); whos('CR','FS')
  Name         Size              Bytes  Class     Attributes
  CR           1x1001             8008  double              
  FS        1000x1                8000  double

where

function    [ CR, FS ] = cssm( N )
    str = repmat( '1.1,', 1, N );
    fid = fopen( 'cssm.txt', 'w' );
    fprintf( fid, '%s', str );
    fclose( fid );
    CR  = csvread( 'cssm.txt' );
    fid = fopen( 'cssm.txt', 'r' );
    FS  = fscanf( fid, '%f,' );
    fclose( fid );
end

"As I understand it, csvread should produce a row matrix with an extra 0" &nbsp I didn't find that stated in in the documentation of csvread

csvread is based on textscan and contains a bit of automagic. I guess, it was never intended for rows that long, i.e. files without new lines.

&nbsp

without the ending comma

And without the ending comma, cvsread returns a row for the large file.

>> [CR,FS] = cssm(1e5); whos('CR','FS')
  Name           Size                 Bytes  Class     Attributes
  CR             1x100000            800000  double              
  FS        100000x1                 800000  double

&nbsp

textscan with empty formatSpec

csvread calls textscan with formatSpec set to an empty string. That option of textscan is not documented. It makes a difference in this special case.

>> [CR,FS,TS1,TS2] = cssm(1e3); whos('CR','FS','TS1','TS2')
  Name         Size              Bytes  Class     Attributes
  CR           1x1001             8008  double              
  FS        1000x1                8000  double              
  TS1       1000x1                8000  double              
  TS2          1x1001             8008  double              
>> [CR,FS,TS1,TS2] = cssm(1e5); whos('CR','FS','TS1','TS2')
  Name           Size             Bytes  Class     Attributes
  CR        100000x1             800000  double              
  FS        100000x1             800000  double              
  TS1       100000x1             800000  double              
  TS2       100000x1             800000  double

where

function    [ CR, FS, TS1, TS2 ] = cssm( N )
    str = repmat( '1.1,', 1, N );
    fid = fopen( 'cssm.txt', 'w' );
    fprintf( fid, '%s', str(1:end) );
    fclose( fid );
    CR  = csvread( 'cssm.txt' );
    fid = fopen( 'cssm.txt', 'r' );
    FS  = fscanf( fid, '%f,' );
    fclose( fid );
    fid = fopen( 'cssm.txt', 'r' );
    cac = textscan( fid, '%f', 'Delimiter',','            ... 
                  , 'CollectOutput',true, 'EmptyValue',999 );
    fclose( fid );
    TS1 = cac{:};
    fid = fopen( 'cssm.txt', 'r' );
    cac  = textscan( fid, '', 'Delimiter',','               ...
                   , 'CollectOutput',true, 'EmptyValue',999 );
    fclose( fid );
    TS2 = cac{:};
end

&nbsp

For the large file, all of the functions and options I tested fails to recognize the ending comma.

댓글 수: 2
없음 표시없음 숨기기

Peter 2015년 6월 5일

MATLAB Online에서 열기

Interesting. While it's not on the documentation page,

help csvread

produces

csvread fills empty delimited fields with zero.  Data files where
    the lines end with a comma will produce a result with an extra last 
    column filled with zeros.

per isakson 2015년 6월 5일

편집: per isakson 2015년 6월 5일

The test suites at The MathWorks don't always cover all the edge cases, I guess.

댓글을 달려면 로그인하십시오.

Why does csvread behave differently for large csv files?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Why does csvread behave differently for large csv files?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 2
없음 표시없음 숨기기