Importing a Large CSV in chunks and getting Data Range Invalid

Question

0 개 추천

I have a large CSV File with 17+million rows of data. My current code is importing 100,000 lines at a time and processing that data down than overwritting that raw data for the next 100k in an attempt to not run out of memory.

for i=0:100000:10000000
    start_row=2+i;
    end_row=100000+i; 
    RangeString = sprintf('B%d:O%d', start_row, end_row); 
    [num,text,raw] = xlsread('ABC.csv', 1, RangeString);
 %%Process Data%%
End

The End Result anytime I try import a line over 1m.

Data range 'B1000002:O1100000' is invalid.

Error in DataAnalysis (line 13)

[num,text,raw] = xlsread('ABC.csv', 1, RangeString);

What is the better way to import and process this data?

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

Walter Roberson 2019년 4월 25일

You might want to look at https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.tabulartextdatastore.html -- though it appears to use textscan() in part.

Walter Roberson 2019년 4월 25일

For textscan() you specify a format in a fscanf() like way. You need one format specifier for each input field, and if you do not want the field returned you can use the * modifier:

%f%*s%f

=> read a number to be returned, read a string but discard it, read a number to be returned

Normally, textscan returns a cell array in which there is one element for each format specifier (that is not marked to discard.) In most uses, that means a cell array that contains one cell per column of input (that you want to keep.) The CollectOutput option triggers merging the output for adjacent elements that use the same datatype.

Dates can be read in as dates with a %D specifier. However, it is tricky to get it to scan a date with a space in it as a single field; it is often easier to scan a date with a space in it as a date field and then a duration field, and add the duration to the date afterwards.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Bob Thompson 2019년 4월 25일

편집: Bob Thompson 2019년 4월 25일

0 개 추천

You can call headerlines to specify a starting point with textscan, but I don't know if it is possible to specify an ending point, which is part of why I suggested dlmread.

Because you have dates, another option you can try is to read each line as a string of text, and then split the values with regexp. That might look something like this.

fid = fopen('mydata.csv');
line = fgetl(fid);
count = 1;
while ~isnumeric(line)
    if rem(count,100000)==0; % Look for 
        tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens'); % Sample only, should return four cells with data
                            ^        ^            ^            ^
                          Interger Integer    DateTime      Integer
        data(count,:)  = [tmp{:}]; % May need to go one level deep {1}{:}, regexp is weird
        % Put what you want to do with each block of 100k data lines here. Not really sure what you want to do with it
        count = 0;
    else
        tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens');
        data(count,:)  = [tmp{:}];
    end
    line = fgetl(fid);
    count = count + 1;
end

This is not a perfect code, and may require you to learn some about regexp to get the formatting correct. I was also unsure what exactly your format looks like.

Alternatively, you could try reading with dlmread, and then concatenating several columns into one for a date time, either using datetime, or just by concatenating.

Keep in mind that no matter which method you use that you will have a set of data with mixed classes, so you will either need to work with cells, a table, or a structure.

댓글 수: 2
없음 표시 없음 숨기기

John Gagnon 2019년 4월 25일

Thank you. I will take a look at regexp and see if I can get this to work!

Walter Roberson 2019년 4월 25일

For textscan() you specify a repeat count immediately after the format string.

댓글을 달려면 로그인하십시오.

Answer 2

Jeremy Hughes 2019년 4월 25일

0 개 추천

I'd reccomend using tabularTextDatastore for this case. It will try to automatically detect formats and handle the reading of sections of the file for you.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Importing a Large CSV in chunks and getting Data Range Invalid

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

답변 (2개)

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

Importing a Large CSV in chunks and getting Data Range Invalid

댓글 수: 6 이전 댓글 4개 표시 이전 댓글 4개 숨기기

답변 (2개)

댓글 수: 2 없음 표시 없음 숨기기

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기