MATLAB Answers

0

Importing a Large CSV in chunks and getting Data Range Invalid

John Gagnon 님이 질문을 제출함. 25 Apr 2019
최근 활동 Jeremy Hughes 님이 답변함. 25 Apr 2019
I have a large CSV File with 17+million rows of data. My current code is importing 100,000 lines at a time and processing that data down than overwritting that raw data for the next 100k in an attempt to not run out of memory.
for i=0:100000:10000000
start_row=2+i;
end_row=100000+i;
RangeString = sprintf('B%d:O%d', start_row, end_row);
[num,text,raw] = xlsread('ABC.csv', 1, RangeString);
%%Process Data%%
End
The End Result anytime I try import a line over 1m.
Data range 'B1000002:O1100000' is invalid.
Error in DataAnalysis (line 13)
[num,text,raw] = xlsread('ABC.csv', 1, RangeString);
What is the better way to import and process this data?

  댓글 수: 6

표시 이전 댓글 수: 3
Im going down the TextScan rabbit hole now. Any help on formating would be great, im unsure where to call the specific Rows and Columns.
For textscan() you specify a format in a fscanf() like way. You need one format specifier for each input field, and if you do not want the field returned you can use the * modifier:
%f%*s%f
=> read a number to be returned, read a string but discard it, read a number to be returned
Normally, textscan returns a cell array in which there is one element for each format specifier (that is not marked to discard.) In most uses, that means a cell array that contains one cell per column of input (that you want to keep.) The CollectOutput option triggers merging the output for adjacent elements that use the same datatype.
Dates can be read in as dates with a %D specifier. However, it is tricky to get it to scan a date with a space in it as a single field; it is often easier to scan a date with a space in it as a date field and then a duration field, and add the duration to the date afterwards.

로그인 to comment.

제품


릴리스

R2016b

답변 수: 2

Bob Nbob 님의 답변 25 Apr 2019
Bob Nbob 님이 편집함. 25 Apr 2019

You can call headerlines to specify a starting point with textscan, but I don't know if it is possible to specify an ending point, which is part of why I suggested dlmread.
Because you have dates, another option you can try is to read each line as a string of text, and then split the values with regexp. That might look something like this.
fid = fopen('mydata.csv');
line = fgetl(fid);
count = 1;
while ~isnumeric(line)
if rem(count,100000)==0; % Look for
tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens'); % Sample only, should return four cells with data
^ ^ ^ ^
Interger Integer DateTime Integer
data(count,:) = [tmp{:}]; % May need to go one level deep {1}{:}, regexp is weird
% Put what you want to do with each block of 100k data lines here. Not really sure what you want to do with it
count = 0;
else
tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens');
data(count,:) = [tmp{:}];
end
line = fgetl(fid);
count = count + 1;
end
This is not a perfect code, and may require you to learn some about regexp to get the formatting correct. I was also unsure what exactly your format looks like.
Alternatively, you could try reading with dlmread, and then concatenating several columns into one for a date time, either using datetime, or just by concatenating.
Keep in mind that no matter which method you use that you will have a set of data with mixed classes, so you will either need to work with cells, a table, or a structure.

  댓글 수: 2

Thank you. I will take a look at regexp and see if I can get this to work!
For textscan() you specify a repeat count immediately after the format string.

로그인 to comment.


Jeremy Hughes 님의 답변 25 Apr 2019

I'd reccomend using tabularTextDatastore for this case. It will try to automatically detect formats and handle the reading of sections of the file for you.

  댓글 수: 0

로그인 to comment.



Translated by