Read a very large .csv file, split into parts and save each part into a smaller .csv file

Question

GioPapas81 2019년 9월 26일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/482275-read-a-very-large-csv-file-split-into-parts-and-save-each-part-into-a-smaller-csv-file

댓글: GioPapas81 2019년 10월 1일

Deat Matlabers,

I need to read a very large .csv file with about 15.000 columns and 500.000 rows. I need to split it into chunks of rows (i.e. 20.000 rows and all 15.000 columns), and save each chunk into a separate .csv file.

I have tried to use textscan, but I am not sure that this can work, as I have not only numerics, but also non-numerics and dates across separate columns. I would ideally aim to get all this information, as I will need it for different parts of my project.

2. I also attempted tabularTextDatastore, but I get an error:

Unable to determine the format of the DATETIME data.

Try adding a format to the DATETIME specifier. e.g. '%{MM/dd/uuuu}D'.

Is there any way I could provide a DATETIME specifier (this is not explained in the relevant documentation)?

Memory is not a problem here, as I currently use a supercomputer in terms of RAM.

I would be grateful for any ideas.

George

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Jeremy Hughes 2019년 9월 27일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/482275-read-a-very-large-csv-file-split-into-parts-and-save-each-part-into-a-smaller-csv-file#answer_393809

MATLAB Online에서 열기

If your plan is to write all the small CSV files out, and do nothing with them in MATLAB, I'd say just use tabularTextDatastore, and set all of the ds.TextscanFormats(:) = {'%q'}, There should never be any errors with '%q'

Then use writetable.

ds = tabularTextDatastore(filename,'ReadSize',myReadSize);
ds.TextscanFormats(:) = {'%q'};
while hasdata(ds)
    % Need to figure out the file names but other than that, this should work.
    writetable(read(ds),output_filename);
end

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

GioPapas81 2019년 9월 30일

편집: GioPapas81 2019년 9월 30일

Hi Jeremy,

Many thanks for your answer. Sorry, as I am not very familiar with the TextscanFormats yet, what is the term (:) specifying here? Is it the number of the column (e.g. column no 1000 should be denoted as (1000))?

I have dates across many columns, but I could find a way to specify them. If let's say I have dates in columns number 100 and 1000, would the following command be correct (?):

ds.TextscanFormats(100,1000) = {'%q','%q'};

2. On a separate but similar note, I also tried the following way:

ds.SelectedVariableNames = {'eid','20201-2.0'};

where eid, and 20210-2.0 are column variables. This way would also work for me, as I could extract specific columns to work with for my data analysis. However, I get an error:

Error using matlab.io.datastore.TabularTextDatastore/set.SelectedVariableNames (line 619)

SelectedVariableNames must be a unique subset of VariableNames.

I don't have other variable with that name, but the same name repeats in the same column across multiple rows.

If I could get either of 1 or 2 to work, that would be so helpful.

Thank you again,

George

Jeremy Hughes 2019년 9월 30일

':' is a MATLAB syntax meaning "all".

x(:) = -1,

would set all the values in x to -1. I meant literally that code. =)

GioPapas81 2019년 10월 1일

Thank you Jeremy, I will try this out.

George

댓글을 달려면 로그인하십시오.

Answer 2

Sulaymon Eshkabilov 2019년 9월 26일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/482275-read-a-very-large-csv-file-split-into-parts-and-save-each-part-into-a-smaller-csv-file#answer_393623

MATLAB Online에서 열기

Hi,

The answer is rather simple. You can take out all dates with string specifier: %s. E.g. file called: DATA_date.txt

DATE Row1 Row2 Row3 Row5

11/11//2019 1 1.13 2 3.33

11/11//2019 2 0.13 3.12 3.33

11/11//2019 3 2.13 -2 -5.33

11/11//2019 4 4.13 -3 -7.33

11/11//2019 5 3.13 5.5 -8.33

11/11//2019 6 2.13 2.6 -13.33

Can be imported into matlab workspace with:

FileName = 'DATA_date.txt';
FID = fopen(FileName, 'r');
SPECs = '%s%d%f%f%f';
N_header = 1;
DATA = textscan(FID, SPECs, 'headerlines', N_header);
fclose(FID);

Now all imported data will be inside a cell array DATA. DATA{1,1} contains DATE values; DATA{1,2} contains data of Row1; ... DATA{1,5} contains data of Row5.

Good luck.

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Sulaymon Eshkabilov 2019년 9월 26일

Carefully pay attention how your data is formatted such as data type, viz. integer, floating point, dates, texts, etc. Number of columns in each row has to match with the subsequent row. That means your data need to be very well neatly formatted. If you have one data point missing somewhere in your large data that would create a problem.

Good luck.

GioPapas81 2019년 9월 27일

Unfortunately I do have lots of missing data in my file,randomly distributed. I also don't know which columns have dates (there are tousands of columns, across houndreds of thousans of rows).

I hoped that the tabularTextDatastore option would be possible, but I think it is not possible to account for dates via that route (according to the errors I get above).

But, thank you for your responses Sulaymon.

댓글을 달려면 로그인하십시오.

Read a very large .csv file, split into parts and save each part into a smaller .csv file

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (1개)

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Read a very large .csv file, split into parts and save each part into a smaller .csv file

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (1개)

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기