Reading data from large .CSV files

Question

hal9k 2021년 1월 15일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/718100-reading-data-from-large-csv-files

답변: Athul Prakash 2021년 1월 22일

So I am trying to figure out a best and efficient way to achive as shown in pic. I have looked into datastore and other things but I can't seem to find right strategy which will allow me to accomplish this.

I have large csv filesets (each of them are ~10 Gbs) that have timestamps on them. I need to extract certain section based on times (see sample on the right), combine them and create .mat or .txt (or .csv) files.

What would be the best strategy to achieve this? If the file sizes were small, I could easily achieve that by loading and sorting but with massive files I cant seem to do it efficiently.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

dpb 2021년 1월 15일

Are the timestamps uniform across files so can compute number of records needed for each timestep? If so could just copy that many records sequentially from each to the new output file...

What is to be done in the output file about timestamps -- each file a column or just append or what?

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Athul Prakash 2021년 1월 22일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/718100-reading-data-from-large-csv-files#answer_603898

MATLAB Online에서 열기

Hi there,

The use of a datastore is what comes to mind at first - you could try using read to obtain a chunk of data at once and select required fields efficiently using logical indexing..

% something like..
r = read(ds);   % 'ds' - tabularTextDataStore
r_sel = r(r.Time<t1 & r.Time>t2, :);
% Save 'r_sel' into another CSV, or add to a running variable.

It's not clear which factor would cause datastore operation to be too slow for your case.

Perhaps it may be faster if you could process one whole csv file at once, allocating all the data into different time slots and then moving on to the next CSV file. You could consider implementing Bucket Sort for the different time stamps.

Finally, you may consider using MapReduce. It's more flexible and powerful than using a datastore, but may require a learning curve. You may come up with the same solution you've tried before on MapReduce and find it working faster.

Documentation: Getting started with MapReduce

Hope it helps!

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Reading data from large .CSV files

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Reading data from large .CSV files

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기