Any simple ideas to improve performance of reading long (10 GB scale) .csv file?

Question

Boldizsár Balog 2022년 2월 1일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1640740-any-simple-ideas-to-improve-performance-of-reading-long-10-gb-scale-csv-file

댓글: Boldizsár Balog 2022년 2월 7일

Hello,

I am trying to improve the performance of reading .cvs files containing fiber photometry data. As stated, the files are around a few 10 GB in size and have around 10e7 rows and 11 columns, long story short: they fit in the memory of my machine ().

I've done some benchmarking using textscan, fread, readtable and readmatrix. Please look at the attached file, that generates some sample data and runs the benchmark. This data is smaller, than the original data, but it can be made longer easily.

I've read other questions on the topic, but did not found any solution. Does anyone have a good idea to speed up the file reading? Can I for example try paralell computing?

Another thing, maybe I should open another question for this. In the example I attach, textscanning "linewise" (using a line format) results in an error if the file contains empty values. This is the error message I get

Error using textscan

Mismatch between file and format character vector.

Trouble reading 'Numeric' field from file (row number 4, field

number 7) ==> ,0.643668,0.396115,0.987143,1,1,1,1,\n

Is there a way, I can get NaN values for missing data just like the readtable function does? I tried all the settings, I could find in the textscan documentation without any success. I attached a sample file with missing values.

Thank you for your time in advance

Boldizsar

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

Boldizsár Balog 2022년 2월 4일

Yes it is stored on SSD.

Walter Roberson 2022년 2월 5일

See by the way my recent remarks in https://www.mathworks.com/matlabcentral/answers/1638310-running-matlab-in-parallel-on-local-machine-or-another-suggestion#answer_884150

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Yair Altman 2022년 2월 2일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1640740-any-simple-ideas-to-improve-performance-of-reading-long-10-gb-scale-csv-file#answer_886850

If you need to read the entire file for some reason, then try to use parallelization with multiple workers. Each worker would read a different file section, starting at a different offset. At the end, you'd need to merge the sections read by the different workers. You can test different number of workers, but I assume you'd get better results by using more workers than your CPU cores (i.e. more than the default number of pool workers), because in most likelihood you would be limited by I/O throughput more than CPU processing.

But if you do not need to read the entire file into memory, just a certain section, this would make the problem much easier (and faster) to resolve.

댓글 수: 2
없음 표시없음 숨기기

Boldizsár Balog 2022년 2월 4일

I will try that, and let you know, what I found, thank you.

Boldizsár Balog 2022년 2월 7일

I don't have any experience with paralell computing. I'm sure there has to be an example of your suggestion somewhere. If you know of any, please send it to me!

댓글을 달려면 로그인하십시오.

Answer 2

Walter Roberson 2022년 2월 5일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1640740-any-simple-ideas-to-improve-performance-of-reading-long-10-gb-scale-csv-file#answer_889300

Ah, I managed to find it. See https://www.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable#comment_1458376 where I did some timing tests.

fileread() followed by str2num() was the fastest, by a small margin. However, you have the difficulty that you have empty fields. Neither str2num() nor sscanf() can deal with empty fields.

As your files are quite big, it is difficult at the moment to predict what the fastest approach would be that takes into account empty fields. It might turn out to be readmatrix() . But if you have enough memory, then there is a possibility that maybe it would be more efficient to fileread() and then regexprep() to replace empty fields with NaN and then str2num()

댓글 수: 2
없음 표시없음 숨기기

Yair Altman 2022년 2월 5일

I just uploaded the str2number function on the File Exchange, which is a faster replacement for str2num/str2double/sscanf for the common case of simple scalar non-complex numbers represented as strings. It makes little difference when the number of values is small, but can be important when you process a very large number of values, or when performance is critical (e.g. when processing real-time input data), or when the input data contains non-convertable strings.

Boldizsár Balog 2022년 2월 7일

Thank you for your suggestions. The problem is, that converting a string to numbers requires about 10 times more memory (with str2num, str2double, str2number and others I tried, but can't recall), than the size of the original file in my case, which unfortunately does not fit in 16GB installed in my machine. Maybe I will come back to this issue later, but for now, I will patiently wait a minute to read the data.

댓글을 달려면 로그인하십시오.

Any simple ideas to improve performance of reading long (10 GB scale) .csv file?

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

답변 (2개)

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Any simple ideas to improve performance of reading long (10 GB scale) .csv file?

댓글 수: 7 이전 댓글 5개 표시이전 댓글 5개 숨기기

답변 (2개)

댓글 수: 2 없음 표시없음 숨기기

댓글 수: 2 없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기