Improving speed of readtable

조회 수: 46 (최근 30일)
Daniel van Huyssteen
Daniel van Huyssteen 2021년 4월 14일
댓글: Daniel van Huyssteen 2021년 4월 14일
I have a large array stored in a .dat file (see Example.dat attached) and I need to import the array into MATLAB.
At the moment I am using the following approach to load the table and convert it to an array.
Example_Table = readtable("Example.dat");
Example_Array = table2array(Example_Table);
This process is, however taking much longer than I would expect since I have a reasonably powerful PC.
I suspect that the issue is related to the array having a large number of zero entries.
The results of Run & Time are shown below
It is clear that pretty much all of the time is involved in reading the table and not in converting it to an array.
The timing profile of table.readTextFile>textscanReadData is shown below
Where all of the time is spent on the TreatAsEmpty command (because of having many zero entries?).
Below is a snapshot of the CPU and RAM usage during the reading of table.
Here it is clear that there is a lot of computational power not being used so this process should be able to be sped up some way or another.
How can I make this process run faster?
I have to read in lots of data like this and it is a very frustrating process.
Thanks in advance!

채택된 답변

Walter Roberson
Walter Roberson 2021년 4월 14일
Where all of the time is spent on the TreatAsEmpty command (because of having many zero entries?)
No, that is not what is happening.
What is happening is that Mathworks coded the internal call to textscan() split across two lines
data = textscan(fid, format, a bunch of stuff here, ...
'TreatAsEmpty', treatasempty, a bunch more stuff here)
The time for the overall call is accounted to the last line of the call, as the textscan() call itself cannot start until all of the parameters have been executed.
For this purpose, "executed" for something like
treatasempty = [];
textscan('TreatAsEmpty', treatasempty)
would consist of parsing the character vector 'TreatAsEmpty' and created a temporary (unnamed) expression block for it and pushing that into the parameters; and then parsing the variable named treatasempty and locating the variable in scope and pushing its (named) expression block into the parameters. Those operations might not take long but they take some time, and that time is time spent preparing to call textscan() but not yet having called textscan(). The time to parse the parameters and get ready for the call is being shown in the 554 line, the data = textscan( part.
'TreatAsEmpty' is not a command in this context: it is just a literal constant to be prepared and passed in to the function.
The timing you are seeing for line 555 is the time spent executing the textscan()
  댓글 수: 7
Walter Roberson
Walter Roberson 2021년 4월 14일
If you have a matrix of data, D, then
tic
[r, c, s] = find(D);
as_rows = [r, c, s].';
dlmwrite('Example_sparse.csv', as_rows, 'precision', 16);
toc
About 1.3 seconds to write
Restore with
tic
S = fileread('Example_sparse.csv'); rcs = str2num(S); Drestored = sparse(rcs(1,:), rcs(2,:), rcs(3,:));
time_via_sparse = toc
about 1.5 seconds to read.
Daniel van Huyssteen
Daniel van Huyssteen 2021년 4월 14일
This works great! Thanks! :D

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Bjorn Gustavsson
Bjorn Gustavsson 2021년 4월 14일
It is a rather large data-file to read. You might reduce the read-time if you use load instead of readtable - that should reduce all sorts of overhead associated with the capacity to handle all sorts of data-formats of readtable.
If you have the capacity to modify the data-format of your files that might be a far more successful way forward if you have very sparse data - then you might be better off saving the non-zero components together with their row and column indices and handle that when reading data instead of saving large number of zeros. But maybe you're given the data and have to shovel zeros and zeros around...
HTH
  댓글 수: 2
Daniel van Huyssteen
Daniel van Huyssteen 2021년 4월 14일
Thanks for the answer.
Unfortunately I'm not able to manipulate the writing/storage of the original data.
Curiously, using load instead of readtable takes even longer!
Bjorn Gustavsson
Bjorn Gustavsson 2021년 4월 14일
That's a double bummer. I'm really surprised that load takes longer time, I would've bet good money that the more general capacity of readtable would cost time. Then perhaps you can save overall processing-time by following Walter's suggestion of converting the data-files to a sparse format. You might be able to bulk-process all data-files over-night when it doesn't test your patience...

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Large Files and Big Data에 대해 자세히 알아보기

제품


릴리스

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by