textscan: Instantaneous out of memory error when accessing very large file (only with newest Matlab versions)

Question

3 개 추천

I am working with a very large dataset (total 500GB) that is split up into more than a thousand individual .txt files (160 columns/characteristics per file, more than a million rows possible, contains a mixture of string and numeric variables), each covering a specific geo area. For files covering large areas, a single .txt file can be as large as 16GB. To cope with the large amount of data, I proceed as follows for each of the files:

access the respective .txt file (with "fopen")
within a while-loop import 250,000 rows using "textscan"
process data and export smaller dataset (append if not first loop iteration)
repeat steps above until end of the .txt file is reached (while ~feof)

The code that imports the data looks like this:

fileID = fopen(filename) ;  % Identify file name, "filename" is the path of the current .txt file 
while ~feof(fileID)         % import stacks of 250,000 rows until the end of the current .txt file is reached
        
        % Import data, "format varlist" identifies format and columns to
        % import. Delimiter is "|". 
        Data= textscan(fileID,strcat(char(format_varlist),'\r\n'),250000,'Delimiter','|',...
           'HeaderLines', double(first_iteration==1),'EndOfLine','\r\n','EmptyValue',-1) ;
end 

Doing so allows me to effectively reduce the size of my dataset such that I can conveniently work with the full dataset later.

My problem is the following: The code works perfectly well for ALL files (including the very large .txt ones) with version 2019a. With version 2021a (if I recall correctly, it did not work with 2020a either), the code works perfectly well UNTIL the code reaches a file that is too large. At this point, the code (instantaneously!) stops with an "out of memory" error:

Out of memory.
Related documentation

I suspect that the newer "textscan" function recognizes the filesize that is to be accessed would be too large to load in fully (which it is), but does not recognize that I only want 250k lines at a time.

I looked at the "readtable" command, but as far as I know, this command does not allow to import smaller stacks of data once at a time (only for spreadsheets).

Is there a workaround/fix for my issue? As I work (and worked) frequenty with these types of codes, I would be eternally stuck with the 2019 version. Thank you very much in advance for your help.

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

dpb 2021년 5월 7일

편집: dpb 2021년 5월 7일

I'd submit this issue directly to Mathworks as technical support issue/bug.

There is now the datastore and tall arrays designed for large datasets that likely will be the recommended alternative/workaround. I've not ever used them "in anger" so don't have any practical hands-on experience.

It is a shame that the movement to more elegant coding practices comes at such a price in efficiency and memory usage and that old tried-and-true methods break, agreed.

ADDENDUM:

Hmmm...I went and read some of the background doc again and see in recent file the following interesting tidbit--

"Use the textscan function to access parts of a large text file by reading only the selected columns and rows. If you specify the number of rows or a repeat format number with textscan, MATLAB calculates the exact amount of memory required beforehand."

I wonder if the problem is that with all the added baggage/overhead in newer release of MATLAB you've simply reached a point where your chunk size is too big. Have you tried cutting 250000 to 200000, say, to see if symptoms change?

Of course, if that does work, it will come at the cost of having to do 25% more iterations and probably at least that much more in processing time. "There is no free lunch!" with the upgrades, for sure. All the new bells and whistles come with side effects, sometimes serious ones for serious problems. But, I'd think TMW would still be interested in specific use cases that expose such problems.

Simon Stehle 2021년 5월 7일

Thanks a lot for the hints, I was hoping to avoid to completely restructure my code, but will definitely look into the datastore methods.

As for the cuts in chunk size, I have tried any small number. The error message still pops up immediately, which is why I am almost sure that the function "does not even try".

dpb 2021년 5월 7일

And, textscan is fully builtin so not even the preliminaries are able to be looked at to see what it might do in that regards.

If it indeed won't work at all, I'd say that qualifies as a bug and is in total violation of documented behavior.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Walter Roberson 2021년 5월 12일

1 개 추천

specify the encoding on your fopen() so that the i/o library does not have to read through the entire file to ddetermine the encoding. The default now is to automatically detect but that can require reading the entire file to disprove the hypothesis that the file might contain utf8.

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

Walter Roberson 2021년 5월 12일

R2020a rather than R2019b.

https://www.mathworks.com/help/matlab/release-notes.html?category=desktop&rntext=&startrelease=R2020a&endrelease=R2020a&groupby=release&sortby=descending&searchHighlight=

File Encoding: Save MATLAB code files (.m) and other plain text files as UTF-8 encoded files by default

[...] When opening existing files, the Editor and other functions like type or fopen automatically determine the current encoding.

dpb 2021년 5월 12일

편집: dpb 2021년 5월 13일

Huh. But it never even hints that this feature can cause an out-of-memory error in any of the documentation. That should definitely be highlighted along with the above description and fopen ought to be able to report the problem specifically as to the cause and fix instead of just dumping the "out-of-memory" standard error message.

댓글을 달려면 로그인하십시오.

Answer 2

Shiva Kalyan Diwakaruni 2021년 5월 12일

0 개 추천

Hi,

Refer to Memory Usage information located at the following URL:

https://www.mathworks.com/help/matlab/performance-and-memory.html

Specifically to the sections:

1. Strategies for Efficient Use of Memory

2. Resolving "Out of Memory" Errors

Concepts:

1. Memory Allocation

2. Memory Management Functions

Some additional resources for resolving "Out of Memory" errors:

https://www.mathworks.com/help/releases/R2020b/matlab/matlab_prog/strategies-for-efficient-use-of-memory.html#responsive_offcanvas

https://www.mathworks.com/help/releases/R2020b/matlab/matlab_prog/resolving-out-of-memory-errors.html

Hope it helps.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

Steven Lord 2021년 5월 13일

0 개 추천

I recommend you look at the functionality in MATLAB to process large files and big data. The approach you've described sounds like you could use a tall array backed by a tabularTextDatastore.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

textscan: Instantaneous out of memory error when accessing very large file (only with newest Matlab versions)

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

추가 답변 (2개)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

Community Treasure Hunt

textscan: Instantaneous out of memory error when accessing very large file (only with newest Matlab versions)

댓글 수: 3 이전 댓글 1개 표시 이전 댓글 1개 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시 이전 댓글 4개 숨기기

추가 답변 (2개)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기