Array size of data from large file
조회 수: 3 (최근 30일)
이전 댓글 표시
Hi People, I have some very large text files (~3GB) that I need to process, and can't read the whole thing in at once. I know I can use csvread to read the data in one line at a time, but I would like to know before I begin what the dimensions of the data array are. The files are only numbers, no text or anything. Any ideas would be appreciated.
댓글 수: 0
답변 (1개)
Walter Roberson
2013년 10월 30일
csvread() invokes dlmread() which invokes textscan(). You can call upon that directly for increased efficiency. Note that you can tell textscan() how many times you want the format to be re-used, and so effectively can tell it how many lines you want to process at one time.
Knowing the number of items involved is good for pre-allocation. Unfortunately, there is no method in MATLAB or any of the supported operating systems to find out how many lines are in a file without reading through the file and counting the lines. That can be expensive for large files.
A strategy that can be used fairly effectively for variable-length datasets is to do allocations in chunks, fill in the chunks, allocate more if you need to, and so on until you read end of file, at which point you truncate the final chunk and include it in your data.
Depending on how you need to work with your data while you are reading it in, sometimes it is best to allocate more to your existing array by writing at the new provisional endpoint. Reminder if you do that: adding extra columns is more efficient than adding extra rows (the copying of the old data to the new array that MATLAB will do internally is most efficient when the data does not need to be reorganized as it goes, but allocating new rows requires reorganization.)
If you do not need to work with the data until the end, then a useful strategy can be to create a cell array, read in a chunk of data and add that chunk in as a cell, and keep reading chunks and pushing them into the cell array slots. Then at the end, you can vertcat() or horzcat() or cell2mat() the cell array chunks into one real array. This strategy will involve a lot less intermediate data movement than extending a numeric array by writing to the end of it will.
You might find that you can meaningfully process the data in chunks in your algorithm -- for example smoothing over a chunk boundary only requires the previous chunk and the current chunk to be available, after which you can discard the previous chunk. Or fill it with new data (two-buffer algorithms used to be pretty common in the days of smaller memory.)
참고 항목
카테고리
Help Center 및 File Exchange에서 Text Files에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!