Loading Large .txt files

Question

ha ha 2019년 6월 16일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/467290-loading-large-txt-files

댓글: per isakson 2019년 6월 16일

Let's say I have a very very large .txt file with (200millions row & 11 columns= 200m-by-11 matrix). All data are numeric number value (e.g., 10, 100 ,200...) . My file is ~ 20GB

When I load this data in Matlab, the errors occurs: "Out of Memory"

clear;clc;filename = 'test42.txt'; load('test42.txt');P = test42(:,1:3);%get data=coordinate(x,y,z) from set of data "column" at (all row & column 1,2,3)

My PC system: win10-64 bit, RAM 16GB, core-i7, HDD:1TB; SSD 1TB

Actually, I just want to load the data contain only first 3 columns. It mean, the matrix that I want to get is: 200m-by-3 matrix. And with the reduce column, I hope Matlab is able to load data.

Do you know any way to read the whole dataset, or read the reduce data with only first 3 columns? Thanks.

The format of my file is like this.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

per isakson 2019년 6월 16일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/467290-loading-large-txt-files#answer_379359

편집: per isakson 2019년 6월 16일

MATLAB Online에서 열기

You didn't say how much physical memory is in your system.

Matlab provides ways to handle large text files, Large Files and Big Data, but forget that for a moment. It's not a free lunch.

" (e.g., 10, 100 ,200...)" Does that mean positive whole numbers only? If so, do you know the maximum value of the three first columns, respectively?

"When I load this data" What exactly did you do?

The three first columns will take 4.8GB to store as double.

>> 200*1e6*3*8/1e9
ans =
          4.8

But do you need to use double?

Simplest first, convert the three first columns to double and skip the remaining seven columns. Try

%%
fid = fopen( 'c:\whatever\the_huge_text_file.txt', 'r' );
cac = textscan( fid, '%f%f%f%*f%*f%*f%*f%*f%*f%*f', 'HeaderLines',0 );
fclose( fid );

Or use an alternative formatspec, which is a bit easier to read. It says read the first three columns and skip the rest up til a new-line character.

cac = textscan( num2str([1:10]), '%f%f%f%*[^\n]', 'HeaderLines',0 );

Next step requires input from you on the numbers in the three first columns.

In response to the edited question

To keep the precision of the numbers in the text file you need to use double. I'm positive that

%%
fid = fopen( 'c:\whatever\test42.txt', 'r' );
cac = textscan( fid, '%f%f%f%*f%*f%*f%*f%*f%*f%*f', 'CollectOutput',true );
fclose( fid );

will load the three first columns without problems (on your system).

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

ha ha 2019년 6월 16일

MATLAB Online에서 열기

importfile_upgrade.m

%%
tic
clear;clc;filename = 'Rama9_A.txt'; 
inputfile = importfile_upgrade(filename);%this is 1x1 cell contain kx3 matric of all data
P=cat(1, inputfile{:});%get data=coordinate(x,y,z) from set of data "column" at (all row & column 1,2,3)
toc

per isakson 2019년 6월 16일

MATLAB Online에서 열기

Thank you for showing these results.

The three elapsed times puzzles me. The differences are so large.

However, one thing I know is that it's difficult to reproduce results from reading files. A specific result depends on the state of the system cache. Large parts of the file may already be in the cache and furthermore the cache may be more or less "fragmented".

You use the statement

P=cat(1, inputfile{:});

It's possible to increase readability (imo) and save a second or two. Try

>> cssm
Elapsed time is 1.812582 seconds.
Elapsed time is 2.085810 seconds.
Elapsed time is 0.000041 seconds.
Elapsed time is 0.314024 seconds.
>> cssm
Elapsed time is 1.700285 seconds.
Elapsed time is 2.239885 seconds.
Elapsed time is 0.000040 seconds.
Elapsed time is 0.370981 seconds.

where the script cssm reads

%% Sample data 
tic, cac = { reshape( [1:194500412*3], [],3 ) }; toc
%%
tic, P1 = cat(1,cac{:}); toc
%%
tic, P2 = cac{1}; toc
%%
tic, P3 = cell2mat( cac ); toc

and P1 and P2 are equal

 >> all( P1==P2, 1 )
ans =
  1×3 logical array
   1   1   1

댓글을 달려면 로그인하십시오.

Answer 2

Walter Roberson 2019년 6월 16일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/467290-loading-large-txt-files#answer_379357

textscan() is more likely to succeed than some of the other alternatives.

Most reliable would be to pre-allocate all of the storage, and then to process chunks of the file at a time (for efficiency). For example, if you told textscan() to read 50 lines of the file, that would be just under 4 Kb, which would fit easily into MATLAB's "small blocks" storage strategy where it can extend an array in place if the array is sufficiently small. Copy the 50 rows into the master matrix, proceed to next chunk.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Loading Large .txt files

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Loading Large .txt files

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기