How to read large text data into matlab

조회 수: 116 (최근 30일)
Fan Li
Fan Li 2018년 1월 5일
댓글: Abdullahi Samantar 2018년 12월 12일
Hi every one I have a text file up to 10 GB which has to be read into matlab. The part of the data is listed below
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 51.4573 -48.0145 -55.5854 0.00121546 -0.00693737 -0.00454935
2242 -25.5898 -24.3081 -29.3729 0.00671099 0.00205397 -0.0108453
2243 9.2867 27.1493 -37.9274 -0.00115821 0.00912371 -0.00178601
2244 3.89714 -48.5019 70.5903 0.0041159 -0.00255481 -0.0029498
2245 49.8803 -40.1819 -5.30361 -0.0106695 0.0224494 0.00918698
2246 0.22115 -19.9758 -2.30173 0.0190817 0.0262146 -0.0153229
2247 -53.6289 50.5517 -23.5032 0.00388499 -0.00559089 0.000787281
.
.
.
.
.
.
.
.
ITEM: TIMESTEP
10
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 -50.0606 -93.6118 -70.4534 0.000504085 -0.00684199 -0.00394166
2242 -14.4928 20.0993 3.55963 0.00244236 0.00203074 -0.0162865
2243 -2.64823 8.26566 23.6457 -0.000503352 0.0140246 -0.00909782
2244 -153.189 40.6383 -12.0141 0.00192712 -0.00177534 -0.00194966
2245 35.0712 -14.4107 6.31868 0.00668828 0.012556 0.00468532
2246 22.0675 -14.7867 61.4774 0.0182799 0.0194239 -0.00942033
2247 -3.80959 -88.6786 1.61222 0.00459477 -0.00577238 0.000324204
2248 -18.4777 -9.35017 -1.12766 0.0146401 0.00924069 -0.00730373
2249 16.2354 -7.34658 -25.1694 -0.0169203 0.0249397 0.0085598
2250 110.508 19.9749 -4.95758 -0.00500049 0.000961677 0.00667405
2251 -7.46059 3.35324 -41.665 0.0175383 -0.00791068 -0.00702065
Basically, it has many parts which start with the "ITEM: TIMESTEP". I have to skip the first 9 lines for each part and then read the other lines.
I tried the textscan function (May be I misused it), but it is very slow. Is there a faster way to do it in Matlab?
  댓글 수: 1
Cedric
Cedric 2018년 1월 6일
편집: Cedric 2018년 1월 6일
How much RAM do you have available?

댓글을 달려면 로그인하십시오.

채택된 답변

Cedric
Cedric 2018년 1월 6일
편집: Cedric 2018년 1월 6일
If you have enough RAM for this, the following could run a little faster. It is way less versatile than Per's solution though, and exploits specific characters present in the header. You may have to adapt it a bit if there are e.g. other types of header content:
content = fileread( 'data.txt' ) ;
blockEnds = strfind( content, 'ITEM: T' ) - 1 ;
blockEnds = [blockEnds(2:end), numel( content )] ;
blockStarts = strfind( content, '6]' ) + 3 ;
nBlocks = numel( blockStarts ) ;
data = cell( nBlocks, 1 ) ;
fprintf( '%d blocks found.\n', nBlocks ) ;
for bId = 1 : nBlocks
data{bId} = reshape( sscanf( content(blockStarts(bId):blockEnds(bId)), '%f' ), 7, [] ).' ;
end
PS: this takes < 20s for a 1GB data file on a small laptop (with 32GB RAM though).
  댓글 수: 2
Fan Li
Fan Li 2018년 1월 8일
Thanks. This way is faster.
Abdullahi Samantar
Abdullahi Samantar 2018년 12월 12일
Hi Fan Li,
How did you manipulate Cedric code to get your large txt file (lammps) run.
I couldnt figure it out.
Thank you

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

per isakson
per isakson 2018년 1월 5일
편집: per isakson 2018년 1월 6일
Given:
  • All headers consist of 9 lines
  • All data blocks consist of 7 columns of numerical data
  • The blocks of numerical data should be converted to double. (Added later.)
  • The columns of the data are separated by space, char(32)
  • There is RAM enough to store the parsed data. Nearly 10GB is needed to store in double. Single would introduce a rounding error.
Try:
>> cac = cssm( 'cssm.txt' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x6 3696 cell
>> cac
cac =
[7x7 double] [11x7 double] [7x7 double] [11x7 double] [7x7 double] [11x7 double]
>>
>> cac{1}
ans =
1.0e+03 *
2.2410 0.0515 -0.0480 -0.0556 0.0000 -0.0000 -0.0000
2.2420 -0.0256 -0.0243 -0.0294 0.0000 0.0000 -0.0000
2.2430 0.0093 0.0271 -0.0379 -0.0000 0.0000 -0.0000
2.2440 0.0039 -0.0485 0.0706 0.0000 -0.0000 -0.0000
2.2450 0.0499 -0.0402 -0.0053 -0.0000 0.0000 0.0000
2.2460 0.0002 -0.0200 -0.0023 0.0000 0.0000 -0.0000
2.2470 -0.0536 0.0506 -0.0235 0.0000 -0.0000 0.0000
>>
where
function cac = cssm( ffs )
fid = fopen( ffs );
cac = cell(1,0);
while not( feof( fid ) )
cac(1,end+1) = textscan( fid, '%f%f%f%f%f%f%f' ...
, 'Headerlines',9, 'CollectOutput',true );
end
fclose( fid );
end
and cssm.txt contains three copies of the data of the question
"textscan [...] is very slow. Is there a faster way [...] Matlab?" AFAIK: No, not significantly faster. However, I don't agree that it's very slow.
  댓글 수: 2
Fan Li
Fan Li 2018년 1월 6일
Hi per isakson
I am using fgetl function which is faster.
per isakson
per isakson 2018년 1월 6일
편집: per isakson 2018년 1월 6일
That's comparing apples to oranges. I assumed without stating it that the "numerical blocks" should be parsed.

댓글을 달려면 로그인하십시오.


Steven Lord
Steven Lord 2018년 1월 8일
If your data is too large to fit in memory all at once, consider using a datastore. Since you have data in the headers that I assume you want to access, using a TabularTextDatastore probably won't suit your needs. You may need to use a general FileDatastore or develop your own custom datastore using your knowledge of the way your data is formatted.
Once you have a datastore you could use it to create a tall array.
  댓글 수: 1
Fan Li
Fan Li 2018년 1월 16일
편집: Fan Li 2018년 1월 16일
Hi Steven Lord
I have read the part about tall array and datastore. It is useful for me. But I do not know how to skip the headers which I do not need . The function I am using now and other method provided here is not for datastore. There is limited source for skipping the headers with datastore. So, can you tell me how to skip the headers with the datastore? The format of the data is provided above.
Thanks

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Large Files and Big Data에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by