Load parts of VERY LARGE text file content and create a smaller matrix

조회 수: 8 (최근 30일)
John Tetteh
John Tetteh . 2015년 3월 11일
댓글: dpb . 2015년 3월 14일
I have a very large file. Sample format is attached. There is a header with comment marks $$.
The rest of the data begins from Start 1, Pos #, , followed by two columns of data the length
to Start 2, Pos # etc. Note that the columns after Pos # is NOT fixed.
The length of the two columns after Start #, Pos # ranges between 100 to around 500,000.
The Scan # ranges from 1 to around 4000.
I want to be able to read in sequentially the two columns after each Start 1, Pos# to just before Start 2 and Pos # and then move on to Start 2, Pos # etc.
I have tried textscan with block size but this is not working well.
It is not possible to load all data directly into Matlab.
Any directions will be greatly appreciated.
  댓글 수: 6
dpb 2015년 3월 11일
OK, that's a big step forward...I've gotta' run and finish up the evening chores now but I'll try to take a look at it later on this evening. My first hunch is can make a textscan call work ok since you can process by grouping but I'll have to 'spearmint to test the hypothesis...altho the basic idea is once you get to the beginning of the first section you then do an unterminated read of the floating point data; textscan will convert until it errors on the next section. Then you trap the error and get the next character line to reset the file pointer to a clean record and repeat. "Rinse and repeat" until feof.
As say, one generally has to test these things on a given file to work out the nitty, but the above tactic generally works as a tactic.

댓글을 달려면 로그인하십시오.

채택된 답변

dpb 2015년 3월 11일
편집: dpb 님. 2015년 3월 12일
OK, stuff's taken care of and I'm in for the evening (we're back to family farm; retired from the consulting gig so this is my fun at keeping hand in a little). Anyway, the basic outline is--
>> fid=fopen('vlarge.txt');
id=cell2mat(textscan(fid,'Start %d','headerlines',5)); % first section ID
pos=cell2mat(textscan(fid,'Pos %f')); fprintf('\n Section %d\n', id),
dat=cell2mat(textscan(fid,'%f %f')); fprintf('%.3f %d\n',dat.')
% process first section here
while ~feof(fid) % w/ the header out of the way, do rest of file...
id=cell2mat(textscan(fid,'Start %d'));
pos=cell2mat(textscan(fid,'Pos %f'));
dat=cell2mat(textscan(fid,'%f %f'));
fprintf('\n Section %d\n', id),fprintf('%.3f %d\n',dat.')
% proces subsequent sections here, of course...
Section 1
100.037 0
118.979 0
118.983 1
118.987 5
Section 2
100.037 0
100.966 0
100.969 1
100.973 0
121.007 7
Section 3
100.037 8
100.966 0
100.969 1
100.973 0
121.007 0
141.040 0
161.074 20
181.107 0
As you see, you're lucky with the blank line in the file that terminates the translation and that all you need for the indeterminate section lengths is the two fields to return the array in the right shape. Note I also went ahead and cast the cell output from textscan to an ordinary array at the time of the read; I almost always do this unless there's some specific reason for needing a cell array.
  댓글 수: 7
dpb 2015년 3월 14일
Yes, you always want to process in as large of blocks as possible; the line-by-line parsing is gare-on-teed to be slow for large files and is to be relied on only when there's no other way.
It is, however, appropriate for the leading header to simply look for the beginning of the data section when there's a variable number of lines and no data at the beginning of the file that encodes that to be able to compute the 'headerlines' parameter value. It'll be a little slower than being able to use the header count of course, but since it's only done once it won't be a killer. One can refine the search if one knows there's some minimum number of lines and then possibly more by not doing any testing until that minimum number have been read and all sorts of other fancier things for any specific file of course, including up to reading a sizable chunk of a file into memory as a character array image and doing the searches in memory, then reposition the file for the actual scan/conversion...

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Robert Cumming
Robert Cumming 2015년 3월 11일
Use fopen to open the file then parse it line by line saving what you need and ignoring the rest. Remember to close the file with fclose as well.
  댓글 수: 4
John Tetteh
John Tetteh 2015년 3월 13일
Hi Robert,
Many thanks for your efforts. I have found a solution on the forum. I really want to thank everyone who contributed their time towards my question.
Best regards,

댓글을 달려면 로그인하십시오.


Help CenterFile Exchange에서 Data Import and Export에 대해 자세히 알아보기


Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by