Fast way to read text file of formatted data

I have a text file formatted as (just showing the first, second, second to last and last lines):
0.00 68.000 68.000
1.00 68.001 68.000
...
1923.00 1871.164 1869.803
1924.00 1871.484 1870.134
The data values are delimited by spaces (number varies, depending by data values).
I want to import these as floating numbers, eventually in a 3 column array. I will always know ahead of time how many columns there are but I will not know ahead of time how many rows there will be.
I can input this easily using one of many commands: dlmread, importdata, textscan, fscanf. For a resulting 1925x3 array, fscanf is the fastest and takes around .004 sec. Since I will have to do this import over a hundred thousand times in my MATLAB script, is there a faster way to do this? Thanks

답변 (3개)

Cedric
Cedric 2014년 5월 27일
편집: Cedric 2014년 5월 27일

2 개 추천

You should/could perform this operation using a RAM drive/disk. There is little interest in saving temporary files on disk when you have a lot of them. Once you have the data in MATLAB, save it in one or a few .mat files, to avoid having to deal with that many files in the future. Your processing flow becomes:
  1. MATLAB code writes MATLAB variables to RAM drive/txt file.
  2. Ancient software reads RAM drive/txt file, processes content, outputs outcome to RAM drive/txt file.
  3. MATLAB code reads RAM drive/txt file, processes content, stores outcome in variable.
  4. Loop to 1 if not done.
When done: store variable in unique .mat file. If too large to hold in memory, split into a few blocks, e.g. 1GB, to minimize the number of files. As mentioned by Star Rider, the .mat format is well suited for storing large amounts of data; it is based on HDF5 [ ref ] (from version 7.3 on). Yet, if you want to push even further, this post is not uninteresting.
To optimize read/write operations on a very large number of files, stick to low level functions, and try to be as specific as possible during calls (i.e. it is more efficient to specify a separator than to let the function find it out by testing, it is more efficient to specify a date format than to let a function finding it out, etc).

댓글 수: 1

Ted
Ted 2014년 5월 27일
Thanks for the RAMdisk suggestion. I did test a RAM disk when I was trying out different data import methods (dlmread, importdata, textscan, fscanf). I did a simple large import and looped it for 10000 times. The RAM disk was only 5% faster--I suspect most of the execution time was in parsing the imported txt file. Now that I have my code done, I will try some timing tests again. Even if there is just a small difference in time, reducing HDD wear and tear is a good idea.
I settled on using fscanf, it seemed to be the fastest but not by much. I tried several different formatspecs, but they all timed about the same. Don't have to provide a delimiter so I think that I've made fscanf as fast as it can be.
I will also try rift's code below.

댓글을 달려면 로그인하십시오.

Star Strider
Star Strider 2014년 5월 26일

1 개 추천

I would import them once and save them as a ‘.mat’ file, with their variable names included. Then load the ‘.mat’ each time instead.

댓글 수: 9

Ted
Ted 2014년 5월 26일
Sorry, I should have been clearer. Each text file is different.
I still suggest that once you read in the data from each text file that you save the data from each file as a separate ‘.mat’ file (along with its variable names if you choose to associate variable names with various data). Every one of the ‘.mat’ files you then load (using the load function) will import with it not only the data but the variable names.
The ‘.mat’ files are binary files, so they not only require less space than the original files, but load much faster than reading the text files.
You have 100,000 text files? That's a lot. If you created one file every second, that would be over 27 hours of creating text files. Where did all these files originate from?
Can you avoid the problem in the first place and just create one file, like a binary file or a mat file or image or something.
Ted
Ted 2014년 5월 27일
편집: Ted 2014년 5월 27일
Actually, I might see some cases where I will approach a million files!
Another point of clarity (sorry). I only use the text files one at a time and I am stuck working with text files.
I am working on analysis software that uses outputs from 40-year old executables (aka ancient software) as an input (text files are generated on the fly depending on the results from previous results). I can't edit the old code so I am stuck with the text file inputs/outputs. Here is a simple code flow:
  • My MATLAB Code (step 1)
  • Write MATLAB variables to txt file
  • Ancient software reads txt file
  • Ancient code executes inputs from txt file
  • Ancient code writes output to txt file
  • MATLAB code reads txt file
  • MATLAB code puts txt inputs into MATLAB variables
  • My MATLAB code
  • Loop back to step 1 until reach termination criteria
I only need access to one file at a time, since the data in the file drives computation that results in a subsequent text file.
Unfortunately, I am stuck dealing with reading txt files.
Now knowing your requirements and constraints, all I can offer you is my sympathy.
Ted
Ted 2014년 5월 27일
LOL, it is self-inflicted since I am trying to use 40-year old fortran code. So far fscanf seems to be my fastest option unless someone else has a better idea.
That’s probably the best you can hope for, unless you want to reverse-engineer the FORTRAN code and create MATLAB code from it. That’s not something I would eagerly undertake.
Ted
Ted 2014년 5월 27일
Neither would I, it's 10000 lines in Fortran 66. Since I have not touched Fortran in 30 years, it was painful enough to tweak a few file names and suppress command line output.
It was challenging just to find a compatible compiler on the Mac, since G77 and gfortran would not compile. Intel Fortran Compiler for the Mac worked fortunately.
I rarely program in FORTRAN now (haven’t in more than a decade) but I have a compiler in my DVD software library that will run on this machine (Win 8) that I keep partially out of nostalgia. I was still writing FORTRAN code for my neural nets and genetic algorithms about 20 years ago because MATLAB was very slow on those machines. FORTRAN was significantly faster, and probably still is for large projects.
I strongly suggest you consider Cedric’s idea of a RAM drive for the temporary files. It is much faster in terms of read-write time — you can also wipe the files from the RAM drive quickly — and you don’t have to worry about HDD file fragmentation that would likely slow your processes.

댓글을 달려면 로그인하십시오.

rifat
rifat 2014년 5월 27일
편집: rifat 2014년 5월 27일

0 개 추천

I did a similar thing before. Hope this helps. result will be on the variable mat.
fclose('all');
fid=fopen('fileName.txt','rt');
frewind(fid);
lnum=0
count = 1;
while(count==1)
line=fgetl(fid);
lnum=lnum+1;
if ~ischar(line)
count=0;
else
count=1;
line=strtrim(line);
len=length(line);
a=line==' ';
loc=find(diff(a)~=0);
num1=line(1:loc(1));
num2=line(loc(2)+1:loc(3));
num3=line(loc(3)+1:end);
mat(lnum,1)=str2num(num1);
mat(lnum,2)=str2num(num2);
mat(lnum,3)=str2num(num3);
end
end

댓글 수: 2

Ted
Ted 2014년 5월 27일
rifat, thanks for the suggestion. I tried your code and found fscanf to be significantly faster. I suspect that str2num (or even str2double) is much slower than the I/O in fscanf.
rifat
rifat 2014년 5월 27일
yeah.. i wasnt worried about performance.. Thats why this rough implementation

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 External Language Interfaces에 대해 자세히 알아보기

질문:

Ted
2014년 5월 26일

댓글:

2014년 5월 27일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by