Reading a very large text file of an almost regular data with empty value

조회 수: 2 (최근 30일)
Hello everybody,
I am trying to import an almost regular matrix into matlab. I used textscan with EmptyValue option to do it.
But it always give a error message 'badly formated string'. I do not understand why. Could you please give me a hand.
Below is the data file. The problem with the text file is:
first, there is empty in it. It would be better if I can get a NaN or 0 to replace the empty at the end
Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult.
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
The matrix I wanna obtain is
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00 NaN
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01 NaN
4.417E-03 1.000E+00 2.200E+05 11135 2.543452E+00 NaN
4.417E-03 1.000E+00 2.200E+05 11138 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11141 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+05 11351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2350 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2356 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2522 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2711 NaN NaN
Ps. the file is very large. Data contains thousands of rows and columns. Should I do some optimisation for reading files? Thanks in advance to help me. Thank you very much.
[EDITED]
Format='%*10E %10E %9f %5d %E %E'
opt = {'EmptyValue',NaN,'CollectOutput',1};
tmp = textscan(fid,Format,'Delimiter','','Whitespace','',opt{:});

채택된 답변

Jacob Palczynski
Jacob Palczynski 2011년 8월 25일
Since your matrix is just almost regular, you will not be able to work with textscan easily. I guess the problem with EmptyValue is that there are no delimiters for the empty fields in your file.
I would suggest to use fgetl, and extract digits from the resulting char array with regexp. Then you get a cell array of char arrays, which you can convert to doubles.
For example:
clear
fid = fopen('test.txt','rt');
% maximum number of columns
maxlength = 6;
% preallocation
step = 2;
tm = nan(step,maxlength);
% reading line by line
k = 1;
while ~feof(fid)
thisline = fgetl(fid);
thisline = regexp(thisline,'\d*','match');
thisline = cellfun(@str2num,thisline);
tm(k,1:length(thisline)) = thisline;
r = size(tm,1);
% need to preallocate more?
if k == r
tm((r+1):(r+step), 1:maxlength) = nan;
end
k = k + 1;
end
fclose(fid);
For large files it makes sense to increase step.
  댓글 수: 8
gringoire
gringoire 2011년 8월 26일
Otherwise, do you know how we can do the optimization for the time to read the text? Since the file has 500*10000 data to be read...
Jacob Palczynski
Jacob Palczynski 2011년 8월 26일
When I run the script in the profiler with a 10000*6-data-file, most time (~20 seconds) is spend in the cellfun line. Only 7 seconds are needed to read the file.
Hence, the conversion which we need due to the data format takes most time.

댓글을 달려면 로그인하십시오.

추가 답변 (3개)

Oleg Komarov
Oleg Komarov 2011년 8월 25일
The txt I created is (with a newline character at the end, otherwise won't read properly):
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
.
fid = fopen('test.txt');
% Import 3rd and 4th column as fixed char, the other directly as doubles
out = textscan(fid,'%f%f%9c%5c%f%f','EmptyValue',NaN);
% Line of blank to pad char fields
pad = repmat(' ',1,numel(out{1}));
% Read in the fixed width char fields as double
out{3} = cell2mat(textscan([pad; out{3}.'],'%f'));
out{4} = cell2mat(textscan([pad; out{4}.'],'%f'));
fclose(fid);
cell2mat(out)
  댓글 수: 4
Oleg Komarov
Oleg Komarov 2011년 8월 26일
Works also on the data posted by gringoire except I have to add a newline at its end.
Soni huu
Soni huu 2012년 6월 29일
편집: Soni huu 2012년 6월 29일
what about this, can you solve?? this data is 9 column
02:38:00 R- .026 065.457 01** 4862 0097 0074 +19
02:39:00 NaN .000 065.457 01** 4862 0101 0074 +19
02:40:00 NaN .000 065.457 01** 4862 0099 0074 +19
02:41:00 NaN .000 065.457 01** 4862 0097 0074 +19
02:42:00 R- .129 065.459 01** 4862 0111 0074 +19
02:43:00 R- .051 065.460 01** 4862 0099 0074 +19
note: NaN is empty or 2space(" ")
thanks b4

댓글을 달려면 로그인하십시오.


Jan
Jan 2011년 8월 25일
Please post the command, which fails. It is impossible to guess exactly, what you have done.
This is a serious problem: "Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult." It is not trivial to split the string "2.200E+0511135". Without additional restirctions, it is even impossible. I assume, the exponent is limited to 2 digits - can you confirm this? Actually values > 1.0e100 are valid, but then it I'd consider the data file as damaged.
The file uses 4 significant digits per number in the leading columns. This is rather inprecise. I'd never draw any conclusions about the regularity of a large matrix with millions of elements, if the data are represented with such a low accuracy. Do you see any possibilities to obtain the values with SINGLE or better DOUBLE precision?
It is impossible to use TEXTSCAN due to the touching numbers. Therefore you have to insert a space after the 29th character at first and create a new file:
fidIN = fopen(FileName1, 'r');
if fidIN < 0, error('Cannot open file %s', FileName1); end
fidOUT = fopen(FileName2, 'w'); % EDITED: 'r'->'w'
if fidOUT < 0, error('Cannot open file %s', FileName2); end
while 1
S = fgetl(fidIN);
if ~ischar(S)
break;
end
% Split '4.417E-03 1.000E+00 2.200E+0511135'
S = [S(1:29), ' ', S(30:end)];
fwrite(fidOUT, S, 'uchar');
fwrite(fidOUT, 10, 'uchar'); % Or [13,10] for DOS line breaks
end
fclose(fidIN);
fclose(fidOUT);
Then TEXTSCAN has a chance to read the data. You need the 'MultipleDelimsAsOne' and the 'EmptyValue' flags.
PS. Please tell the programer who has created the file, that the format is trashy. It is near to be unusable. The strategy to format ASCII files are easy: Let the file be readable with the minimum number of rules. But you have repeated delimiters, missing data, touching columns and a large data set with low precision.
  댓글 수: 5
Jan
Jan 2011년 8월 25일
@gringoire: Talking in English about MATLAB is not easy.
Look at "help textscan" -> Supported FORMAT specifiers: http://www.mathworks.de/help/techdoc/ref/textscan.html
There is no "%E" specifier. You can define the width as in "%10E" in newer MATLAB versions only - e,g, not in 2009a.
Your format specifier belong to FSCANF - which is btw. also a good idea to read the file line by line.
gringoire
gringoire 2011년 8월 26일
Ah ok.. I thought %10E works for every function-

댓글을 달려면 로그인하십시오.


Antti
Antti 2011년 8월 25일
Hi,
I think Jan's example code is good, but I suppose 3rd line should be:
fidOUT = fopen(FileName2, 'w'); % 'w' like write

카테고리

Help CenterFile Exchange에서 Logical에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by