Using textscan with mixed data type in a single field/array
조회 수: 3 (최근 30일)
이전 댓글 표시
Hello,
I am having trouble reading a large (~30,000 rows) text file into Matlab. The data looks something like this:
BLOCK
1) 1996/01/01 00:00:00 -99.000N -99.000N
2) 1996/01/01 00:15:00 -99.000N -99.000N
3) 1996/01/01 00:30:00 -99.000N -99.000N
4) 1996/01/01 00:45:00 -99.000N -99.000N
5) 1996/01/01 01:00:00 -99.000N -99.000N
- skipped rows
16455) 1996/06/20 09:30:00 -99.000N -99.000N
16456) 1996/06/20 09:45:00 -99.000N -99.000N
16457) 1996/06/20 10:00:00 -99.000N -99.000N
16458) 1996/06/20 10:15:00 1.869T 0.088T
16459) 1996/06/20 10:30:00 1.892 0.083
16460) 1996/06/20 10:45:00 1.913 -0.082
16461) 1996/06/20 11:00:00 1.913 -0.064
16462) 1996/06/20 11:15:00 1.895 0.035
I use textscan to read in the data like this:
textFilename = [year,SID,'.txt'];
fid = fopen(textFilename, 'rt');
C = textscan(fid, '%*s%d/%d/%d%d%c%d%c%d%f%c%f%c','Headerlines',11);
The problem (as you can see from the data) is some of the values in the last two columns contains a letter alongside it. As this doesn't apply to all rows, when I consider this letter as a character (%c), where it doesn't appear, textscan moves along and reads the '-' symbol from the next integer. Thus, the values from the fourth column are incorrectly read as positive where they are actually negative.
My question is that how can I tell textscan to read in the values from the last two columns whilst somehow separating the letters...
Any and all help greatly appreciated!
Ozgun
댓글 수: 4
Cedric
2014년 6월 2일
The first, important point to answer if the first question asked by dpb: do you need 'N' and 'T'.
채택된 답변
Cedric
2014년 6월 2일
편집: Cedric
2014년 6월 2일
If you don't need N and T, the simplest approach is probably to eliminate them before the call to TEXTSCAN:
content = fileread( 'myFile.txt' ) ;
isNT = content == 'N' | content = 'T' ;
content(isNT) = ' ' ; % Replace with white space.
then you can TEXTSCAN type-homogeneous columns:
C = textscan( content, ... ) ;
Note the content variable as first argument, as TEXTSCAN accepts both file handles and strings. If you need N and T, we can talk about the post-processing mentioned in my comment above (no time now, but I'll come back later tonight).
If you wanted to process the whole in one shot using REGEXP, here is an example, but keep in mind that REGEXP is overkill for this operation and will take more time to process than a basic TEXTSCAN.
content = fileread( 'myFile.txt' ) ;
% Build cell array of entries.
pattern = '([\d]+)\)\s+([\d\s:/]{19})\s+([\d\-.]+)([NT]?)\s+([\d\-.]+)([NT]?)' ;
tokens = regexp( content, pattern, 'tokens' ) ;
tokens = reshape( [tokens{:}], numel( tokens{1} ), [] ).' ;
% Convert columns into numeric, string, and time data.
numData = str2double( tokens(:,[1,3,5]) ) ; % Row ID, 1st coord, 2nd coord.
strData = tokens(:, [4,6]) ; % 1st and 2nd N, T, or empty.
timData = datevec(tokens(:,2), 'yyyy/mm/dd HH:MM:SS' ) ;
댓글 수: 3
dpb
2014년 6월 2일
I was just coming back to mention that if you do need N and T, there's an easy way to get it and still be able to do the quick 'n dirty trick of replacing them. First, read the file but put it in a cell array of strings, one per line instead of just binary memory copy--
>> txt = textread('organz.txt','%s','delimiter','\n','whitespace','');
>> find(cellfun(@length,strfind(txt,'N')))
ans =
1
2
3
4
5
6
7
8
>> find(cellfun(@length,strfind(txt,'T')))
ans =
9
>>
If it's possible for there to be a T and N or only one T or N you'll have to parse the output of the strfind result a little more; this takes advantage there they're both in both when exist in the sample file.
The result is the row of N or T so you can then just associate that with the array row and once done eliminate the characters as Cedric suggests and textscan the input.
Cedric
2014년 6월 3일
편집: Cedric
2014년 6월 3일
There are many ways to post process columns. One would be a basic loop which tests whether the last char of an entry is in 'TN', and then SSCANF the appropriate length of the entry. Most approaches based on CELLFUN or ARRAYFUN would actually be equivalent to this loop. After thinking a little about it, I think that an efficient way to proceed is to REGEXP the two relevant columns only, using a short pattern (more efficient than processing the whole file with a loop or a long pattern in principle). Here is an example of such an approach, which assumes that these columns are available as data{4} and data{5} after an appropriate call to TEXTSCAN (using %s in the formatSpec for these columns):
>> data{[4,5]}
ans =
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'1.869T'
'1.892'
'1.913'
'1.913'
'1.895'
ans =
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'-99.000N'
'0.088T'
'0.083'
'-0.082'
'-0.064'
'0.035'
We pass to REGEXP a comma separated merge of these entries, and we match numbers and optional T or N chars:
tokens = regexp( sprintf( '%s,', data{4}{:} ), '([\d\-\.]+)([TN]*),', ...
'tokens' ) ;
tokens = cat( 1, tokens{:} ) ;
vals4 = str2double( tokens(:,1) ) ;
chrs4 = tokens(:,2) ;
where vals4 and chars4 contain split values/chars for column 4.
>> vals4
vals4 =
-99.0000
-99.0000
-99.0000
-99.0000
-99.0000
-99.0000
-99.0000
-99.0000
1.8690
1.8920
1.9130
1.9130
1.8950
>> chrs4
chrs4 =
'N'
'N'
'N'
'N'
'N'
'N'
'N'
'N'
'T'
''
''
''
''
The same can be applied to column 5.
Finally, it is worth investigating a bit Image Analyst's answer as well as dpb's hints above. I never really used READTABLE, and I'd be curious to see how well it can perform in such situation, and if you have time, it would be interesting to profile all approaches.
추가 답변 (2개)
Image Analyst
2014년 6월 3일
If you can add a header row giving the names of the columns, then you can simply use the new table data type:
t = readtable('organz.txt')
Nice and simple.
댓글 수: 0
참고 항목
카테고리
Help Center 및 File Exchange에서 Text Data Preparation에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!