Using textscan with mixed data type in a single field/array

Question

0 개 추천

Hello,

I am having trouble reading a large (~30,000 rows) text file into Matlab. The data looks something like this:

BLOCK

 1) 1996/01/01 00:00:00     -99.000N    -99.000N 
 2) 1996/01/01 00:15:00     -99.000N    -99.000N 
 3) 1996/01/01 00:30:00     -99.000N    -99.000N 
 4) 1996/01/01 00:45:00     -99.000N    -99.000N 
 5) 1996/01/01 01:00:00     -99.000N    -99.000N

skipped rows

 16455) 1996/06/20 09:30:00     -99.000N    -99.000N 
 16456) 1996/06/20 09:45:00     -99.000N    -99.000N 
 16457) 1996/06/20 10:00:00     -99.000N    -99.000N 
 16458) 1996/06/20 10:15:00       1.869T      0.088T 
 16459) 1996/06/20 10:30:00       1.892       0.083  
 16460) 1996/06/20 10:45:00       1.913      -0.082  
 16461) 1996/06/20 11:00:00       1.913      -0.064  
 16462) 1996/06/20 11:15:00       1.895       0.035

I use textscan to read in the data like this:

textFilename = [year,SID,'.txt'];
fid = fopen(textFilename, 'rt');
C = textscan(fid, '%*s%d/%d/%d%d%c%d%c%d%f%c%f%c','Headerlines',11);

The problem (as you can see from the data) is some of the values in the last two columns contains a letter alongside it. As this doesn't apply to all rows, when I consider this letter as a character (%c), where it doesn't appear, textscan moves along and reads the '-' symbol from the next integer. Thus, the values from the fourth column are incorrectly read as positive where they are actually negative.

My question is that how can I tell textscan to read in the values from the last two columns whilst somehow separating the letters...

Any and all help greatly appreciated!

Ozgun

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Ozgun 2014년 6월 2일

dpb,

Thanks for the advice, I will look into regexp and see how that goes.

Cedric, thanks also. Out of curiosity, how would you go about post-processing? I've already looked at trying to split a string of mixed data type somehow but struggled to find anything about it..

Thanks again to both. Oz

Cedric 2014년 6월 2일

The first, important point to answer if the first question asked by dpb: do you need 'N' and 'T'.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Cedric 2014년 6월 2일

편집: Cedric 2014년 6월 2일

MATLAB Online에서 열기

4 개 추천

If you don't need N and T, the simplest approach is probably to eliminate them before the call to TEXTSCAN:

 content       = fileread( 'myFile.txt' ) ;
 isNT          = content == 'N' | content = 'T' ;
 content(isNT) = ' ' ;                             % Replace with white space.

then you can TEXTSCAN type-homogeneous columns:

C = textscan( content, ... ) ;

Note the content variable as first argument, as TEXTSCAN accepts both file handles and strings. If you need N and T, we can talk about the post-processing mentioned in my comment above (no time now, but I'll come back later tonight).

If you wanted to process the whole in one shot using REGEXP, here is an example, but keep in mind that REGEXP is overkill for this operation and will take more time to process than a basic TEXTSCAN.

 content = fileread( 'myFile.txt' ) ;
 % Build cell array of entries.
 pattern = '([\d]+)\)\s+([\d\s:/]{19})\s+([\d\-.]+)([NT]?)\s+([\d\-.]+)([NT]?)' ;
 tokens  = regexp( content, pattern, 'tokens' ) ;
 tokens  = reshape( [tokens{:}], numel( tokens{1} ), [] ).' ;
 % Convert columns into numeric, string, and time data.
 numData = str2double( tokens(:,[1,3,5]) ) ;   % Row ID, 1st coord, 2nd coord.
 strData = tokens(:, [4,6]) ;                  % 1st and 2nd N, T, or empty.
 timData = datevec(tokens(:,2), 'yyyy/mm/dd HH:MM:SS' ) ;

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

dpb 2014년 6월 2일

MATLAB Online에서 열기

I was just coming back to mention that if you do need N and T, there's an easy way to get it and still be able to do the quick 'n dirty trick of replacing them. First, read the file but put it in a cell array of strings, one per line instead of just binary memory copy--

>> txt = textread('organz.txt','%s','delimiter','\n','whitespace','');
>> find(cellfun(@length,strfind(txt,'N')))
ans =
   1
   2
   3
   4
   5
   6
   7
   8
>> find(cellfun(@length,strfind(txt,'T')))
ans =
   9
>>

If it's possible for there to be a T and N or only one T or N you'll have to parse the output of the strfind result a little more; this takes advantage there they're both in both when exist in the sample file.

The result is the row of N or T so you can then just associate that with the array row and once done eliminate the characters as Cedric suggests and textscan the input.

Cedric 2014년 6월 3일

편집: Cedric 2014년 6월 3일

MATLAB Online에서 열기

There are many ways to post process columns. One would be a basic loop which tests whether the last char of an entry is in 'TN', and then SSCANF the appropriate length of the entry. Most approaches based on CELLFUN or ARRAYFUN would actually be equivalent to this loop. After thinking a little about it, I think that an efficient way to proceed is to REGEXP the two relevant columns only, using a short pattern (more efficient than processing the whole file with a loop or a long pattern in principle). Here is an example of such an approach, which assumes that these columns are available as data{4} and data{5} after an appropriate call to TEXTSCAN (using %s in the formatSpec for these columns):

 >> data{[4,5]}
 ans = 
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '1.869T'
    '1.892'
    '1.913'
    '1.913'
    '1.895'
 ans = 
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '-99.000N'
    '0.088T'
    '0.083'
    '-0.082'
    '-0.064'
    '0.035'

We pass to REGEXP a comma separated merge of these entries, and we match numbers and optional T or N chars:

 tokens = regexp( sprintf( '%s,', data{4}{:} ), '([\d\-\.]+)([TN]*),', ...
                  'tokens' ) ;
 tokens = cat( 1, tokens{:} ) ;
 vals4  = str2double( tokens(:,1) ) ;
 chrs4  = tokens(:,2) ;

where vals4 and chars4 contain split values/chars for column 4.

 >> vals4
 vals4 =
  -99.0000
  -99.0000
  -99.0000
  -99.0000
  -99.0000
  -99.0000
  -99.0000
  -99.0000
    1.8690
    1.8920
    1.9130
    1.9130
    1.8950
 >> chrs4
 chrs4 = 
    'N'
    'N'
    'N'
    'N'
    'N'
    'N'
    'N'
    'N'
    'T'
    ''
    ''
    ''
    ''

The same can be applied to column 5.

Finally, it is worth investigating a bit Image Analyst's answer as well as dpb's hints above. I never really used READTABLE, and I'd be curious to see how well it can perform in such situation, and if you have time, it would be interesting to profile all approaches.

댓글을 달려면 로그인하십시오.

Answer 2

Image Analyst 2014년 6월 3일

MATLAB Online에서 열기

0 개 추천

If you can add a header row giving the names of the columns, then you can simply use the new table data type:

t = readtable('organz.txt')

Nice and simple.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

Ozgun 2014년 6월 3일

0 개 추천

Appreciate the help everyone. I've used regexp to post-process the files exactly how you described Cedric- it worked great! Unfortunately, the headers within the files meant that I couldn't use readtable as you've said, Image Analyst.

BIG thanks to all...

Oz

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Using textscan with mixed data type in a single field/array

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

추가 답변 (2개)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

Community Treasure Hunt

Using textscan with mixed data type in a single field/array

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시 이전 댓글 1개 숨기기

추가 답변 (2개)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기