Textscan with different formats

Question

0 개 추천

Hi,

I'm not familiar reading text files and also formats involved in text files. Here is my problem, i been trying to read a text file which has a unknown rows and columns, with a '\t' delimiter, column headers with more than 2( second one will be an unit which is not required for me, only first one is considered). I was using importdata for reading text and data separately, it was working fine but yesterday i found a problem like my input text file contains '*' for missing data, which during importing considered as character and as a row header.

There is been hundreds of questions asked for text file reading, ive found solutions like tableread, import as char and convert with str2double(which is slow),readtext(file exchange) but none of the solution is as fast as importdata function.

What i was expecting is read only the numeric data from the textfile(replace char with NaN during import itself as xlsread), I understand which can be done using textscan but i was unable to give formatspec for the files Or a faster str2double function.

When i give formatspec as ('%s %f') is the first row is taken as string or the first column?

Note: text file size is 100000*600 column.Some files second column(Units) may not be present,data starts form second column itself. Suppose if my delimiter changes to ',' for another file how to auto detect delimiter?

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Stephen23 2017년 10월 25일

@surey: are the missing data always in that column, or can they occur in other columns as well?

Vick 2017년 10월 25일

편집: Vick 2017년 10월 25일

Hi, There are more than 20 missing column in my actual data.. Can be in any column.. Additionally My missing data may be at a single row at any column,rather than being a whole column...

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Walter Roberson 2017년 10월 24일

MATLAB Online에서 열기

0 개 추천

"When i give formatspec as ('%s %f') is the first row is taken as string or the first column?"

No, not either. textscan() loops contining from the current file position, which might be in the middle of a line. If your format only reads a portion of a line, then the rest of the line is not discarded before the format is used again: instead the file position is updated right into the middle of a line and then it loops and applies the format again to where-ever it is.

For example, in the file

abc 123 456 789 1011
def

then a "%s%f" format would first read the 'abc' with %s format, then read the 123 with numeric format, temporarily leaving the textscan output as {{'abc'}, [123]} . Then textscan would re-apply the format from where it was, reading '456' with the %s format and 789 with the numeric format, updating the textscan output to {{'abc'; '456'}, [123; 789]}. Then the %s would grab the 1011, and the %f would choke on the def of the next line, leaving you with {{'abc'; '456'; '1011'}, [123; 789]} -- notice the numeric column is shorter than the text column because it happened to give up reading before that column was updated.

Now, if you happen to have the same number of format items as you have columns, then the effect is that each format item applies to a column. But if you hit a row that has a missing entry that is implied by spacing (no explicit delimiter between fields), or you have a numeric field specification but encounter a string instead and you do not have TreatAsEmpty set, or if %s column unexpectedly has a space in it... in any of those circumstances, the nice correspondence between column and format specifier will get messed up.

One of the key things you need to know about textscan() is that unless you have set 'WhiteSpace' to exclude the space character, that at the beginning of every format specifier, blanks starting at the current position are discarded -- even if the format specifier is %c or %s or %[]. This makes it tricky to deal with optional fields that are replaced by blanks, (unless you happen to be using a field separator such as comma or tab). The immediate thought might be to just remove space from the 'whitespace', but when that parameter does not include space, then leading spaces are an error for numeric fields! I showed how to get around that in https://www.mathworks.com/matlabcentral/answers/361377-textscan-failing-to-read-data-in-text-file#answer_286302

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Vick 2017년 10월 25일

Hi Roberson,

Thanks for the detailed explanation. I'm now able to specify the format spec for simpler problems but Still i'm struggling to specify the formatspec for my problem.

Attached the file on @Stephen Cobeldick's comment.. https://in.mathworks.com/matlabcentral/answers/362921-textscan-with-different-formats#comment_496926

댓글을 달려면 로그인하십시오.

Textscan with different formats

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

제품

태그

Community Treasure Hunt

Textscan with different formats

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

제품

태그

참고 항목

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기