Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

Question

ndb 2019년 3월 27일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/452934-textscan-encountering-unwanted-character-how-do-i-kill-that-line-and-move-on-without-killing-the-sc

편집: ndb 2019년 4월 4일

채택된 답변: Walter Roberson

ndb_sample.txt

MATLAB Online에서 열기

Hello:

I'm processing a large temporal dataset (data recorded every minute with 60+ columns). Right now, I'm using textscan() to parse it a bit. Maybe 5 times throughout one file, there is an upside-down question mark (¿) within th data. So, this kills my script because it expects a float, and I'd like to avoid that by skipping the column where it finds that character as well as the remaining data/columns in that textscan line, and treat them as empty. I've attached a few minutes of the data that include good data and one line with the ¿. Here's a bit of ugly code within a loop that deals with that:

filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, ...
    'TextType', 'string', 'ReturnOnError', false,'EmptyValue',-Inf);
fclose(fileID);

I know it's probably not the most efficient way to do it, but it's what I've got now. I've looked a bit into regular expression replacement, but I could never get that to work. Any advice is appreciated.

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

dpb 2019년 3월 28일

MATLAB Online에서 열기

How big is the actual file?

With today's memory, I'd be tempted to just load the whole thing in memory and clean up the offending lines, then process.

Or, it's surprisingly fast, just write a quick filter that kills any line if finds with the bum character...or use a standalone grep utility first...

magicchar=char(N);    % whatever the offending character is
fidi=fopen('yourfile.txt','r');
fodo=fopen('newfile.txt,'w');
while ~feof(fidi)
  l=fgets(fidi);
  if contains(l,magicchar),continue,end
  fprintf(fido,'%s')
end
fclose(fidi)
fclose(fido)

ndb 2019년 3월 28일

편집: ndb 2019년 3월 28일

The files aren't big at all, ~400 KB. The thing is, I'd like to keep the data up to the point where the offending character lives, and then kill everything after that... if possible.

You're right though: I should be reading in the whole file, doing my magic on it, and then writing it out somewhere else. I wasn't doing that to begin with because the format of the file up to a certain date had an even wonkier format that wasn't uniform. Now that the files coming in are uniform, this will definitely be the way to go. Thanks also for the shortcut on the line formatting. That will help in all my other endevours as well.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Walter Roberson 2019년 3월 28일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/452934-textscan-encountering-unwanted-character-how-do-i-kill-that-line-and-move-on-without-killing-the-sc#answer_367797

편집: Walter Roberson 2019년 3월 28일

MATLAB Online에서 열기

filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
ncol = 72;
filler = repmat('-inf ', 1, ncol);
S = fileread(filename);
newS = regexprep(S, '\S*\x00.*?$', filler, 'lineanchors');
dataArray = textscan(newS, formatSpec, 'Delimiter', delimiter, 'TextType', 'string');

You already have a %[^\n\r] to eat to the end of line. Typically that will get "0 0 " in it (that is, if you were trying to read all the numeric columns as numeric then your count was off by 2). I take advantage of that eating by detecting the bad characters and substituting an entire full line's worth of -inf pattern, knowing that the -inf will be used by the %f format and that any left-over -inf will be eaten by the %[^\n\r] pattern. You will get a dataArray{end} line that has a number of "-inf" occurances. I figure that if the 0 0 was significant for something that you would have read it with %f%f .

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

Walter Roberson 2019년 3월 29일

Sure.

\S is a non-space character. * means "any number of them" including none at all. This part is expressing "match all of the non-space characters that are immediately before the bad character -- wiping out the field that the bad character appears in, as requested. (It would have also been valid to consider only wiping out from the bad character onwards. And looking at your data, it looks like you could have considered splitting the records at the bad character, as it looks like the bad character replaces the end of a record and immediately after the character starts a new line.)

Then the \x means that the next two digits are to be interpreted as the hexadecimal representation of the character. So \x00 means char( hex2dec('00')) which is char(0) . The bad character was the null character when I looked in the file.

Then . means "match any character", ordinarily including possibly newline characters. The * means "any number of them". Normally that would extend as far as possible, matching any number of any character, going as far as you can in the file until forced to backtrack to match something else that followed in the regular expression. However, the ? means to instead use as few characters as is needed to match the rest of the expression.

The $ then normally matches end of the string. However, with the 'lineanchors' option, it matches the end of a line instead. In the context of following .*? it effectively modifies from "match everything until end of file" and makes it "match everything until the end of the current line"

Thus, the \x00.*?$ with the 'lineanchors' option means to match from a null character to the end of the same line. And the \S* before that goes back to the beginning of the field that the null was found in.

The 'lineanchors' option is important in this context. It would not hurt to also use the 'dotexceptnewline' option to reinforce that the .*? is not to go past the end of the current line.

regexprep(string, pattern, replacement) finds all occurances of the pattern in the string, and replaces them with the replacement. Thus we are searching for all cases in which null occurs on a line and replacing from the beginning of that field until the end of that line.

What we are replacing with is 72 copies of '-inf '. -inf is what you had previously configured at your EmptyValue option for textscan, so in cases where it found emptiness in a field you wanted -inf returned for that field; that's where the '-inf' comes from.

The code assumes that the null could appear anywhere on the line. If we knew that the null only appeared at one particular location, we could replace it with a bunch of -inf just long enough for that one case, but the code does not assume it got anywhere on the line. It assumes that the null could even be in the first field, and so that it might have to put in all 72 copies of -inf to fill the fields. The code does not bother to try to figure out which field number it is working in, so it does not bother to work out how many fields it needs to replace with -inf on this particular line. Instead the code just puts in all the -inf it could possibly need, and counts on the %[^\n\r] catch-all at the end of the textscan pattern to consume any -inf that were not needed.

Walter Roberson 2019년 3월 29일

I take it with the last example there that ,00,07 is what is unexpected, because we "know" that 0 immediately after comma should only occur if the entire field is 0 as in ,0, or if there is a decimal place after the 0, as in ,0.482 ? The two 0 in a row at the beginning of the field would be unexpected, and the 0 before the 7 in the field would be unexpected?

... Because if so then that is more work to analyze. %f format will gladly convert 00 or 07 and think nothing is wrong. You have to get relatively contextual as to exactly what numbers normally look like to figure out those kinds of glitches.

The characters like ö are relatively easy to deal with. Where I had \x00 in the pattern, you could instead use [^\x0a\x0d\x20-\x7e] or add \x09 if you want to permit tab as well. \x0d is carriage return, \x0a is newline, 0x20 is space (first non-control character), \x7e is ~ (last regular ascii character other than the non-printing char(127) which is "delete")

Walter Roberson 2019년 3월 29일

Ah, I see it now, the 13.22.282 . Unfortunately, textscan is happy to treat that as 13.22 0.282 without noticing anything wrong. So yes, a fair bit would have to be known about the correct representation of numbers on the system. For example it helps to know for sure that it always puts leading 0. on valid fractional values < 1: some systems would instead leave out the leading 0 and go directly to the period, '0.282' versus '.282' .

Are the numbers certain to have 3 decimal places? And is it certain that a positive number will always have a single space after the comma but a negative value will have no space after the comma?

ndb 2019년 4월 4일

편집: ndb 2019년 4월 4일

MATLAB Online에서 열기

Just to close this out: I ended up incoporating readtable() (new to 2109a) into my workflow instead of textscan(). I decided to read in each file, play with it, and write out a file of filtered and processed data. While it's a bit slower, readtable() allowed me to deal with the data in a slightly cleaner way. I also ended up killing each line of data that had offending characters or insufficient number of columns or variables, etc. While this exercise has allowed me to work on my regular expressions, I decided that I don't have time to deal with every little exception and error. For what it's worth, it's the bit of code that uses readtable() to read in my data:

fileList = dir('*.Neph.txt');
varNames = {'Year','Month','Day','Hour','Minute','Sec','Mtrash',...
    'Dtrash','ytrash','Htrash','Mintrash','Sectrash','nm635','nm525',...
    'nm450','back635nm','back525nm','back450nm','SampleTemp',...
    'EnclosureTemp','RH','Pressure','MajorState','DIOState'};
varTypes = {'double','double','double','double','double','double',...
    'double','double','double','double','double','double','double',...
    'double','double','double','double','double','double','double',...
    'double','double','char','char'};
delimiter = {' ','\t',',','/',':','-'};
dataStartLine = 1;
opts = delimitedTextImportOptions('VariableNames',varNames,...
    'VariableTypes',varTypes,...
    'Delimiter',delimiter,...
    'DataLines', dataStartLine,...
    'ConsecutiveDelimitersRule','join',...
    'MissingRule','omitrow',...
    'EmptyLineRule','skip',...
    'ImportErrorRule','omitrow',...
    'ExtraColumnsRule','ignore');
for i = 1:length(fileList)
    if fileList(i).bytes ~= 0
        %read in data
        data = readtable(fileList(i).name,opts);
.
.
.

Thanks to Walter and dpb for the help and suggestions. Cheers

댓글을 달려면 로그인하십시오.

Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 10 이전 댓글 8개 표시이전 댓글 8개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기