필터 지우기
필터 지우기

Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

조회 수: 1 (최근 30일)
Hello:
I'm processing a large temporal dataset (data recorded every minute with 60+ columns). Right now, I'm using textscan() to parse it a bit. Maybe 5 times throughout one file, there is an upside-down question mark (¿) within th data. So, this kills my script because it expects a float, and I'd like to avoid that by skipping the column where it finds that character as well as the remaining data/columns in that textscan line, and treat them as empty. I've attached a few minutes of the data that include good data and one line with the ¿. Here's a bit of ugly code within a loop that deals with that:
filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, ...
'TextType', 'string', 'ReturnOnError', false,'EmptyValue',-Inf);
fclose(fileID);
I know it's probably not the most efficient way to do it, but it's what I've got now. I've looked a bit into regular expression replacement, but I could never get that to work. Any advice is appreciated.
  댓글 수: 4
dpb
dpb 2019년 3월 28일
How big is the actual file?
With today's memory, I'd be tempted to just load the whole thing in memory and clean up the offending lines, then process.
Or, it's surprisingly fast, just write a quick filter that kills any line if finds with the bum character...or use a standalone grep utility first...
magicchar=char(N); % whatever the offending character is
fidi=fopen('yourfile.txt','r');
fodo=fopen('newfile.txt,'w');
while ~feof(fidi)
l=fgets(fidi);
if contains(l,magicchar),continue,end
fprintf(fido,'%s')
end
fclose(fidi)
fclose(fido)
ndb
ndb 2019년 3월 28일
편집: ndb 2019년 3월 28일
The files aren't big at all, ~400 KB. The thing is, I'd like to keep the data up to the point where the offending character lives, and then kill everything after that... if possible.
You're right though: I should be reading in the whole file, doing my magic on it, and then writing it out somewhere else. I wasn't doing that to begin with because the format of the file up to a certain date had an even wonkier format that wasn't uniform. Now that the files coming in are uniform, this will definitely be the way to go. Thanks also for the shortcut on the line formatting. That will help in all my other endevours as well.

댓글을 달려면 로그인하십시오.

채택된 답변

Walter Roberson
Walter Roberson 2019년 3월 28일
편집: Walter Roberson 2019년 3월 28일
filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
ncol = 72;
filler = repmat('-inf ', 1, ncol);
S = fileread(filename);
newS = regexprep(S, '\S*\x00.*?$', filler, 'lineanchors');
dataArray = textscan(newS, formatSpec, 'Delimiter', delimiter, 'TextType', 'string');
You already have a %[^\n\r] to eat to the end of line. Typically that will get "0 0 " in it (that is, if you were trying to read all the numeric columns as numeric then your count was off by 2). I take advantage of that eating by detecting the bad characters and substituting an entire full line's worth of -inf pattern, knowing that the -inf will be used by the %f format and that any left-over -inf will be eaten by the %[^\n\r] pattern. You will get a dataArray{end} line that has a number of "-inf" occurances. I figure that if the 0 0 was significant for something that you would have read it with %f%f .
  댓글 수: 10
Walter Roberson
Walter Roberson 2019년 3월 29일
Ah, I see it now, the 13.22.282 . Unfortunately, textscan is happy to treat that as 13.22 0.282 without noticing anything wrong. So yes, a fair bit would have to be known about the correct representation of numbers on the system. For example it helps to know for sure that it always puts leading 0. on valid fractional values < 1: some systems would instead leave out the leading 0 and go directly to the period, '0.282' versus '.282' .
Are the numbers certain to have 3 decimal places? And is it certain that a positive number will always have a single space after the comma but a negative value will have no space after the comma?
ndb
ndb 2019년 4월 4일
편집: ndb 2019년 4월 4일
Just to close this out: I ended up incoporating readtable() (new to 2109a) into my workflow instead of textscan(). I decided to read in each file, play with it, and write out a file of filtered and processed data. While it's a bit slower, readtable() allowed me to deal with the data in a slightly cleaner way. I also ended up killing each line of data that had offending characters or insufficient number of columns or variables, etc. While this exercise has allowed me to work on my regular expressions, I decided that I don't have time to deal with every little exception and error. For what it's worth, it's the bit of code that uses readtable() to read in my data:
fileList = dir('*.Neph.txt');
varNames = {'Year','Month','Day','Hour','Minute','Sec','Mtrash',...
'Dtrash','ytrash','Htrash','Mintrash','Sectrash','nm635','nm525',...
'nm450','back635nm','back525nm','back450nm','SampleTemp',...
'EnclosureTemp','RH','Pressure','MajorState','DIOState'};
varTypes = {'double','double','double','double','double','double',...
'double','double','double','double','double','double','double',...
'double','double','double','double','double','double','double',...
'double','double','char','char'};
delimiter = {' ','\t',',','/',':','-'};
dataStartLine = 1;
opts = delimitedTextImportOptions('VariableNames',varNames,...
'VariableTypes',varTypes,...
'Delimiter',delimiter,...
'DataLines', dataStartLine,...
'ConsecutiveDelimitersRule','join',...
'MissingRule','omitrow',...
'EmptyLineRule','skip',...
'ImportErrorRule','omitrow',...
'ExtraColumnsRule','ignore');
for i = 1:length(fileList)
if fileList(i).bytes ~= 0
%read in data
data = readtable(fileList(i).name,opts);
.
.
.
Thanks to Walter and dpb for the help and suggestions. Cheers

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Logical에 대해 자세히 알아보기

제품


릴리스

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by