read inconsistent ascii file to matrix

Question

Tom DeLonge 2019년 3월 27일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/452816-read-inconsistent-ascii-file-to-matrix

댓글: Tom DeLonge 2019년 4월 10일

I'd like to obtain maximum performance in reading a file containing both, numeric and non-numeric lines. The files typically look as such:

% comment
text 1.49
1.52 -5.3 8.9710
3.629 -5.77 9
another text and numbers
% comment again
1 2 3
and so on

The file can easily contain 1 million lines.

I would like to obtain two cell arrays:

One that contains all rows that match %f %f %f , i.e. a numeric triplet. Already parsed as numeric doubles. Invalid lines should show up as empty entries or NaN.
Another matrix, that contains all rows that did not match cell-array 1. Still as cellstr, prefereably with trimmed whitespaces.

Obtaining matrix 2 is sort of simple if you already have 1: simply by issuing textscan, and setting all rows that did not match 1 as empty. However, I struggle in obtaining cell array #1. textscan will stop reading once it encounters invalid lines.

In a working example I used sscanf and parsed everything line-by-line. This took about 15s for 1 million lines. Since textscan can read the whole file in less than a second, I am confident that there is room for improvement...

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Jan 2019년 4월 1일

편집: Jan 2019년 4월 1일

What is the meaning of searching for ["a" "e" "i" "o" "u" "A" "E" "I" "O" "U"] ? What do you call "al lot of memory"? Can you provide an example file?

Tom DeLonge 2019년 4월 9일

Sorry, I was on vacation previous week.

I found it to be the fastest way to find all rows that contain also non-numeric data. As said above, it is faster than a regular expression since in my case all text-containing rows do have a vowel in it.

By a lot of memory I mean that the data array will occupy about 100MByte of RAM for a 10MByte text-file (factor of 10 overhead). While 100MByte is not so dramatic yet, for even larger file this will be even worse.

The file is proprietary, which means I cannot provide an example file. But the few lines I've shown above should come pretty close...

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Jan 2019년 3월 27일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/452816-read-inconsistent-ascii-file-to-matrix#answer_367698

편집: Jan 2019년 4월 9일

MATLAB Online에서 열기

Data  = fileread(FileName);
C     = strsplit(Data, char(10));
% [EDITED] Remove comments:
C(strncmp(C, '%', 1)) = [];
match = true(size(C));
NumC  = cell(size(C));
for iC = 1:numel(C)
    % [EDITED2] Small shortcut:
    aC = C{iC};
    if ~isempty(aC) && any(aC(1) == '1234567890-.')
      [Num, n] = sscanf(aC, '%g %g %g');
      if n == 3
          NumC{iC}  = Num;
          match(iC) = false;
      end
   end
end
TextC = C(match);

Is this your current version using a loop? How long does it take?

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

Jan 2019년 4월 10일

@Tom: textscan is fast for valid inputs. Then I expect fscanf to be even faster. But as soon as the input cannot be caught by a simple format specifier, the processing gets much slower.

Some C code will be faster also, but very tedious to write. It must import the file line by line, but you have to create a buffer, which must be able to contain the longest line also. Unfortunately you do not know the length in advance and the same for the number of outputs. Re-allocation the output array dynamically is a mess in C. So maybe the code runs some seconds faster, but you need a lot of hours for writing and testing. Therefore I like MATLAB.

Tom DeLonge 2019년 4월 10일

Yes, I do agree and understand the limitations of textscan. Thank you for the insights!

댓글을 달려면 로그인하십시오.

Answer 2

Guillaume 2019년 3월 27일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/452816-read-inconsistent-ascii-file-to-matrix#answer_367697

편집: Guillaume 2019년 3월 27일

MATLAB Online에서 열기

Unfortunately, there's no ignore invalid lines for textscan, so you're going to have to parse the file line by line, or implement the parsing in mex.

The following takes about 10s on my machine for a million lines. It's probably similar to what you've done already:

function [num, text] = parsefile(path)
    lines = strsplit(fileread(path), '\n');
    num = cellfun(@(l) sscanf(l, '%f %f %f')', lines, 'UniformOutput', false);
    text = lines(cellfun(@isempty, num));  %could use cellfun('isempty', num) for a marginal speed gain
end

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Tom DeLonge 2019년 3월 27일

Thanks, this version takes 20 s on my computer and is a bit slower than the one of Jan.

댓글을 달려면 로그인하십시오.

read inconsistent ascii file to matrix

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

추가 답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

read inconsistent ascii file to matrix

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

추가 답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기