File Reading - Skip to next line

조회 수: 21 (최근 30일)
dvd7e
dvd7e 2016년 7월 5일
편집: Stephen23 2016년 7월 5일
Hello, I have a very large textfile (a few gigabytes). The large file size is dominated by how much data there is per line, not the total number of lines. I only need a small fraction of these lines (which are not evenly spaced apart), and I was previously doing this by fgetl() and then dumping the data if I didn't need it. But this has become really slow.
What I want is to be able to read in just the first few characters of a line, and if it matches my search criteria then read in the rest of the line. Otherwise, I don't really care what's in there and I don't want to waste the time of reading it in, so skip to the next line. Can this be done?
fread(fid,[0 10],'*char') seems to do the part of reading in the first few characters, but then how do I skip to the next line without actually reading in the rest of the data on that line?
  댓글 수: 1
Stephen23
Stephen23 2016년 7월 5일
편집: Stephen23 2016년 7월 5일
@Michael Epstein: you need to think about what a "line" really means: some (zero or more) non-newline characters separated by newline characters.
  • Question: How does a program know where the newline characters are ?
  • Answer: It has to read all the characters until it finds a newline character.
So what your are proposing is self-contradictory: you want to skip characters (jump to the next newline) by reading all of the characters until the next newline...
There is no simple solution to this for standard text files with arbitrary line lengths. There may be file formats that have some index of the line locations, or that use a fixed line length.
Probably a better solution would be to store the data in a better format to start with (some binary file like a .mat file), or read your data file once, filter the parts that you need, and then save this in a more efficient form (e.g. a .mat file). Designing good data structures and storage is one of the most important steps of program design, but is sadly underrated by many coders, even though it makes a huge difference to program operation and efficiency.

댓글을 달려면 로그인하십시오.

답변 (1개)

José-Luis
José-Luis 2016년 7월 5일
편집: José-Luis 2016년 7월 5일
That is beyond Matlab's and most IO routines I'm afraid. In order to know where the next line begins you need to know where the current line ends: that means scanning the entire line until you find a newline. If all lines were of the same length you could use some low-level io and skip a certain number of bytes, but my guess would be the gain, if any, is minimal.
You could try memmapfile(), but that would mostly help if all you have is numeric data.
The better approach would be to either modify the file externally in order to get only the lines of interest (e.g. grep) or if you have access to the program that generated the files, then only save what you need.
Also, text files are not the fastest format around.
  댓글 수: 1
Walter Roberson
Walter Roberson 2016년 7월 5일
If you re-read the same file multiple times then it can be worthwhile to preprocess it to determine the ftell() positions of each line, after which you can check a few characters, fseek() to the next line if you are not interested.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Low-Level File I/O에 대해 자세히 알아보기

태그

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by