how to read data from desired lines of a large data set?

조회 수: 1 (최근 30일)
George
George 2012년 10월 5일
Dear all, I want to read desired lines from a large data set(>50GB) which is not possible to load all the data by simply invoking textscan.
what I can think is:
fid = fopen('data.dat');
nline = 0; % the line index
wline = 1000: 10^7; % the wanted lines
i = 1; % index for wline;
while ~feof(fid)||nline<max(wline)
ldata = fgets(fid);
nline = nline+1;
if nline == wline(i)
datas(i) = ldata;
i= i+1;
end
end
as you see, this loop is really time consuming. my questions is: 1. is there any function to read it faster (on Unix system) 2. is it possible to use pointer, so that just read the desired line
thank you
George
dataset 10^9 lines and 4 columns
0 0 0 0.5
0 0.05 200.05 1 ...

답변 (1개)

José-Luis
José-Luis 2012년 10월 5일
편집: José-Luis 2012년 10월 5일
That is one big chunk of data. I have several suggestions:
  • Preallocate: in your code your are growing datas at each iteration. Preallocate using, e.g.
datas = ones(numLines,5);
This might not be a viable option if you want to allocate for a 10^9 x 5 matrix.
  • Split your data in several chunks, that you can read when needed. Look at the split utility
  • Use a database.
If you want to read just one line, and know the exact position (in bytes from the beginning), you could always try fseek.
  댓글 수: 2
George
George 2012년 10월 5일
thank you for your helpful suggestions, José.
the problem is that bytes are changing line by line. which make it difficult to calculate the exact position.
again, thank you. George
José-Luis
José-Luis 2012년 10월 5일
My pleasure.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Large Files and Big Data에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by