Fast move in datastore (TabularTextDatastore)

Question

Adam Koutný 2017년 10월 26일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/363438-fast-move-in-datastore-tabulartextdatastore

댓글: AnaelG 2018년 2월 9일

Hi,

I'm using TabularTextDatastore to iteratively read from several huge (~GB) text files. Is there a way how to move in a datastore say ~millions of rows without actually reading the data (some kind of offset)?

Note:

a) One step reading is impractical due the memory requirements.

b) Iterative reading of smaller portions requires some time...

c) Need to check data while actually reading due to this problem https://www.mathworks.com/matlabcentral/answers/260298-how-to-solve-problem-with-smaller-number-of-records-read-from-datastore

Thanks for your help.

Adam

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Jeremy Hughes 2017년 10월 26일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/363438-fast-move-in-datastore-tabulartextdatastore#answer_287921

편집: Jeremy Hughes 2017년 10월 26일

MATLAB Online에서 열기

Hi Adam,

Unfortunately, there's not a way to do this. (And I don't mean with datastore--text files are linear things... to find the rows you have to read the rows.) Assuming there was an API for it, the underlying code would still need to read the data in between to know where each "row" starts and stops to skip the right data.

If you know more about what's in the file than just text (like if each row is 3000 characters long) you could possibly implement something with a custom datastore. Check the documentation here: Custom Datastore in R2017b

I have a suspicion you're looking for a performance improvement here, and you might see something by looking at the SelectedVariableNames. The data is still being read but no import happens for variables that aren't selected. So if you don't need all the variables, you could possibly skip some, saving memory and time.

ds = tabularTextDatastore(files...)
ds.SelectedVariableNames = {'a';'c'}; % Only import two variables
data = read(ds)

Hope this helps,

Jeremy

댓글 수: 2
없음 표시없음 숨기기

Walter Roberson 2017년 10월 26일

편집: Walter Roberson 2017년 10월 26일

Formally speaking, POSIX says that text files that might have carriage returns as well as new lines cannot be positioned to particular numeric locations, so even if you knew that each row is 3000 characters long, if you were operating in text mode then you would be forbidden to seek to position (3000+2)*(row_number-1) where the +2 is the number of bytes occupied by CR and LF together. POSIX says that for text files, you can only seek to a "marker" that has been previously returned by ftell() on the file, and that the marker may be opaque (e.g., could be a hash value, could be a marker number with the position stored internally.)

That is the formal requirements: in practice what you do with such files is fopen() them in binary mode and seek by offset, either knowing or having deduced that the CR is present and needs to be included in the counting.

More practically though, if a file is UTF encoded then character representation in the file is variable length, and seeking to numeric positions does not become useful (not unless each record has a record identifier and a unique boundary marker, in which case this process can at least get you closer to where you need to be.)

If you are using the same file multiple times then it can become worthwhile to build an index for it.... however at that point you should start considering whether you should just be tossing the records into a database.

Jeremy Hughes 2017년 10월 26일

Very good points; I was thinking ascii data and probably should have said "3000 bytes".

댓글을 달려면 로그인하십시오.

Answer 2

Aaditya Kalsi 2017년 10월 26일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/363438-fast-move-in-datastore-tabulartextdatastore#answer_287976

MATLAB Online에서 열기

If where you want to seek to in the datastore is approximate, there may be a way to do this using PARTITION:

% divide the datastore into 1000 parts and pick the 4th
subds = partition(ds,1000,4);

Granted that this is not exact, but it may be what you are looking for.

In case it is, more details here.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

AnaelG 2018년 2월 9일

MATLAB Online에서 열기

Not sure it applies here but this worked for me to at least be able to skip to the Nth file.

    ds= tabularTextDatastore(files...);
    ds.ReadSize= 'file';
    numFiles= length(ds.Files);
    tableN= read(ds.partition('Files', N ))

댓글을 달려면 로그인하십시오.

Answer 3

Adam Koutný 2017년 10월 27일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/363438-fast-move-in-datastore-tabulartextdatastore#answer_288085

MATLAB Online에서 열기

Hi all, thanks for your answers, appreciate it.

I'm actually using two datastores operation on the same data. The first is used to read only those columns (kind of timestamp) subsequenty utilized to filter/find appropriate data. Then the second reads much more columns of data.

In ordinary way, whole file is processed and reading goes regularly from the begining till the end of file - this works fine as datastores remember a position of lastly read record (row/byte). However, sometimes there is a requirement of reading just portion of data of arbitrary time interval. In case of an interval far from the datastore/file begining, it is quite effective to find appropriate data in the first datastore, however, moving to the same position in second datastore takes relatively long time.

I'm not sure if I can use fseek or ftell in connection with datastore. I don't think so. I was hoping that there is some "undocumented" datastore feature, e.g. that property holding the actual position.

Unfortunately, reading as

[~] = datastore.read();

is not a save way due to c) above. On the other, there is some information about position in info...

[data, info] = datastore.read();

Thanks again

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Jeremy Hughes 2017년 10월 27일

MATLAB Online에서 열기

Hmmm, you might try using tall arrays. This is newer than datastore, and I don't know which release you're using.

>> ds = tabularTextDatastore(files...);
>> T = tall(ds)
>> subT = T(T.Var1 > 3 & T.Var1 < 6,:)
>> gather(subT)

Tall arrays let you operate on the datastore in many of the same ways you can work on in-memory arrays. If you have Parallel Computing Toolbox, you can execute your calculations on multiple workers. You can also use the same SelectedVariableNames optimization.

I believe this will help your workflow. Also, look into TIMETABLE which has additional features for working with timebased data. Based on your description, it sounds like you might be able to do what you're looking for with something like this:

ds = tabularTextDatastore('airlinesmall.csv')
ds.SelectedVariableNames = {'Year','Month','DayofMonth'}
T = tall(ds)
TT = table2timetable(T,'RowTimes',datetime(T.Year,T.Month,T.DayofMonth))
subTT = TT(timerange('16-Oct-1987','21-Oct-1987'),:)
gather(subTT)

Hope this helps,

Jeremy

댓글을 달려면 로그인하십시오.

Fast move in datastore (TabularTextDatastore)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (2개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Fast move in datastore (TabularTextDatastore)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (2개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기