Pre-indexing a tall datastore

Question

0 개 추천

Hello,

I'd like to generate a set of indicies for my tall datastore in advance so, when I want to select rows, it will be fast.

Here is what I came up with.

I expect there is a much better way. Please help me improve it.

FYI, the datastore will be several million rows by four columns. I ran the following test on a very small file.

Thanks

Michael

Code:

% CREATE TALL DATASTORE
ds = datastore(fname);
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
ds.SelectedFormats(1:3) = {'%{MM/dd/uuuu}D','%C','%C'};
tds = tall(ds);
gDATE = unique(tds.DATE);
gTICKER = unique(tds.TICKER);
gFIELD = unique(tds.FIELD);
% FIND UNIQUE LABELS
[uniqueDates, uniqueTickers, uniqueFields] = gather(gDATE,gTICKER,gFIELD);
% FIND THE INDEX OF EACH UNIQUE LABEL
for iTicker = 1:length(uniqueTickers)
    gTicker = find(tds.TICKER==uniqueTickers(iTicker));
    idxTicker{iTicker} = gather(gTicker);
end
for iField = 1:length(uniqueFields)
    gField = find(tds.FIELD==uniqueFields(iField));
    idxField{iField} = gather(gField);
end
% TABULATE
tTickerIndex = array2table([cellstr(uniqueTickers) idxTicker.'], ...
    'VariableNames',{'TICKER','INDEX'});
tTickerIndex.TICKER= categorical(tTickerIndex.TICKER)
tFieldIndex = array2table([cellstr(uniqueFields) idxField.'], ...
    'VariableNames',{'FIELD','INDEX'})
tFieldIndex.FIELD = categorical(tFieldIndex.FIELD)
% DISPLAY FIRST 10 VALUES FOR EQY_DVD_YLD_IND
out = tds(tFieldIndex.INDEX{tFieldIndex.FIELD=='EQY_DVD_YLD_IND'}(1:10),:);
gather(out)

Output

tTickerIndex = 1×2 table
        TICKER              INDEX      
    _______________    ________________
    DFIVX US EQUITY    [34112×1 double]
    
tFieldIndex = 6×2 table
                 FIELD                       INDEX     
    ________________________________    _______________
    DAY_TO_DAY_TOT_RETURN_GROSS_DVDS    [6423×1 double]
    DAY_TO_DAY_TOT_RETURN_NET_DVDS      [6423×1 double]
    DIVIDEND_INDICATED_YIELD            [6242×1 double]
    EQY_DVD_YLD_IND                     [6242×1 double]
    EQY_DVD_YLD_IND_NET                 [6242×1 double]
    FUND_TOTAL_ASSETS                   [2540×1 double]
    
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 1.1 sec
- Pass 2 of 2: Completed in 0.98 sec
Evaluation completed in 2.8 sec
ans = 10×4 table
       DATE           TICKER              FIELD         VALUE 
    __________    _______________    _______________    ______
    06/01/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1484
    06/02/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1611
    06/03/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1718
    06/06/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1654
    06/07/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND     2.159
    06/08/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1442
    06/09/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1484
    06/10/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND     2.138
    06/13/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1277
    06/14/1994    DFIVX US EQUITY    EQY_DVD_YLD_IND    2.1277

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Michael 2019년 9월 1일

MATLAB Online에서 열기

0 개 추천

Apparently, indexing is completely ineffective. My data is about 950 million rows by 4 columns

ds = datastore('tallFinal.csv');
ds.SelectedFormats={'%q','%C','%C','%s'};
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
tds = tall(ds);
i = int16(intersect(idxTicker,idxField));
size(i)
ans = 225 1
temp = tds(i,:);
temp = gather(temp);
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 3: Completed in 23 min 0 sec
- Pass 2 of 3: 33% complete