Pre-indexing a tall datastore

조회 수: 6 (최근 30일)
Michael
Michael 2019년 8월 30일
답변: Rajani Mishra 2020년 2월 13일
Hello,
I'd like to generate a set of indicies for my tall datastore in advance so, when I want to select rows, it will be fast.
Here is what I came up with.
I expect there is a much better way. Please help me improve it.
FYI, the datastore will be several million rows by four columns. I ran the following test on a very small file.
Thanks
Michael
Code:
% CREATE TALL DATASTORE
ds = datastore(fname);
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
ds.SelectedFormats(1:3) = {'%{MM/dd/uuuu}D','%C','%C'};
tds = tall(ds);
gDATE = unique(tds.DATE);
gTICKER = unique(tds.TICKER);
gFIELD = unique(tds.FIELD);
% FIND UNIQUE LABELS
[uniqueDates, uniqueTickers, uniqueFields] = gather(gDATE,gTICKER,gFIELD);
% FIND THE INDEX OF EACH UNIQUE LABEL
for iTicker = 1:length(uniqueTickers)
gTicker = find(tds.TICKER==uniqueTickers(iTicker));
idxTicker{iTicker} = gather(gTicker);
end
for iField = 1:length(uniqueFields)
gField = find(tds.FIELD==uniqueFields(iField));
idxField{iField} = gather(gField);
end
% TABULATE
tTickerIndex = array2table([cellstr(uniqueTickers) idxTicker.'], ...
'VariableNames',{'TICKER','INDEX'});
tTickerIndex.TICKER= categorical(tTickerIndex.TICKER)
tFieldIndex = array2table([cellstr(uniqueFields) idxField.'], ...
'VariableNames',{'FIELD','INDEX'})
tFieldIndex.FIELD = categorical(tFieldIndex.FIELD)
% DISPLAY FIRST 10 VALUES FOR EQY_DVD_YLD_IND
out = tds(tFieldIndex.INDEX{tFieldIndex.FIELD=='EQY_DVD_YLD_IND'}(1:10),:);
gather(out)
Output
tTickerIndex = 1×2 table
TICKER INDEX
_______________ ________________
DFIVX US EQUITY [34112×1 double]
tFieldIndex = 6×2 table
FIELD INDEX
________________________________ _______________
DAY_TO_DAY_TOT_RETURN_GROSS_DVDS [6423×1 double]
DAY_TO_DAY_TOT_RETURN_NET_DVDS [6423×1 double]
DIVIDEND_INDICATED_YIELD [6242×1 double]
EQY_DVD_YLD_IND [6242×1 double]
EQY_DVD_YLD_IND_NET [6242×1 double]
FUND_TOTAL_ASSETS [2540×1 double]
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 1.1 sec
- Pass 2 of 2: Completed in 0.98 sec
Evaluation completed in 2.8 sec
ans = 10×4 table
DATE TICKER FIELD VALUE
__________ _______________ _______________ ______
06/01/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1484
06/02/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1611
06/03/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1718
06/06/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1654
06/07/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.159
06/08/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1442
06/09/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1484
06/10/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.138
06/13/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1277
06/14/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1277

답변 (2개)

Michael
Michael 2019년 9월 1일
Apparently, indexing is completely ineffective. My data is about 950 million rows by 4 columns
ds = datastore('tallFinal.csv');
ds.SelectedFormats={'%q','%C','%C','%s'};
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
tds = tall(ds);
i = int16(intersect(idxTicker,idxField));
size(i)
ans = 225 1
temp = tds(i,:);
temp = gather(temp);
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 3: Completed in 23 min 0 sec
- Pass 2 of 3: 33% complete

Rajani Mishra
Rajani Mishra 2020년 2월 13일
I found below link related to indexing in tall array, find it below:
You can also consider creating a custom datastore and process data in smaller groups. For creating custom datastore please refer below:
Hope this helps!

카테고리

Help CenterFile Exchange에서 Tall Arrays에 대해 자세히 알아보기

제품


릴리스

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by