Extracting a wide table from a very tall datastore array, also trying KDB+

조회 수: 1 (최근 30일)
Michael
Michael 2019년 8월 28일
댓글: Michael 2019년 9월 12일
Hello,
I am currently storing about 40GB of data in a 4 column tall array with about 1B rows. I'm using a Matlab Tall Array Datastore, but I also plan on experimenting with KDB+.
Could someone comment on the best way to extract a wide table, with about 350,000 rows and about 20 columns by selecting some of the items in the first three columns and the corresponding values,e.g DATE = {29-Jul-1983:31-Aug-1983} & STOCK = AAPL US EQUITY & FIELD= {MOV_AVG_50D & PRICE}?
Thank You,
Michael
Input
DATE,STOCK,FIELD,VALUE
29-Jul-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.8979
31-Aug-1983 00:00:00,BHP AT EQUITY,PRICE,0.9029
29-Jul-1983 00:00:00,IBM US EQUITY,MOV_AVG_50D,0.9106
31-Aug-1983 00:00:00,IBM US EQUITY,PRICE_50D,0.9154
29-Jul-1983 00:00:00,AAPL US EQUITY,MOV_AVG_50D,0.9227
31-Aug-1983 00:00:00,AAPL US EQUITY,PRICE,0.9311
Output:
DATE,MOV_AVG_50D,PRICE
29-Jul-1983 00:00:00,0.9227,NaN
31-Aug-1983 00:00:00,NaN,0.9311
  댓글 수: 1
Michael
Michael 2019년 8월 29일
This is what I tried so far.
ds = datastore('bigtall.csv','DatetimeType','text');
ds.SelectedFormats={'%q','%C','%C','%f'};
tds = tall(ds);tm.cputime(1)=cputime;
tic;
t = tds(tds.TICKER=='UNIT US EQUITY' & tds.FIELD=='EBITDA_MARGIN',:);
T = gather(t);
tm.cputime(2)=cputime;
tm.tictoc(2)= toc;
fprintf('%d minutes CPU, %d minutes\n',[diff(tm.cputime) tm.tictoc(2)]./60);
fprintf('Table size [%dx%d]\n',size(T));
T.DATE = datetime(T.DATE);
tb = unstack(T(:,{'DATE','VALUE','FIELD'}), 'VALUE', 'FIELD');
fprintf('Table size [%dx%d]\n',size(tb));
The output took 23 minutes.
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 20 min 44 sec
Evaluation completed in 21 min 26 sec
5 minutes CPU, 23 minutes
Table size [762x4]
Table size [762x2]

댓글을 달려면 로그인하십시오.

채택된 답변

Michael
Michael 2019년 9월 12일
Hello,
Here's an update. Unfortunatly, or fortuntely depending on how you look at it, KDB+ is about 45x times faster (31 seconds vs. 23 minutes) and can be accessed through Matlab's datafeed toolbox.
Best,
Michael
  댓글 수: 1
Michael
Michael 2019년 9월 12일
This took 1 second instead of 23 minutes!
select from f where TICKER=`$("AAPL US EQUITY"), FIELD in (`MOV_AVG_50D,`HIGH), DATE within 1983.07.29 1983.08.31

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Big Data Processing에 대해 자세히 알아보기

제품


릴리스

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by