Efficient Way To Split Dataset Into Subsets

Hello,
I need to split a large dataset (DxN numeric array) into multiple subsets. I can use the code below (where groupIDs is an Nx1 matrix of integer IDs - the group to which each datapoint belongs).
groups = unique(groupIDs);
for i = 1:numel(groups)
tempData = data(:,groupIDs==groups(i));
%do work on tempData
end
However, 90% of the run time of the above code is spent just creating tempData! That amounts to over a minute every time I want to do this. Is there a more efficient way to split data by groupIDs? I tried splitapply() but it doesn't seem to be any faster.
Are there any matlab gurus out there that know a trick? Thanks!

댓글 수: 5

Jos (10584)
Jos (10584) 2017년 11월 24일
how large is "large"?
E
E 2017년 11월 24일
500 x 3,000,000 (so a 12GB non-sparse double).
Greg
Greg 2017년 11월 24일
편집: Greg 2017년 11월 24일
Use the second (or third? - I always have to guess and check between the two) output of unique(groupIDs).
Edit: This likely isn't faster, you still need a comparison check inside the loop. I always forget that part about the third output of unique.
Jos (10584)
Jos (10584) 2017년 11월 24일
12Gb? That is quite a lot. If this doesn't fit in memory, swapping to disk is the likely bottleneck ...
Thanks for the replies. I do have plenty of RAM left to spare, so it doesn't look like the hard drive is involved. Confirmed (re Greg) that using the output of unique is no better. For example, numeric indexing offers no improvement, and the indexing itself is not really the problem - it's probably the data copying:
disp('a. original (without "doing work")');
tic;
for i = 1:numel(groups)
tempData = data(:,groupIDs==groups(i));
end
toc
disp('b. numeric indexing');
idxs = cell(numel(groups));
for i = 1:numel(groups)
idxs{i} = find(groupIDs==groups(i));
end
tic;
for i = 1:numel(groups)
tempData = data(:,idxs{i});
end
toc
disp('c. logical operation alone');
tic;
for i = 1:numel(groups)
tempData = (groupIDs==groups(i));
end
toc
a. original (without "doing work")
Elapsed time is 4.590886 seconds.
b. numeric indexing
Elapsed time is 4.526391 seconds.
c. logical operation alone
Elapsed time is 0.066057 seconds.
There's gotta be another way - if I use a for loop with 3 million iterations it only takes 2 seconds longer.

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

도움말 센터File Exchange에서 Structures에 대해 자세히 알아보기

질문:

E
E
2017년 11월 18일

댓글:

E
E
2017년 11월 26일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by