Loop through the unique values of a very large column and extract data

조회 수: 5 (최근 30일)
Julian Williams
Julian Williams 2020년 6월 15일
댓글: Julian Williams 2020년 6월 16일
This is more a speed question than a "how to" question.
Assume I have the following problem:
Three variables A, B and C.
A is a series of IDs and B and C are data (e.g. dates and a measurement).
For various reasons I want to seperate the data, so instead of three columns I have a structure with something like:
mystruct.First_ID_FROM_A = [B(indexFirstID,:) C(indexFirstID,:)]
Traditionally I just do the following:
[uA,IA,IB] = unique(A);
for i=1:length(uA)
ii = find(i==IB);
mystruct.(uA{i,1}) = [B(ii,:) C(ii,:)];
%sometimes I do other stuff here with some cross referencing so the index ii is useful.
end
Job done. I have tried other methods, but this is pretty fast, except now I have like crazy big data (e.g. A, B and C is like the best part of a billion rows). So this is my second attempt that I run on a server:
[uA,IA,IB] = unique(A);
N = length(uA);
temp = cell(N,1);
% do the indexing with a cell structure that can be cut.
parfor i=1:N
ii = find(i==IB);
temp{i,1} = [B(ii,:) C(ii,:)];
end
% do a second loop just to reallocate the data
for i=1:N
mystruct.(uA{i,1}) = temp{i,1};
end
So despite being two loops this can be quicker as the extraction is in parallel and the assignment is fast.
Is there a fancy way of using something like an array based version of a binary expansion function that can do this faster without the loop, in either step of the second process? Or should I make a C++ and a mex routine to speed this tedious thing up? I think a problem here is the output array is uncertain in terms of size.
If so does anyone have any experience or examples of how to create and map a Matlab structure in C++ so the output can be read by matlab? I use str2doubleq a lot, this takes cell array of strings and outputs doubles, which is quite vanilla, and I have made a few custom C and C++ codes, for fast date and time pulls, when datenum was too slow.
But this is annoying, me, I am sure there is a neater way to do it. Once the data is in the structure, it is reall fast to just use the fieldnames command and then loop through the sub data objects.
  댓글 수: 7
Sindar
Sindar 2020년 6월 16일
The point of tables is that they act like a more organized structure array. If you are naming each structure field, you already spend that memory. Depending on the shape of your data, something similar to:
mytable = array2table([B(IB,:) C(IB,:)],'RowNames',num2str(uA))
should work without any loops
Julian Williams
Julian Williams 2020년 6월 16일
Benjamin, that is very neat, much appreciated. Sindar, many thanks for the point on the tables.

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

Help CenterFile Exchange에서 Matrix Indexing에 대해 자세히 알아보기

제품


릴리스

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by