Speed up distance calculation in N-dimensional space

Hi all,
I have a function I'd like your help to speed up. The function takes two vectors x and Y of points in N-dimensional space as inputs (x and Y have the same size, typically N=500 by m=50,000). For every point in x I want to find the closest (according to squared difference) point in Y and return its number. The inputs are almost random. The function is :
function I = findnearest(x,Y)
I = zeros(size(x,1),1);
o = ones(size(Y,1),1);
for i = 1:size(x,1)
sqdiff = (o*x(i,:) - Y).^2;
xdiff = sum(sqdiff,2);
[~,I(i)] = min(xdiff);
end
end
I've tried to vectorize it but it turns out even slower than the current version. Any ideas on how to improve performance would be greatly appreciated! The function is called upon many times and as such takes immense computational time.
Thanks in advance!
Edit: I just realized that the part of the code that takes 99% of the time I only have N=5 to N=20... I'm terribly sorry for this confusion! The parts where N>=400 are not run that often so they are not that big a deal.

댓글 수: 7

What's the aim of this function?
What is the size of x and y, and what does "immense" mean to you (how many microseconds)?
Hi!
Sorry if I was unclear. The function is part of a numerical algorithm for solving economical models. I have two sets X and Y of M points each in N-dimensional space. Typically N is 200-1000 and M is 40,000-200,000. For each point in X I want to find the closest point in Y.
For now I define "closest" as the point with the smallest sum of squared difference, but the exact definition is not that important - if it is possible to speed up the function using some other definition of closeness that's fine by me.
Hope this clarifies it. If not please ask again! :)
Kindly, Anders
In my tests on problems of that size, it appears to take on the order of a minute for each computation.
Vi us that sounds about right. The thing is that it is within a bunch of nested loops and will be run potentially 100,000's of times. And that's not feasible of it takes a minute each time. :)
An important question is if BOTH sets of points change with each evaluation. If Y stays fixed for example, then one of the tree schemes might be very good, as then the cost of building the tree will be spread out over the many times it will be used.

Hi again! The Y will sometimes be constant 1,000's of times, but will change every now and then. So then the trees might be good after all.

댓글을 달려면 로그인하십시오.

답변 (1개)

Matt J
Matt J 2015년 2월 28일

0 개 추천

See IPDM ( Download ) and maybe also the files it acknowledges and the files it inspired.

댓글 수: 5

John D'Errico
John D'Errico 2015년 2월 28일
편집: John D'Errico 2015년 2월 28일
Oh. I was going to answer this, then I went off to have some breakfast and forgot about it. I'm not sure whether IPDM will be any faster, as the algorithm it uses for a nearest neighbor is not really that great.
I'd wonder about looking for a good k-d tree code instead, which might be able to help here. A problem with a kd-tree is it may not be efficient in a high number of dimensions (500) as is seen here. Ball-trees might also be worth looking into, but I've not seen much of them.
Some further thought suggests that the tree search schemes might not be as useful as I might have hoped. These schemes have the flaw that they are inefficient as the number of dimensions rises, forcing the code to test too many points. So in a very high number of dimensions (500 is very high in this context) you won't see much improvement at all. As well, the branching ensures no gain can come from any vectorization at all.
Hi thanks for your replies!! I'll check out the tree algorithms and see how they do; any improvement would be amazing! Matt - in really impressed by your use of bsxfun in my other question! Do you think it, or array fun, can be used in this context?
Kindly, Anders
Matt J
Matt J 2015년 2월 28일
편집: Matt J 2015년 2월 28일
@Anders,
Matt - in really impressed by your use of bsxfun in my other question!
Perhaps you could Accept-click the answer then :D ...
As for the rest of your last post, IPDM does use bsxfun, so if John doesn't recommend it, I'm not sure how much to hope for. Arrayfun is not designed for high speed computation, so I wouldn't recommend it for what you are trying to do.
The only thing I can think of at this point is to divide the task with parfor in combination with Worker Object Wrapper ( Download ). The Worker Object Wrapper will allow you to make Y persist on the parallel workers across multiple calls to parfor, so you don't have to rebroadcast it each time you repeat the parfor loop. Since you say Y changes rarely, this will save you some overhead.
Hi again,
I'm sorry for a late reply - something else came up that I needed to deal with before this. Thank you soo much John and Matt! Its getting better, but I'm not yet there.
I've tried a bunch of different things but still haven't found anything that is quite as fast as I'd like it to be. My homemade findnearest is an obviously bad first attempt. I also tried John's ipdm, as well as the built in functions createns/knnsearch combo. Please see below for computational comparisons. I'll keep looking and will report back if I find anything faster than the createns/knnsearch. If anyone has additional ideas on how to improve this, feel free to comment below. [Approximate nearest neighbour might be good enough for me, so I might try to compile FLANN (Link)]. The 2.5seconds might not seem that bad at first, but they are very unwelcome when repeated hundreds of thousands of times!
disp('50000 by 1')
A=rand(50000,1); B = rand(50000,1);
tic;
disp('Built in: createns/knnsearch')
disp(' -- Grow tree --')
kdtree=createns(B,'NSMethod','kdtree','Distance','Euclidean');
toc
tic
disp(' -- Search --')
res1=knnsearch(kdtree,A)';
toc
tic;
disp('Johns IPDM')
res2=ipdm(A,B,'Subset','nearest','result','struct')';
toc
tic;
disp('My findnearest')
res3=findnearest(A,B);
toc
disp(' ')
disp('50000 by 8')
A=rand(50000,8); B = rand(50000,8);
tic;
disp('Built in: createns/knnsearch')
disp(' -- Grow tree --')
kdtree=createns(B,'NSMethod','kdtree','Distance','Euclidean');
toc
tic
disp(' -- Search --')
res1=knnsearch(kdtree,A)';
toc
tic;
disp('Johns IPDM')
res2=ipdm(A,B,'Subset','nearest','result','struct')';
toc
tic;
disp('My findnearest')
res3=findnearest(A,B);
toc
Results
50000 by 1
Built in: createns/knnsearch
-- Grow tree --
Elapsed time is 0.069353 seconds.
-- Search --
Elapsed time is 0.055465 seconds.
Johns IPDM
Elapsed time is 0.029600 seconds.
My findnearest
Elapsed time is 12.559964 seconds.
50000 by 8
Built in: createns/knnsearch
-- Grow tree --
Elapsed time is 0.128941 seconds.
-- Search --
Elapsed time is 2.504368 seconds.
Johns IPDM
Elapsed time is 175.717344 seconds.
My findnearest
Elapsed time is 295.582135 seconds.|

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 Loops and Conditional Statements에 대해 자세히 알아보기

질문:

2015년 2월 28일

댓글:

2015년 3월 6일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by