Any GPU implementation of k-nearest neighbor search?

조회 수: 20 (최근 30일)
Hao Zhang
Hao Zhang 2018년 12월 13일
댓글: Matt J 2018년 12월 14일
Hi, I am developing a SPH (smoothed particle hydrodynamic) code of solving fluid equations in Matlab. I have successfully vectorized and implemented the code on GPU. The speed up is astonishing, ~10 times faster than CPU code. In the process I got invaluable suggestions from Matt J and Joss Knight for speeding up the code. Thank them for their wonderful suggestions.
In SPH, one has to find the neighbor particles in a given radius for every particles in the domain. I use the matlab function knnsearch for this purpose. Now the other part of the code (except neighbor search but solving the fluid equations) is so fast that the limiting part now is the knnsearch, which uses kdtree algorithm runing on CPU. It takes 85% percent of the runtime (see the following code profiler results)
Gputime.png
The function 'knnCPU_kdtree_func' uses the matlab built-in function knnsearch with kdtree algorithm runing on CPU. The other functions are solving the real fluid equations runing on GPU only consumes 10% of the total time.
I wonder is there any GPU implementation of k-nearest neighbor search that I can free download and using as a function call in my matlab code? Many thanks.

채택된 답변

Matt J
Matt J 2018년 12월 14일
편집: Matt J 2018년 12월 14일
A quick Google search turned up this,
They have Matlab bindings in their legacy version.
  댓글 수: 2
Hao Zhang
Hao Zhang 2018년 12월 14일
Many thanks! This seems to be a very relevant code!
I have downloaded and tried to use nvcc to compile their knncuda.cu file into a ptx file. This step succeed.
In the next step I use the matlab GPUcoder to create CUDAKernel object and trying to run it on GPU, for the entry I choose '_Z17compute_distancesPfiiS_iiiS_'.
clear;close all;clc;
gd=gpuDevice();
reset(gd);
N=int32(10);
rx=rand(N,1);
ry=rand(N,1);
ref=[gpuArray(single(rx)) gpuArray(single(ry))];
ref_width=N;
ref_pitch=int32(2);
query=ref;
query_width=N;
query_pitch=int32(2);
height=int32(2);
dist=zeros(N,1,'single','gpuArray');
%%% create CUDAKernel object %%%
knncuda_kernel=parallel.gpu.CUDAKernel('knncuda.ptx','knncuda.cu','_Z17compute_distancesPfiiS_iiiS_');
%%% create CUDAKernel object %%%
%%% run CUDAKernel %%%
result=feval(knncuda_kernel,ref,ref_width,ref_pitch,query,query_width,query_pitch,height,dist)
%%% run kernel %%%
No error is reported, however I always get the same function return (result) as one of the input argument (ref). Why is that? I have very limited knowledge of cuda C... Could you please help? Thanks!
Matt J
Matt J 2018년 12월 14일
It's all 3rd party black boxes to me, I'm afraid.

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

Joss Knight
Joss Knight 2018년 12월 13일
knnsearch is supported on the GPU, so just use it!
  댓글 수: 7
Matt J
Matt J 2018년 12월 14일
편집: Matt J 2018년 12월 14일
The workspace of pdist2 has no awareness of the workspace that calls it. It cannot know that you have a pre-allocated variable D there ready to house the results. To do that, you would need a different implementation of pdist2 which lets you pass in D as an input argument. This can be done with a MEX file and probably on the GPU as well with a cudaKernel object, but you would have to be willing to get into C coding for that.
Hao Zhang
Hao Zhang 2018년 12월 14일
Ok, I am willing to write a cuda C code, but I am not a computer scientist, I would imagine cuda C will be a steep learning curve for me...
I wonder is there any standard cuda libaray that I can free download to do this?

댓글을 달려면 로그인하십시오.


Matt J
Matt J 2018년 12월 13일
편집: Matt J 2018년 12월 13일
In SPH, one has to find the neighbor particles in a given radius for every particles in the domain. I use the matlab function knnsearch for this purpose.
Perhaps instead, you should use rangesearch with a pre-constructed KDTreeSearcher object. You could then use parfor to process different chunks of query data in parallel.
  댓글 수: 1
Hao Zhang
Hao Zhang 2018년 12월 13일
Hi Matt, I found using knnsearch followed by finding which neighbors are within radius is faster than directly call rangesearch. Just have to use an enough large K so that all neighbors within radius is included in the K-nearest neighbors.
I tried to use parfor to process chunks of query data in parallel, but the execution time is even longer than call knnsearch once incuding all query points. The following is a test code:
clear;clc;close all
n_parallel=4;
if isempty(gcp('nocreate'))==1
pooljob=parpool('local',n_parallel);
end
N=1e5;
knn_K=50;
rx=rand(N,1);
ry=rand(N,1);
BucketSize_kdtree=50;
Mdl=KDTreeSearcher([rx ry],'Distance','euclidean','BucketSize',BucketSize_kdtree); %%% kdtree search object
idx_Neighbor_cell=cell(n_parallel,1);
d_Neighbor_cell=cell(n_parallel,1);
logic_beyor_cell=cell(n_parallel,1);
N_par=round(N/n_parallel);
tic;
[idx_Neighbor1,d_Neighbor1]=knnsearch(Mdl,[rx ry],'K',knn_K);
toc
tic;
rx_query_cell=mat2cell(rx,[N_par*ones(1,n_parallel-1) N-N_par*(n_parallel-1)],1);
ry_query_cell=mat2cell(ry,[N_par*ones(1,n_parallel-1) N-N_par*(n_parallel-1)],1);
parfor i=1:n_parallel
[idx_Neighbor_cell{i},d_Neighbor_cell{i}]=knnsearch(Mdl,[rx_query_cell{i} ry_query_cell{i}],'K',knn_K);
end
idx_Neighbor=cell2mat(idx_Neighbor_cell);
d_Neighbor=cell2mat(d_Neighbor_cell);
toc;
The runtime is:
Elapsed time is 0.389322 seconds.
Elapsed time is 1.199775 seconds.
The parfor version takes much longer time despite using 4 cores.
Does anyone know why this is and how to optimize it further? Thanks!

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 GPU Computing에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by