Optimize this program on a GPU

I'm trying to speed up my program on a GPU. But it is much slower than the CPU version, although I use a very powerful GPU. I haven't been able to vectorize my code due to the fact that every element in my main vector has to be compared to all the other elements in my main vector. Therefore i had to use a parfor loop.
This is my program:
%%generate data
num_tst = 5000; % this is the key number. The CPU takes about 10 sek for 65000
binning = num_tst/100;
N = num_tst /binning;
timestamps = gpuArray.randi(num_tst,num_tst,1);
detectors = gpuArray.randi(4,num_tst,1);
T = N*binning;
timestamps = sort(timestamps);
%%to be optimized
tic;
correlations = gpuArray.zeros(1,N);
parfor i = 1:(size(timestamps,1)-1)
a = gpuArray.zeros(1,N);
j = i+1;
dts = timestamps(j) - timestamps(i);
while (dts < T) && (j <= size(timestamps,1))
if dts == 0 && detectors(i) ~= detectors(j)
a(1) = a(1) + 2;
elseif detectors(i) ~= detectors(j)
dts = floor(dts/binning)+1;
a(dts) = a(dts) + 1;
end
j = j + 1;
if j <= size(timestamps,1)
dts = timestamps(j) - timestamps(i);
end
end
correlations = correlations + a;
end
toc;
How can I speed up my program? Is it possible to vectorize a program like this or do I need to implement this program in CUDA code?

댓글 수: 2

Edric Ellis
Edric Ellis 2013년 4월 22일
You have managed only to run the addition on the GPU, which is why you're seeing no benefit. You should profile the code with no parallelism to see which portions might benefit. If you could make the code self-contained so that it is executable, it might be possible to see how to vectorise or parallelise things.
Matthias
Matthias 2013년 4월 22일
I thought I parallelised the whole iteration process. Addition and iteration is basically all the program is made of. I edited the code, so that it is executable. The second paragraph is the interesting one.

댓글을 달려면 로그인하십시오.

 채택된 답변

Matt J
Matt J 2013년 4월 22일
편집: Matt J 2013년 4월 22일

0 개 추천

Here's a somewhat more vectorized version. The operations you're doing don't seem to be very good candidates for GPU acceleration. Accumarray ops seem hard to accelerate on the GPU, judging from the conspicuous absence of a GPU-accelerated accumarray from MATLAB and all other GPU applications I've seen. However, parfor on the CPU might bring some advantage.
correlations=zeros(N,1);
parfor i = 1:(size(timestamps,1)-1)
dts=timestamps(i+1:end)-timestamsp(i);
val=detectors(i+1:end)~=detectors(i) & dts<T;
val=val+val&(dts==0);
subs=floor(dts/binning)+1;
a=accumarray(subs(:),val(:),[N,1]);
correlations = correlations + a;
end
correlations=correlations.';

댓글 수: 8

Thanks for the quick response. But I think the code is not fully correct. I get the following Error and I haven't been able to fix it:
First input SUBS and third input SZ must satisfy
ALL(MAX(SUBS)<=SZ).
I am just wondering how it is possible to satisfy this condition since max(subs) ~ 2000 and 1 < 2000. The documentary didn't really help me as well.
Matt J
Matt J 2013년 4월 22일
What is N?
Matthias
Matthias 2013년 4월 22일
'N' is the number of all bins. Each bin has the size of 'binning'. So the total time covered by all the bins is: T= N * binning
This means for all dts < binning correlations(1) will be increased. For all dts with: binning < dts < 2*dts correlatios(2) will be increased.
Matt J
Matt J 2013년 4월 22일
편집: Matt J 2013년 4월 22일
No, I mean what number is N set to? In your earlier comment you wrote what looks like a typo 1 < 2000. Did you really mean N<2000? If so, your original code should have had the same problem. The only way that max(subs)) can be ~2000 is if there is some value of
floor(dts/binning)+1
that is ~2000, which would have made a(dts)=a(dts)+1 in your original code impossible to execute without an error.
Matthias
Matthias 2013년 4월 22일
편집: Matthias 2013년 4월 22일
2000 is just an example. But it is like I explained N is the number of bins, but it is only used to initialize empty arrays here and there.
After the program ran correlations should be a vector with length N.
Matt J
Matt J 2013년 4월 22일
편집: Matt J 2013년 4월 22일
I can only repeat what I said before. Both your code and mine evaluate the expression
floor(dts/binning)+1
If this expression is ever greater than N, then both our versions will fail. Otherwise, both should be fine.
Matthias
Matthias 2013년 4월 22일
편집: Matthias 2013년 4월 22일
I found my Error. The fake data I generated was false. Thank you!
If I'm not mistaken the expression ALL(MAX(SUBS)<=SZ) will only be true, if max(subs) <= 1
No, it requires all(max(subs)<=N). Here are some examples to illustrate the issue
>> X=accumarray([1;2;2;2;3;3],true,[10,1]); X'
ans =
1 3 2 0 0 0 0 0 0 0
but
>> X=accumarray([1;2;2;2;3;30],true,[10,1]); X'
Error using accumarray
First input SUBS and third input SZ must satisfy ALL(MAX(SUBS)<=SZ).

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 Parallel Computing Toolbox에 대해 자세히 알아보기

질문:

2013년 4월 22일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by