Slice into gpuArray and perform functions on the GPU with arrayfun

I would like to know how I can index into a given matrix to make pairwise combinations of column-vectors, and perform operations on these vectors - all on the GPU. So consider the simple function below:
function out = sum2Vecs(in1,in2) %in1 and in2 are (n x 1) vectors.
out = sum(in1,1) + sum(in2,1); %Output is a scalar "double".
end
Quick example: an array such as
fullMatrix = rand(3000,100);
Now I choose all pairwise column-vector combinations of "fullMatrix":
idxArray = nchoosek(1:100,2); %All possible pairwise index combinations of "fullMatrix".
nCombinations = length(idxArray);
And a simple for-loop performs the "sum2Vecs" function on each combination of two-column vectors:
for idx = 1 : nCombinations,
outArray(idx) = sum2Vecs( fullMatrix(:,idxArray(idx,1)) , fullMatrix(:,idxArray(idx,2)) );
end
Also, a parfor-loop with slicing works fine:
parfor idx = 1 : nCombinations,
in1 = fullMatrix(:,idxArray(idx,1));
in2 = fullMatrix(:,idxArray(idx,2));
outArray(idx) = sum2Vecs(in1,in2);
end
My goal is to be able to perform this loop on the GPU using e.g. "arrayfun". But I am relatively inexperienced with this, so I would appreciate any helpful pointers. What I am particularly interested in learning is how to efficiently index into an array like "fullMatrix" and send parts of it to each GPU worker efficiently.
Thanks very much. Hamad.

답변 (1개)

Matt J
Matt J 2015년 1월 11일
편집: Matt J 2015년 1월 11일
In the generality that you've described, that kind of computation doesn't look like the kind of thing that's well-suited to the GPU . The GPU is for situations when you have lots of parallel tasks involving small chunks of data. The chunks in your example, two 3000x1 vectors, wouldn't likely be small enough unless the operation can be subdivided further.
For that specific example, I would probably try to vectorize on the GPU as follows,
idxArray = gpuArray( nchoosek(1:100,2).' ) ;
A= gpuArray(fullMatrix);
[m,n]=size(A);
outArray=sum( reshape(A(:,idxArray),2*m ,[]), 1 );

댓글 수: 4

Thanks very much for that, Matt. The last line of your answer works very well if the objective was to parallelize "sum2Vecs" across the GPU cores, but does not extend to cases where the function does not use an already-existing MATLAB function such as "sum". As I explained, what I am really looking for is a way to efficiently index into a gpuArray and send "chunks" of the data to different GPU cores on which to perform an arbitrary @function.
Finally, I'm not sure that two 3000x1 arrays sent to each GPU core should be difficult. Crude calculation: suppose you have 2500 CUDA cores, then 3000 x 2500 x 2 x 8 bytes = around 114 MB of RAM per cycle.
In any case, thanks very much for your help. Hamad.
Matt J
Matt J 2015년 1월 12일
편집: Matt J 2015년 1월 12일
As I explained, what I am really looking for is a way to efficiently index into a gpuArray and send "chunks" of the data to different GPU cores on which to perform an arbitrary @function.
Yes, I understood that. But as I said, I don't think you can do it! (Not efficiently).
Finally, I'm not sure that two 3000x1 arrays sent to each GPU core should be difficult.
I'd have to know more about your graphics card. You only get the benefit of the GPU if the data can be cached in the local memory used by the "cores". Moreover, arrayfun needs to be smart enough to do so. According to tables here,
the latest cards have 96KB per multiprocessor. But each of your threads requires 48K. That means each multiprocessor would only be able to run 2 threads in parallel.
Anyway, it's just my speculation. We'll see if any TMW contributors have better insights...
Thanks very much Matt.
arrayfun can take a user-defined function, as long as that function carries out scalar operations. You can also index into arrays in that function as long as the array is passed in as an upvalue - see for instance here, the Mandelbrot example on this page and the Monte Carlo example here.
You need to remember that GPU cores are not like parallel workers. They cannot perform complex vector operations. Taken together, they perform complex vector operations, but not individually. In PCT a large number of complex algorithms have been implemented in such a way as to take maximum advantage of the GPU. If you are having trouble formulating your problem in a data-parallel way, then post your real code and we can have a look at whether it is inherently parallelisable. The example you gave - summing vectors - is easily vectorizable as Matt showed above.

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 Parallel for-Loops (parfor)에 대해 자세히 알아보기

질문:

2015년 1월 11일

댓글:

2015년 2월 23일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by