3D gpuArray vs cells of 2D gpuArrays major speed difference!

조회 수: 1 (최근 30일)
Dan Ryan
Dan Ryan 2013년 5월 24일
Can anybody explain why these codes have drastically different runtimes?
I have a shared setup routine
clear all
y = gpuArray.rand(1000, 1000, 'single');
W = cell(1, 5);
WFull = gpuArray.zeros(1000, 1000, 5);
for j = 1:5
W{j} = gpuArray.rand(1000, 1000, 'single');
WFull(:,:,j) = W{j};
end
Version 1 (finishes in 1.4 seconds on my machine)
z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
for j = 1:size(W)
z(:,:,j) = W{j}*y;
end
end
toc
vs. Version 2 (finishes in 39 seconds on my machine... 27x times slower)
z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
for j = 1:size(WFull, 3)
z(:,:,j) = WFull(:,:,j)*y;
end
end
toc
Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?

채택된 답변

Matt J
Matt J 2013년 5월 24일
편집: Matt J 2013년 5월 24일
Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?
Yes, it is faster to look-up a cell than to pull a slice out of a 3D array, and that's true for normal arrays as well, as long as there is a small number of slices/cells. Of course, you should really be including the time needed to allocate memory to each W{j} in your comparison.
Another reason is that you have a syntax error in your for-loop over W{j}. It's only doing 1 loop iteration instead of 5,
>> for j=1:size(W), j, end
j =
1
This is biasing the comparison to some degree.
  댓글 수: 2
Dan Ryan
Dan Ryan 2013년 5월 24일
I caught a couple of other issues where I had left 'single' off of the gpuArray creation for some items and had it present for others... I changed
size(W)
to
size(W, 2)
and now the comparison is much closer.
Here is the new code:
clear all
y = gpuArray.rand(1000, 1000, 'single');
z = gpuArray.zeros(1000, 1000, 5, 'single');
W = cell(1, 5);
for j = 1:5
W{j} = gpuArray.rand(1000, 1000, 'single');
end
tic
for i = 1:500
for j = 1:size(W, 2)
z(:,:,j) = W{j}*y;
end
end
toc
clear all
y = gpuArray.rand(1000, 1000, 'single');
z = gpuArray.zeros(1000, 1000, 5, 'single');
WMat = gpuArray.rand(1000, 1000, 5, 'single');
tic
for i = 1:500
for j = 1:size(WMat, 3)
z(:,:,j) = WMat(:,:,j)*y;
end
end
toc
What is really strange to me is that the execution time is very nonlinear in terms of the number of loops, i. There must be some sort of memory flush going on when i gets large, not really sure why though...
i = 100 -> runtimes are 0.10 and 0.14 seconds
i = 200 -> runtimes are 0.73 and 1.98 seconds
i = 500 -> runtimes are 10.3 and 11.7 seconds (notice the large jump for version 1!)
i = 1000 -> runtimes are 26.3 and 28.0 seconds!
Have any clue about this highly nonlinear trend? I don't see why GPU memory would come into play since I am basically just writing over existing values and performing the exact same computations in every iteration!
Dan Ryan
Dan Ryan 2013년 5월 30일
James Lebak from mathworks helped me out with a really good tip:
use a
wait(gpuDevice)
command before the
toc
command when timing the GPU speeds.
Now the timings increase linearly with number of loop iterations and the two implementations give very similar results. Good to know!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Matrix Indexing에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by