Why is pagemtimes slower on GPU than a CPU?

I'm a physics researcher, and a lot of my numerical work involves batch multiplication of small, complex-valued matrices. This made me very excited to see that in R2020b the pagemtimes function has been implemented, as this is exactly what I need.
On a speed comparison test between GPU and CPU, however, GPU performs significantly worse. Here's a minimal example of such a test:
batchSize = [100 100 100];
matSize = [2 2];
a = complex(rand([matSize batchSize]), rand([matSize batchSize]));
gpuA = gpuArray(a);
f = @() pagemtimes(a,a);
gpuF = @() pagemtimes(gpuA, gpuA);
timeit(f) % I get around 0.05 seconds
gputimeit(gpuF) % I get around 0.3 seconds
Is the significant slowdown simply because the batch/matrix size is too small for GPU optimisation to beat the overheads? Or is there something else going on that I've missed?
I'm testing this on a NVIDIA Quadro P1000 GPU.

댓글 수: 5

Ameer Hamza
Ameer Hamza 2020년 12월 1일
편집: Ameer Hamza 2020년 12월 1일
I think, this might be the result of transferring the data to GPU and then getting back the results. Try using bigger matrices if you have enough memory
batchSize = [500 500 500];
matSize = [2 2];
It may reduce discrepancy.
David Ho
David Ho 2020년 12월 1일
Thanks for the comment. I don't have enough memory to try a batch size of [500 500 500] but using [200 200 200] gives me 0.35s for CPU and 2.8 seconds for GPU.
I don't have an answer to your question yet, but certainly if you fold all your batch dimensions together the performance improves.
>> batchSize = [100 100 100];
>> matSize = [2 2];
>> a = complex(rand([matSize batchSize]), rand([matSize batchSize]));
>> gpuA = gpuArray(a);
>> timeit(@()pagemtimes(a,a))
ans =
0.0626
>> gputimeit(@()pagemtimes(gpuA,gpuA))
ans =
0.1029
>> batchSize = [1000000];
>> a = complex(rand([matSize batchSize]), rand([matSize batchSize]));
>> gpuA = gpuArray(a);
>> timeit(@()pagemtimes(a,a))
ans =
0.0625
>> gputimeit(@()pagemtimes(gpuA,gpuA))
ans =
0.0475
David Ho
David Ho 2020년 12월 2일
Hi Joss, that's interesting to know. For my applications it would be helpful to use a [2 2 N N N] grid, as it represents a matrix-valued field in 3d space. If it's a significant performance increase, though, it might be possible to flatten the batch dimensions and postprocess afterwards.
Joss Knight
Joss Knight 2021년 5월 30일
Flattening and unflattening only changes array metadata so it is essentially a free operation.

댓글을 달려면 로그인하십시오.

답변 (3개)

Matt J
Matt J 2020년 12월 1일
편집: Matt J 2020년 12월 1일

0 개 추천

I don't have a Matlab version that supports pagemtimes, but for the GPU version, it might be advisable to instead try pagefun(@mtimes,...). I do see an improvement relative to mtimesx.
batchSize = [100 100 100];
matSize = [2 2];
a = complex(rand([matSize batchSize]), rand([matSize batchSize]));
gpuA = gpuArray(a);
f = @() mtimesx(a,a);
gpuF = @() pagefun(@mtimes,gpuA, gpuA);
timeit(f) % I get around 0.8175 seconds
gputimeit(gpuF) % I get around 0.1033 seconds

댓글 수: 1

David Ho
David Ho 2020년 12월 2일
Hi Matt, thanks for answering. When I test pagefun I get exactly the same performance as pagemtimes. It's interesting that your CPU and GPU times are very different to mine, though.

댓글을 달려면 로그인하십시오.

Nathan Zechar
Nathan Zechar 2021년 5월 29일

0 개 추천

Hello David, I have a similiar problem and have found that pagemtimes is slower than just expanding an equation and coding it on both CPU and GPU. But for GPU it is exceptionally slow.
Here is an example. This can be coded up two different ways. Notice the performance of pagemtimes with just the CPU.
clear all
Nx = 100;
Ny = 100;
Nz = 100;
[A1,A2,A3,B1,B2,B3,C1,C2,C3,...
E11,E12,E13,E21,E22,E23,E31,E32,E33,...
F11,F12,F13,F21,F22,F23,F31,F32,F33] = deal(rand(Nx,Ny,Nz));
tic
for i = 1:20
%% Electric Field Update
C1 = F11.*(A1.*E11+B1)+F12.*(A2.*E12+B2)+F13.*(A3.*E13+B3);
C2 = F21.*(A2.*E21+B2)+F22.*(A2.*E22+B2)+F23.*(A3.*E23+B3);
C3 = F31.*(A3.*E31+B3)+F32.*(A3.*E32+B3)+F33.*(A3.*E33+B3);
end
toc
[A,B,C] = deal(rand(3,1,Nx,Ny,Nz));
[E,F] = deal(rand(3,3,Nx,Ny,Nz));
tic
for i = 1:20
C = pagemtimes(F,(B+pagemtimes(E,A)));
end
toc
Without pagemtimes - "Elapsed time is 0.032141 seconds".
With pagemtimes - "Elapsed time is 0.325006 seconds"
Using gpuArray() on the variables in the deal() function the the difference in times are even slower!
Without pagemtimes - "Elapsed time is 0.012688 seconds."
With pagemtimes - "Elapsed time is 5.357220 seconds."
Joss Knight
Joss Knight 2021년 5월 30일

0 개 추천

This might be simply because you are running double-precision math on a device designed for single precision operations. gpuBench doesn't show much improvement for double precision operations over the CPU on these devices. Can you convert your data to single precision and see if there is an improvement on GPU?

카테고리

도움말 센터File Exchange에서 GPU Computing에 대해 자세히 알아보기

제품

릴리스

R2020b

태그

질문:

2020년 12월 1일

답변:

2021년 5월 30일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by