Summing array elements seems to be slow on GPU

조회 수: 9 (최근 30일)
Damian Suski
Damian Suski 2023년 4월 26일
댓글: Damian Suski 2023년 5월 18일
I am testing the times of execution for the following function on CPU and GPU
function funTestGPU(P,U,K,UN)
for k = 1:P
H = exp(1i*K);
HU = U.*H;
UN(k,:) = sum(HU,[1,3]);
end
end
where , are complex arrays of size and Kis a complex array of size . So in each iteration I perform element-wise exp(), element-wise multiplication of two arrays and summing elements of 3D array along two dimensions.
I test the execution time on CPU and on GPU with the help of the following script
P = 200;
URe = 1/(sqrt(2))*rand(P);
UIm = 1/(sqrt(2))*rand(P);
KRe = 1/(sqrt(2))*rand(P,P,P);
KIm = 1/(sqrt(2))*rand(P,P,P);
% CPU
U = complex(URe, UIm);
K = complex(KRe, KIm);
UN = complex(zeros(P), zeros(P));
fcpu = @() funTestGPU(P,U,K,UN);
tcpu = timeit(fcpu);
disp(['CPU time: ',num2str(tcpu)])
% GPU
U = gpuArray(complex(URe, UIm));
K = gpuArray(complex(KRe, KIm));
UN = gpuArray(complex(zeros(P), zeros(P)));
fgpu = @() funTestGPU(P,U,K,UN);
tgpu = gputimeit(fgpu);
disp(['GPU time: ',num2str(tgpu)])
and I obtain the results
CPU time: 9.0315
GPU time: 3.3894
My concern is that if I remove the last operation from the funTestGPU (summing array elements) I obtain the results
CPU time: 8.0185
GPU time: 0.0045631
So it looks like the summation is the most time-consuming operation on GPU. Is that an expected result?
I wrote the analogical codes in cuPy and in Pytorch and there the summation does not seem to be the most time consuming operation.
I use Matlab 2019b. My graphics card is NVIDIA GeForce GTX 1050 Ti (768 CUDA cores), my processor is AMD Ryzen 7 3700X (8 physical cores).
  댓글 수: 2
Matt J
Matt J 2023년 4월 27일
이동: Matt J 2023년 4월 27일
So it looks like the summation is the most time-consuming operation on GPU. Is that an expected result?
That's what I would expect. It's the only operation in the chain that is not element-wise.
Damian Suski
Damian Suski 2023년 4월 27일
@Matt J Thank you for your comment. Before I run tests, I imagined that the exponential will be the most time consuming operation, but it turns out that element-wise operations are not the bottleneck of calculations. I just wanted to make sure that I do not miss something obvious.

댓글을 달려면 로그인하십시오.

채택된 답변

Joss Knight
Joss Knight 2023년 4월 27일
These are my results that I got on my (somewhat old) GeForce GTX 1080 Ti:
CPU time: 16.1288
GPU time: 0.96266
If I change the datatype to single I get:
CPU time: 14.9785
GPU time: 0.35102
That's maybe 2x faster?
So on the one hand your GPU is pretty slow and your CPU is pretty fast, and on the other maybe you could try using single precision instead, if you don't mind the loss of accuracy.
  댓글 수: 1
Damian Suski
Damian Suski 2023년 4월 27일
Well, I would also say that my CPu is quite fast and GPu is rather weak (only 800 CUDA cores, 4GB RAM). Several years ago I have bought the cheapest graphics card, without parallel computations in mind.
The results for your card (over 3.5k CUDA cores, 11GB RAM) are pretty impressive, I have tried GeForce RTX 3060 (over 3.5k CUDA cores, 12GB RAM) on another computer and it gave 1,5s for double precision. For the analogical code in pytorch, I have tried Tesla T4 card (freely available on Google Colab), which gave also 1,5s. So the proper choice of the GPU card makes the difference.
I will definitely try single precision, but at the moment it is hard for me to say whether the precision loss will be acceptable for my purpose.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Joss Knight
Joss Knight 2023년 4월 27일
이동: Matt J 2023년 4월 27일
Why are you recomputing H and HU inside the loop? They do not change. If you remove the sum, because the results are never used from the first (P-1) iterations, only the last computation of those values will actually take place.
  댓글 수: 6
Damian Suski
Damian Suski 2023년 4월 28일
I have tried batching approach on my GPU, but have not noticed any speed-up. I will try it on a better GPU and decribe the deatiled results.
Damian Suski
Damian Suski 2023년 5월 18일
I made the experiments and I haven't noticed the speedup in the case of batching. Time of computations increases proportionally to the batch size.
I have implemented the proper procedure and I was able to reproduce the discussed speedup results for the dummy example. The computations time was reduced from 186s on CPU to 42s on GPU. On a better graphics card the computations time is even shorter - 21s. Summing up, I'm satisfied with the results.
What still concerns me is that in Matlab the element-wise exp() is much faster than summing elements along two dimensions. For the analogical calculations in cuPy or pytorch, the situation seems to be the opposite. Can I place here the detailed results of my findings or should I start a new topic?

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 GPU Computing in MATLAB에 대해 자세히 알아보기

태그

제품


릴리스

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by