Speed of looped operation on a GPU depending on number of iterations in loop?

Question

D. Plotnick 2017년 10월 16일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/361608-speed-of-looped-operation-on-a-gpu-depending-on-number-of-iterations-in-loop

댓글: Thomas Barrett 2021년 2월 9일

This is a question that I think will get a bit into the weeds of MATLAB's JIT and GPU toolbox. I will be including a MWE sample code below, and it should be stated that I am using 2017a and have a Titan-X 12GB Pascal GPU.

The basic issue is this: I am performing a looped operation (e.g. an interpolation) on the GPU, and if the number of iterations in the loop is small, the operation is very fast. However, once the number of iterations passes some threshold, each operation slows way down (a factor of >100 in my case).

To illustrate this, I used my minimum working example (MWE) below. It produced on my machine these two figures.

The first shows the average time per numerical operation versus the number of iterations in the loop. At values n<200 the operations take on the order of 1E-4 s/op. After that threshold is passed, they take around 2E-2 s/op, a massive slowdown.

The second shows the total time for the loop. Again, we see a change in behavior where the number of iterations doesn't affect the total time (this is why I think its a JIT thing) until the threshold around n = 200, and then it increases linearly as expected.

Finally, for each loop I output the time spent on each individual operation. For 150 iterations, We see that the time/operation is fairly constant in the 1E-4 s range, but for 200 iterations there is a sudden massive change in the time partway through the loop.

The questions are:

(A) Why is this sudden change in speed occurring?
(B) Is there a way to code this so that it does not occur (pre-allocation didn't seem to work, nor variable clearing).
(C) If I cannot avoid it, can I predict it? In many cases I have the flexibility of changing the number of iterations in a loop through other means, so if keeping that number of iterations below some magic number will make my processing 400x faster, I will work on it.

My MWE code is below; it should be noted that this code shows this behavior on my machine, but it may not on yours. Also, the numerical operation being used here is a stand-in for an actual looped process and is just being used to illustrate the speed issue.

 % =========================================================================
 % MWE
 % =========================================================================
% Clean up
clear all
close all
clc
% Set up some demo data and interpolating spaces
times2 = cell(10,1);
times1 = zeros(10,1);
x = (1:4000).';
y = (1:240);
v = rand(240,4000);
xi = 4000*rand(500);
xi = repmat(xi,1,1,240);
[Mf,Nf,~] = size(xi);
yi = repmat(y,Mf,1,Nf);
yi = permute(yi,[1,3,2]);
% Put it all on the GPU
x = gpuArray(x);
y = gpuArray(y);
v = gpuArray(v);
xi = gpuArray(xi);
yi = gpuArray(yi);
% Outer loop - changes number of iteration used in inener loop
for nn = 1:10
    t1 = tic;
    nn
    timesIn = zeros(50*nn,1);
    % Inner loop, perform our interpolation n-times
    for ii  = 1:50*nn
        tI = tic;
        vi = interp2(x,y,v,xi,yi);
        vi = sum(vi,3);
        timesIn(ii) = toc(tI);
    end   
    % Plot the current time/op and save times
    figure(1)
    plot(timesIn); title(nn); drawnow;
    times1(nn) = toc(t1);
    times2{nn} = timesIn;
    toc(t1)
end
% Make Figures 
for  nn = 1:10
    mTimes(nn) = mean(times2{nn});
end
figure; plot((1:10)*50,mTimes); title('Mean time/operation'); ylabel('Time'); xlabel('n-Iterations');
figure; plot((1:10)*50,times1); title('Total Loop Time'); ylabel('Time'); xlabel('n-Iterations');

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Thomas Barrett 2021년 2월 9일

Did you manage to figure this out in the end?

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2017년 10월 16일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/361608-speed-of-looped-operation-on-a-gpu-depending-on-number-of-iterations-in-loop#answer_286093

You're just doing the timing in an invalid way. Most GPU operations run asynchronously, so all you were timing for the first 100 or so iterations was the kernel launch time. Eventually, you filled the queue and no more kernels could be launched until running kernels had finished. So then you are actually timing the true cost. Use wait(gpuDevice) to synchronize the device before each call to tic or toc to ensure that the timing values make sense. Even better, use gputimeit to get more accurate timings for functional code.

댓글 수: 2
없음 표시없음 숨기기

D. Plotnick 2017년 10월 17일

Thanks as always Joss, and unfortunately this means I made an error in how I formed my MWE since in my actual code there is something odd happening with performance speed not related to the actual timing measurement. I'll have to come up with another, more appropriate MWE.

D. Plotnick 2017년 10월 19일

Joss, I have revised a question posted here if you have a chance to look at it. I did not end up using gputimeit in that MWE, since I couldn't figure out a way to code it using anonymous functions not requiring an input.

댓글을 달려면 로그인하십시오.

Speed of looped operation on a GPU depending on number of iterations in loop?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Speed of looped operation on a GPU depending on number of iterations in loop?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 2
없음 표시없음 숨기기