MATLAB Answers

Confusing results when measuring time of function on CPU/GPU with tic-toc/(gpu)timeit

조회 수: 3(최근 30일)
Florian
Florian 31 May 2020
댓글: Florian 1 Jun 2020
I am currently working on implementing an algorithm on GPU, which is already working. When measuring the time of the algorithm - and especially its subroutines - I encountered very strange results. In order to understand these results better I decided to evaluate the times again for a very simple function. Once again I was surprised by the results and am therefore asking here for an explanation of the results. I post the code as well as the resulting output and hope that someone can explain me, why I am measuring these times.
The code is given by:
%%
% Goal of this script is to better understand the disagreement of the
% time measurements with tic toc and gpu time it.
% fix random numbers
clear;
rng('default');
s = rng(5);
% clear workspace
%% Part 1 measure whether there is a time difference between b and b_gpu
disp('##################################################################')
disp('##################################################################')
disp(' apply b(_gpu) once and measure with tic toc and timeit; ')
disp('##################################################################')
disp('##################################################################')
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with tic and toc')
disp('-----------------------------------------------------')
% 1.1: Timeit and GPU timeit
% 1.1.1: CPU
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
tic
test=b(input);
time_111=toc;
disp(['Time for operation b on CPU is ',num2str(time_111)])
% 1.1.2 GPU with original b
input_gpu_array = gpuArray(input);
tic
test=b(input_gpu_array);
time_112 = toc;
disp(['Time for operation b on GPU is ',num2str(time_112)])
% 1.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
tic
test=b_gpu(input_gpu_array);
time_113 = toc;
disp(['Time for operation b_gpu on GPU is ',num2str(time_113)])
% 1.2: Timeit and GPU timeit
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with timeit')
disp('-----------------------------------------------------')
clear;
rng('default');
s = rng(5);
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
test=b(input);
time_121=timeit(@()b(input));
disp(['Time for operation b on CPU is ',num2str(time_121)])
% 1.1.2 GPU with original b
input_gpu_array = gpuArray(input);
test=b(input_gpu_array);
time_122 = gputimeit(@()b(input_gpu_array));
disp(['Time for operation b on GPU is ',num2str(time_122)])
% 1.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
test=b_gpu(input_gpu_array);
time_123 = gputimeit(@()b_gpu(input_gpu_array));
disp(['Time for operation b_gpu on GPU is ',num2str(time_123)])
disp(' ')
disp(' ')
disp('Q1.1: Why is the time measured with gputimeit smaller than wit tic toc')
disp('Q1.2: Why is the time on the other hand longer for the CPU when measured')
disp(' with timeit')
disp('Q1.3: Why is b_gpu faster than b on the gpu when measured with tic toc')
disp('Q1.4: Why are b_gpu and b equally fast when measured with gputimeit')
%% Part 2: measure the time it takes to do the operation iteratively
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp('##################################################################')
disp('##################################################################')
disp(' apply b(_gpu) iteratively and measure with tic toc; ')
disp(' important note: the input is overwritten as input = b(input) ')
disp('##################################################################')
disp('##################################################################')
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with tic and toc')
disp('-----------------------------------------------------')
clear;
rng('default');
s = rng(5);
% 1.1:
% 2.1.1: CPU
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
tic
for counter = 1:10
input=b(input);
end
time_211=toc;
disp(['Time for operation 10 times b on CPU is ',num2str(time_211)])
disp(['Time on average for operation b on CPU is ',num2str(time_211/10)])
disp(' ')
% 2.1.2 GPU with original b
input_gpu_array = gpuArray(input);
for counter = 1:10
input_gpu_array=b(input_gpu_array);
end
time_212=toc;
disp(['Time for operation 10 times b on GPU is ',num2str(time_212)])
disp(['Time on average for operation b on GPU is ',num2str(time_212/10)])
disp(' ')
% 2.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
for counter = 1:10
input_gpu_array=b_gpu(input_gpu_array);
end
time_213=toc;
disp(['Time for operation 10 times b_gpu on GPU is ',num2str(time_213)])
disp(['Time on average for operation b_gpu on GPU is ',num2str(time_213/10)])
disp(' ')
disp(' ')
disp(' ')
disp('Q2.1: Why is now the CPU faster than the GPU with tic toc?')
disp('Q2.2: Why are now the b and b_gpu equally fast when measured with tic toc?')
%% Part 3: measure the time it takes to do the operation iteratively but
% without overwriting stuff
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp('##################################################################')
disp('##################################################################')
disp(' apply b(_gpu) iteratively and measure with tic toc; ')
disp('important note: the input is not! overwritten as result = b(input)')
disp('##################################################################')
disp('##################################################################')
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with tic and toc')
disp('-----------------------------------------------------')
clear;
rng('default');
s = rng(5);
% 3.1.1: CPU
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
tic
for counter = 1:10
result=b(input);
end
time_311=toc;
disp(['Time for operation 10 times b on CPU is ',num2str(time_311)])
disp(['Time on average for operation b on CPU is ',num2str(time_311/10)])
disp(' ')
% 3.1.2 GPU with original b
input_gpu_array = gpuArray(input);
for counter = 1:10
result=b(input_gpu_array);
end
time_312=toc;
disp(['Time for operation 10 times b on GPU is ',num2str(time_312)])
disp(['Time on average for operation b on GPU is ',num2str(time_312/10)])
disp(' ')
% 3.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
for counter = 1:10
result=b_gpu(input_gpu_array);
end
time_313=toc;
disp(['Time for operation 10 times b_gpu on GPU is ',num2str(time_313)])
disp(['Time on average for operation b_gpu on GPU is ',num2str(time_313/10)])
disp(' ')
disp(' ')
disp(' ')
disp('Q3.1: Why are all three times now shorter than in 2?')
and the corresponding output by:
##################################################################
##################################################################
apply b(_gpu) once and measure with tic toc and timeit;
##################################################################
##################################################################
-----------------------------------------------------
measuring the times for one application of b(_gpu) with tic and toc
-----------------------------------------------------
Time for operation b on CPU is 0.34433
Time for operation b on GPU is 0.14935
Time for operation b_gpu on GPU is 0.093943
-----------------------------------------------------
measuring the times for one application of b(_gpu) with timeit
-----------------------------------------------------
Time for operation b on CPU is 0.42076
Time for operation b on GPU is 0.036677
Time for operation b_gpu on GPU is 0.036649
Q1.1: Why is the time measured with gputimeit smaller than wit tic toc
Q1.2: Why is the time on the other hand longer for the CPU when measured
with timeit
Q1.3: Why is b_gpu faster than b on the gpu when measured with tic toc
Q1.4: Why are b_gpu and b equally fast when measured with gputimeit
##################################################################
##################################################################
apply b(_gpu) iteratively and measure with tic toc;
important note: the input is overwritten as input = b(input)
##################################################################
##################################################################
-----------------------------------------------------
measuring the times for one application of b(_gpu) with tic and toc
-----------------------------------------------------
Time for operation 10 times b on CPU is 5.5627
Time on average for operation b on CPU is 0.55627
Time for operation 10 times b on GPU is 6.3557
Time on average for operation b on GPU is 0.63557
Time for operation 10 times b_gpu on GPU is 7.091
Time on average for operation b_gpu on GPU is 0.7091
Q2.1: Why is now the CPU faster than the GPU with tic toc?
Q2.2: Why are now the b and b_gpu equally fast when measured with tic toc?
##################################################################
##################################################################
apply b(_gpu) iteratively and measure with tic toc;
important note: the input is not! overwritten as result = b(input)
##################################################################
##################################################################
-----------------------------------------------------
measuring the times for one application of b(_gpu) with tic and toc
-----------------------------------------------------
Time for operation 10 times b on CPU is 4.2037
Time on average for operation b on CPU is 0.42037
Time for operation 10 times b on GPU is 4.6343
Time on average for operation b on GPU is 0.46343
Time for operation 10 times b_gpu on GPU is 5.008
Time on average for operation b_gpu on GPU is 0.5008
Q3.1: Why are all three times now shorter than in 2?

  댓글 수: 0

댓글을 달려면 로그인하십시오.

답변(1개)

Edric Ellis
Edric Ellis 1 Jun 2020
Ok, there's a lot going on here. A few observations:
  1. Putting scalar values on the GPU is often counter-productive, since the implementation often needs to implicitly gather them to the CPU before proceeding.
  2. Timing GPU operations with tic and toc is not reliable because the GPU can perform some (but not all) operations asynchronously. This is why in your first section the timings look odd - some time that should be accounted for in the first operation actually ends up being timed in the second operation
  3. You're missing some tic calls in sections 2.1.2, 2.1.3, 3.1.2, and 3.1.3, so the timings there are completely messed up. Fixing this as per my suggestion below should give you sensible timings.
To get accurate timings of iterative GPU operations, you should do something like this:
dev = gpuDevice();
timer = tic();
for i = 1:N
myGpuAlgorithm();
end
wait(dev);
t = toc(timer);
Calling wait on the gpuDevice ensures all asynchronous work has completed.

  댓글 수: 3

Florian
Florian 1 Jun 2020
Thank you for your answer and especially for pointing out the missing tic's! Now that I have added your suggestions (using wait, adding the missing tics), the results are more reasonable. Still I have some questions concerning your observations:
  1. As far as I understand (1.) b_gpu should perform worse than b, as the scalars have to be transfered to the CPU and then back to the GPU. Yet the performance for b_gpu is (after fixing my errors) faster or equally fast as b (on the GPU). Generally speaking: if i have to use a function of type b=@(array_gpu) sin(array_gpu.*scalar), what would garanty in general the best performance on the GPU? The algorithm, which I want to implement for the GPU, unfortunately relies on such functions.
  2. As mentioned, I followd your advice and use now wait(device) for measuring the time. Still one thing that suprises me is the comparison of (gpu)timeit and tic-(wait(dev)-)toc for the simple case of 1 (applying b only once). While b and b_gpu give almost the same times (roughly 0.043 seconds) when using gputimeit, when using tic toc (with wait(dev)) I obtain for b (on GPU) 0.1824 seconds and for b_gpu only 0.043 seconds. I assume this has to something with initialising the functions or transfering the scalars to the gpu in the case of using b, but I am not sure, what the reason is exactly. What confuses me even more is the different behaviour in the case of performing the operation within the for loop. Using
input_gpu_array=b_gpu(input_gpu_array);
within the for-loop gives very similar results for both b and b_gpu (roughly 0.7 seconds for the whole
loop for both cases). If I, on the other hand, use
res=b_gpu(input_gpu_array);
within the for-loop, it leads to b_gpu being 3 times faster than b (0.17 seconds vs. 0.06 seconds).
Edric Ellis
Edric Ellis 1 Jun 2020
  1. Usually, I'd expect better performance when scalars are kept as CPU data, but it shouldn't make a huge difference either way if the whole operation is non-trivial.
  2. Hm, I'm not quite sure what's going on there. Perhaps you could narrow that down to a simple reproduction showing exactly what you're trying.
Florian
Florian 1 Jun 2020
I think the script, which I upload with this message, should show what I am trying to do. In principle I just (try to) compare the performance on CPU and GPU for b and b_gpu. For this I
  1. apply the function once and measure this with tic toc and timeit. There, I already encounter the difference between b_gpu and gpu
  2. apply the function in a for loop, where I (1) overwrite each time the input like input=b(input) and (2) just return the output each time output=b(input). While for (1) the results of b_gpu and b are the same, for (2) they are different.

댓글을 달려면 로그인하십시오.

태그

제품


릴리스

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by