필터 지우기
필터 지우기

Strange: Execution time per iteration increases, when GPU arrays are being used

조회 수: 2 (최근 30일)
Maurice
Maurice 2014년 2월 10일
편집: Joss Knight 2014년 3월 19일
Hi,
the following code computes 20,000 times the product of 2 rather big 1d arrays on the GPU:
1 reset(gpuDevice);
2 a = 1:1:(256*256*100);
3 b = a;
4 c = a;
5 a = gpuArray(a);
6 b = gpuArray(b);
7 c = gpuArray(c);
8 tic
9 for z=1:20000
10 tstart(z) = tic;
11 c =a .* b;
12 telapsed(z) = toc(tstart(z));
13 end
14 toc
If lines 5-7 are commented out, the product is then computed on the CPU.
Here are the execution times per iteration:
1. CPU usage - execution time: 386.48 seconds
2. GPU usage (driver 332.21 which can be found on the nvidia page) - execution time: 17.11 seconds
3. GPU usage (driver 332.21 which is included in the current cuda toolkit version 5.5) - execution time: 16.99 seconds
Here are my settings:
  • no memory leakage (I've chekced that - memory on the GPU is constant)
  • no other CPU or GPU processes are running or interferring the computation
  • GPU: GeForce GTX 570
The execution times for case 2. and 3. are very strange.
  1. Do you know why the execution time increases in case 2?
  2. Do you know why there is a 'pattern' in case 2 and 3?
It would be perfect if the execution time in case 3 would stay at 1e-4s per iteration (like the first 6000 iterations).
Im running the stuff on windows 7. I know that it is hard to measure times beyond 1e-2s per iteration on windows. However, I can confirm that in case 3 the overall execution time for 6000 iterations is 0.32 seconds. Therefore 20,000 iterations should be computed in less than 1 second (I've measured 17 seconds). This confirms that the measurements are correct.
Is this a known bug of Matlab? I don't see any reason why the execution time per iteration should increase.
Thanks for your advice!
Best,
Maurice
  댓글 수: 4
Maurice
Maurice 2014년 2월 14일
thanks for the hints, I've updated the title and other places where I mixed this up.
Joss Knight
Joss Knight 2014년 3월 14일
Maurice, what is the difference between 2 and 3? Your comments imply the driver is different, but the version number is the same.

댓글을 달려면 로그인하십시오.

답변 (3개)

Anton
Anton 2014년 3월 4일
I have the same problem using gpuArray (but still no solution). I also would guess it is a memory problem. Maybe there exists an input transfer buffer on the nvidia graphics card. If someone has an solution, please post it!
  댓글 수: 3
Maurice
Maurice 2014년 3월 5일
Jill, you're right the timing using tic and toc may be not accurate. However one can analyze the total running time as I wrote above: 6000 iterations on the GPU are running in 0.32 seconds (in total). This is 0.32/6000 = 5.3*10^-5 seconds per iteration, however 20,000 iterations run in approx. 17 seconds (in total). In this case the time per iteration is more than 1 order of magnitude lower (approx. 8.5 * 10^-4 seconds and the execution speed decreases with further iterations. Best,
Maurice

댓글을 달려면 로그인하십시오.


Joss Knight
Joss Knight 2014년 3월 14일
편집: Joss Knight 2014년 3월 19일
As Jill points out, your timings are basically meaningless because you are using tic and toc. c = a.*b returns as soon as the GPU kernel is queued. All you've done is queue up a huge number of kernels - the step change is just when the queue is full and so you've started actually measuring the kernel execution time.
The only fair way to measure the time is to divide the overall time by the number of iterations, or to use gputimeit. You could use wait(gpuDevice) inside the loop, but this unfairly adds the cost of waiting for the kernel to complete and return control to MATLAB before queuing the next one. I ran your code using both gputimeit and wait(gpuDevice) and the execution time was dead flat throughout.
You also need to pre-allocate your timing array, otherwise you're including the time taken to grow the array. This will necessarily take more and more time each time more memory is allocated because of the cost of copying the existing data to the new location. This is probably the reason for your growing execution time.
Try this instead and let me know whether it gives you flat timings:
telapsed = zeros(20000,1); % Pre-allocate
for z=1:20000
telapsed(z) = gputimeit(@()a.*b);
end
It will take a long time to execute because gputimeit runs the code at least 10 times to get an average value. You might want to reduce the number of iterations. If you don't think this code is testing the timing properly, do it with wait(gpuDevice) - just be aware that the actual execution time is much less:
telapsed = zeros(20000,1); % Pre-allocate
for z=1:20000
tstart = tic;
c = a .* b;
wait(gpuDevice);
telapsed(z) = toc(tstart);
end

Iain
Iain 2014년 2월 10일
I suspect that it has to do with caching.
A slow main DRAM of 1GB or whatever will have "slow" write access times (tens-hundreds of ns) . A cached SRAM of 16MB (or whatever) will have "fast" write access times (sub-ns)
Once the cached SRAM fills, it needs to start unloading to the main DRAM, which may restrict your ability to write to the SRAM and say "done".
  댓글 수: 2
Maurice
Maurice 2014년 2월 14일
There is no write back to the main ram during the iterations, everything is computed on either CPU or GPU
Iain
Iain 2014년 3월 5일
On a GPU, you'll still have some local static ram running at very high clock rates (a few MB is likely), and a much bigger ram space (hundreds/thousands of MB), still on the graphics card, and then a further, main ram on the motherboard (ones-tens of GB), and then a further massive "RAM" located on the machine's hard drive (hundreds-thousands of GB)
It might not be writing back to motherboard ram, but its almost certain to be writing back to a slow, large scale RAM on the graphics card.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 GPU Computing에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by