Strange: Execution time per iteration increases, when GPU arrays are being used

Question

Maurice 2014년 2월 10일

2
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used

편집: Joss Knight 2014년 3월 19일

Hi,

the following code computes 20,000 times the product of 2 rather big 1d arrays on the GPU:

reset(gpuDevice);
a = 1:1:(256*256*100);
b = a;
c = a;
a = gpuArray(a);
b = gpuArray(b);
c = gpuArray(c);
tic
for z=1:20000
      tstart(z) = tic;
c =a .* b;
      telapsed(z) = toc(tstart(z));    
end
toc

If lines 5-7 are commented out, the product is then computed on the CPU.

Here are the execution times per iteration:

1. CPU usage - execution time: 386.48 seconds

2. GPU usage (driver 332.21 which can be found on the nvidia page) - execution time: 17.11 seconds

3. GPU usage (driver 332.21 which is included in the current cuda toolkit version 5.5) - execution time: 16.99 seconds

Here are my settings:

no memory leakage (I've chekced that - memory on the GPU is constant)
no other CPU or GPU processes are running or interferring the computation
GPU: GeForce GTX 570

The execution times for case 2. and 3. are very strange.

Do you know why the execution time increases in case 2?
Do you know why there is a 'pattern' in case 2 and 3?

It would be perfect if the execution time in case 3 would stay at 1e-4s per iteration (like the first 6000 iterations).

Im running the stuff on windows 7. I know that it is hard to measure times beyond 1e-2s per iteration on windows. However, I can confirm that in case 3 the overall execution time for 6000 iterations is 0.32 seconds. Therefore 20,000 iterations should be computed in less than 1 second (I've measured 17 seconds). This confirms that the measurements are correct.

Is this a known bug of Matlab? I don't see any reason why the execution time per iteration should increase.

Thanks for your advice!

Best,

Maurice

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Maurice 2014년 2월 14일

thanks for the hints, I've updated the title and other places where I mixed this up.

Joss Knight 2014년 3월 14일

Maurice, what is the difference between 2 and 3? Your comments imply the driver is different, but the version number is the same.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Anton 2014년 3월 4일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used#answer_127023

I have the same problem using gpuArray (but still no solution). I also would guess it is a memory problem. Maybe there exists an input transfer buffer on the nvidia graphics card. If someone has an solution, please post it!

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Maurice 2014년 3월 5일

Jill, you're right the timing using tic and toc may be not accurate. However one can analyze the total running time as I wrote above: 6000 iterations on the GPU are running in 0.32 seconds (in total). This is 0.32/6000 = 5.3*10^-5 seconds per iteration, however 20,000 iterations run in approx. 17 seconds (in total). In this case the time per iteration is more than 1 order of magnitude lower (approx. 8.5 * 10^-4 seconds and the execution speed decreases with further iterations. Best,

Maurice

Maurice 2014년 3월 14일

push

댓글을 달려면 로그인하십시오.

Answer 2

Joss Knight 2014년 3월 14일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used#answer_128372

편집: Joss Knight 2014년 3월 19일

MATLAB Online에서 열기

As Jill points out, your timings are basically meaningless because you are using tic and toc. c = a.*b returns as soon as the GPU kernel is queued. All you've done is queue up a huge number of kernels - the step change is just when the queue is full and so you've started actually measuring the kernel execution time.

The only fair way to measure the time is to divide the overall time by the number of iterations, or to use gputimeit. You could use wait(gpuDevice) inside the loop, but this unfairly adds the cost of waiting for the kernel to complete and return control to MATLAB before queuing the next one. I ran your code using both gputimeit and wait(gpuDevice) and the execution time was dead flat throughout.

You also need to pre-allocate your timing array, otherwise you're including the time taken to grow the array. This will necessarily take more and more time each time more memory is allocated because of the cost of copying the existing data to the new location. This is probably the reason for your growing execution time.

Try this instead and let me know whether it gives you flat timings:

telapsed = zeros(20000,1);  % Pre-allocate
for z=1:20000
  telapsed(z) = gputimeit(@()a.*b);
end

It will take a long time to execute because gputimeit runs the code at least 10 times to get an average value. You might want to reduce the number of iterations. If you don't think this code is testing the timing properly, do it with wait(gpuDevice) - just be aware that the actual execution time is much less:

telapsed = zeros(20000,1);  % Pre-allocate
for z=1:20000
  tstart = tic;
  c = a .* b;
  wait(gpuDevice);
  telapsed(z) = toc(tstart);
end

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

Iain 2014년 2월 10일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used#answer_123850

I suspect that it has to do with caching.

A slow main DRAM of 1GB or whatever will have "slow" write access times (tens-hundreds of ns) . A cached SRAM of 16MB (or whatever) will have "fast" write access times (sub-ns)

Once the cached SRAM fills, it needs to start unloading to the main DRAM, which may restrict your ability to write to the SRAM and say "done".

댓글 수: 2
없음 표시없음 숨기기

Maurice 2014년 2월 14일

There is no write back to the main ram during the iterations, everything is computed on either CPU or GPU

Iain 2014년 3월 5일

On a GPU, you'll still have some local static ram running at very high clock rates (a few MB is likely), and a much bigger ram space (hundreds/thousands of MB), still on the graphics card, and then a further, main ram on the motherboard (ones-tens of GB), and then a further massive "RAM" located on the machine's hard drive (hundreds-thousands of GB)

It might not be writing back to motherboard ram, but its almost certain to be writing back to a slow, large scale RAM on the graphics card.

댓글을 달려면 로그인하십시오.

Strange: Execution time per iteration increases, when GPU arrays are being used

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

답변 (3개)

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Strange: Execution time per iteration increases, when GPU arrays are being used

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

답변 (3개)

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2 없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기