Slow performance of fftn in the gpu when used inside a loop

Question

Arabarra 2019년 10월 9일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/484382-slow-performance-of-fftn-in-the-gpu-when-used-inside-a-loop

댓글: Arabarra 2020년 4월 29일

I have just realized that the execution times for fftn operations inside a loop is not proportional to the length of the loop when working inside the GPU.

As an example, if I define a cube in the GPU

a = gpuArray(ones(256,256,256,'single'));

I see that the user time does not scale with the number of visits in a loop. For moderate loops I read:

 >> N=100;tic;for i=1:N;g=fftn(a);end;toc
Elapsed time is 0.008618 seconds.

... but for a loop which is 10 times bigger

>> N=1000;tic;for i=1:N;g=fftn(a);end;toc
Elapsed time is 7.299844 seconds.

the total time does not scale by 10 but by 1000!!!! I know tic/toc is not the best way to measure performance, but it is still the time seen by the users of the program... is there some basical principle of handling gpuArrays inside loops that I am missing?

댓글 수: 2
없음 표시없음 숨기기

Daniel M 2019년 10월 9일

편집: Daniel M 2019년 10월 9일

MATLAB Online에서 열기

I suspect that most of the time is spent transfering information back and forth to the gpu. If you list your full code we could check. But based on the line you wrote above, you may not be familiar with optimizing performance for gpu.

tic
a = gpuArray(ones(256,256,256,'single'));
toc
tic
b = ones(256,'single','gpuArray');
toc
Elapsed time is 0.072559 seconds.
Elapsed time is 0.019793 seconds.
% method 1 is over 3x slower

Arabarra 2019년 10월 15일

The full code are pretty much the three lines above... not certain what you mean

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Edric Ellis 2019년 10월 10일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/484382-slow-performance-of-fftn-in-the-gpu-when-used-inside-a-loop#answer_395647

MATLAB Online에서 열기

Various methods on the GPU operate to some extent asynchronously. But there are limits to this - depending on the amount of memory available etc. The best way to time GPU operations is to use gputimeit, like this:

a = gpuArray(ones(256,256,256,'single'));
% Basic case, no looping
t1 = gputimeit(@() fftn(a));
% Looping cases
t100 = gputimeit(@() iLoop(a,100));
t1000 = gputimeit(@() iLoop(a,1000));
% Compare results
disp([t1, t100/100, t1000/1000])
function iLoop(a,N)
for i = 1:N
   fftn(a); 
end
end

On my machine, I see that the results are consistent - i.e. gputimeit does a good job of getting an accurate time even for a single call to fftn. Running the above script, the result I get is:

>> repro
    0.0081    0.0081    0.0081

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Joss Knight 2019년 10월 15일

The point is it's just your timing code is wrong, because the FFT is not complete when you call toc. All that's happening is that after enough iterations you can't queue any more kernels and you start to see the real computation time. You need to consult the documentation on how to time your code:

Measure Performance on the GPU

The best way to measure performance on the GPU is to use gputimeit. This function takes as input a function handle with no input arguments, and returns the measured execution time of that function. It takes care of such benchmarking considerations as repeating the timed operation to get better resolution, executing the function before measurement to avoid initialization overhead, and subtracting out the overhead of the timing function. Also, gputimeit ensures that all operations on the GPU have completed before the final timing.

For example, consider measuring the time taken to compute the lu factorization of a random matrix A of size N-by-N. You can do this by defining a function that does the lu factorization and passing the function handle to gputimeit:

A = rand(N,'gpuArray');

fh = @() lu(A);

gputimeit(fh,2); % 2nd arg indicates number of outputs

You can also measure performance with tic and toc. However, to get accurate timing on the GPU, you must wait for operations to complete before calling toc. There are two ways to do this. You can call gather on the final GPU output before calling toc: this forces all computations to complete before the time measurement is taken. Alternately, you can use the wait function with a gpuDevice object as its input. For example, if you wanted to measure the time taken to compute the lu factorization of matrix A using tic, toc, and wait, you can do it as follows:

gd = gpuDevice();

tic();

[l,u] = lu(A);

wait(gd);

tLU = toc();

You can also use the MATLAB profiler to show how computation time is distributed in your GPU code. Note, that to accomplish timing measurements, the profiler runs each line of code independently, so it cannot account for overlapping (asynchronous) execution such as might occur during normal operation. For timing whole algorithms, you should use tic and toc, or gputimeit, as described above. Also, the profile might not yield correct results for user-defined MEX functions if they run asynchronously.

Arabarra 2020년 4월 29일

thanks for the answer! The wait command on device gave me the key to discover where my algorithm was creating a hidden bottleneck.

댓글을 달려면 로그인하십시오.

Slow performance of fftn in the gpu when used inside a loop

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Slow performance of fftn in the gpu when used inside a loop

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기