GPU memory overhead dependent on fft dimension.

Question

D. Plotnick 2018년 7월 2일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/408452-gpu-memory-overhead-dependent-on-fft-dimension

답변: Joss Knight 2018년 7월 2일

Hello all, I have a question regarding memory management during Matlab's gpuArray/fft operation. I have a large NxM matrix [N = 10E3,M = 20E3, as an approx] where where I wish to take an fft in the M dimension. Now, for CPU operations I would normally permute the matrix to make the fft operation act in the 1st (column) dimension, for speed.

On the GPU, if I run the fft operation in the 1st dimension, I slam into the memory ceiling of my GPU. However, if I apply it in the row dimension I do not. I assume that this has to do with whether Matlab is doing N asynchronous fft's in the row direction, vs. a single massive matrix operation in the column dimension.

So, 4 questions:

Is my assumption true?
Are GPU operations still faster in the column direction (sort of answered this myself, got 3x speed advantage with below snippet.)
Is there a way to know what the GPU memory need will be for the fft? If so, I can try chunking up the fft based on the GPU memory available.
Is there another implementation that will have the speed of the column operation without the memory issues? I am going to try doing this as an arrayfun just to see.

Code snippet:

 x = gpuArray.rand(10000,10000);
xp = x.';
gputimeit(@() fft(x,[],1))
gputimeit(@() fft(xp,[],2))

Thanks all.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

D. Plotnick 2018년 7월 2일

MATLAB Online에서 열기

As I suspected, arrayfun (at least my way of using it) is way slower.

 f = @(i) fft(x(:,i),[],1);
tic
y = arrayfun(f,1:size(x,2),'UniformOutput',false);
wait(g);
y = cat(2,y{:});
toc

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2018년 7월 2일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/408452-gpu-memory-overhead-dependent-on-fft-dimension#answer_327211

MATLAB Online에서 열기

MATLAB uses cufft, so the behaviour is whatever its behaviour is. The implication of the batching API as described by the doc - https://docs.nvidia.com/cuda/cufft/index.html - is that batches that are contiguous result in multiple kernel launches. This will be slower, but more efficient with memory.

Because the amount of memory an FFT needs is so variable and dependent on signal length, it isn't that valuable to know what the size will be for any particular example. If you're curious you can watch the FreeMemory property output from gpuDevice:

gpu = gpuDevice
gpu.FreeMemory

After an FFT the FFT plan is retained so you should see how much memory it took up (as long as it's the first FFT you do in the MATLAB session). For working memory you can assume there will be a copy of the input, possibly two because MATLAB itself will often take a copy of the input in order to ensure your data is not corrupted in the event of an error.

If you can get your signals to be a power of 2 in length (say, 8192) you'll find them much more efficient with memory.