IFFT slow down with using gpuArray

Two sets of data A (4096 x 1024) matrix and B (32768 x 1024) matrix have been transferred to the GPU using gpuArray. A is passed into the FFT function and has shown a significant speed increase in comparison to the CPU A data. B is passed into the IFFT function and has shown approximately a 50% decrease in efficiency in comparison to the CPU B data. Is there a reason why the IFFT function does not have the speed increase proportional to the FFT function? I understand the sizes differ but I do no understand why the GPU implemented IFFT is slower then the CPU implemented IFFT. Also, the tic toc function and the run and time function were used to time the results. Thank you for your help.

댓글 수: 4

Jill Reese
Jill Reese 2013년 5월 3일
What version of MATLAB are you running?
Michael
Michael 2013년 5월 3일
MATLAB R2012b
James Lebak
James Lebak 2013년 5월 3일
편집: James Lebak 2013년 5월 3일
When I time this on MATLAB R2013a, 3.5 GHz Xeon, with a Tesla C2075 GPU, I see 0.36 s for the IFFT of a 32768x1024 matrix on the CPU and 0.051s on the GPU. Here is the code I used:
x=gpuArray.ones(32768,1024);
gd=gpuDevice;
tic;y=ifft(x);wait(gd);toc
xc=gather(x);
tic;y=ifft(xc);toc
And the output:
Elapsed time is 0.050705 seconds.
Elapsed time is 0.364836 seconds.
I would be interested to know what this code shows you, and also whether having the other array that you mentioned in memory changes the performance. I didn't see a change, but I don't have access to this specific card that you have.
Thank you for the test case. When I run this same program the output is:
Elapsed time is 0.466822 seconds.
Elapsed time is 0.863542 seconds.
I believe the Tesla C2075 has a faster processing time than the GeForce GT 630M. However, your efficiency is terrific with a speed up of approximately 600% and mine was 85%. Why would there be such a difference? Thank you

댓글을 달려면 로그인하십시오.

 채택된 답변

Matt J
Matt J 2013년 5월 3일

0 개 추천

What graphics card do you have? How much RAM does it have? It could be that the larger array is just having a harder time because of memory constraints.

댓글 수: 10

Matt J
Matt J 2013년 5월 3일
Whatever the issue is, it's hardware specific. I see a speed-up of more than 3x on the GTX 580.
Michael
Michael 2013년 5월 3일
Graphics Card: GeForce GT 630M RAM: 6.00 GB (5.89 GB usable)
I thought with B being 32768 x 1024 which is 2^12 x 2^10 values and each value is a float would not have memory constraint issues but I could be wrong would this lead to a problem?
Matt J
Matt J 2013년 5월 3일
It doesn't look like a memory issue. I have half as much RAM on the GTX 580 and as I said, I see significant speed-up. Maybe if you how the code you used to run the test, it will provide a clue.
Michael
Michael 2013년 5월 3일
편집: Matt J 2013년 5월 3일
function [ pdZeropadBuffer ] = Zeropadding( pdBuffer, pdInit, pdHeader)
NumberofAlines = pdHeader(1);
nAlineLength = pdHeader(2);
nPaddingFactor = pdInit(4);
pdBuffer = gpuArray(pdBuffer);
pdPaddedFFT = gpuArray.ones(nPaddingFactor*nAlineLength, NumberofAlines, 'double');
nMidLength = nAlineLength / 2 + 1;
pdFFT = fft(pdBuffer);
pdFFT(nMidLength, :) = pdFFT(nMidLength, :) / 2.0;
pdPaddedFFT(1:nMidLength, :) = pdFFT(1:nMidLength, :);
pdPaddedFFT(end-nMidLength+2:end, :) = pdFFT(end-nMidLength+2:end, :);
pdPaddedFFT = ifft(pdPaddedFFT);
pdZeropadBuffer = real(pdPaddedFFT) * nPaddingFactor;
pdZeropadBuffer = gather(pdZeropadBuffer);
clear nMidLength pdFFT pdPaddedFFT;
end
Michael
Michael 2013년 5월 3일
I am sorry about the spacing of the program. I used the code application but I feel that I used it poorly.
Matt J
Matt J 2013년 5월 3일
편집: Matt J 2013년 5월 3일
That is not code we can run. We do not have values for the input arguments. Also, where did you apply tic..toc? To this entire code or just the fft parts?
James Lebak
James Lebak 2013년 5월 3일
편집: James Lebak 2013년 5월 3일
Assuming you are timing this whole function, a couple of things jump out. pdPaddedFFT is originally created as real, but it will be converted to complex when you do the assignment from pdFFT to pdPaddedFFT (line 10). This could be costly, especially if pdPaddedFFT is the 32768x1024 array you mentioned. Create pdPaddedFFT as complex from the start by doing something like
X = gpuArray.zeros(M,N);
pdPaddedFFT = complex(X,X);
Also if you are timing the creation of the array, as well as the gather, then IFFT is probably not the bottleneck. On my machine the gather of a 32768x1024 matrix takes more than twice as long as the IFFT. If you can leave the data on the GPU for the next step you will get better performance.
Michael
Michael 2013년 5월 3일
I placed the tic toc function around the fft call "pdFFT = fft(pdBuffer);" and another tic toc function around the ifft call "pdPaddedFFT = ifft(pdPaddedFFT);". Also, Matt J, my apologizes about the program. The code I attached is part of a project. Is there a way to attach .m files in this forum? James Lebak, thank you for the suggestion on conversion from real to complex and I will change that. I only timed each function of the FFT and IFFT individually to compare only the FFT and IFFT calls. Would this conversion effect the FFT and IFFT calls?
Also, Matt J, my apologizes about the program. The code I attached is part of a project. Is there a way to attach .m files in this forum?
Just give values for
NumberofAlines = pdHeader(1);
nAlineLength = pdHeader(2);
nPaddingFactor = pdInit(4);
I assume that pdBuffer is the 32768x1024 array?
Of course. Thank you:
NumberofAlines = 1024
nAlineLength = 4096
nPaddingFactor = 8

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

James Lebak
James Lebak 2013년 5월 3일
편집: James Lebak 2013년 5월 4일

1 개 추천

The GeForce GT630M is a mobile graphics card. Frequently, these cards don't perform as well in double-precision as they do in single-precision. If your application can handle single-precision, you can try the IFFT in single and see if that gives you better performance. If you need double precision performance, you might want to try a different card.
This especially applies if the card in question is compute capability 3.0. You can find out the compute capability of the card in MATLAB from the structure returned by 'gpuDevice'.
Edit: removed incorrect identification of the 630M.

댓글 수: 5

When I run 'gpuDevice' the output is:
Name: 'GeForce GT 630M'
Index: 1
ComputeCapability: '2.1'
SupportsDouble: 1
DriverVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [65535 65535]
SIMDWidth: 32
TotalMemory: 2.1473e+09
FreeMemory: 2.0642e+09
MultiprocessorCount: 2
ClockRateKHz: 950000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The GeForce Gt 630M supports double precision by the ComputeCapability greater than 1.3. Also, I assume 'SupportsDouble' is referring to double precision but I am not sure.
James Lebak
James Lebak 2013년 5월 3일
You are correct that the card can compute in double-precision, but that doesn't always mean that it can compute faster than your CPU. I apologize for getting the compute capability of your card wrong -- I misread the chart -- but the point is that many Geforce and mobile cards are good at single-precision computation and less good at double-precision.
Michael
Michael 2013년 5월 4일
That is no problem and thank you I will rewrite this program for single precision and see if that is problem. Thank you again for all your help.
Michael
Michael 2013년 5월 5일
James Lebak you were correct. Single-precision is performing significantly more efficient than the double-precision data. Matt J and James Lebak thank you for all your help.
Matt J
Matt J 2013년 5월 6일
편집: Matt J 2013년 5월 6일
If James was right, then why didn't you accept his Answer instead of mine???

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 GPU Computing in MATLAB에 대해 자세히 알아보기

태그

질문:

2013년 5월 3일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by