Failed to generate large CUDA kernel in GPU coder with FFT function inside

Question

0 개 추천

I am trying to get my code paralle in GPU.

I have converted the code with the "main.m" script as attached. But the mex code on GPU is much slower than the m code on CPU. I understand that the GPU is not suitable for such small data size. But it takes much much longer time on the GPU if bigger data size is used.

Then I check the profilling timeline. I find that many cuda kernel is created and the overall GPU utilization is low. After some debugging, I find that if the fft command is used, the GPU coder failed to generate large CUDA kernel.

I think that the perfermance can be improved significantly if the fft can be incoporate inside one CUDA kernel like the situation without fft. FFT is needed. I have try to search on Google, but nothing relative can be found. Can you provide any information about this or any solution? The output of gpuDevice is also provided in the attachment.

Here is the profilling timeline without fft.

Here is the profilling timeline with fft.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Justin Hontz 2024년 9월 18일

0 개 추천

Hi He,

In your M-code for RandCopy, the for loop cannot be executed as a GPU kernel (even with the coder.gpu.kernel pragma) because of the fft / ifft calls inside of the loop. This is because fft is implemented using its own specialized GPU kernel, and GPU Coder does not supported nested kernels execution. Consequently, the for loop runs sequentially, which explains why you see thousands of small kernel instaces within the performance analyzer timeline graph.

To improve the performance of your code, you will want to perform your computation using only a single fft / ifft call that operates on the entire input array instead of individual slices. Something like this should work:

Tmp = fft(Data,[],2);

Tmp = Tmp + (1 + 1i);

Tmp = Tmp * (1564 + 798i);

Data = ifft(Tmp,[],2);

After making the change on my end, the performance analyzer report shows a significant performance improvement, with the timeline graph looking similar to the original one without fft.

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Justin Hontz 2024년 9월 19일

Based on your description of the computation, the best approach would still likely be to rewrite your code similiar to the way that I described above to achieve optimal performance. That is, instead of performing a sequence of operations on each slice of the input array, perform a sequence of larger operations on the entire input array. With this approach, you would still be able to parallelize the computation over the slices. You may end up with multiple kernels being generated instead of just one, but this is unlikely to be a significant performance bottleneck as long as the entire computation can be performed on GPU.

Without seeing your full code, I cannot give any specific advice, though for certain individual operations, you may want to implement them using a for loop with coder.gpu.kernel pragma if they cannot be implemented efficiently using vectorized MATLAB toolbox functions.

Regarding nested kernel execution, combining the code manually is unlikely to work. This is because GPU Coder by default implements fft using the cuFFT API, which is likely not callable from device code. If you still wish to keep your code in its current form, you can also try disabling use of cuFFT from the coder config (see property EnableCUFFT of coder.GpuConfig) and see if that improves the situation.

He 2024년 9월 19일

I fully understand the benefit of calculation on the entire array, which is the way I am working for years. However, it is not suitable inherently. I haved tried to disable cuFFT in the coder config, which results thousands of memory copy between the host and device. Maybe it requires other optimzation.

I found something here: https://developer.nvidia.com/cufftdx-downloads

It said:NVIDIA cuFFT introduces cuFFTDx APIs, device side API extensions for performing FFT calculations inside your CUDA kernel. Fusing numerical operations can decrease the latency and improve the performance of your application.

It seems like that the cuFFT can be called from device code. Hopefully you can show me how to use cuFFTDx in RandCopy.m. Perhaps that may be overly demanding.

Justin Hontz 2024년 9월 19일

MATLAB Online에서 열기

GPU Coder currently does not support generating direct calls to the cuFFTDx API. With that said, however, you may still be able to indirectly call into the API in the generated code if you are willing to write your own CUDA wrapper function that directly uses the API. This can possibly be achieved by invoking the wrapper function inside the for loop of your M-code via coder.ceval. The call would look something like this:

coder.ceval('-gpudevicefcn', 'myFFTWrapper', coder.ref(data), ...);

The -gpudevicefcn flag indicates that the wrapper function is meant to be executed by a GPU thread rather than by the CPU.

Note that I have not tried using this approach on my end, so I cannot guarantee that such an approach would work correctly without issue.

댓글을 달려면 로그인하십시오.

Failed to generate large CUDA kernel in GPU coder with FFT function inside

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

Failed to generate large CUDA kernel in GPU coder with FFT function inside

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기