Kernels from Scatter-Gather Type Operations
GPU Coder™ also supports the concept of reductions - an important exception to the rule
that loop iterations must be independent. A reduction variable accumulates a value that
depends on all the iterations together, but is independent of
the iteration order. Reduction variables appear on both side of an assignment statement, such
as in summation, dot product, and sort. The following example shows the typical usage of a
reduction variable x
:
x = ...; % Some initialization of x for i = 1:n x = x + d(i); end
The variable x
in each iteration gets its value either before entering
the loop or from the previous iteration of the loop. This serial order type implementation is
not suitable for parallel execution due to the chain of dependencies in the sequential
execution. An alternative approach is to employ a binary tree-based approach.
In the tree-based approach, you can execute every horizontal level of the tree in parallel
over a certain number of passes. When compared to sequential execution, the binary tree does
require more memory because each pass requires an array of temporary values as output. The
performance benefit that you receive far outweighs the cost of
increased memory usage. GPU Coder creates reduction kernels by using this tree-based approach wherein each thread
block reduces a portion of the array. Parallel reduction requires partial result data
exchanges between thread blocks. In older CUDA® devices, this data exchange was achieved by using shared memory and thread
synchronization. Starting with the Kepler GPU architecture, CUDA provides shuffle (shfl
) instruction and fast device memory
atomic operations that make reductions even faster. Reduction kernels that the GPU Coder creates use the shfl_down
instruction to reduce across a warp
(32 threads) of threads. Then, the first thread of each warp uses the atomic operation
instructions to update the reduced value.
For more information on the instructions, refer to the NVIDIA® documentation.
Vector Sum Example
This example shows how to create CUDA reduction type kernels by using GPU Coder. Suppose that you want to create a vector v
and compute the
sum of its elements. You can implement this example as a MATLAB® function.
function s = VecSum(v) s = 0; for i = 1:length(v) s = s + v(i); end end
Prepare vecSum for Kernel Creation
GPU Coder requires no special pragma to infer reduction
kernels. In this example, use the coder.gpu.kernelfun
pragma to generate
CUDA reduction kernels. Use the modified VecSum
function.
Note
Using the coder.gpu.kernel
pragma for loops containing reductions
is not recommended.
function s = VecSum(v) %#codegen s = 0; coder.gpu.kernelfun(); for i = 1:length(v) s = s + v(i); end end
Generated CUDA Code
When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs the vector sum calculation. The
following is a snippet of vecSum_kernel1
.
static __global__ __launch_bounds__(512, 1) void vecSum_kernel1(const real_T *v, real_T *s) { uint32_T threadId; uint32_T threadStride; uint32_T thdBlkId; uint32_T idx; real_T tmpRed; ; ; thdBlkId = (threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x) + threadIdx.x; threadId = ((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + thdBlkId; threadStride = gridDim.x * blockDim.x * (gridDim.y * blockDim.y) * (gridDim.z * blockDim.z); if (!((int32_T)threadId >= 512)) { tmpRed = 0.0; for (idx = threadId; threadStride < 0U ? idx >= 511U : idx <= 511U; idx += threadStride) { tmpRed += v[idx]; } tmpRed = workGroupReduction1(tmpRed, 0.0); if (thdBlkId == 0U) { atomicOp1(s, tmpRed); } } }
Before calling VecSum_kernel1
, two cudaMemcpy
calls transfer the vector v
and the scalar s
from the
host to the device. The kernel has one thread block containing 512 threads per block,
consistent with the size of the input vector. A third cudaMemcpy
call
copies the result of the computation back to the host. The following is a snippet of the
main function.
cudaMemcpy((void *)gpu_v, (void *)v, 4096ULL, cudaMemcpyHostToDevice); cudaMemcpy((void *)gpu_s, (void *)&s, 8ULL, cudaMemcpyHostToDevice); VecSum_kernel1<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_v, gpu_s); cudaMemcpy(&s, gpu_s, 8U, cudaMemcpyDeviceToHost);
Note
For better performance, GPU Coder gives priority to parallel kernels over reductions. If your algorithm contains reduction inside a parallel loop, GPU Coder infers the reduction as a regular loop and generates kernels for it.
1-D Reduction Operations on the GPU
You can use the gpucoder.reduce
function to generate CUDA code that performs efficient 1-D reduction operations on the GPU. The
generated code uses the CUDA shuffle intrinsics to implement the reduction operation.
For example, to find the sum
and max
elements of
an array A
:
function s = myReduce(A) s = gpucoder.reduce(A, {@mysum, @mymax}); end function c = mysum(a, b) c = a+b; end function c = mymax(a, b) c = max(a,b); end
gpucoder.reduce
function has
these requirements:
The input must be of numeric or logical data type.
The function passed through the @handle must be a binary function that accepts two inputs and returns one output. The inputs and outputs must be of the same data type.
The function must be commutative and associative.
Note
For some inputs that are of the integer data type, the code generated for the
gpucoder.reduce
function may contain intermediate computations that
reach saturation. In such cases, the results from the generated code may not match the
simulation results from MATLAB.
See Also
coder.gpu.kernel
| coder.gpu.kernelfun
| gpucoder.matrixMatrixKernel
| coder.gpu.constantMemory
| gpucoder.stencilKernel
| gpucoder.reduce