Main Content

gpucoder.reduce

Optimized GPU implementation for reduction operations

Since R2019b

Description

example

S = gpucoder.reduce(A,FUN) aggregates the values present in the input array A to a single value using the given function handle FUN. The output S is a scalar.

S = gpucoder.reduce(A,{@FUN1,@FUN2,...}) accepts an input array and a cell array of function handles. It aggregates the values present in the input array to a single value for every function handle provided in the cell array. The size of output is 1-by-N, where N is the number of function handles.

The code generator uses shuffle intrinsics to perform efficient reduction on the GPU. Multiple function handles are aggregated inside a single kernel on the GPU.

S = gpucoder.reduce(___,Name,Value) aggregates the values present in the input array using the options specified by one or more Name,Value pair arguments.

Examples

collapse all

This example generates CUDA® code to find the sum and the maximum of the elements of an array.

In one file, write an entry-point function multireduce that accepts a matrix input A and dimension dim. Use the gpucoder.reduce function to perform two types of reduction operations on the elements of A.

function [s1,s2] = multireduce(A,dim)
  [s1,s2] = gpucoder.reduce(A, {@mysum, @mymax},"dim",dim); 
end

function c = mysum(a, b)
  c = a+b;
end

function c = mymax(a, b)
  c = max(a,b);
end

Use the codegen function to generate CUDA MEX function.

inputArgs = {rand(32,32,"double"),coder.Constant(2)};
cfg = coder.gpuConfig('mex');
codegen -config cfg -args inputArgs -report multireduce

The following is a snippet of the generated code.

...
cudaMalloc(&gpu_iv, 8UL);
cudaMalloc(&gpu_s2, 256UL);
cudaMalloc(&gpu_s1, 256UL);
cudaMalloc(&gpu_A, 8192UL);
multireduce_kernel1<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(*gpu_iv);
cudaMemcpy(cpu_iv, *gpu_iv, 8UL, cudaMemcpyDeviceToHost);
validLaunchParams = mwGetLaunchParameters1D(static_cast<real_T>(cpu_iv[0U]),
                                            &grid, &block, 1024U, 65535U);
if (validLaunchParams) {
  cudaMemcpy(*gpu_A, cpu_A, 8192UL, cudaMemcpyHostToDevice);
  multireduce_kernel2<<<grid, block>>>(*gpu_A, *gpu_iv, *gpu_s1);
} else {
  cudaMemcpy(*gpu_A, cpu_A, 8192UL, cudaMemcpyHostToDevice);
}
coder_reduce0<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(*gpu_A, *gpu_s1);
multireduce_kernel3<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(*gpu_iv);
cudaMemcpy(cpu_iv, *gpu_iv, 8UL, cudaMemcpyDeviceToHost);
validLaunchParams = mwGetLaunchParameters1D(static_cast<real_T>(cpu_iv[0U]),
                                            &grid, &block, 1024U, 65535U);
if (validLaunchParams) {
  multireduce_kernel4<<<grid, block>>>(*gpu_A, *gpu_iv, *gpu_s2);
}
coder_reduce1<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(*gpu_A, *gpu_s2);
cudaMemcpy(cpu_s1, *gpu_s1, 256UL, cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_s2, *gpu_s2, 256UL, cudaMemcpyDeviceToHost);
cudaFree(*gpu_A);
cudaFree(*gpu_s1);
cudaFree(*gpu_s2);
cudaFree(*gpu_iv);
...

Input Arguments

collapse all

The input array to perform the reduction operation on. For code generation, the input array must be of numeric or logical data type.

Handle to a user-defined function. FUN can also be a cell array of function handles. The function handle is a binary function and must satisfy the following requirements:

  • Accept two inputs and returns one output. The type of the inputs and output to the function must match the type of the input array A.

  • The function must be commutative and associative, otherwise the behavior is undefined.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: gpucoder.reduce(A, {@mySum, @myMax},'dim',2);

Perform reduction along the specified dimension.

Example: gpucoder.reduce(A, {@mySum, @myMax},'dim',2);

Apply a preprocessing function to the elements of the input array before performing the reduction operation.

Example: gpucoder.reduce(A,@mySum,'preprocess',@myScale);

Output Arguments

collapse all

Result of the reduction operation. During reduction, S is initialized to the value of one of elements of the input array A. Then, the reduction operation is performed by applying FUN to every element in A and S.

Limitations

  • gpucoder.reduce does not support input arrays that are of complex data type.

  • The user-defined function must accept two inputs and returns one output. The type of the inputs and output to the function must match the type of the input array A.

  • The user-defined function must be commutative and associative, otherwise the behavior is undefined.

  • For some inputs that are of the integer data type, the generated code may contain intermediate computations that reach saturation. In such cases, the results from the generated code may not match the simulation results from MATLAB®.

Version History

Introduced in R2019b