GPU Coder

Key Features

  • CUDA C and C++ code generation
  • Deep learning network support (with Deep Learning Toolbox™)
  • Image processing support (with Image Processing Toolbox™)
  • Loop optimizations and CUDA kernel optimizations
  • MEX function generation for code verification and acceleration
  • Legacy CUDA code integration
  • Code profiling and verification

GPU Coder app (left) and code generation report (right) showing generated CUDA code.

Generate CUDA from MATLAB

Translating MATLAB® algorithms to CUDA® code involves specifying implementation requirements. The GPU Coder app and equivalent command-line functions guide you through this iterative process while enabling you to continue working with the familiar MATLAB language.

GPU Coder™ helps you prepare your algorithm for code generation by analyzing your MATLAB code to propose the data types and sizes for your inputs. You can ensure that your algorithm is ready for code generation by generating a MEX function that wraps the compiled code for execution back within MATLAB. GPU Coder produces a report that identifies any errors you need to fix to make your MATLAB algorithm ready for code generation. You iterate between fixing errors and regenerating a MEX function until your MATLAB algorithm is suitable for code generation.

You can then generate CUDA from your algorithm either as source code, a static or dynamic library, or a MEX function tuned for performance to accelerate computationally intensive portions of your MATLAB code. The generated code can be used for applications such as deep learning or embedded vision, and for autonomous systems.

Use GPU Coder to generate CUDA code from a fog rectification algorithm written in MATLAB.

Three-step process for generating code.

Generate CUDA from Deep Learning Networks

Create, Train, and Deploy Trained Deep Learning Networks with Deep Learning Toolbox

Use GPU Coder with Deep Learning Toolbox to deploy trained deep learning networks on NVIDIA GPUs such as the Tesla® and Tegra®. Use the transfer learning approach to retrain existing networks, such as AlexNet or VGG-16/19, to perform new tasks. For example, categorize only ten objects in your data set that are most critical, instead of 1000 different objects. Or, you can train a deep network from scratch for a new application by gathering a very large labeled data set and designing a network architecture that will learn the features and the model.

GPU Coder generates code for preprocessing and postprocessing along with the trained deep learning network, so you get the complete algorithm. For example, you might need to clean up foggy input images using classical machine learning techniques before using a trained deep learning network like AlexNet or VGG-16 to detect and classify objects. GPU Coder generates code for both the machine learning algorithm and for the trained deep learning network, so you can develop your complete application more easily.

Walk through a real-time object detection example using YOLO v2 in MATLAB. Generate optimized CUDA code and verify it using a mex file that runs at about 80 fps on a test file. Deploy the generated...
See an example of a DAG network for semantic segmentation. Using the cnncodegen function in GPU Coder, generate CUDA code and build it into a MEX function that runs 6x faster than in MATLAB.

Deep learning approach to categorizing vehicles.

Accelerate Training of Deep Learning Models with Parallel Computing Toolbox

Training a deep learning model can take a long time, from days to weeks. With Parallel Computing Toolbox™, you can leverage GPUs on your machine, on a cluster, or in the cloud to significantly speed up the training process. Using GPUs can cut the training time for an image classification problem from days to hours.

Optimize the Generated CUDA

Kernel Creation, Memory Transfer Minimization, and GPU Memory Allocation

GPU Coder creates CUDA kernels that minimize memory transfers between CPU and GPU and optimize GPU memory usage. GPU Coder automatically analyzes, identifies, and partitions segments of MATLAB code to run on either the CPU or GPU. Pragmas are also provided for users to manually specify all or parts of their MATLAB algorithm to run on the GPU. MATLAB code identified to run on the GPU is converted into CUDA kernels, and these are created from constructs such as FOR-loops, element-wise matrix and vector math, scatter-gather and reduction operations (for example, mean, sum), and higher-level algorithms such as FFTs and image processing functions.

GPU Coder also analyzes the data dependency between the CPU and GPU partitions. Data shared between the CPU and GPU is allocated on GPU memory using cudaMalloc or cudaMallocManaged. The analysis determines the minimum set of locations where data must be copied between CPU and GPU using cudaMemcpy. If you are using unified memory in CUDA, GPU Coder also determines the minimum of cudaDeviceSync calls needed for correct functional behavior.

Various GPU memory spaces are supported, from local to global memory. Within each kernel, GPU Coder maps data to the memory space that results in greater memory bandwidth.

Use design patterns like gpucoder.stencilKernel to optimize the generated CUDA.

Accelerated Library Support

The generated code calls optimized NVIDIA CUDA libraries, including TensorRT, cuDNN, cuSolver, cuFFT, cuBLAS, and Thrust.

  • NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
  • The NVIDIA cuDNN library is a GPU-accelerated library of primitives for deep neural networks and provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
  • The NVIDIA cuSOLVER library provides a collection of dense and sparse direct solvers to accelerate computer vision and linear optimization applications.
  • The NVIDIA cuFFT library is designed for high-performance computation of Fast Fourier Transforms.
  • The NVIDIA cuBLAS library is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS).
  • Finally, Thrust is a C++ template library for CUDA that provides a rich collection of data parallel primitives such as scan, sort, and reduce to implement high performance parallel applications with minimal programming effort.
Generate CUDA code from a trained deep neural network in MATLAB and leverage the NVIDIA TensorRT library for inference on NVIDIA GPUs using a pedestrian detection application as an example.

Design Patterns

You can achieve additional acceleration by using MATLAB design patterns such as stencil processing and matrix-matrix processing. Stencil processing can be used for operations such as convolution, median filtering, and finite element methods. The generated code uses shared memory to improve memory bandwidth and data locality that exhibit data locality and reuse opportunities. Matrix-matrix processing can be used for operations such as sum of absolute differences (SAD) and sum of squared differences (SSD). In this case, the generated code reuses data and improves memory bandwidth.

Use Supported MATLAB Language and Toolboxes for Code Generation

GPU Coder generates code from MATLAB language features that design engineers typically use for developing algorithms as components of larger systems. This includes more than 380 operators and functions from MATLAB and companion toolboxes, including:

MATLAB language and toolbox support for code generation.

Incorporate External CUDA with Generated Code

You can use coder.ceval to incorporate external CUDA code into your generated code. External code can be existing handwritten code, code for environments that you integrate with the generated code, or other user-specified lines of code that you include in the GPU Coder build process. The generated code will contain calls to the external CUDA functions at the appropriate locations.

You can also bring external CUDA code into MATLAB for simulation and verification by writing a MATLAB function that uses coder.ceval to call the external CUDA code, and then generating a MEX file. The resulting MEX file, when executed in MATLAB, will in turn execute the external CUDA code. If you generate standalone code from the MATLAB function, it produces code that calls the external CUDA function.

External CUDA code, which can be integrated into the generated code. You can also call external CUDA functions in MATLAB via a MEX file.

Generate MEX Functions for Acceleration and Verification

A MEX function, compiled C or CUDA code for execution within MATLAB, can be called in place of your MATLAB code to:

  • Test and verify the compiled code back in MATLAB
  • Accelerate the execution

As a part of the three-step iterative workflow, you should generate and test the MEX function to verify that it provides the same functionality as the original MATLAB code.

Testing the MEX function before generating code enables you to detect and fix run-time errors that are much harder to diagnose in the generated code. Running your MEX function in MATLAB executes memory integrity checks for C/C++ code that perform array bounds checking and dimension checking, and detects violations of memory integrity in C/C++ code generated for MATLAB functions. Executing the MEX function in MATLAB also checks for register spills and runs stack size conformance checks in the CUDA code. If a violation is detected, MATLAB stops execution and provides a diagnostic message.

Explore gallery (2 images)

Run Generated Code on GPUs such as NVIDIA Tesla and Tegra

You can compile the generated code using NVIDIA products and execute it on GPUs such as NVIDIA Tesla and NVIDIA Tegra. For GPUs mounted on the host machine where MATLAB resides, you can compile it using NVIDIA compilers. If the generated code calls third-party accelerated libraries such as cuDNN, cuFFT, cuSolver, or cuBLAS, you need to install these libraries separately prior to compiling the generated code.

For embedded GPUs, you can manually integrate the generated code and compile it on the target using NVIDIA tools. Alternatively, with GPU Coder™ Support Package for NVIDIA® GPUs you can cross-compile and deploy the generated CUDA code as a standalone application on the embedded GPU such as the NVIDIA Drive platform or the NVIDIA Jetson® board. The support package also enables you to remotely communicate with the NVIDIA target and control the peripheral devices for early prototyping.

Generate CUDA code from a trained deep neural network in MATLAB and leverage the NVIDIA TensorRT library for inference on NVIDIA GPUs using a pedestrian detection application as an example.
Learn how you can use GPU Coder hardware support package for NVIDIA GPUs to prototype, verify, and deploy your deep learning models and algorithms in MATLAB for embedded vision, autonomous driving applications on NVIDIA GPUs like the NVIDIA Drive.

Run generated CUDA on NVIDIA GPUs such as the Tesla and Tegra.

Use GPU Coder with Embedded Coder

By using GPU Coder with Embedded Coder®, you can further optimize code efficiency and customize the generated code. Use the interactive traceability report to gain insights into how your MATLAB code maps to the generated C code. Embedded Coder also enables you to verify the numerical behavior of the generated code using software-in-the-loop (SIL) execution to verify the generated CUDA code as deployed on your embedded GPU.

Interactive traceability report using MATLAB Coder with Embedded Coder.