Translating MATLAB® algorithms to CUDA® code involves specifying implementation requirements. The GPU Coder app and equivalent command-line functions guide you through this iterative process while enabling you to continue working with the familiar MATLAB language.
GPU Coder™ helps you prepare your algorithm for code generation by analyzing your MATLAB code to propose the data types and sizes for your inputs. You can ensure that your algorithm is ready for code generation by generating a MEX function that wraps the compiled code for execution back within MATLAB. GPU Coder produces a report that identifies any errors you need to fix to make your MATLAB algorithm ready for code generation. You iterate between fixing errors and regenerating a MEX function until your MATLAB algorithm is suitable for code generation.
You can then generate CUDA from your algorithm either as source code, a static or dynamic library, or a MEX function tuned for performance to accelerate computationally intensive portions of your MATLAB code. The generated code can be used for applications such as deep learning or embedded vision, and for autonomous systems.
Use GPU Coder with Deep Learning Toolbox to deploy trained deep learning networks on NVIDIA GPUs such as the Tesla® and Tegra®. Use the transfer learning approach to retrain existing networks, such as AlexNet or VGG-16/19, to perform new tasks. For example, categorize only ten objects in your data set that are most critical, instead of 1000 different objects. Or, you can train a deep network from scratch for a new application by gathering a very large labeled data set and designing a network architecture that will learn the features and the model.
GPU Coder generates code for preprocessing and postprocessing along with the trained deep learning network, so you get the complete algorithm. For example, you might need to clean up foggy input images using classical machine learning techniques before using a trained deep learning network like AlexNet or VGG-16 to detect and classify objects. GPU Coder generates code for both the machine learning algorithm and for the trained deep learning network, so you can develop your complete application more easily.
Training a deep learning model can take a long time, from days to weeks. With Parallel Computing Toolbox™, you can leverage GPUs on your machine, on a cluster, or in the cloud to significantly speed up the training process. Using GPUs can cut the training time for an image classification problem from days to hours.
GPU Coder creates CUDA kernels that minimize memory transfers between CPU and GPU and optimize GPU memory usage. GPU Coder automatically analyzes, identifies, and partitions segments of MATLAB code to run on either the CPU or GPU. Pragmas are also provided for users to manually specify all or parts of their MATLAB algorithm to run on the GPU. MATLAB code identified to run on the GPU is converted into CUDA kernels, and these are created from constructs such as FOR-loops, element-wise matrix and vector math, scatter-gather and reduction operations (for example,
sum), and higher-level algorithms such as FFTs and image processing functions.
GPU Coder also analyzes the data dependency between the CPU and GPU partitions. Data shared between the CPU and GPU is allocated on GPU memory using
cudaMallocManaged. The analysis determines the minimum set of locations where data must be copied between CPU and GPU using
cudaMemcpy. If you are using unified memory in CUDA, GPU Coder also determines the minimum of
cudaDeviceSync calls needed for correct functional behavior.
Various GPU memory spaces are supported, from local to global memory. Within each kernel, GPU Coder maps data to the memory space that results in greater memory bandwidth.
The generated code calls optimized NVIDIA CUDA libraries, including TensorRT, cuDNN, cuSolver, cuFFT, cuBLAS, and Thrust.
You can achieve additional acceleration by using MATLAB design patterns such as stencil processing and matrix-matrix processing. Stencil processing can be used for operations such as convolution, median filtering, and finite element methods. The generated code uses shared memory to improve memory bandwidth and data locality that exhibit data locality and reuse opportunities. Matrix-matrix processing can be used for operations such as sum of absolute differences (SAD) and sum of squared differences (SSD). In this case, the generated code reuses data and improves memory bandwidth.
GPU Coder generates code from MATLAB language features that design engineers typically use for developing algorithms as components of larger systems. This includes more than 380 operators and functions from MATLAB and companion toolboxes, including:
You can use
coder.ceval to incorporate external CUDA code into your generated code. External code can be existing handwritten code, code for environments that you integrate with the generated code, or other user-specified lines of code that you include in the GPU Coder build process. The generated code will contain calls to the external CUDA functions at the appropriate locations.
You can also bring external CUDA code into MATLAB for simulation and verification by writing a MATLAB function that uses
coder.ceval to call the external CUDA code, and then generating a MEX file. The resulting MEX file, when executed in MATLAB, will in turn execute the external CUDA code. If you generate standalone code from the MATLAB function, it produces code that calls the external CUDA function.
A MEX function, compiled C or CUDA code for execution within MATLAB, can be called in place of your MATLAB code to:
As a part of the three-step iterative workflow, you should generate and test the MEX function to verify that it provides the same functionality as the original MATLAB code.
Testing the MEX function before generating code enables you to detect and fix run-time errors that are much harder to diagnose in the generated code. Running your MEX function in MATLAB executes memory integrity checks for C/C++ code that perform array bounds checking and dimension checking, and detects violations of memory integrity in C/C++ code generated for MATLAB functions. Executing the MEX function in MATLAB also checks for register spills and runs stack size conformance checks in the CUDA code. If a violation is detected, MATLAB stops execution and provides a diagnostic message.
Explore gallery (2 images)
You can compile the generated code using NVIDIA products and execute it on GPUs such as NVIDIA Tesla and NVIDIA Tegra. For GPUs mounted on the host machine where MATLAB resides, you can compile it using NVIDIA compilers. If the generated code calls third-party accelerated libraries such as cuDNN, cuFFT, cuSolver, or cuBLAS, you need to install these libraries separately prior to compiling the generated code.
For embedded GPUs, you can manually integrate the generated code and compile it on the target using NVIDIA tools. Alternatively, with GPU Coder™ Support Package for NVIDIA® GPUs you can cross-compile and deploy the generated CUDA code as a standalone application on the embedded GPU such as the NVIDIA Drive platform or the NVIDIA Jetson® board. The support package also enables you to remotely communicate with the NVIDIA target and control the peripheral devices for early prototyping.
By using GPU Coder with Embedded Coder®, you can further optimize code efficiency and customize the generated code. Use the interactive traceability report to gain insights into how your MATLAB code maps to the generated C code. Embedded Coder also enables you to verify the numerical behavior of the generated code using software-in-the-loop (SIL) execution to verify the generated CUDA code as deployed on your embedded GPU.