Generate GPU Code That Uses the NVIDIA cuBLAS Library
This example shows how to generate CUDA® code that multiplies two matrices by using the NVIDIA® cuBLAS library.
Examine the blas_gemm Entry-Point Function
The blas_gemm function multiplies the matrices A and B. The function uses the coder.gpu.kernelfun pragma to map the function to the GPU in the generated code.
type blas_gemm.mfunction [C] = blas_gemm(A,B)
coder.gpu.kernelfun();
C = A * B;
end
By default, GPU Coder™ replaces certain math functions with calls to the cuBLAS library in the generated code. In this example, if A, B, and C are matrices, GPU Coder replaces the multiplication of A and B with a call to the cuBLAS library. For more information about the operations that can use the cuBLAS library, see Kernels from Library Calls.
Generate CUDA Code
To generate code from the blas_gemm function, create a code configuration object by using the coder.gpuConfig function with the mex input argument.
cfg = coder.gpuConfig("mex");Create two 1024-by-1024 input matrices named X and Y to use as arguments for the blas_gemm function.
X = rand(1024); Y = rand(1024);
Generate CUDA code from blas_gemm by using the codegen command.
codegen blas_gemm.m -config cfg -args {X,Y};
Code generation successful.
Examine the Generated Code
When you generate CUDA code, GPU Coder creates function calls in the generated code that initialize the cuBLAS library, perform matrix-matrix operations, and release the hardware resources that the cuBLAS library uses.
After initializing the cuBLAS library, the generated code calls the blas_gemm function. The function copies the inputs from the CPU to the GPU by using the cudaMemcpy function, and then calls the cublasDgemm function.
codePath=fullfile(pwd,"codegen","mex","blas_gemm","blas_gemm.cu"); coder.example.extractLines(codePath,"void blas_gemm","// End of code");
void blas_gemm(const real_T cpu_A[1048576], const real_T cpu_B[1048576],
real_T cpu_C[1048576])
{
real_T(*gpu_A)[1048576];
real_T(*gpu_B)[1048576];
real_T(*gpu_C)[1048576];
real_T alpha1;
real_T beta1;
checkCudaError(mwCudaMalloc(&gpu_C, 1048576ULL * sizeof(real_T)), __FILE__,
__LINE__);
checkCudaError(mwCudaMalloc(&gpu_B, 1048576ULL * sizeof(real_T)), __FILE__,
__LINE__);
checkCudaError(mwCudaMalloc(&gpu_A, 1048576ULL * sizeof(real_T)), __FILE__,
__LINE__);
alpha1 = 1.0;
beta1 = 0.0;
checkCudaError(cudaMemcpy(*gpu_A, cpu_A, 1048576ULL * sizeof(real_T),
cudaMemcpyHostToDevice),
__FILE__, __LINE__);
checkCudaError(cudaMemcpy(*gpu_B, cpu_B, 1048576ULL * sizeof(real_T),
cudaMemcpyHostToDevice),
__FILE__, __LINE__);
cublasDgemm(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 1024, 1024,
1024, &alpha1, &(*gpu_A)[0], 1024, &(*gpu_B)[0], 1024, &beta1,
&(*gpu_C)[0], 1024);
checkCudaError(cudaMemcpy(cpu_C, *gpu_C, 1048576ULL * sizeof(real_T),
cudaMemcpyDeviceToHost),
__FILE__, __LINE__);
checkCudaError(mwCudaFree(*gpu_A), __FILE__, __LINE__);
checkCudaError(mwCudaFree(*gpu_B), __FILE__, __LINE__);
checkCudaError(mwCudaFree(*gpu_C), __FILE__, __LINE__);
}
The function cublasDgemm is a level 3 Basic Linear Algebra Subprogram (BLAS3) that performs General Matrix Multiplication (GEMM). The function calculates the result C as C = αAB + βC, where α and β are scalars, and A, B, and C are matrices in column-major format. Because C is equal to A*B, in the generated code, the variable alpha1 is equal to one and beta1 is equal to zero. The generated code uses the macro CUBLAS_OP_N to compute the matrix multiplication without transposing either matrix.