Deep Learning Prediction with Different Batch Sizes

This example demonstrates code generation with batch sizes greater than 1. This demo contains two examples, first, uses cnncodegen to generate code which takes in a batch of images as input. The second example creates MEX file using codegen and passes a batch of images as input.

Prerequisites

  • CUDA® enabled NVIDIA® GPU with compute capability 3.2 or higher.

  • NVIDIA CUDA toolkit and driver.

  • NVIDIA cuDNN library.

  • NVIDIA TensorRT library.

  • Environment variables for the compilers and libraries. For information on the supported versions of the compilers and libraries, see Third-party Products. For setting up the environment variables, see Environment Variables.

  • Computer Vision Toolbox™ for the video reader and viewer used in the example.

  • Deep Learning Toolbox™ for using SeriesNetwork or DAGNetwork objects.

  • Image Processing Toolbox™ for reading and displaying images.

  • This example uses TensorRT library which is only supported in linux platforms.

Verify the GPU Environment

Use the coder.checkGpuInstall function and verify that the compilers and libraries needed for running this example are set up correctly.

envCfg = coder.gpuEnvConfig('host');
envCfg.DeepLibTarget = 'tensorrt';
envCfg.DeepCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

Create a Folder and Copy Relevant Files

The following code creates a folder in your current working folder (pwd). The new folder contains only the files that are relevant for this example. If you do not want to affect the current folder (or if you cannot generate files in this folder), change your working folder.

gpucoderdemo_setup('gpucoderdemo_resnet_batchsize');

Classification with ResNet-50 Network

The example uses the popular DAG network ResNet-50 for image classification. A pretrained ResNet-50 model for MATLAB® is available in the ResNet-50 support package of the Deep Learning Toolbox. To download and install the support package, use the Add-On Explorer. To learn more about finding and installing add-ons, see Get Add-Ons (MATLAB).

net = resnet50;
% To see all the layers, use |net.Layers|

Generate Code for NVIDIA GPUs with TensorRT Support

For a NVIDIA target with TensorRT, code generation and execution is performed on the host development computer. To run the generated code, your development computer must have a NVIDIA GPU with compute capability of at least 3.2. Use the cnncodegen command to generate code for NVIDIA platform by using 'tensorrt' option. By default, the cnncodegen command generates code that uses 32-bit float precision for the tensor inputs to the network. In the predict call, multiple images can be batched into a single call, and passed as an input. This runs predictions over the batch of inputs in parallel. The default value of the batch size is 1.

% Input batch size can be specified using the |'batchsize'| option. This is
% the batch size that should be passed to the generated code. Here
% 15 images are considered as a batch, and so batch size is 15.
% Passing different batch size at runtime causes errors.

Note: To generate code using CuDNN, we need to specify 'cudnn' instead of 'tensorrt', for 'targetlib' option.

cnncodegen(net,'targetlib','tensorrt', 'batchsize', 15);
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWAdditionLayer.o" "MWAdditionLayer.cpp"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWBatchNormalizationLayer.o" "MWBatchNormalizationLayer.cpp"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWConvLayer.o" "MWConvLayer.cpp"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "cnn_api.o" "cnn_api.cpp"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWAdditionLayerImpl.o" "MWAdditionLayerImpl.cu"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWBatchNormalizationLayerImpl.o" "MWBatchNormalizationLayerImpl.cu"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWCNNLayerImpl.o" "MWCNNLayerImpl.cu"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWConvLayerImpl.o" "MWConvLayerImpl.cu"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "MWTargetNetworkImpl.o" "MWTargetNetworkImpl.cu"
nvcc -c  -rdc=true  -Xcompiler -fPIC -Xptxas "-w" -Xcudafe "--display_error_number --diag_suppress=2381 --diag_suppress=unsigned_compare_with_zero" -O3 -arch sm_35 -std=c++11 -I"/mathworks/devel/sbs/37/vravicha.lcmFirst/matlab/toolbox/gpucoder/gpucoderdemos/gpucoderdemo_resnet_batchsize2/codegen" -I"/mathworks/hub/share/apps/GPUTools/TensorRT/5.0.2.6/glnxa64/include"  -I"/mathworks/hub/3rdparty/R2019a/3840803/glnxa64/cuDNN/cuda/include" -o "cnn_exec.o" "cnn_exec.cpp"
nvcc -lib -Xlinker -rpath,"/bin/glnxa64",-L"/bin/glnxa64" -lc -Xnvlink -w -Wno-deprecated-gpu-targets -arch sm_35 -std=c++11 -o cnnbuild.a MWAdditionLayer.o MWBatchNormalizationLayer.o MWConvLayer.o cnn_api.o MWAdditionLayerImpl.o MWBatchNormalizationLayerImpl.o MWCNNLayerImpl.o MWConvLayerImpl.o MWTargetNetworkImpl.o cnn_exec.o -L".." -lcudnn -lnvcaffe_parser -lnvinfer_plugin -lnvinfer -lcublas -lcudart -lcusolver 
### Created: cnnbuild.a
### Successfully generated all binary outputs.

1. Description of the Generated Code

The presetup() and postsetup() functions perform additional configuration required for TensorRT. Layer classes in the generated code folder call into TensorRT libraries.

Main File

The main file creates and sets up the CnnMain network object with layers and weights. It uses the OpenCV VideoCapture method to read frames from input video. It runs prediction for each frame fetching the output from the final fully connected layer.

Frames obtained from OpenCV VideoCapture object are converted from packed BGR (OpenCV) format to planar RGB (MATLAB) format. A buffer is allocated and filled with the image data as shown. This raw buffer is used as an in input to the network.

   void readBatchData(float *input, vector<Mat>& orig, int batchSize)
   {
        for (int i=0; i<batchSize; i++)
        {
           if (orig[i].empty())
           {
               orig[i] = Mat::zeros(ROWS,COLS, orig[i-1].type());
               continue;
           }
           Mat tmpIm;
           resize(orig[i], tmpIm, Size(COLS,ROWS));
           for (int j=0; j<ROWS*COLS; j++)
           {
               // BGR packed to RGB planar conversion
               input[CH*COLS*ROWS*i + 2*COLS*ROWS + j] = (float)(tmpIm.data[j*3+0]);
               input[CH*COLS*ROWS*i + 1*COLS*ROWS + j] = (float)(tmpIm.data[j*3+1]);
               input[CH*COLS*ROWS*i + 0*COLS*ROWS + j] = (float)(tmpIm.data[j*3+2]);
           }
        }
   }

2. Build and Execute

Use make to build the resnet_batchSize_exe executable.

system(['make ','tensorrt']);
or
system(['make ','cudnn']);

Run the executable with an input video file, with batch size as the first argument and the name of the video file as the second argument. For example, ./resnet_batchSize_exe 15 testVideo.avi

system(['./resnet_batchSize_exe ', 15 , 'testVideo.avi']);

Generate CUDA MEX for 'resnet_predict_mex' Function

Create a GPU configuration object for MEX target and set the target language to C++. On the configuration object, set DeepLearningConfig with targetlib 'tensorrt'. To generate CUDA MEX, use the codegen command and specify the input as a 4D matrix of size [224,224,3,batchSize]. This value corresponds to the input layer size of the resnet network.

batchSize = 5;
cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('tensorrt');
codegen -config cfg resnet_predict -args {ones(224,224,3,batchSize,'uint8')} -report
Code generation successful: To view the report, open('codegen/mex/resnet_predict/html/report.mldatx').

Call Predict on a Test Image Batch

im = imread('peppers.png');
im = imresize(im, [224,224]);
%
% Concatenating 5 images since |batchSize = 5|
imBatch = cat(4,im,im,im,im,im);
predict_scores = resnet_predict_mex(imBatch);
%
% get top 5 probability scores and their labels, for each image in the batch
[val,indx] = sort(transpose(predict_scores), 'descend');
scores = val(1:5,:)*100;
net = resnet50;
classnames = net.Layers(end).ClassNames;
for i = 1:batchSize
    labels = classnames(indx(1:5,i));
    disp(['Top 5 predictions on image, ', num2str(i)]);
    disp(labels);
end
Top 5 predictions on image, 1
    'bell pepper'
    'cucumber'
    'acorn squash'
    'lemon'
    'zucchini'

Top 5 predictions on image, 2
    'bell pepper'
    'cucumber'
    'acorn squash'
    'lemon'
    'zucchini'

Top 5 predictions on image, 3
    'bell pepper'
    'cucumber'
    'acorn squash'
    'lemon'
    'zucchini'

Top 5 predictions on image, 4
    'bell pepper'
    'cucumber'
    'acorn squash'
    'lemon'
    'zucchini'

Top 5 predictions on image, 5
    'bell pepper'
    'cucumber'
    'acorn squash'
    'lemon'
    'zucchini'

Clear the static network object loaded in memory.

clear mex;

The warning about non-coalesced access is because codegen command generates column major code. However, for the purposes of this example, this warning can be ignored.

Run Command: Cleanup

Clear mex to clear the static network object loaded in memory.

clear mex;
cleanup