Main Content

Vehicle Detection Using YOLO v2 Deployed to FPGA

Deep learning is a powerful machine learning technique that you can use to train robust object detectors. Several techniques for object detection exist, including Faster R-CNN and you only look once (YOLO) v2.

Train and deploy a you look only once (YOLO) v2 object detector by using the dlhdl.Workflow object.

Load Data Set

This example uses a small vehicle data set that contains 295 images. Many of these images come from the Caltech Cars 1999 and 2001 data sets, used with permission and available at the Caltech Computational Vision website, created by Pietro Perona. Each image contains one or two labeled instances of a vehicle. A small dataset is useful for exploring the YOLO v2 training procedure, but in practice, more labeled images are needed to train a robust detector. Extract the vehicle images and load the vehicle ground truth data.

data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;

The vehicle data is stored in a two-column table, where the first column contains the image file paths and the second column contains the vehicle bounding boxes.

% Add the fullpath to the local vehicle data folder.
vehicleDataset.imageFilename = fullfile(pwd,vehicleDataset.imageFilename);

Split the data set into training and test sets. Select 60% of the data for training and use the rest for testing the trained detector.

shuffledIndices = randperm(height(vehicleDataset));
idx = floor(0.6 * length(shuffledIndices) );
trainingDataTbl = vehicleDataset(shuffledIndices(1:idx),:);
testDataTbl = vehicleDataset(shuffledIndices(idx+1:end),:);

Use imageDatastore and boxLabelDatastore to create datastores for loading the image and label data during training and evaluation.

imdsTrain = imageDatastore(trainingDataTbl{:,'imageFilename'});
bldsTrain = boxLabelDatastore(trainingDataTbl(:,'vehicle'));

imdsTest = imageDatastore(testDataTbl{:,'imageFilename'});
bldsTest = boxLabelDatastore(testDataTbl(:,'vehicle'));

Combine the image and box label datastores.

trainingData = combine(imdsTrain,bldsTrain);
testData = combine(imdsTest,bldsTest);

Create YOLO v2 Object Detection Network

A YOLO v2 object detection network is composed of two subnetworks: a feature extraction network followed by a detection network. The feature extraction network is typically a pretrained CNN (for details, see Pretrained Deep Neural Networks). This example uses AlexNet for feature extraction. You can also use other pretrained networks such as MobileNet v2 or ResNet-18 can also be used depending on application requirements. The detection sub-network is a small CNN compared to the feature extraction network and is composed of a few convolutional layers and layers specific for YOLO v2.

Use the yolov2Layers function to create a YOLO v2 object detection network. The yolov2Layers funcvtion requires you to specify several inputs that parameterize a YOLO v2 network:

  • Network input size

  • Anchor boxes

  • Feature extraction network

First, specify the network input size and the number of classes. When choosing the network input size, consider the minimum size required by the network itself, the size of the training images, and the computational cost incurred by processing data at the selected size. When feasible, choose a network input size that is close to the size of the training image and larger than the input size required for the network. To reduce the computational cost of running the example, specify a network input size of 224-by-224-by-3, which is the minimum size required to run the network.

inputSize = [224 224 3];

Define the number of object classes to detect.

numClasses = width(vehicleDataset)-1;

The training images used in this example are larger than 224-by-224 and vary in size, so you must resize the images in a preprocessing step prior to training.

Next, use the estimateAnchorBoxes function to estimate anchor boxes based on the size of objects in the training data. To account for the resizing of the images prior to training, resize the training data for estimating anchor boxes. Use the transform function to preprocess the training data and then define the number of anchor boxes and estimate the anchor boxes. Resize the training data to the input image size of the network by using the supporting function yolo_preprocessData, attached to this example.

For more information on choosing anchor boxes, see Estimate Anchor Boxes From Training Data (Computer Vision Toolbox) (Computer Vision Toolbox™) and Anchor Boxes for Object Detection (Computer Vision Toolbox).

trainingDataForEstimation = transform(trainingData,@(data)yolo_preprocessData(data,inputSize));
numAnchors = 7;
[anchorBoxes, meanIoU] = estimateAnchorBoxes(trainingDataForEstimation, numAnchors)
anchorBoxes = 7×2

   145   126
    91    86
   161   132
    41    34
    67    64
   136   111
    33    23

meanIoU = 0.8651

Use the alexnet function to load a pretrained model.

featureExtractionNetwork = alexnet
featureExtractionNetwork = 
  SeriesNetwork with properties:

         Layers: [25×1 nnet.cnn.layer.Layer]
     InputNames: {'data'}
    OutputNames: {'output'}

Select 'relu5' as the feature extraction layer to replace the layers after 'relu5' with the detection subnetwork. This feature extraction layer outputs feature maps that are downsampled by a factor of 16. This amount of downsampling is a good trade-off between spatial resolution and the strength of the extracted features, as features extracted farther down the network encode stronger image features at the cost of spatial resolution.

featureLayer = 'relu5';

Create the YOLO v2 object detection network.

lgraph = yolov2Layers(inputSize,numClasses,anchorBoxes,featureExtractionNetwork,featureLayer);

You can visualize the network by using the analyzeNetwork function or the Deep Network Designer from Deep Learning Toolbox™.

If you require more control over the YOLO v2 network architecture, use the Deep Network Designer to design the YOLO v2 detection network manually. For more information, see Design a YOLO v2 Detection Network (Computer Vision Toolbox).

Data Augmentation

Data augmentation is used to improve network accuracy by randomly transforming the original data during training. By using data augmentation you can add more variety to the training data without actually having to increase the number of labeled training samples.

Use the transform function to augment the training data by randomly flipping the image and associated box labels horizontally. Note that data augmentation is not applied to the test and validation data. Ideally, test and validation data is representative of the original data and is left unmodified for unbiased evaluation.

augmentedTrainingData = transform(trainingData,@yolo_augmentData);

Preprocess Training Data and Train YOLO v2 Object Detector

Preprocess the augmented training data and the validation data to prepare for training.

preprocessedTrainingData = transform(augmentedTrainingData,@(data)yolo_preprocessData(data,inputSize));

Use the trainingOptions function to specify network training options. Set 'ValidationData' to the preprocessed validation data. Set 'CheckpointPath' to a temporary location. Thse settings enables the saving of partially trained detectors during the training process. If training is interrupted, such as by a power outage or system failure, you can resume training from the saved checkpoint.

options = trainingOptions('sgdm', ...
        'MiniBatchSize', 16, ....
        'InitialLearnRate',1e-3, ...
        'CheckpointPath', tempdir, ...

Use the trainYOLOv2ObjectDetector function to train the YOLO v2 object detector.

[detector,info] = trainYOLOv2ObjectDetector(preprocessedTrainingData,lgraph,options);
Training a YOLO v2 Object Detector for the following object classes:

* vehicle

Training on single CPU.
Initializing input data normalization.
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     RMSE     |     Loss     |      Rate       |
|       1 |           1 |       00:00:02 |         7.23 |         52.3 |          0.0010 |
|       5 |          50 |       00:00:43 |         0.99 |          1.0 |          0.0010 |
|      10 |         100 |       00:01:24 |         0.77 |          0.6 |          0.0010 |
|      14 |         150 |       00:02:03 |         0.64 |          0.4 |          0.0010 |
|      19 |         200 |       00:02:41 |         0.57 |          0.3 |          0.0010 |
|      20 |         220 |       00:02:55 |         0.58 |          0.3 |          0.0010 |
Detector training complete.

As a quick test, run the detector on one test image. Make sure that you resize the image to the same size as the training images.

I = imread(testDataTbl.imageFilename{2});
I = imresize(I,inputSize(1:2));
[bboxes,scores] = detect(detector,I);

Display the results.

I_new = insertObjectAnnotation(I,'rectangle',bboxes,scores);

Load Pretrained Network

Load the pretrained network.


Use the analyzeNetwork function to obtain information about the network layers.


Create Target Object

Create a target object for your target device with a vendor name and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to program the device.

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Create Workflow Object

Create an object of the dlhdl.Workflow class. Specify the network and the bitstream name. Specify the saved pretrained series network trainedNetNoCar as the network. Make sure the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board. The bitstream uses the single data type.

hW=dlhdl.Workflow('Network', snet, 'Bitstream', 'zcu102_single','Target',hTarget)
hW = 
  Workflow with properties:

            Network: [1×1 DAGNetwork]
          Bitstream: 'zcu102_single'
    ProcessorConfig: []
             Target: [1×1 dlhdl.Target]

Compile YOLO v2 Object Detector

To compile the snet series network, run the compile function of the dlhdl.Workflow object.

dn = hW.compile
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single ...
### The network includes the following layers:

     1   'data'                Image Input                   224×224×3 images with 'zerocenter' normalization                                  (SW Layer)
     2   'conv1'               Convolution                   96 11×11×3 convolutions with stride [4  4] and padding [0  0  0  0]               (HW Layer)
     3   'relu1'               ReLU                          ReLU                                                                              (HW Layer)
     4   'norm1'               Cross Channel Normalization   cross channel normalization with 5 channels per element                           (HW Layer)
     5   'pool1'               Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
     6   'conv2'               Grouped Convolution           2 groups of 128 5×5×48 convolutions with stride [1  1] and padding [2  2  2  2]   (HW Layer)
     7   'relu2'               ReLU                          ReLU                                                                              (HW Layer)
     8   'norm2'               Cross Channel Normalization   cross channel normalization with 5 channels per element                           (HW Layer)
     9   'pool2'               Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
    10   'conv3'               Convolution                   384 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]              (HW Layer)
    11   'relu3'               ReLU                          ReLU                                                                              (HW Layer)
    12   'conv4'               Grouped Convolution           2 groups of 192 3×3×192 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    13   'relu4'               ReLU                          ReLU                                                                              (HW Layer)
    14   'conv5'               Grouped Convolution           2 groups of 128 3×3×192 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    15   'relu5'               ReLU                          ReLU                                                                              (HW Layer)
    16   'yolov2Conv1'         Convolution                   256 3×3×256 convolutions with stride [1  1] and padding 'same'                    (HW Layer)
    17   'yolov2Batch1'        Batch Normalization           Batch normalization with 256 channels                                             (HW Layer)
    18   'yolov2Relu1'         ReLU                          ReLU                                                                              (HW Layer)
    19   'yolov2Conv2'         Convolution                   256 3×3×256 convolutions with stride [1  1] and padding 'same'                    (HW Layer)
    20   'yolov2Batch2'        Batch Normalization           Batch normalization with 256 channels                                             (HW Layer)
    21   'yolov2Relu2'         ReLU                          ReLU                                                                              (HW Layer)
    22   'yolov2ClassConv'     Convolution                   42 1×1×256 convolutions with stride [1  1] and padding [0  0  0  0]               (HW Layer)
    23   'yolov2Transform'     YOLO v2 Transform Layer.      YOLO v2 Transform Layer with 7 anchors.                                           (SW Layer)
    24   'yolov2OutputLayer'   YOLO v2 Output                YOLO v2 Output with 7 anchors.                                                    (SW Layer)

### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
2 Memory Regions created.

Skipping: data
Compiling leg: conv1>>yolov2ClassConv ...
Compiling leg: conv1>>yolov2ClassConv ... complete.
Skipping: yolov2Transform
Skipping: yolov2OutputLayer
Creating Schedule...
Creating Schedule...complete.
Creating Status Table...
Creating Status Table...complete.
Emitting Schedule...
Emitting Schedule...complete.
Emitting Status Table...
Emitting Status Table...complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "24.0 MB"       
    "OutputResultOffset"        "0x01800000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x01c00000"     "0.0 MB"        
    "SystemBufferOffset"        "0x01c00000"     "28.0 MB"       
    "InstructionDataOffset"     "0x03800000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x03c00000"     "16.0 MB"       
    "EndOffset"                 "0x04c00000"     "Total: 76.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. The function also downloads the network weights and biases. The deploy function checks for the Xilinx Vivado tool and the supported tool version. It then starts programming the FPGA device by using the bitstream and displays progress messages and the time it takes to deploy the network.

### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 20-Dec-2020 15:26:28

Load Example Image and Run Prediction

Execute the predict function on the dlhdl.Workflow object and display the result.

[prediction, speed] = hW.predict(I_pre,'Profile','on');
### Finished writing input activations.
### Running single input activations.

              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    8615567                  0.03916                       1            8615567             25.5
    conv1                  1357049                  0.00617 
    norm1                   569406                  0.00259 
    pool1                   205869                  0.00094 
    conv2                  2207222                  0.01003 
    norm2                   360973                  0.00164 
    pool2                   197444                  0.00090 
    conv3                   976419                  0.00444 
    conv4                   761188                  0.00346 
    conv5                   521782                  0.00237 
    yolov2Conv1             660213                  0.00300 
    yolov2Conv2             661162                  0.00301 
    yolov2ClassConv         136816                  0.00062 
 * The clock frequency of the DL processor is: 220MHz

Display the prediction results.

[bboxesn, scoresn, labelsn] = yolo_post_proc(prediction,I_pre,anchorBoxes,{'Vehicle'});
I_new3 = insertObjectAnnotation(I,'rectangle',bboxesn,scoresn);