Main Content

Get Started with SOLOv2 for Instance Segmentation

Perform instance segmentation using the Computer Vision Toolbox™ Model for SOLOv2 Instance Segmentation support package.

The Segmenting Objects by LOcations version 2 (SOLOv2) model for instance segmentation offers the advantage of lightweight, scalable, and memory-efficient architecture [1]. SOLOv2 achieved state-of-the-art performance on the COCO instance segmentation benchmark, outperforming previous models. The model can process inputs of various resolutions due to its multiscale feature pyramid network (FPN), enabling it to capture object details across an extensive range of object sizes. SOLOv2 does not require external region proposal networks, and directly estimates the object centers and associated masks through anchor point localization and mask segmentation modeling.

Install Support Package

You can install the Computer Vision Toolbox Model for SOLOv2 Instance Segmentation from Add-On Explorer. For more information about installing add-ons, see Get and Manage Add-Ons. The support package also requires Deep Learning Toolbox™ and Computer Vision Toolbox. Processing image data on a GPU requires a supported GPU device and Parallel Computing Toolbox™.

Segment Image with Pretrained SOLOv2 Network

Use the process in this section to segment a test image using a pretrained SOLOv2 network with default settings, or to perform inference using a trained SOLOv2 network.

At inference, a fully convolutional network (FCN) backbone of the SOLOv2 network extracts a set of feature maps of various spatial resolutions, or levels, from the input image. The network feeds the extracted feature maps into parallel category and mask branches to generate the final predictions: semantic categories (classes) and instance masks. You can overlay the predicted instance segmentation masks on the image to create the visualization of each object instance, and generate corresponding class labels.

SOLOv2 architecture: the FCN serves as the backbone network to extract multi-scale features and classify each pixel in the input image into segmentation categories. To combine the semantic category and instance mask, the predicted instance segmentation masks and semantic categories are overlaid on the input image.

You can perform inference on a test image with default network options using a pretrained SOLOv2 network.

  1. Load an image or image datastore to segment from the workspace. The SOLOv2 model supports RGB or grayscale images.

    I = imread("kobi.png");
    
  2. Create a solov2 object to configure a pretrained SOLOv2 network with a ResNet-50 or ResNet-18 backbone as the feature extractor. To increase inference speed at the possible cost of detecting less objects, specify the lightweight ResNet-18 backbone with a reduced number of features, "light-resnet18-coco".

    model = solov2("light-resnet18-coco");
  3. Perform instance segmentation by using the segmentObjects function on the pretrained network, specifying that the function return the object masks, labels, and detection scores.

    [masks,labels,scores] = segmentObjects(model,I);
  4. Visualize the results by using the insertObjectMask function.

    maskedImage = insertObjectMask(I,masks);
    imshow(maskedImage)

    Instance mask of object, generated using the SOLOv2 pretrained network, overlaid on the input image

Perform Transfer Learning with SOLOv2

To modify a network to detect additional classes, or to adjust other network parameters, you can perform transfer learning. This section shows how to prepare your training data, configure the SOLOv2 model, and train the network to perform transfer learning.

Configure Training Data

To train a SOLOv2 detector, specify your labeled ground truth training data trainingData as a datastore. You must set up your data so that calling the read and readall functions on the datastore returns a cell array with four columns. This table describes the format of each cell in each column.

Input DataDescription
RGB or grayscale image

RGB or grayscale images that serve as network inputs, specified as H-by-W-by-3 or H-by-W numeric arrays, respectively. For example, load a sample modified RGB image from the CamVid data set [2] that contains objects of interest such as vehicles, traffic lights, and pedestrians.

RGB image of a street scene with vehicles, traffic lights, and pedestrians

Ground truth bounding boxes

Bounding boxes for objects in the RGB images, specified as an M-by-4 matrix, with rows of the form [x y w h], where M is the number of object instances in the image.

For example, the bboxes variable shows the bounding boxes of nine objects in the sample RGB image.

bboxes =

     1   178    94   133
   178   173   115   126
    63   181    54    68
   320   169    15    42
   383   173    12    39
   359   167    14    41
   141   131    12    30
    55    86    75   117
   146   167    14    43

Instance labels

Label of each instance, specified as a NumObjects-by-1 vector of strings or a NumObjects-by-1 cell array of character vectors. NumObjects is the number of labeled objects in the image.

For example, the labels variable shows the label names of the nine labeled objects in the sample RGB image.

labels = 

  9×1 categorical array

     car 
     car 
     car 
     person 
     person 
     person 
     traffic light 
     bus 
     person 

Instance masks

Masks for instances of objects. Mask data comes in these formats:

  • Binary masks, specified as a logical array of size H-by-W-by-NumObjects. Each mask is the segmentation of one instance in the image.

  • Polygon coordinates, specified as a NumObjects-by-2 cell array. Each row of the array contains the (x, y) coordinates of a polygon along the boundary of one instance in the image.

    The SOLOv2 network requires binary masks, not polygon coordinates. If your mask data is in polygon coordinates, use the poly2mask function to convert the polygon coordinates to binary masks of size h-by-w-by-numObjects. For example, if the variable masks_polygon contains polygon coordinates, you can use this code to convert them to binary masks.

    denseMasks = false([h w numObjects]);
    for i = 1:numObjects
        denseMasks(:,:,i) = poly2mask(masks_polygon{i}(:,1),masks_polygon{i}(:,2),h,w);
    end

To display the instance mask data over a sample training image I, use the insertObjectMask function. You can specify a colormap so that each object instance appears in a different color.

For example, if the variable masks contains the corresponding instance masks, overlay the masks over the image using the lines colormap function.

imOverlay = insertObjectMask(im,masks,Color=lines(numObjects));
imshow(imOverlay)

Each object has a unique falsecolor hue over the RGB image

The datastore must return your data as a 1-by-4 cell array with four columns of the form {RGB images Bounding boxes Labels Masks}. You can create a datastore in the required format using these steps:

  1. Create an ImageDatastore that returns RGB or grayscale image data.

  2. Create a boxLabelDatastore that returns bounding box data and instance labels as a two-element cell array.

  3. Create an ImageDatastore and specify a custom read function that returns mask data as a binary matrix.

  4. Combine the three datastores using the combine function.

For more information, see Datastores for Deep Learning (Deep Learning Toolbox).

Train the SOLOv2 Network

To configure a SOLOv2 network for training, specify the class names when you create a solov2 object. You can optionally specify additional network properties, such as the network input size to use during training and inference. For example, specify a SOLOv2 network that uses ResNet-50 as the base network to detect the classes in ClassNames during training.

ClassNames = ["person","traffic light","car","bus"];
Network = solov2("resnet50-coco",ClassNames);

Specify the network training options using the trainingOptions (Deep Learning Toolbox) function. To learn more about using trainingOptions to fine-tune network parameters for training, see Set Up Parameters and Train Convolutional Neural Network (Deep Learning Toolbox).

To train the network, pass your training data, the configured solov2 object, and the trainingOptions function output to the trainSOLOV2 function. The function returns a trained solov2 network trainedNetwork.

trainedNetwork = trainSOLOV2(trainingData,Network,options);

To perform inference on a test image I using the trained network, pass the trained network as input to the segmentObjects function. For more details, see the Segment Image with Pretrained SOLOv2 Network section.

For a detailed example of a custom training workflow, see the Perform Instance Segmentation Using SOLOv2 example.

Evaluate Instance Segmentation Results

Evaluate the quality of the instance segmentation results using the evaluateInstanceSegmentation function. Ensure that your ground truth datastore is set up so that calling the datastore with the read function returns a cell array with at least two elements in the format {masks labels}.

To calculate the prediction metrics, specify the output of the segmentObjects function and your ground truth data as input to evaluateInstanceSegmentation function. The function calculates metrics such as the confusion matrix and average precision. The instanceSegmentationMetrics object stores the metrics.

References

[1] Wang, Xinlong, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. “SOLOv2: Dynamic and Fast Instance Segmentation.” ArXiv, October 23, 2020. https://doi.org/10.48550/arXiv.2003.10152.

[2] Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. "Semantic Object Classes in Video: A High-Definition Ground Truth Database." Pattern Recognition Letters 30, no. 2 (January 2009): 88–97. https://doi.org/10.1016/j.patrec.2008.04.005.

See Also

Apps

Functions

Related Examples

More About