Classify Videos Using Deep Learning with Custom Training Loop
This example shows how to create a network for video classification by combining a pretrained image classification model and a sequence classification network.
You can perform video classification without using a custom training loop by using the trainNetwork
function. For an example, see Classify Videos Using Deep Learning. However, If trainingOptions
does not provide the options you need (for example, a custom learning rate schedule), then you can define your own custom training loop as shown in this example.
To create a deep learning network for video classification:
Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as GoogLeNet, to extract features from each frame.
Train a sequence classification network on the sequences to predict the video labels.
Assemble a network that classifies videos directly by combining layers from both networks.
The following diagram illustrates the network architecture:
To input image sequences to the network, use a sequence input layer.
To extract features from the image sequences, use convolutional layers from the pretrained GoogLeNet network.
To classify the resulting vector sequences, include the sequence classification layers.
When training this type of network with the trainNetwork
function (not done in this example), you must use sequence folding and unfolding layers to process the video frames independently. When you train this type of network with a dlnetwork
object and a custom training loop (as in this example), sequence folding and unfolding layers are not required because the network uses dimension information given by the dlarray
dimension labels.
Load Pretrained Convolutional Network
To convert frames of videos to feature vectors, use the activations of a pretrained network.
Load a pretrained GoogLeNet model using the googlenet
function. This function requires the Deep Learning Toolbox™ Model for GoogLeNet Network support package. If this support package is not installed, then the function provides a download link.
netCNN = googlenet;
Load Data
Download the HMBD51 data set from HMDB: a large human motion database and extract the RAR file into a folder named "hmdb51_org"
. The data set contains about 2 GB of video data for 7000 clips over 51 classes, such as "drink"
, "run"
, and "shake_hands"
.
After extracting the RAR file, make sure that the folder hmdb51_org
contains subfolders named after the body motions. If it contains RAR files, you need to extract them as well. Use the supporting function hmdb51Files
to get the file names and the labels of the videos. To speed up training at the cost of accuracy, specify a fraction in the range [0 1] to read only a random subset of files from the database. If the fraction
input argument is not specified, the function hmdb51Files
reads the full dataset without changing the order of the files.
dataFolder = "hmdb51_org";
fraction = 1;
[files,labels] = hmdb51Files(dataFolder,fraction);
Read the first video using the readVideo
helper function, defined at the end of this example, and view the size of the video. The video is an H-by-W-by-C-by-T array, where H, W, C, and T are the height, width, number of channels, and number of frames of the video, respectively.
idx = 1; filename = files(idx); video = readVideo(filename); size(video)
ans = 1×4
240 352 3 115
View the corresponding label.
labels(idx)
ans = categorical
shoot_ball
To view the video, loop over the individual frames and use the image
function. Alternatively, you can use the implay
function (requires Image Processing Toolbox).
numFrames = size(video,4); figure for i = 1:numFrames frame = video(:,:,:,i); image(frame); xticklabels([]); yticklabels([]); drawnow end
Convert Frames to Feature Vectors
Use the convolutional network as a feature extractor: input video frames to the network and extract the activations. Convert the videos to sequences of feature vectors, where the feature vectors are the output of the activations
function on the last pooling layer of the GoogLeNet network ("pool5-7x7_s1"
).
This diagram illustrates the data flow through the network.
Read the video data using the readVideo
function, defined at the end of this example, and resize it to match the input size of the GoogLeNet network. Note that this step can take a long time to run. After converting the videos to sequences, save the sequences and corresponding labels in a MAT file in the tempdir
folder. If the MAT file already exists, then load the sequences and labels from the MAT file directly. In case a MAT file already exists but you want to overwrite it, set the variable overwriteSequences
to true
.
inputSize = netCNN.Layers(1).InputSize(1:2); layerName = "pool5-7x7_s1"; tempFile = fullfile(tempdir,"hmdb51_org.mat"); overwriteSequences = false; if exist(tempFile,'file') && ~overwriteSequences load(tempFile) else numFiles = numel(files); sequences = cell(numFiles,1); for i = 1:numFiles fprintf("Reading file %d of %d...\n", i, numFiles) video = readVideo(files(i)); video = imresize(video,inputSize); sequences{i,1} = activations(netCNN,video,layerName,'OutputAs','columns'); end % Save the sequences and the labels associated with them. save(tempFile,"sequences","labels","-v7.3"); end
View the sizes of the first few sequences. Each sequence is a D-by-T array, where D is the number of features (the output size of the pooling layer) and T is the number of frames of the video.
sequences(1:10)
ans=10×1 cell array
{1024×115 single}
{1024×227 single}
{1024×180 single}
{1024×40 single}
{1024×60 single}
{1024×156 single}
{1024×83 single}
{1024×42 single}
{1024×82 single}
{1024×110 single}
Prepare Training Data
Prepare the data for training by partitioning the data into training and validation partitions and removing any long sequences.
Create Training and Validation Partitions
Partition the data. Assign 90% of the data to the training partition and 10% to the validation partition.
numObservations = numel(sequences); idx = randperm(numObservations); N = floor(0.9 * numObservations); idxTrain = idx(1:N); sequencesTrain = sequences(idxTrain); labelsTrain = labels(idxTrain); idxValidation = idx(N+1:end); sequencesValidation = sequences(idxValidation); labelsValidation = labels(idxValidation);
Remove Long Sequences
Sequences that are much longer than typical sequences in the networks can introduce lots of padding into the training process. Having too much padding can negatively impact the classification accuracy.
Get the sequence lengths of the training data and visualize them in a histogram of the training data.
numObservationsTrain = numel(sequencesTrain); sequenceLengths = zeros(1,numObservationsTrain); for i = 1:numObservationsTrain sequence = sequencesTrain{i}; sequenceLengths(i) = size(sequence,2); end figure histogram(sequenceLengths) title("Sequence Lengths") xlabel("Sequence Length") ylabel("Frequency")
Only a few sequences have more than 400 time steps. To improve the classification accuracy, remove the training sequences that have more than 400 time steps along with their corresponding labels.
maxLength = 400; idx = sequenceLengths > maxLength; sequencesTrain(idx) = []; labelsTrain(idx) = [];
Create Datastore for Data
Create an arrayDatastore
object for the sequences and the labels, and then combine them into a single datastore.
dsXTrain = arrayDatastore(sequencesTrain,'OutputType','same'); dsYTrain = arrayDatastore(labelsTrain,'OutputType','cell'); dsTrain = combine(dsXTrain,dsYTrain);
Determine the classes in the training data.
classes = categories(labelsTrain);
Create Sequence Classification Network
Next, create a sequence classification network that can classify the sequences of feature vectors representing the videos.
Define the sequence classification network architecture. Specify the following network layers:
A sequence input layer with an input size corresponding to the feature dimension of the feature vectors.
A BiLSTM layer with 2000 hidden units with a dropout layer afterwards. To output only one label for each sequence, set the
'OutputMode'
option of the BiLSTM layer to'last'.
A dropout layer with a probability of 0.5.
A fully connected layer with an output size corresponding to the number of classes and a softmax layer.
numFeatures = size(sequencesTrain{1},1); numClasses = numel(categories(labelsTrain)); layers = [ sequenceInputLayer(numFeatures,'Name','sequence') bilstmLayer(2000,'OutputMode','last','Name','bilstm') dropoutLayer(0.5,'Name','drop') fullyConnectedLayer(numClasses,'Name','fc') softmaxLayer('Name','softmax') ];
Convert the layers to a layerGraph
object.
lgraph = layerGraph(layers);
Create a dlnetwork
object from the layer graph.
dlnet = dlnetwork(lgraph);
Specify Training Options
Train for 15 epochs and specify a mini-batch size of 16.
numEpochs = 15; miniBatchSize = 16;
Specify the options for Adam optimization. Specify an initial learning rate of 1e-4
with a decay of 0.001, a gradient decay of 0.9, and a squared gradient decay of 0.999.
initialLearnRate = 1e-4; decay = 0.001; gradDecay = 0.9; sqGradDecay = 0.999;
Visualize the training progress in a plot.
plots = "training-progress";
Train Sequence Classification Network
Create a minibatchqueue
object that processes and manages mini-batches of sequences during training. For each mini-batch:
Use the custom mini-batch preprocessing function
preprocessLabeledSequences
(defined at the end of this example) to convert the labels to dummy variables.Format the vector sequence data with the dimension labels
'CTB'
(channel, time, batch). By default, theminibatchqueue
object converts the data todlarray
objects with underlying typesingle
. Do not add a format to the class labels.Train on a GPU if one is available. By default, the
minibatchqueue
object converts each output to agpuArray
object if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox).
mbq = minibatchqueue(dsTrain,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFcn', @preprocessLabeledSequences,... 'MiniBatchFormat',{'CTB',''});
Initialize the training progress plot.
if plots == "training-progress" figure lineLossTrain = animatedline('Color',[0.85 0.325 0.098]); ylim([0 inf]) xlabel("Iteration") ylabel("Loss") grid on end
Initialize the average gradient and average squared gradient parameters for the Adam solver.
averageGrad = []; averageSqGrad = [];
Train the model using a custom training loop. For each epoch, shuffle the data and loop over mini-batches of data. For each mini-batch:
Evaluate the model gradients, state, and loss using
dlfeval
and themodelGradients
function and update the network state.Determine the learning rate for the time-based decay learning rate schedule: for each iteration, the solver uses the learning rate given by , where t is the iteration number, is the initial learning rate, and k is the decay.
Update the network parameters using the
adamupdate
function.Display the training progress.
Note that training can take a long time to run.
iteration = 0; start = tic; % Loop over epochs. for epoch = 1:numEpochs % Shuffle data. shuffle(mbq); % Loop over mini-batches. while hasdata(mbq) iteration = iteration + 1; % Read mini-batch of data. [dlX, dlY] = next(mbq); % Evaluate the model gradients, state, and loss using dlfeval and the % modelGradients function. [gradients,state,loss] = dlfeval(@modelGradients,dlnet,dlX,dlY); % Determine learning rate for time-based decay learning rate schedule. learnRate = initialLearnRate/(1 + decay*iteration); % Update the network parameters using the Adam optimizer. [dlnet,averageGrad,averageSqGrad] = adamupdate(dlnet,gradients,averageGrad,averageSqGrad, ... iteration,learnRate,gradDecay,sqGradDecay); % Display the training progress. if plots == "training-progress" D = duration(0,0,toc(start),'Format','hh:mm:ss'); addpoints(lineLossTrain,iteration,double(gather(extractdata(loss)))) title("Epoch: " + epoch + " of " + numEpochs + ", Elapsed: " + string(D)) drawnow end end end
Test Model
Test the classification accuracy of the model by comparing the predictions on the validation set with the true labels.
After training is complete, making predictions on new data does not require the labels.
To create a minibatchqueue
object for testing:
Create an array datastore containing only the predictors of the test data.
Specify the same mini-batch size used for training.
Preprocess the predictors using the
preprocessUnlabeledSequences
helper function, listed at the end of the example.For the single output of the datastore, specify the mini-batch format
'CTB'
(channel, time, batch).
dsXValidation = arrayDatastore(sequencesValidation,'OutputType','same'); mbqTest = minibatchqueue(dsXValidation, ... 'MiniBatchSize',miniBatchSize, ... 'MiniBatchFcn',@preprocessUnlabeledSequences, ... 'MiniBatchFormat','CTB');
Loop over the mini-batches and classify the images using the modelPredictions
helper function, listed at the end of the example.
predictions = modelPredictions(dlnet,mbqTest,classes);
Evaluate the classification accuracy by comparing the predicted labels to the true validation labels.
accuracy = mean(predictions == labelsValidation)
accuracy = 0.6721
Assemble Video Classification Network
To create a network that classifies videos directly, assemble a network using layers from both of the created networks. Use the layers from the convolutional network to transform the videos into vector sequences and the layers from the sequence classification network to classify the vector sequences.
The following diagram illustrates the network architecture:
To input image sequences to the network, use a sequence input layer.
To use convolutional layers to extract features, that is, to apply the convolutional operations to each frame of the videos independently, use the GoogLeNet convolutional layers.
To classify the resulting vector sequences, include the sequence classification layers.
When training this type of network with the trainNetwork
function (not done in this example), you have to use sequence folding and unfolding layers to process the video frames independently. When training this type of network with a dlnetwork
object and a custom training loop (as in this example), sequence folding and unfolding layers are not required because the network uses dimension information given by the dlarray
dimension labels.
Add Convolutional Layers
First, create a layer graph of the GoogLeNet network.
cnnLayers = layerGraph(netCNN);
Remove the input layer ("data"
) and the layers after the pooling layer used for the activations ("pool5-drop_7x7_s1"
, "loss3-classifier"
, "prob"
, and "output"
).
layerNames = ["data" "pool5-drop_7x7_s1" "loss3-classifier" "prob" "output"]; cnnLayers = removeLayers(cnnLayers,layerNames);
Add Sequence Input Layer
Create a sequence input layer that accepts image sequences containing images of the same input size as the GoogLeNet network. To normalize the images using the same average image as the GoogLeNet network, set the 'Normalization'
option of the sequence input layer to 'zerocenter'
and the 'Mean'
option to the average image of the input layer of GoogLeNet.
inputSize = netCNN.Layers(1).InputSize(1:2); averageImage = netCNN.Layers(1).Mean; inputLayer = sequenceInputLayer([inputSize 3], ... 'Normalization','zerocenter', ... 'Mean',averageImage, ... 'Name','input');
Add the sequence input layer to the layer graph. Connect the output of the input layer to the input of the first convolutional layer ("conv1-7x7_s2"
).
lgraph = addLayers(cnnLayers,inputLayer); lgraph = connectLayers(lgraph,"input","conv1-7x7_s2");
Add Sequence Classification Layers
Add the previously trained sequence classification network layers to the layer graph and connect them.
Take the layers from the sequence classification network and remove the sequence input layer.
lstmLayers = dlnet.Layers; lstmLayers(1) = [];
Add the sequence classification layers to the layer graph. Connect the last convolutional layer pool5-7x7_s1
to the bilstm
layer.
lgraph = addLayers(lgraph,lstmLayers); lgraph = connectLayers(lgraph,"pool5-7x7_s1","bilstm");
Convert to dlnetwork
To be able to do predictions, convert the layer graph to a dlnetwork
object.
dlnetAssembled = dlnetwork(lgraph)
dlnetAssembled = dlnetwork with properties: Layers: [144×1 nnet.cnn.layer.Layer] Connections: [170×2 table] Learnables: [119×3 table] State: [2×3 table] InputNames: {'input'} OutputNames: {'softmax'} Initialized: 1
Classify Using New Data
Unzip the file pushup_mathworker.zip.
unzip("pushup_mathworker.zip")
The extracted pushup_mathworker
folder contains a video of a push-up. Create a file datastore for this folder. Use a custom read function to read the videos.
ds = fileDatastore("pushup_mathworker", ... 'ReadFcn',@readVideo);
Read the first video from the datastore. To be able to read the video again, reset the datastore.
video = read(ds); reset(ds);
To view the video, loop over the individual frames and use the image
function. Alternatively, you can use the implay
function (requires Image Processing Toolbox).
numFrames = size(video,4); figure for i = 1:numFrames frame = video(:,:,:,i); image(frame); xticklabels([]); yticklabels([]); drawnow end
To preprocess the videos to have the input size expected by the network, use the transform
function and apply the imresize
function to each image in the datastore.
dsXTest = transform(ds,@(x) imresize(x,inputSize));
To manage and process the unlabeled videos, create a minibatchqueue:
Specify a mini-batch size of 1.
Preprocess the videos using the
preprocessUnlabeledVideos
helper function, listed at the end of the example.For the single output of the datastore, specify the mini-batch format
'SSCTB'
(spatial, spatial, channel, time, batch).
mbqTest = minibatchqueue(dsXTest,... 'MiniBatchSize',1,... 'MiniBatchFcn', @preprocessUnlabeledVideos,... 'MiniBatchFormat',{'SSCTB'});
Classify the videos using the modelPredictions
helper function, defined at the end of this example. The function expects three inputs: a dlnetwork
object, a minibatchqueue
object, and a cell array containing the network classes.
[predictions] = modelPredictions(dlnetAssembled,mbqTest,classes)
predictions = categorical
pushup
Helper Functions
Video Reading Function
The readVideo
function reads the video in filename
and returns an H-by-W-by-C-
by-T array, where H, W, C, and T are the height, width, number of channels, and number of frames of the video, respectively.
function video = readVideo(filename) vr = VideoReader(filename); H = vr.Height; W = vr.Width; C = 3; % Preallocate video array numFrames = floor(vr.Duration * vr.FrameRate); video = zeros(H,W,C,numFrames,'uint8'); % Read frames i = 0; while hasFrame(vr) i = i + 1; video(:,:,:,i) = readFrame(vr); end % Remove unallocated frames if size(video,4) > i video(:,:,:,i+1:end) = []; end end
Model Gradients Function
The modelGradients
function takes as input a dlnetwork
object dlnet
and a mini-batch of input data dlX
with corresponding labels Y
, and returns the gradients of the loss with respect to the learnable parameters in dlnet
, the network state, and the loss. To compute the gradients automatically, use the dlgradient
function.
function [gradients,state,loss] = modelGradients(dlnet,dlX,Y) [dlYPred,state] = forward(dlnet,dlX); loss = crossentropy(dlYPred,Y); gradients = dlgradient(loss,dlnet.Learnables); end
Model Predictions Function
The modelPredictions
function takes as input a dlnetwork
object dlnet
, a minibatchqueue
object of input data mbq
, and the network classes, and computes the model predictions by iterating over all data in the mini-batch queue. The function uses the onehotdecode
function to find the predicted class with the highest score. The function returns the predicted labels.
function [predictions] = modelPredictions(dlnet,mbq,classes) predictions = []; while hasdata(mbq) % Extract a mini-batch from the minibatchqueue and pass it to the % network for predictions [dlXTest] = next(mbq); dlYPred = predict(dlnet,dlXTest); % To obtain categorical labels, one-hot decode the predictions YPred = onehotdecode(dlYPred,classes,1)'; predictions = [predictions; YPred]; end end
Labeled Sequence Data Preprocessing Function
The preprocessLabeledSequences
function preprocesses the sequence data using the following steps:
Use the
padsequences
function to pad the sequences in the time dimension and concatenate them in the batch dimension.Extract the label data from the incoming cell array and concatenate into a categorical array.
One-hot encode the categorical labels into numeric arrays.
Transpose the array of one-hot encoded labels to match the shape of the network output.
function [X, Y] = preprocessLabeledSequences(XCell,YCell) % Pad the sequences with zeros in the second dimension (time) and concatenate along the third % dimension (batch) X = padsequences(XCell,2); % Extract label data from cell and concatenate Y = cat(1,YCell{1:end}); % One-hot encode labels Y = onehotencode(Y,2); % Transpose the encoded labels to match the network output Y = Y'; end
Unlabeled Sequence Data Preprocessing Function
The preprocessUnlabeledSequences
function preprocesses the sequence data using the padsequences
function. This function pads the sequences with zeros in the time dimension and concatenates the result in the batch dimension.
function [X] = preprocessUnlabeledSequences(XCell) % Pad the sequences with zeros in the second dimension (time) and concatenate along the third % dimension (batch) X = padsequences(XCell,2); end
Unlabeled Video Data Preprocessing Function
The preprocessUnlabeledVideos
function preprocesses unlabeled video data using the padsequences
function. This function pads the videos with zero in the time dimension and concatenates the result in the batch dimension.
function [X] = preprocessUnlabeledVideos(XCell) % Pad the sequences with zeros in the fourth dimension (time) and % concatenate along the fifth dimension (batch) X = padsequences(XCell,4); end
See Also
lstmLayer
| sequenceInputLayer
| dlfeval
| dlgradient
| dlarray
Related Topics
- Define Custom Training Loops, Loss Functions, and Networks
- Time Series Forecasting Using Deep Learning
- Sequence-to-Sequence Classification Using Deep Learning
- Sequence-to-Sequence Regression Using Deep Learning
- Sequence-to-One Regression Using Deep Learning
- Classify Videos Using Deep Learning
- Long Short-Term Memory Neural Networks
- Deep Learning in MATLAB