Object Recognition: Deep Learning and Machine Learning for Computer Vision

Object recognition is enabling innovative systems like self-driving cars, image based retrieval, and autonomous robotics. The machine learning and deep learning these systems rely on can be difficult to train, evaluate, and compare.

In this webinar we explore how MATLAB addresses the most common challenges encountered while developing object recognition systems. This webinar will cover new capabilities for deep learning, machine learning and computer vision.

We will use real-world examples to demonstrate:

Training models using large image datasets
Training deep neural networks from scratch
Using transfer learning to re-use trained deep networks for new tasks
Exploring the tradeoffs between machine learning and deep learning

About the Presenters

Johanna Pingel joined the MathWorks team in 2013, specializing in Image Processing and Computer Vision applications with MATLAB. She has a M.S. degree from Rensselaer Polytechnic Institute and a B.A. degree from Carnegie Mellon University. She has been working in the Computer Vision application space for over 5 years, with a focus on object detection and tracking.

Avinash Nehemiah works on computer vision applications in technical marketing at MathWorks. Prior to joining MathWorks he spent 7 years as an algorithm developer and researcher designing computer vision algorithms for hospital safety and video surveillance. He holds an MSEE degree from Carnegie Mellon University.

Recorded: 7 Mar 2017

Hello and welcome to the Object Recognition webinar. My name's Johanna, and I'll be talking to you today about machine learning and deep learning. The agenda today is to go over two real-world examples in MATLAB of object recognition using machine learning and deep learning. The two demos are going to be scene classification and object classification. In other words, where am I, and what am I looking at?

So let's jump right in. Our first demo today will be to automatically classify what scene we are in. And for this demo, we're going to be using a machine learning approach. So in order to get calibrated, let's define what a machine learning workflow looks like. And later, we'll define a deep learning workflow and the differences between the two. Let's say I want to classify images into three different categories. And for this example, I chose three different categories of animals.

I first want to take a training image and extract features from it. Features can be anything from color, or edges, or corners. Anything that can represent an image that is invariant to different lighting conditions and can also handle changes in rotation and scale. Once we extract those features, we want to add them to our machine learning model. And we don't want to just do this for the one image; we want to have many images of the object that we're looking to recognize.

So once again, we take all of those images, we extract those features, and we add them to our model. So if I'm looking to classify dogs, cats, and turkeys, we want to extract those features and add those to our model. We know what class these features represent, so we want to train our model to fit these features into three distinct buckets. We now have a trained model, so when a test image comes in, we can extract those same features and predict what category those features best represent. So keeping this model in mind, let's jump into MATLAB to perform this feature extraction and classification.

The goal of this demo is to identify what scene an image is from, with the goal of answering the question, where am I? Am I in a field, in an auditorium, on the beach, or in a restaurant? And these locations are fairly diverse and have lots of different variations. So we want lots of samples from each category to help correctly differentiate between the different categories. So I want to bring up the first challenge of how we're supposed to read in all of these images. We don't want to have to loop through each file in a directory and check that it's a proper image format.

Instead, we can use a function in MATLAB called imageDataStore to automatically read in all of the images in a directory. We can specify whether we want to read the subfolders or not, and we can label our data based on the folder names. So if I already have my data labeled in subfolders with the proper names, imageDataStore will read in my data from these folders and label them accordingly. Note another benefit of using imageDataStore is that it can handle massive amounts of data, and handle reading in the images even if they don't all fit into memory at once, which makes it great for machine learning and deep learning tasks.

With imageDataStore, I can also do things like count the number of images in each category. So we can see here all of the labels of the folders and the number of images in each category. I can also split each label into a certain number of images. So here I am asking for 16 different images from each directory, but later on, I'm going to use this to separate a training set and a test set. I can show a subset of images in a montage that are associated with an auditorium, and I can also show a small sample from each of the four categories.

Now, I'm going to separate my images into a training set and a test set. And I'm selecting 700 images for my training set, and 200 for my test set. Now, I want to take these images in my training set and extract features from them. And I have a lot of different options for extracting features. I could use edges or corners. The list goes on and on. But I want to highlight one feature extractor called bagOfFeatures. And I'm going to run this, and at the same time pull up a slide, so we can talk about visually what's going on.

bagOfFeatures could also be defined as bag of visual words. So let's take our example. If I were to ask, how would you describe a beach, what words would you use? Maybe water, sand, trees. How about a restaurant? Maybe words like food, table, chairs. So I have listed these four words. How likely is it that I see these words for these scenes? Maybe the probabilities look something like this for a beach, and something like this for a restaurant. And there is some overlap. These words are not mutually exclusive.

By the way, I made these numbers up, so please don't quote me that 40% of beaches have trees. So we count the number of times that we see these words, and based on that frequency, we can describe different scenes. So we can look at our code, and we essentially have two lines of code to do this. bagOfFeatures, which defines the most relevant words, and then we count the number of occurrences.

So in the time it took me to say all that, the creation of the bagOfWords is complete. Now, we can visualize those features. So we're pulling a random image from each of our categories, and we've asked for 250 words to describe these scenes. And on the right, you can see the occurrences of those words. Now, we want to be able to create a classifier that takes those occurrences and is able to differentiate the differences between the four classes.

There are a lot of different machine learning techniques that we can use to create a classifier. And to help us determine what works best for our data interactively, we can use an app called the Classification Learner. There are a lot of different options in this app, so I encourage you to try this out on your own data, since we don't have time to go through all of the options in detail. But I want to point your attention to the list of all of the classifiers.

The one question I get a lot is, how do I know which classifier to use, or what classifier will work best on my data? And the answer is I have no idea. Everyone's data is different, so I can't say, always use a quadratic SVM, for example. If that worked every time, we wouldn't need so many options. Instead, I can tell you should try out a bunch of different classifiers and see what works best for your data. So we can try a quadratic SVM, for example, and see how that performs on our data. We can also use this app to try all of the SVMs. And maybe we want to try all of the KNNs, as well.

So at the end of this, I want to pick the best classifier with the highest accuracy, which happens to be the medium Gaussian SVM. One other feature in this app is the confusion matrix. If the trained classifier was working perfectly, 100% accurate, we would see everything green along the diagonal, which would mean that every time it predicted a scene, it got it right. Instead, there is some red on the outside of the diagonal.

So for example, there were 26 cases where it predicted a beach but it was actually a field. And this tells us that sometimes it may mistake beaches for fields. And this could give us an indication of what features we might want to use or avoid, and what we should be on the lookout for in our data to maybe help improve the accuracy of the classifier.

I'm going to export that model, but do keep in mind that I can also generate MATLAB code from this app. So if I'm looking to repeat this process over and over again, I can automate this process and not have to use the interaction of the app every time. Once I've extracted the model from the Classification Learner app, it's given me the way to predict on new data. And I can use that line of code to predict on my test data.

So in this case, I'm extracting those bagOfFeatures for every image in my test set, and I'm going to predict the category using the trained classifier. So for each of my 200 images in my test set, I get an accuracy of roughly 70%. Keep this number in mind, because we're going to revisit this example later on in the webinar and see if we can use different techniques to potentially increase this accuracy.

The final thing I'm doing in this script is testing the classifier visually. So I can pick a random image from my test set, and I can see visually which images the classifier is getting right and wrong. Maybe, based on that information, I can decide to use different features or a different classifier to potentially increase the accuracy.

So that concludes our first demo. We just went through a lot of content, so let me give you three things to take away from that demo. The first is the Classification Learner app. You don't need to be a machine learning expert to use this app. You don't need to know which classifier is going to work best. You can try as many as you'd like, or all of them, and see which one works best for your data. bagOfFeatures was the feature extraction technique that we used. You can call this in one line of MATLAB code. Finally, try imageDataStore to be able to handle reading and directories of data without worrying about it all fitting into memory at once.

Moving on. You've already seen this slide, and we've used this workflow in the previous demo. Now, how does this workflow compare to a deep learning one? We're going to take everything related to feature extraction and classification, and bundle it into a CNN, or a convolutional neural network. This is one of the most common deep learning architectures. This will perform end-to-end feature learning and classification for us.

So we will feed our images directly into the network, and it will learn the best features and perform the classification. So I want to discuss two different approaches to a deep learning task. The first is to create your own neural network from scratch, and the second is to take a pre-trained model and use that to perform a new classification task. So in the second case, I have a network that someone else has already trained to classify many objects, such as cars, trucks, and bicycles. And now, I want to fine tune the network to perform a new classification task.

So training the network from scratch is one approach to deep learning. You would essentially be responsible for setting up all of the layers of the network and the weights. And MATLAB is a great environment to try this approach. However, be aware that many training samples are needed, as in millions of images. And training the network can be quite time-consuming.

An alternative approach is transfer learning, which can be fairly accurate in a smaller amount of time and with less data. So for the next demo, we'll see how to perform transfer learning to classify five new categories of objects. For this demo, let's import a pre-trained CNN, and retrain it to classify five different foods that you might find in a restaurant. I have my five different categories of food separated into five different folders. So I can use image data store to import that data, and label it based on those folder names.

I first want to import my pre-trained CNN. I've already downloaded the CNN, so I'm simply going to load it in from a saved file. But we do have a handy helperImport function to help import the data and convert it to the proper series network format.

Now, we can take a look at the layers of this network. We can see that the CNN has already been trained to classify 1,000 different objects, and we want to change this to instead classify our five different objects. We can also take a look at the first layer, the input layer, to see what type of data it's expecting. So we need to make sure that our input data is of these same dimensions.

Now, we want to manipulate the CNN to perform our task. I'm going to take that last layer that's classifying the 1,000 categories, and switch that to 5. I'm also going to alter the learning weights. So the last layer, the one that I'm manipulating, I want that to learn the fastest, and I want to keep all of the other learning rates the same. So all of the training that has already been accomplished in the network in the earlier layers does not get altered as much as the final layer, where I'm doing the five-category classification.

Finally, before we retrain, I want to ensure that each class has the same number of input images. And I once again want to set aside some images in a training and a test set, the test set to be able to validate the classifier. Here's where the training happens. If we take a look at the documentation, we have a number of different options available for how we can alter the network. I'll just point out a few here. The initial learning rate and the maximum number of epochs, or stages.

So for this example, you'll see that I set the learning rate lower than the default, and also, the maximum stages I set to 20. You can change how the network is trained by changing these parameters, but because there are so many combinations of training options, it may take a few iterations to find the best combination to work best with your data. Now, I'm going to run this. And in the interest of time, I've sped up the training to show what it looks like to train the entire network. So keep in mind that the actual time this took the train is much longer than you're seeing in this video.

While this is training, you get information on how the network is performing through each of the stages. You can see that is getting a very high accuracy in just a few stages. And while this might sound like a good thing, it could be an indication that the network is overfitting the data. Now that the training is done, we want to test to see how well it does on our data. We set a number of test samples aside, and now we can use those to classify what category each of the images is predicted to be. So for all of our test data, we get an accuracy of around 84%.

We can take a look at the confusion matrix, which will give us the accuracy of each category along the diagonal. And we can see the lowest accuracy is in the third column, which corresponds to the hamburger class. I bet you didn't think you were going to be hearing the words “hamburger class” in this webinar. So we can write a few lines of MATLAB code to pull out the test images of the class that we want to investigate, and we can also limit the images to those that were misclassified only. This can be a way of visualizing why the network is misidentifying a particular category of images.

And we can see that some of these images of hamburgers aren't very good. They're either blurry, they might have bad lighting, and some aren't even hamburgers at all. So this is a good reminder to make sure that your data that you're trying to represent is of high quality and represents how you hope the classifier will respond.

So going back to the very beginning of the script again, that's exactly what I did. I went back through all of the data and cut out the images that weren't of high-enough quality—either low lighting or weird angles—and also weren't of the object that we were trying to classify. And this is a good reminder that before you get too deep into training a neural network and trying all of those different parameters to improve the accuracy of your classifier, make sure that the network isn't getting confused based on the quality of your images.

Another thing I changed for the second pass was to add another layer to the network. This is just one other trick that you can use to prevent overfitting your data, and to add more nonlinearity to the network. This is a completely optional step. I'm going to run this from the beginning again, but in the interest of time, I'm not going to show you the entire training of the network. Instead, I'm going to load in a trained network that I've already saved.

This time when we test the network, we get an accuracy of a little over 90%. So with those small changes, we can directly affect the accuracy of the results. So in order to truly validate the CNN, I went out to a bunch of restaurants and took video from those five categories. And I know what you're thinking, it's a tough job, but someone had to do it. And we can see, as the video is streaming, how the classifier is working, and also the competence of that prediction. Sadly, I'm in an office right now, and not in the restaurant. But keep in mind that while this is a pre-recorded video, I can also run this on a livestream as well.

So here are two takeaways from the second demo. The first is to make sure to always use good data. Do this before spending lots of time training the network. If you are passing bad images into the network, expect the accuracy to be lower than you would like. The second is to remember transfer learning can be a very powerful technique to classify new objects. You have control over the network with a variety of training options. It may take a few times to get the right combination of options to converge on a highly accurate solution, but MATLAB is the right environment to try these combinations out quickly.

Our final example is going to be a combination of machine learning and a deep learning approach. So what does that look like? We have our deep learning workflow again, but now we're going to use deep learning as a feature extractor and pull those features into a traditional machine learning algorithm. So we're taking advantage of the many layers of the CNN architecture to extract the most relevant features from our images.

Let's see this machine learning and deep learning combined approach in action. So as promised, we're revisiting the first demo. We're using the same categories as before, and we still want to identify what scene we're in, this time using a combination of machine learning and a deep learning approach. So some of this should look familiar by now. We're still using image data store to handle accessing our images, and we're splitting each label into exactly the same number of images. In this case, the minimum set is 904.

And we can look at a sample image from each of our four categories. Once I've imported the pre-trained CNN, I can take a look at the layers. Pulling the features out of the layer FC7 gives us the combination of training from all of the previous layers. And we can look at the first layer as a reminder that there is a requirement that the input images be to 227 by 227 by 3.

I can specify my own custom read function using imageDataStore. This allows me to do any pre-processing of my data all in one place. The custom read function would also be the place I could specify how to read in any non-standard image formats. But in this case, I'm simply ensuring it's an RBG image, and also resizing that image 2 to 227-by-227 requirement.

So I once again want to separate my data into a training set and a test set. And now, I want to extract those features from the CNN. The activations of the network at FC7 is the result of passing the image through all of the learned filters of the CNN. This is likely going to make a good feature extractor, because the original CNN was trained using billions of images.

So now we have those features. This is typically when we can go to the Classification Learner app and choose a classifier. But if you already have extracted code from the app, or if you simply know the classifier you want to use and the function to call it, you can call that directly in the script.

So now I have a linear SVM that was trained using the features from the CNN, and I want to evaluate how that classifier performs using the test data. I'm going to extract the same features from my test data, and I'm going to predict using the classifier that was just created. And as this is running, recall the accuracy of the first example, where we were getting roughly 70% accuracy. So in this case, it looks like this method was able to considerably increase the accuracy of the prediction by changing the feature extraction method to a deep learning approach. Finally, we can visually inspect how our classifier is performing by going through random images from our test set and comparing the prediction with the actual results.

That concludes the final demo for this webinar. Before we finish, I want to start a discussion on deep learning versus machine learning. You've seen three demos today which featured machine learning and deep learning techniques, but how do you know which one to use? With machine learning, you will have the option to train on many different classifiers, and you have a wide range of feature extraction methods to choose from, as well.

Also, if you understand your data, you may intuitively know which features are the best that will produce good results. Plus you have the flexibility to choose a combination of different classifiers and different features, and see which combination works best for your data. And MATLAB is a great tool to try these combinations quickly. There is a lot of hype recently around deep learning, and for good reason.

The accuracy of some models is very, very high. You don't have to understand which features are best to represent in the model. They're chosen for you. But in a deep learning model, it can take a while to train. And because it is a semi black-box solution, if something isn't working correctly, it may be hard to figure out how to debug it.

I'll leave you with a summary of the last demo, which is a deep learning and a machine learning combination approach. You can use deep learning as a powerful feature extractor, and then have the ability to choose a classifier that gives you the most accurate results. But always remember, there is not going to be a one-size-fits-all approach, so be sure to try out as many options as possible for getting the best results for your particular application.

Thank you for tuning in today. If you have any questions, you can reach me and the rest of the computer vision and deep learning team at image-processing@mathworks.com.

Related Products

Bridging Wireless Communications Design and Testing with MATLAB

Read white paper

Feedback

Deep Learning Toolbox

Up Next:

24:56

Optimal Neural Network for Automotive Product Development

Object Recognition: Deep Learning and Machine Learning for Computer Vision

Related Products

Deep Learning Toolbox

Up Next:

Related Videos: