전이 학습이란? | 엔지니어를 위한 딥러닝 Part 4
해당 시리즈: Deep Learning for Engineers
Brian Douglas
* 한국어 자막을 포함한 영상입니다.
가속도계를 사용하여 하이파이브를 인식하고 횟수를 세는 예제를 보여드렸습니다. 이 영상에서는 이게 어떻게 이루어졌는지, 데이터 전처리는 어떻게 했는지, 전이 학습을 통해 어떻게 몇 시간 만에 이 문제를 해결할 수 있었는지에 대해 설명합니다.
발행: 2021년 4월 1일
In the first video in this series, I talked about how we can use deep learning to recognize complex patterns in data, and I sort of jokingly stated that one example could be to recognize and count high-fives using an accelerometer. Well, it turns out it wasn’t so much of a joke because this is me testing it out on a network that I just trained. And it’s working surprisingly well. So, in this video, I want to talk about how this was accomplished, the data preprocessing I did and why transfer leaning made it possible for me to solve this problem in just a couple of hours. I hope you stick around for it. I’m Brian, and welcome to a MATLAB Tech Talk.
For this project, I’m using a simple MEMS accelerometer that is wired to an Arduino. The Arduino reads the sensor and prints the measurements onto the serial bus, which is then connected to my computer via USB. I’m reading the serial bus with MATLAB, and plotting the last second and a half of data onto the screen. You can see as I move the accelerometer around how it changes the plot accordingly. Now, this is just sort of random motions, but what I really want is something that can recognize a very specific sequence of accelerations. The high five. Now of course this was kind of a simple high five motion since the cables I’m using are quite short, but this is going to be the motion that I’m going to teach a deep neural network to learn and you’ll just have to imagine that this is a full scale high five in all of it’s glory.
Ok, so this is the acceleration pattern that I’m looking for. And the general approach I’m taking to solve this problem is to turn this 3 dimensional signal into images that I can use to train a deep neural network. This is similar to what we did in the last video where we preprocessed audio signals with the short time Fourier transform to create a spectrogram.
And I could do that here as well and it would probably work just fine, but what I’m going to do instead is use the continuous wavelet transform to create a scalogram. A scalogram is another time-frequency representation of signals, but it differs from the spectrogram in a way that makes it more suitable for signals that exist at multiple scales. That is signals that are low frequency and slowly varying, but then are occasionally interrupted with high frequency transients. They are useful for signals like ECG’s that have this characteristic heart beat pattern, and as it turns out, they are useful for the occasional high five in an otherwise slowly moving hand.
I’m not going to go into the details of the scalogram in this video since there are a lot of other things I want to cover, but I left several great resources in the description that you should check out if you want to learn what it is and how to create one in MATLAB.
Alright, the first thing I want to point out is that we need 3 images to represent a high five, and therefore, I have to feed three images into my network architecture. Luckily, many architectures are already set up to input color images which have a red, blue, and green channel. So, to just fit with existing architectures, what I did was assign the x scalogram to the red channel, y to the green, and z to the blue and then combine them to create a single color image. And this sort of pinkish volcano with some smoke coming off the peak is the scalogram representation of a high five.
And this image looks pretty cool, I think but it gets even cooler if we view the scalogram on the real time acceleration stream. Each sample time, I receive a new measurement from the sensor, I update the buffer of the last few seconds of data, and I create the scalogram. You can see how the patterns and colors change as I move the sensor around. And what’s really cool is that you probably couldn’t tell if I was doing a high five by looking at the raw acceleration data streaming by, but I’m sure you can see the characteristic pink volcano streaking past every once in a while. It’s a very obvious pattern. And now we can take this processed data and train a deep neural network to find that obvious pattern.
In order to do that, there are at least two things I need still - I need some training data with labeled scalograms of high fives and some with no high fives. And I need an architecture.
Let’s talk about the architecture first. Rather than design and train an architecture completely from scratch we can build on what already exists with transfer learning. Transfer learning is modifying an existing architecture and then retraining it to accomplish your task rather than the task it was original trained for.
To understand what that means, let’s revisit what we talk about in the first video with the very basic description of what these image-based architectures are doing. They are looking for patterns in data, or in this case patterns in the images. It does that by looking for primitive features in the early layers like blobs and edges and colors, and then as you progress through the layers it combines them into more complex features, and then ultimately combines those into final patterns that can be labeled.
Again, this is an oversimplification of the these networks, but this is a useful oversimplification for describing transfer learning. Let’s assume this is a network that is fully trained to recognize flowers in images.
Not obviously, a network trained to recognize flowers won’t do a great job of classifying high five patterns in our scalograms. But, here’s the interesting thing, blobs, and color, and loops, and lines, and features like that exist in pretty much all images, including our scalograms. It’s only the last few layers in the network that combine those features and does the final classification that is very specific to the types of images you want to classify.
So, we can think of the first portion of a trained image classification network as a general feature recognizer and therefore, keep them and chop off the last few layers responsible for the specific classification, and then replace them with new layers that can output the labels that we are looking to classify.
Now, theoretically training this network should be much faster and require much less data since there is a lot less that the network has to learn. It doesn’t need to learn how to recognize the primitive features, it only needs to learn how to combine them to recognize the larger patterns you’re looking for.
Which is great because most image networks require millions of training images and weeks of time using several high-end GPU’s to train. And I don’t have the time, the hardware, or the arm strength to create millions of high five training images and then train a giant network.
So, this is what I’m going to do. I’m going to start with an existing network, modify the last few layers, and then retrain it with a much smaller amount of training data. So, let’s get to it.
There are many different pretrained image classification networks that I can start from. They each have their own execution speeds, database sizes, and prediction accuracies. I’m going to go with GoogleNet for this project but I assure this choice didn’t come down to any logical reasoning it’s just because the MATLAB example that I’m basing my project off of uses GoogleNet and transfer learning to recognize patterns in ECG signals. So, to keep it in line with this example I’m doing the same.
We can open it in the deep network designer app to visually see all of the layers and how they are connected. And check out this architecture! It has 144 layers. There are bunch of different things going on in this architecture. There are some convolutional layers at the beginning, followed by a series of parallel inception layers, and there are a number of these groupings, one after another. Luckily, we don’t really need to concerns ourselves with most of these layers for our transfer learning example. The important one’s are at the end. This layer here is a fully connected layer, which means each input, of which there are 1024, connect to each of the output, which there are 1000. Simplistically, the way we can think about this is that there are 1024 different complex features that this network has learned and it uses combinations of those features to classify 1000 different objects. So, whichever output value from this layer is largest is assigns the label that is associated with that particular neuron. And it finds the maximum values with this probability layer and then determines the label with this output layer. You can see the labels here, tench, goldfish, great white shark and 997 others.
So, for this project, we only need to replace two layers; the fully connected layer, and the output layer. I can drag in a new fully connected layer into our network and then by clicking on it I can change its parameters. I want to set the output size to be 2, which essentially means that we want this layer to combine the 1024 input features to recognize just two main patterns. And since this layer hasn’t been trained at all, I’m going to increase the weight and bias learning rate factor to 5. This will allow this layer to make larger changes with each training cycle.
The second layer we need to replace is the final output layer so that it only has two labels, high five and not high five. It’s going to determine what those labels are automatically from the training data, we don’t have to specify them here. And that’s it, we’ll use the rest of the network as is!
Now we’re almost ready to start training, but first we need to create our training data.
The ECG example used about 160 images for training validation, so to stay in line with that order of magnitude I opted to record 100 high five images and 100 non high five images. I created a script that loops 100 times, recording 1.5 seconds of acceleration data each loop and saving off the scalogram in a data folder. I got pretty good at timing my high fives so that the event happened near the middle of the window, but I did have to go through the full list of images and prune out the ones that didn’t look so good. I didn’t want those bad images to corrupt the training. But over all, you can clearly see the pink volcano in all of these.
I did the same thing for the other label, no high five. Here I just recorded various things like no motion, slight random motion, and other faster motions that are similar to but not quite a high five.
And that’s it. With my training data created and GoogleNet ready to be modified, I can start the training.
Let’s go back to the deepNetworkDesigner app in MATLAB. Under the data tab we can import all of the training data we just created. All of my data exists within the data folder and the high_five and no_high_five subfolders. The names of the subfolders is where it’s pulling the labels from. If I wanted to, I could also augment all of this data by scaling, translating, or rotating it in order to cover all of the possible solution conditions, but I’m going to leave it as is. Lastly, I’m going to have it randomly select 20 percent of the data to be used for validation.
So, this is a snapshot of my training data, we have 80 examples of high fives and 80 examples of no high fives. On the training tab, I can now set my training options. Here, I’m making a few adjustments to how I want to train this network. I’m just changing them to match the training parameters that were in the MATLAB example that I showed you earlier. The link to that example is in the description of this video if you’d like to check it out. It goes over all of this stuff in really good detail. Alright, with the options set, we can start training this network.
So, let’s kick it off.
Now, while we wait for this to train let’s talk about the steps we took to get here. We took a network that was trained to recognize objects like goldfish and sharks, and repurposed it to find patterns in 3-dimensional acceleration data. And while I used it to create something rather silly, this process can be used to find patterns in any data. GoogleNet, in particular requires a color image as input and so as long as you can format that data into an image you could use this transfer learning process.
You could use it for object detection like pedestrians or street signs, you can use it for predictive maintenance to find patterns in data that indicate when a component will fail, you could use this to look for defects in materials, you could use it to find patterns in data from wearable electronics like determining if a person is walking or running, or if their head was impacted too hard. There are a million use cases for this.
There are also networks that are pertained to classify sounds like the VGGish network. And we can use transfer learning with it as well to get it to classify our own sounds in audio signals.
And what is really interesting about this whole thing is how quickly you can get a trained network up and running. I started this project by following along with an existing MATLAB example but I essentially went from nothing to a trained network in about 2 hours. And this was with a single CPU. Now, your project might require more training data than I used and maybe a GPU for training in a reasonable amount of time, but starting from an existing network will almost certainly require less time and data than having to start from scratch. And starting from an existing MATLAB example, of which there are a lot, you can jump into these types of problems much quicker. I left a link to this example list in the description.
Alright, so we’ve got ourselves a trained network now and it took about 4 minutes to train. Also the network is 97% accurate using, at least using the 20% of the data that we set aside for validation. So it missed one of the 40 images. Not too bad!
And now, this brings us back to the initial video I showed you of me testing out the trained network on the acceleration data stream. Each time sample I would udpate the scalogram based on the latest acceleration data, and then feed that image into the classify function and have it return a label. If the label came back as high five, I displayed that and incremented my high five counter. And that’s pretty much the whole thing. It turned out to be easier than I was expecting and the whole high five counter is actually pretty satisfy to run. So, I hope this has helped you understand the benefits of transfer learning and maybe got you thinking about your particular pattern recognition problems and whether a technique like this could help you with classification.
That is where I’ll leave this video. In the next video, I want to talk about verification of these trained networks and how we can be confident they are going to work. So, if you don’t want to miss that or any other Tech Talk video, don’t forget to subscribe to this channel. Also, if you want to check out my channel, Control System Lectures, I cover more control topics there as well. Thanks for watching, and I’ll see you next time.