Detection and classification of hand poses in Python
Today, I'll show you how I build a detection and classification algorithm for hand poses in Python.
This project will make use of the following principles :
If you want to follow along I'll detail every step, or you can grab the finished code here on my Github. The readme is pretty detailed and should give you everything to know to install the required packages and use the application.
Hand detection - SSD
For the hand detection I'm using a pretrained model of MobileNet, retrained by Victor Dibia.
The architecture is based on MobileNet, that was trained on the COCO dataset for object detection. The last layers are retrained with the EgoHands dataset. The CNN will localize the hand(s) in the image and return the bounding box(es) corresponding to the hands location.
After having the location of the hand, we can feed the image to the CNN for the pose classification.
Pose classification - CNN
The CNN architecture used is the following:
The CNN architecture is nothing special. It's one of the most simple CNN architecture you usually find for tackling the MNIST digit classification problem.
It consists of a grayscale input of size 28x28, 2 convolutionnal layers with increasing number of features, a max-pooling layer for dimensionnality reduction and finally a fully connected dense layer, followed by a soft-max to give us 6 classes of prediction values between 0 and 1.
I'll first describe the whole pipeline of the program, but you'll find more detail on how the CNN can be trained, and how to create the data.
To achieve better perfomance, this application uses multiple threads. The pipeline is as follows:
The main thread is, as the name indicates, the central part of the program. It will be in charged of getting the frame from the camera thread, putting it into a queue and displaying the results when they are available.
The camera thread is important to ensure that I/O is not holding us back. Indeed, it can be very time consuming to wait for the next frame. So once a frame is ready, the main thread grabs it and puts it into the input queue. Then, one of the workers will grab it.
The workers are doing the heavy lifting. First, they run the hand detection (SSD) and that gives them a frame with the bouding box drawn, and a cropped frame of only the bouding box region. This last frame is used to classify the pose by being sent through the CNN. It comes out a 6d vector of the confidence of the network in one of the poses. The image, the cropped image, and the inference vector are each put into their respective queue for the main to grab them.
Once the main grabs the results from the queues, it can display three windows, one for the video feed with the bouding box, one for the cropped frame and a last one to visualize the classification.
Below you can see the application running. With this architecture I achieved a solid 25 FPS on a 4 core 8 threads Intel i5-8300H, running at 4Ghz with 4 workers. I think the perfomances are decent and the application could be considered real-time.
Note: I think it's possible to train directly the SSD to do the pose classfication on top of the hand recognition. That would surely improve the performances.
Generating the data for the CNN
The CNN has to be trained, and the simplest way to generate the required data is to film yourself doing the desired hand-pose. Indeed, filming at 25fps will generate 750 images for only 30 seconds of video. You could then film yourself in different environments (lighting, background, etc...). To generate good examples, it's important to move your hand during the recording. Here is an example of the video I recorded for the startrek pose:
When generating the data, you'll not see the hand detection bounding box. The detection will be done afterwards. I did it that way as I think it is less biased. Indeed, you'll not think about how you should move your hand for the SSD to detect it, but you'll simply move it around and this will generate "truer" data. Data that is more likely to be what real usage will be.
The importance of a garbage class
The 6th class is not really a pose, but what I called a "garbage" class. This class is composed of examples that are not related to one of your pose. These examples are false positives from the hand detection, or images of your hand doing nothing. This will greatly help the classification. Indeed, the CNN will always try to predict a class for a given image. But sometimes, the SSD detects a hand where there is none. In that case, you don't want your CNN to try to classify this image as one of your pose. With the garbage class, this image will be classified as garbage and will not interfer with the classification of the real poses you want to detect.
Now you should have everything to make a hand pose recognizer. Again, you can check the code here. Some future improvements I might work on, or you can try to implement:
- Retrain the SSD with a better dataset.
- Use GPU for inference.
- Make a webserver as an interface replacement.
- Use the SSD for detection and poses classification.