Today, I'll show you how I build a detection and classification algorithm for hand poses in Python.
This project will make use of the following principles :
If you want to follow along I'll detail every step, or you can grab the finished code here on my Github. The readme is pretty detailed and should give you everything to know to install the required packages and use the application.
Hand detection - SSD
For the hand detection I'm using a pre-trained model of MobileNet, retrained by Victor Dibia.
The architecture is based on MobileNet, that was trained on the COCO dataset for object detection. The last layers are retrained with the EgoHands dataset. The CNN will localize the hand(s) in the image and return the bounding box(es) corresponding to the hands location.
After having the location of the hand, we can feed the image to the CNN for the pose classification.
Pose classification - CNN
The CNN architecture used is the following:
The CNN architecture is nothing special. It's one of the most simple CNN architecture you usually find for tackling the MNIST digit classification problem.
It consists of a grayscale input of size 28x28, 2 convolutional layers with increasing number of features, a max-pooling layer for dimensionality reduction and finally a fully connected dense layer, followed by a soft-max to give us 6 classes of prediction values between 0 and 1.
I'll first describe the whole pipeline of the program, but you'll find more detail on how the CNN can be trained, and how to create the data.
To achieve better performance, this application uses multiple threads. The pipeline is as follows:
The main thread is, as the name indicates, the central part of the program. It will be in charged of getting the frame from the camera thread, putting it into a queue and displaying the results when they are available.
The camera thread is important to ensure that I/O is not holding us back. Indeed, it can be very time consuming to wait for the next frame. So once a frame is ready, the main thread grabs it and puts it into the input queue. Then, one of the workers will grab it.
The workers are doing the heavy lifting. First, they run the hand detection (SSD) and that gives them a frame with the bouding box drawn, and a cropped frame of only the bouding box region. This last frame is used to classify the pose by being sent through the CNN. It comes out a 6d vector of the confidence of the network in one of the poses. The image, the cropped image, and the inference vector are each put into their respective queue for the main to grab them.
Once the main grabs the results from the queues, it can display three windows, one for the video feed with the bouding box, one for the cropped frame and a last one to visualize the classification.
Below you can see the application running. With this architecture I achieved a solid 25 FPS on a 4 core 8 threads Intel i5-8300H, running at 4Ghz with 4 workers. I think the performances are decent and the application could be considered ‘real-time’.