In the course of my recent work I've been investigating Computer Vision field, and more specifically, object detection and tracking problem. Object detection is a classic problem in computer vision, with applications in many areas. In this post I'm going to share my findings about the latest research advances in this area.
Everyone who is familiar by hearsay with it, knows the great paper “Robust Real-Time Face Detection” by Paul Viola and Michael J. Jones. Following their approach it is possible to detect faces at 15 frames per second on a conventional Intel Pentium III (as they did). Our task is to achieve the same results on some conventional embedded processor. Also, their learning algorithm http://en.wikipedia.org/wiki/Machine_learning may be running days or even weeks. Our goal is to try to reduce this time.
There are three main contributions of their paper: an integral image http://en.wikipedia.org/wiki/Summed_Area_Table, a simple and efficient classifier (at that time) and an approach for combining more complex classifiers.
The integral image allows for extremely fast feature evolution. Viola and Michael used a set of Haar-like features http://en.wikipedia.org/wiki/Haar-like_features. Using the integral image any Haar-like feature can be computed at any scale or location in constant time. Haar features are similar to Haar wavelets http://en.wikipedia.org/wiki/Haar_wavelet. The feature set considers rectangular regions of image and sums up the pixels in this region. The results are used to classify images. The Haar wavelets are a natural set basis functions which encode differences in average intensities between different regions.
Face detection is a rare event binary classification problem, in the sense that among millions of sub-windows, only a few contain faces. The classifier introduced by those guys filters out over 50% of the image while preserving 99% of the faces. As we will see later, we could go far beyond this limits. This classifier is built using AdaBoost learning algorithm http://en.wikipedia.org/wiki/AdaBoost to select a small number of critical visual features from a very large set of potential features. The basic idea of AdaBoost is to train an ensemble of M weak classifiersin the form:
where are voting coefficients. This is done by training, for m from 1 to M. However, at m-th stage, the training examples are weighted differently using a weighting function, based on performance of the previous stage. The idea is to increase the weights of those correctly classified examples, so that subsequent weak classifiers focus more on the wrongly classified examples.
With each feature is associated one weak classifier, which classifies a sub-window by first integrating the sub-window with the feature using integral images, and then thresholding the value with a properly chosen threshold. Then they are combined into a “cascade” which allows useless regions to be quickly discarded. Those sub-windows which are not rejected by initial classifier of the cascade are processed by a sequence of classifiers, each slightly more complex than the last.
We finished with Viola and Michael's paper. Now let us examine some ideas to improve performance of the learning algorithm and detector.
Minh-Tri Pham and Tat-Jen Cham brought a new real insight into the aforementioned method with their paper “Fast training and selection of Haar features using statistics in boosting-based face detection”. It is extremely fast and comparable in accuracy. They claim that “traditional techniques for training a weak classifier usually run in, with N examples... and T features” And they present a novel approach to train a weak classifier in time, where d is the number of pixels of the probed image sub-window, by using only the statistics of the weighted input data. Did you see it?! Rather than trying to reduce either N or T, they break up the NT factor. I do not want to reprint that exciting paper because it contains a lot of math, so just go and read the paper. By the way, this is currently the world's fastest method for training a face detector. So it is worth the reader's attention. If you need help in statistics, I strongly advise to read first chapters of “Pattern classification” by R.O. Duda, P. E. Hart, and D. G. Stork.
Let us look in another direction. The main limitations of the aforementioned detectors is to gather a representative training set. To overcome that limitation we should resort to the help of H.Grabner, P. M. Roth, and H.Bischof. They have published an “inspiring” piece of work: “Is Pedestrian Detection Really a Hard Task?”.
The basic idea is to train a separate classifier for each image location which has only to discriminate the object from background at a specific location. For the complexity is reduced “we can use a simple update strategy that requires only a few positive samples and is stable by design”. And some limitations arise there: fixed cameras looking always at the same scene. This approach uses on-line unsupervised learning methods http://en.wikipedia.org/wiki/Unsupervised_learning which usually tend to wrong updates which reduces the performance of the detector. The detector might start to drift and would end in an unreliable state. There are variations of the on-line learning algorithm which are suffering from the drifting problem:
Self-training. In a self-training framework the current classifier evaluates an input sample and predicts a label which is then directly used to update the classifier.
Co-training. In a co-training framework two classifiers are trained in parallel using different views of data. The confident predicted labels are used to update the other classifier, respectively.
Autonomous supervision. The results obtained by the classifier are verified by an analyzer and if the obtained labels are confident the samples are used for updating the classifier.
And again, authors overcome these problems and proved that whether the false positive rate is increased nor that recall is decreased if the system is running for a longer period of time. Their method does not take into account the classifier response for delivering update.
For practical implementation a fixed highly overlapping grid (both in location and scale) is placed in the image. Each grid element corresponds to a classifier , which is responsible only for its underlying image patch.
To be continued...
Paul Viola, Michael Jones. Robust Real-Time Face Detection. 2004
Mihn-Tri Phan, Tat-Jen Cham. Fast training and selection of Haar features using statistics in boosting-based face detection. 2007
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. Wiley-interscience. 2000
H. Grabner, P. M. Roth, and H. Bischof. Is Pedestrian Detection Really a Hard Task. 2007
Tags: computer vision