Detecting Humans in Video Footage using Multiple Classifiers

Student: James Russell

Supervisor: Dr. Eun-Jung Holden

 

Manual analysis of video content such as surveillance footage is a tedious task as it requires high levels of concentration over long periods of time. Systems capable of automatically detecting and tracking humans can assist or replace human operators by providing a more accurate and cost effective solution. Such systems are currently in high demand due to the emphasis on security and anti-terrorism strategies, and the increasing availability and affordability of security cameras. Many other applications such as the archiving of surveillance footage, vision-based user interfaces, and people counting also rely on the ability to detect and track humans.

 

This project detects foreground objects using a background subtraction technique; tracks the objects using a mean shift based algorithm; then detects humans using a combination of classifiers utilising visual features such as skin colour, body shape, and the periodic nature of the walking motion. By combining a number of independent classifiers a more robust system capable of handling occlusions and changes in human orientation is produced.

 

TRACKING

We employ the mean-shift algorithm of Fuganaga and Hostetler [Fu75] to track individuals. This algorithm has recently received renewed attention for its capacity to track objects in real-time even in the presence of partial occlusions, significant clutter, and target scale variations.  The mean shift algorithm is a nearest mode seeking, clustering algorithm. The algorithm tracks the target candidate that is most similar to a given target model.  When applied to appearance-based blob tracking [Cm00, Co03], sample points are regularly distributed along the image, and each point is associated with a weight that is high for the target foreground object and low for the background.  Weights can be determined using visual features such as colour, texture or even template correlation.   The mean-shift algorithm compares the target model consisting of sample weights in a local window with a set of target candidates to find the displacement of the centroid of the blob in the image. 

We have implemented a prototype of the tracking process. In our prototype, the system is initialised by firstly using a background subtraction technique to find the moving foreground object. Then a rectangular area around the mid-upper body is chosen as the object window to track. This initialisation occurs in the first frame and the object is tracked in subsequent frames.  This is a fast and robust tracking system, even handling partial occlusions.  Some example results are shown below.  An animated sequence of these images can be shown in this MPEG MOVIE.

 

 

 

 

 

 
 
 

SKIN COLOUR CLASSIFIER

 

The Bayesian-based approach [Ch01] is employed as it exhibits the highest classification rate amongst the methods surveyed by Vezhnevets et al.[Ve01]. In addition, the speed of classification is also fast due to the simplicity of the classification process. To classify a colour as skin or non-skin, a lookup table is consulted resulting in minimal processing time. The Bayesian method also performs similarly across all colour spaces.

 

The method involves dividing the chosen colour space into equal sized bins. Using a set of skin training pixels, a skin probability map is formed by determining the number of pixels belonging to each bin. A non-skin probability map is also formed using non-skin pixels as training data. These maps give the probability P(c|skin) and P(c| ~skin) of observing a particular colour given a skin and non-skin pixel respectively. The Bayes decision rule is then used to calculate the probability of a given colour c corresponding to skin.  An example skin detection result is shown below.

 

 

 

 

 

 

 


SHAPE CLASSIFIER

 

The shape classification technique of Tabb et al. [Ta01,Ta02] is adapted in the system. The primary reason for this choice is the speed of execution due to the simple nature of the model and the effective use of neural networks to classify model parameters.

 

The technique presented by Tabb et al. uses the axis crossover representation of objects to train a feed-forward back-propagation neural network to classify objects as human or non-human. The axis crossover representation involves projecting several axes from the centroid of an object and locating the points at which they intersect the object's perimeter. The normalised lengths of each of these axes form a vector representation of the shape of the object.  An example shape representation is illustrated below.

 

 

 

 

 

 

 


PERIODIC MOTION DETECTOR

 

The periodic nature of the human walking motion is used as a cue to classify objects as human or non-human. The method proposed by Cutler and Davis [Cu01] is employed to identify periodicity in the shape changes of objects during motion. The periodic motion based approach uses a set of images of an object's appearance in consecutive frames. The centroid of each object image is aligned and the image is resized to provide for accurate comparison between the images. The resizing process allows for scale changes in the object due to varying distances from the camera.

 

The object's self similarity image is then computed, using the set of normalised images. This involves computing the self similarity between each object image.  The self similarity image is a greyscale representation of how similar an image is to every other image in the sequence. Each pixel (xi, yj) of the self similarity image corresponds to the similarity between the ith and jth object in the sequence. If the motion of the object is periodic, this self similarity image is also periodic. If the appearance of an object is similar to one observed in a previous frame, it indicates the object has returned to the same orientation. During the human walking process the body will return to a similar orientation after half a cycle of motion.

 

To determine if the self similarity image is periodic, the Fourier transform, that decomposes a waveform into sinusoids of different frequencies, is used. The power spectrum of the Fourier transform is then computed to give the magnitude of each frequency present. Periodic motion will show up as peaks in this spectrum at the motion's fundamental frequencies.

 

COMBINING CLASSIFIERS

 

The proposed approach uses a weighted majority output scheme, producing a weighted sum of the outputs of the colour, shape and motion classifiers. This allows the periodic motion classifier to be incorporated into the system by only assigning it importance when sufficient frames have been processed to recognise periodic motion.  The weighting is initially distributed evenly between the skin colour and shape-based classifiers. Once enough frames have been processed to identify periodic motion the weighting is spread evenly between all three classifiers. In the implementation, a sequence of 20 frames is processed before assigning importance to the motion classifier.

 

Detailed description of this study can be found in [Ru04].

 

 

References

[Ch00] D. Chai and A. Bouzerdoum.  A Bayesian approach to skin colour classification in YcbCr colour space.  Proceedings of the IEEE Region Ten Conference, Vol.2, pp.421-424, 2000.

[Cm00] D. Comaniciu, V. Ramesh and R. Meer.  Real-time tracking of non-rigid objects using mean shift.  Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp.142-149, 2000.

[Co03] R. T. Collins. Mean-shift blob tracking through scale space. IEEE Computer Vision and Pattern Recognition (CVPR03), Madison, WI, June 16-22, 2003, pp 234-240.

[Cu00] R. Cutler and L.S. Davis.  Robust real-time periodic motion detection analysis, and applications.  IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.22, pp.781-796, Aug. 2000.

[Fu75] K. Fukanaga and L. D. Hostetler. The estimation of the gradient of a density function, with application in pattern recognition.   IEEE Transactions on Information Theory, Vol . 21, pp.32-40, 1975.

[Ru04] J. Russell.  Detecting humans in video footage using multiple classifiers, Honours Thesis, School of Computer Science & Software Engineering, The University of Western Australia, 2004.

[Ta01] K. Tabb, N. Davey, R. Adams and S. George.  Omni-directional motion: Pedestrian shape classification using neural networks and active contour models.  Proceedings of the Image and Vision Computing Conference, Nov. 2001.