Student: James Russell
Supervisor: Dr. Eun-Jung Holden
Manual analysis of video content such as surveillance footage is a
tedious task as it requires high levels of concentration over long periods of
time. Systems capable of automatically detecting and tracking humans can assist
or replace human operators by providing a more accurate and cost effective solution.
Such systems are currently in high demand due to the emphasis on security and
anti-terrorism strategies, and the increasing availability and affordability of
security cameras. Many other applications such as the archiving of surveillance
footage, vision-based user interfaces, and people counting also rely on the
ability to detect and track humans.
This project detects foreground objects using a background subtraction
technique; tracks the objects using a mean shift based algorithm; then detects
humans using a combination of classifiers utilising visual features such as
skin colour, body shape, and the periodic nature of the walking motion. By
combining a number of independent classifiers a more robust system capable of
handling occlusions and changes in human orientation is produced.
TRACKING
We employ the mean-shift algorithm of Fuganaga and Hostetler
[Fu75] to track individuals. This algorithm has recently received renewed
attention for its capacity to track objects in real-time even in the presence
of partial occlusions, significant clutter, and target scale variations. The mean shift algorithm is a nearest mode
seeking, clustering algorithm. The algorithm tracks the target candidate that
is most similar to a given target model.
When applied to appearance-based blob tracking [Cm00, Co03], sample
points are regularly distributed along the image, and each point is associated
with a weight that is high for the target foreground object and low for the
background. Weights can be determined
using visual features such as colour, texture or even template
correlation. The mean-shift algorithm
compares the target model consisting of sample weights in a local window with a
set of target candidates to find the displacement of the centroid of the blob
in the image.
We have implemented a prototype of the tracking process. In our prototype, the system is initialised by firstly using a background subtraction technique to find the moving foreground object. Then a rectangular area around the mid-upper body is chosen as the object window to track. This initialisation occurs in the first frame and the object is tracked in subsequent frames. This is a fast and robust tracking system, even handling partial occlusions. Some example results are shown below. An animated sequence of these images can be shown in this MPEG MOVIE.

The Bayesian-based approach
[Ch01] is employed as it exhibits the highest classification rate amongst the
methods surveyed by Vezhnevets et al.[Ve01]. In addition, the speed of
classification is also fast due to the simplicity of the classification
process. To classify a colour as skin or non-skin, a lookup table is consulted
resulting in minimal processing time. The Bayesian method also performs
similarly across all colour spaces.
The
method involves dividing the chosen colour space into equal sized bins. Using a
set of skin training pixels, a skin probability map is formed by determining
the number of pixels belonging to each bin. A non-skin probability map is also
formed using non-skin pixels as training data. These maps give the probability P(c|skin)
and P(c| ~skin) of observing a particular colour given a skin and
non-skin pixel respectively. The Bayes decision rule is then used to calculate
the probability of a given colour c corresponding to skin. An example skin detection result is shown
below.

The shape classification technique of Tabb et al. [Ta01,Ta02] is adapted
in the system. The primary reason for this choice is the speed of execution due
to the simple nature of the model and the effective use of neural networks to
classify model parameters.
The technique presented by Tabb et al. uses the axis crossover representation of objects to train a feed-forward back-propagation neural network to classify objects as human or non-human. The axis crossover representation involves projecting several axes from the centroid of an object and locating the points at which they intersect the object's perimeter. The normalised lengths of each of these axes form a vector representation of the shape of the object. An example shape representation is illustrated below.

The periodic nature of the human walking motion is used as a cue to
classify objects as human or non-human. The method proposed by Cutler and Davis
[Cu01] is employed to identify periodicity in the shape changes of objects
during motion. The periodic motion based approach uses a set of images of an
object's appearance in consecutive frames. The centroid of each object image is
aligned and the image is resized to provide for accurate comparison between the
images. The resizing process allows for scale changes in the object due to
varying distances from the camera.
The object's self similarity image is then computed, using the set of
normalised images. This involves computing the self similarity between each
object image. The self similarity
image is a greyscale representation of how similar an image is to every other
image in the sequence. Each pixel (xi, yj) of the self similarity
image corresponds to the similarity between the ith and jth
object in the sequence. If the motion of the object is periodic, this self
similarity image is also periodic. If the appearance of an object is similar to
one observed in a previous frame, it indicates the object has returned to the
same orientation. During the human walking process the body will return to a
similar orientation after half a cycle of motion.
To determine if the self similarity image is periodic, the Fourier transform, that decomposes a waveform into sinusoids of different frequencies, is used. The power spectrum of the Fourier transform is then computed to give the magnitude of each frequency present. Periodic motion will show up as peaks in this spectrum at the motion's fundamental frequencies.
The proposed approach uses a weighted majority output scheme, producing
a weighted sum of the outputs of the colour, shape and motion classifiers. This allows the
periodic motion classifier to be incorporated into the system by only assigning
it importance when sufficient frames have been processed to recognise periodic
motion. The weighting is initially
distributed evenly between the skin colour and shape-based classifiers. Once
enough frames have been processed to identify periodic motion the weighting is
spread evenly between all three classifiers. In the implementation, a sequence
of 20 frames is processed before assigning importance to the motion classifier.
Detailed description of this study can be found in [Ru04].
[Ch00]
D. Chai and A. Bouzerdoum. A Bayesian
approach to skin colour classification in YcbCr colour space. Proceedings of the IEEE Region Ten
Conference, Vol.2, pp.421-424, 2000.
[Cm00]
D. Comaniciu, V. Ramesh and R. Meer.
Real-time tracking of non-rigid objects using mean shift. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Vol. 2, pp.142-149, 2000.
[Co03]
R. T. Collins. Mean-shift blob tracking through scale space. IEEE Computer Vision and Pattern Recognition
(CVPR03), Madison, WI, June 16-22, 2003, pp 234-240.
[Cu00]
R. Cutler and L.S. Davis. Robust
real-time periodic motion detection analysis, and applications. IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol.22, pp.781-796, Aug. 2000.
[Fu75]
K. Fukanaga and L. D. Hostetler. The estimation of the gradient of a density
function, with application in pattern recognition. IEEE Transactions on Information Theory, Vol . 21, pp.32-40,
1975.
[Ru04]
J. Russell. Detecting humans in
video footage using multiple classifiers, Honours Thesis, School of
Computer Science & Software Engineering, The University of Western
Australia, 2004.
[Ta01]
K. Tabb, N. Davey, R. Adams and S. George.
Omni-directional motion: Pedestrian shape classification using neural
networks and active contour models. Proceedings
of the Image and Vision Computing Conference, Nov. 2001.