Accommodating for 3D Head Movement in Visual Lipreading
[This is an abstract of the paper that
is submitted to the Int. Conf. on Signal and Image Processing]
[By Eun-Jung Holden*, Gareth Loy**, Robyn
Owens*]
[* Department of Computer Science, UWA; **
School of Systems Engineering, ANU]
BACKGROUND
The existing automatic lipreading systems extract lip shapes appearing on 2D images and recognise them as visual speech. Such technique assumes that the speaker is facing the camera directly without any rotation of the head. As people talk, their heads naturally move about as they gesture and follow natural conversation cues. It is necessary for an automatic lipreading device to be robust with respect to this natural behaviour, to be able to detect, monitor and account for 3D movement of a speaker's head. We have developed a 3D head tracker that extracts the 6 DOF parameters of the head from a 2D image sequence captured from a single view point. These 3D head parameters are then used to correct the lip shape appearing in the image for further recognition processes.
3D HEAD TRACKING
Tracking is a sequential estimation process where 3D head transformation parameters are estimated throughout the sequence of images. The system employs a 3D model-based tracking algorithm of David Lowe [1]. Given a 3D model and the 2D features extracted from the images, the 3D model is reconfigured to fit the head pose appearing in the image. This is achieved by a Newton-style iterative minimisation technique where the Euclidean distances between the projected 3D model and the extracted features are minimised to find the correction parameters for the model.
Head Model
The head is represented as a rigid mechanism of a triangular shape that is formed by three feature locations on the face, namely the outer corners of the eyes, and one nostril, as illustrated below. The head model represents a kinematic chain that describes the head configuration, where a model state encodes the orientation of the head by using 3 rotational and 3 translational parameters
![]()
Feature Extraction
Features are the locations of the outer corners of both eyes and one nostril.A template matching technique is used to locate the features required for the head tracking in each frame. The templates are initialised manually on the first frame and updated automatically in the subsequent frames by using the weighted average of the initial and the current templates.
MOUTH SHAPE DETECTION AND CORRECTION
Assuming that the mouth lies and moves within the flat face plane, the 3D mouth shape is first detected from the image. The mouth shape represents the 2D projection of the mouth shape on the face plane where its 3D orientation is already known by the 3D head tracker. The correction of the mouth shape is made by projecting the detected mouth shape back into the face plane in 3D and measuring the mouth shape within that face plane, as shown in the figure below.
Thus the corrected mouth shape represents the mouth shape of the face plane that is normal to the viewing direction of the camera. Mouth shape is detected by tracking the corners, top and bottom of the outer lip contours.
EXPERIMENTAL RESULTS
An experiment was conducted where the speaker was allowed to rotate his head up to 30 degrees about all axis, in addition to slight translation movement, while speaking a sequence of phonems ('W'-'long E'-'M'-'short A'-'long O'). The system successfully tracked the 3D head parameter changes. The tracking results can be viewed as a mpeg movie.
Mouth shapes are effectively detected and corrected throughout the sequence. The figure below shows the results of the head tracking as well as the mouth detection and correction. The plots illustrate the changes of parameter values over time. (a) Head translation along x,y and z axis are shown dotted, dashed and solid respectively. (b) Head rotations along x, y and z axis are shown dotted dashed and solid respectively. (c) Detected (uncorrected) and corrected mouth heights are shown dotted and dashed respectively. (d) Detected (uncorrected) and corrected mouth widths are shown dotted and dashed respectively.
____________________________________________________________________________________
Home | Computer Science | UWA