
Lipreading feature extraction using
the linear predictive coding analysis and the higher order local autocorrelation
features.
Investigators: Dr.
Eun-Jung Holden, Professor
Robyn Owens
Last updated: 6 March 2000.
[This is an ARC supported project.]
1. Overview
Automatic lipreading provides an alternative or additional communication
channel to speech recognition in an acoustically-noisy environment.
Lipreading requires not only the lip contours but also inner lip information
such as the appearance of the tongue and teeth. Representing and
extracting such features from the visual input is a challenging task.
While the 2D image templates of the inner lip is often used for the visual
speech recognition, they form a very large dimensional input to the classifier,
usually, the Hidden Markov Models (HMMs). We employed the gesture
recognition technique used by Kurita and Hayamitzu [1] to lipreading, which
uses the combination of the Linear Predictive Coding (LPC) technqiue and
the higher order local autocorrelation (HLAC) feature extraction technique.
The LPC technique is an acoustic speech recognition technique that represents
the continuous speech waveform using a sequence of equally spaced feature
vectors where each feature vector represents the waveform during the corresponding
duration in the continuous speech. The LPC technique is used to represent
the image sequence as a collection of the partial correlation (PARCOR)
coefficients for each pixel intensity change over time, forming the PARCOR
images. Then the HLAC features are extracted from the
PARCOR images and a single feature vector of 70 dimensions are produced
as the visual feature for lipreading. This is illustrated in the
figure below.

2. Linear Predictive Coding and Higher Order Local
Autocorrelation Features
In speech recognition, linear predictive analysis
approximates the speech sample as a linear combination of past speech
samples, and by minimizing the sum of the squared differences (over a finite
interval) between the actual samples and the linear predictive ones, a
unique set of predictor coefficients can be determined. In solving
these LPC parameters, our lipreading system uses the autocorrelation method
which is most commonly used along with the covariance method in speech
recognition. While the acoustic speech recognition deals with single
dimensional signals, the visual speech recognition deals with 2D image
sequences. In applying the LPC analysis to the image squence, we
consider for each pixel, the normalised luminance values over time as the
speech signal. Firstly, the autocorrelation values are calculated.
Secondly, using the Durbin's method, we convert the autocorrelation coefficients
into the LPC parameter set [2]. Then the 1st and 2nd order PARCOR
coefficients of the LPC parameter set are used to form the PARCOR images.
Once the PARCOR images are generated, the HLAC
features are extracted by using the 2nd order autocorrelation functions
[3]. This generates 35 dimensional feature vector for each PARCOR
image, thus 70 dimensional feature vector is produced for the 1st and the
2nd order PARCOR images.
3. Experimental Results
A preliminary evaluation is conducted by using two
participants who have quite different speech patterns. The purpose
of this exercise was to investigate if the HLAC features from the PARCOR
images could be used to distinguish words given a single speaker, and to
find the similarity between individuals given a single word.
Two subjects were asked to say `YES' and `NO'.
The following figure shows the speech of each word by both subjects, starting
from the neutral mouth position. Speech sequence consists of color
images of size 328 x 228.

The following figure shows the PARCOR images produced
from the input sequences, as well as the HLAC feature vector which is plotted
for observation.

It seems that there are no similarities in features for a given word
between the individuals because much larger area of the facial movements
were detected for subject 1 than subject 2. Thus, the lip area is
manually selected and feature is extracted from the selected area, as shown
in the figure below.

There are more similarities for a given word between individuals and
distinction between the words within an individual seems to be more recognisable.
4. On-going Development
We are currently integrating the snake algorithm
to extract lip areas with the feature extraction technique reported here.
These feature vectors will be classified by HMMs. Other on-going
investigations include the following
-
The possible use of other LPC parameter sets such
as cepstral coefficients, which is reported to be a more robust and reliable
feature set for speech recognition.
-
The size of the frame in the frame blocking process
for spoken words.
-
Further reduction of the feature dimension by using
a principal component analysis, or something similar.
References
[1] T. Kurita and S. Hayamizu. Gesture recognition
using HLAC features of PARCOR images and HMM based recogniser. Proc.
of the Int. Conf. on Automatic Face and Gesture Recognition, pp. 422-
427, 1998.
[2] L. Rabiner and R. Schafer. Digital
Processing of Speech Signals. Pretice hall signal processing
series, 1978.
[3] F. Goudail, E. Langue, T. Iwamoto, K. Kyuma,
and N. Otsu. Face recognition system using local autocorreltaions
and multiscale integration. IEEE Transactions on Pattern Recognition
and Machine Intelligence, 18(10):1024-1028, October, 1996.
Home
| Department | UWA