Lipreading feature extraction using the linear predictive coding analysis and the higher order local autocorrelation features.


Investigators: Dr. Eun-Jung Holden, Professor Robyn Owens
Last updated: 6 March 2000.

[This is an ARC supported project.]

1. Overview

Automatic lipreading provides an alternative or additional communication channel to speech recognition in an acoustically-noisy environment.  Lipreading requires not only the lip contours but also inner lip information such as the appearance of the tongue and teeth.  Representing and extracting such features from the visual input is a challenging task.  While the 2D image templates of the inner lip is often used for the visual speech recognition, they form a very large dimensional input to the classifier, usually, the Hidden Markov Models (HMMs).  We employed the gesture recognition technique used by Kurita and Hayamitzu [1] to lipreading, which uses the combination of the Linear Predictive Coding (LPC) technqiue and the higher order local autocorrelation (HLAC) feature extraction technique.  The LPC technique is an acoustic speech recognition technique that represents the continuous speech waveform using a sequence of equally spaced feature vectors where each feature vector represents the waveform during the corresponding duration in the continuous speech.  The LPC technique is used to represent the image sequence as a collection of the partial correlation (PARCOR) coefficients for each pixel intensity change over time, forming the PARCOR images.    Then the HLAC features are extracted from the PARCOR images and a single feature vector of 70 dimensions are produced as the visual feature for lipreading.  This is illustrated in the figure below.

2. Linear Predictive Coding and Higher Order Local Autocorrelation Features

In speech recognition, linear predictive analysis approximates the speech sample as a linear  combination of past speech samples, and by minimizing the sum of the squared differences (over a finite interval) between the actual samples and the linear predictive ones, a unique set of predictor coefficients can be determined.  In solving these LPC parameters, our lipreading system uses the autocorrelation method which is most commonly used along with the covariance method in speech recognition.  While the acoustic speech recognition deals with single dimensional signals, the visual speech recognition deals with 2D image sequences.  In applying the LPC analysis to the image squence, we consider for each pixel, the normalised luminance values over time as the speech signal.  Firstly, the autocorrelation values are calculated.  Secondly, using the Durbin's method, we convert the autocorrelation coefficients into the LPC parameter set [2].  Then the 1st and 2nd order PARCOR coefficients of the LPC parameter set are used to form the PARCOR images.

Once the PARCOR images are generated, the HLAC features are extracted by using the 2nd order autocorrelation functions [3].  This generates 35 dimensional feature vector for each PARCOR image, thus 70 dimensional feature vector is produced for the 1st and the 2nd order PARCOR images.

3. Experimental Results

A preliminary evaluation is conducted by using two participants who have quite different speech patterns.  The purpose of this exercise was to investigate if the HLAC features from the PARCOR images could be used to distinguish words given a single speaker, and to find the similarity between individuals given a single word.

Two subjects were asked to say `YES' and `NO'.  The following figure shows the speech of each word by both subjects, starting from the neutral mouth position.  Speech sequence consists of color images of size 328 x 228.

The following figure shows the PARCOR images produced from the input sequences, as well as the HLAC feature vector which is plotted for observation.

It seems that there are no similarities in features for a given word between the individuals because much larger area of the facial movements were detected for subject 1 than subject 2.  Thus, the lip area is manually selected and feature is extracted from the selected area, as shown in the figure below.

There are more similarities for a given word between individuals and distinction between the words within an individual seems to be more recognisable.

4. On-going Development

We are currently integrating the snake algorithm to extract lip areas with the feature extraction technique reported here.  These feature vectors will be classified by HMMs.  Other on-going investigations include the following

References

[1] T. Kurita and S. Hayamizu.  Gesture recognition using HLAC features of PARCOR images and HMM based recogniser.  Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, pp. 422- 427, 1998.
[2] L. Rabiner and R. Schafer.  Digital Processing of Speech Signals.  Pretice hall signal processing series, 1978.
[3] F. Goudail, E. Langue, T. Iwamoto, K. Kyuma, and N. Otsu.  Face recognition system using local autocorreltaions and multiscale integration.  IEEE Transactions on Pattern Recognition and Machine Intelligence, 18(10):1024-1028, October, 1996.

Home Department  | UWA