Visual Speech Recognition Using Cepstral Images
Automatic lipreading provides an alternative or additional communication channel to speech recognition in an acoustically-noisy environment. Lipreading requires not only the lip contours but also inner lip information such as the appearance of the tongue and teeth. Representing and extracting such features from the visual input is a challenging task. While the 2D image templates of the inner lip is often used for the visual speech recognition, they form a very large dimensional input to the classifier, usually, the Hidden Markov Models (HMMs). In order to deal with this problem, we employed and extended a gesture recognition technique used by Kurita and Hayamitzu [1] to lipreading. Similarly to the Kurita's gesture recognition system, we use the combination of the Linear Predictive Coding (LPC) technqiue and the higher order local autocorrelation (HLAC) feature extraction technique. But while the PARCOR coefficients are used in LPC analysis of the image sequence, we used Cepstral coefficients that are known to be more robust and reliable parameters for acoustic speech recognition. The feasibility study is conducted by recognising words of a given individual. The experiment used a visual-only input to recognise the speech of the numbers from one to welve of a single speaker. By using a simple template matching technique, the system correctly recognised 8 numbers. Failed recognition is caused either by the ambiguities of using visual-only information, or by the head movement during speech.
Linear Predictive Coding (LPC) Analysis of 2D images
The LPC technique is an acoustic speech recognition technique that represents the continuous speech waveform using a sequence of equally spaced feature vectors where each feature vector represents the waveform during the corresponding duration in the continuous speech. In speech recognition, linear predictive analysis approximates the speech sample as a linear combination of past speech samples, and by minimizing the sum of the squared differences (over a finite interval) between the actual samples and the linear predictive ones, a unique set of predictor coefficients can be determined. Possible coefficients include LPC coefficients, PARCOR coefficients, etc., but the cepstral coefficients are known to be a more reliable and robust feature sets for speech recognition.
While the acoustic speech recognition deals with single dimensional signals, the visual speech recognition deals with 2D image sequences. In applying the LPC analysis to the image squence, we consider for each pixel, the normalised intensity values over time as the speech signal. Firstly, the autocorrelation values are calculated. Secondly, using the Durbin's method, we convert the autocorrelation coefficients into the LPC parameter set [2]. Then the 1st and 2nd order Cepstral coefficients of the LPC parameter set are used to form the Cepstral images.
Feature Extraction
Given two Cepstral images, the Higher Order AutoCorrelation (HLAC) features are extracted and a single feature vector of 70 dimensions are produced as the visual feature. This is illustrated in the figure below.
The purpose of this experiment was to investigate if the HLAC features from the Cepstral images could be used to distinguish words given a single speaker. Two sequences for each of the numbers from 'one' to 'twelve' are recorded at 25Hz. One is used as a sample and the other for test. A simple template matching is used for classification of the words.
Two experiments were conducted: one for the features from the whole image, and the other for the features of the mouth-only images. An example is shown. The image sequence can be viewed as a movie by placing the mouse over the picture, and the corresponding Cepstral images and feature vectors are shown below.
For the whole image feature vectors, the system correctly recognised 6 out of 12 numbers, and for the mouth-only feature vectors, it correctly recognised 8 out of 12 numbers, achieving a 67% success rate.
In mouth-only feature recognition, it is observed that the incorrect recognition is caused by either two reasons. The first is the ambiguity of the using visual-only data where speech of some numbers look similar, for example, 'five', 'eight','nine'. The second is the movement of the speaker's head causing some confusion between the test and the template features, resulting in an incorrect recognition.
We have also experimented the use of PARCOR images instead of Cepstral images. For the mouth-only images, the system recognised only 5 numbers for our data set, thus the use of Cepstral images produced much better result.
Currently, the whole input sequence is considered as one time frame block for LPC analysis, but if dealing with a longer speech, the system will divide the input sequence into smaller frame blocks and generate a sequence of feature vectors for recognition. An example is shown below.
Various extensions to this work is under development.
[1] T. Kurita and S. Hayamizu.
Gesture recognition using HLAC features of PARCOR images and HMM
based recogniser. Proc. of the Int. Conf. on Automatic
Face and Gesture Recognition, pp. 422- 427, 1998.
[2] L. Rabiner and R. Schafer. Digital
Processing of Speech Signals. Pretice hall signal
processing series, 1978.
[3] F. Goudail, E. Langue, T. Iwamoto, K.
Kyuma, and N. Otsu. Face recognition system using local
autocorreltaions and multiscale integration. IEEE
Transactions on Pattern Recognition and Machine Intelligence,
18(10):1024-1028, October, 1996.
Home | Department | UWA