Visual Solutions to the Translation

between Spoken and Signed Languages

BACKGROUND

Communication between Auslan and English Speakers

Deaf people in Australia communicate by using the sign language called Auslan. Auslan uses movement of the hands and arms in a signing space defined about the upper body, as well as facial expressions. It is a language unrelated to spoken English, with an entirely different grammatical structure. Thus, there are Auslan users in our community who do not understand English. Many deaf people do, however, understand English, and they can sign using another sign language called Signed English. Nevertheless, Auslan remains the official language of the Australian deaf community.

To build a communication bridge between the deaf and the hearing in our community, there is a need for an automatic two-way communication tool that translates between Auslan and English. Such a tool would be useful in public places such as hospitals and police stations where human interpreters are not immediately available in emergency cases. Moreover, it is a tool that could be integrated into our telecommunication systems, allowing the deaf to fully participate in society via telephones and teleconferencing.

For several years research in the Department of Computer Science & Software Engineering at The University of Western Australia has focussed on the development of an automatic translation system between English and Auslan. This is a daunting challenge requiring developments in many areas. Translation from English to Auslan requires:

  1. Speech recognition, since spoken English must be recognised robustly;
  2. Language mapping from English to Auslan; and
  3. The graphical animation of the Auslan signs.

The reverse translation from Auslan to English requires:

  1. Fine and course-grain gesture recognition;
  2. Language mapping from Auslan to English; and
  3. The synthesis of English speech or a text display.

Our work to date provides partial solutions to some of these problems. We have

Graphics-based solutions

We began our work by developing a system that translates English text into Signed English. In 1992, we developed a prototype of the Hand Sign Translator (HST) (Holden & Roy 1992). The HST system translates text input English sentences into Signed English, animating two-handed movement using computer graphics. It uses human movement animation techniques employing kinematic constraints, where the hand shapes and motion are generated in 3D. The prototype has a tutorial interface so the system helps an English speaker to learn Signed English. At the time of development, Signed English was taught in deaf schools in Australia but shortly afterwards the deaf community moved back to teaching Auslan. Part of the project outlined in this proposal will involve updating the HST system to animate the whole upper body and use Auslan instead of Signed English. With an increased vocabulary the system will then be ready for public use.

Vision-based solutions

As a first step towards the translation from Auslan to English, a prototype of the Hand Motion Understanding (HMU) system was developed (Holden, 1997). The HMU system is a computer-vision based hand sign recognition system that recognises fine-grain static and dynamic hand signs from colour images captured from a single viewpoint, and it generates English text as output. This is achieved by visually tracking 21 degrees of freedom in the hand, specifying the 3D hand orientation parameters consisting of joint angles. We use a 3D model-based object-tracking algorithm. The motion reflecting the change of these parameters is recognised as Auslan signs by a fuzzy expert system. The evaluation (conducted over a set of 22 signs) successfully demonstrated the effectiveness of these techniques by achieving a recognition rate of over 90% (Holden, Roy and Owens, 1996; Holden & Owens, 2001).

In 1998 we were granted an ARC Large Grant A49941023 to work on the visual component of speech recognition, in particular lip reading. In recent years, lip reading has often been used to improve the robustness of audio speech recognition when the acoustic data are degraded by noise (Bregler & Omohundro, 1995). The motivation for combining visual and acoustic data for speech recognition is well justified by the fact that visual information is used for human speech recognition processing, when a listener can see the speaker (McGurck & MacDonald, 1976).

Lip reading has been an important communication tool for the hearing impaired. Speech is composed of individual speech sounds, called phonemes, and when spoken, they show specific movements in the mouth shape including the lips, tongue and the appearance of teeth. However, there are 43 phonemes in the English language, while there exist only 28 different mouth shapes that separate them (Greenwald, 1984). For example, `d' and `t', or `f' and `v' produce the same mouth shape. Therefore, the art of lip reading for humans is context sensitive: it consists not only in visually recognising mouth shapes, but also mentally recognising key elements to predict the word, as well as further recognising key words to predict the sentence.

Automatic lip reading, moreover, is difficult for both the visual feature extraction and the speech recognition processes. It is however important for our project, which requires a robust speech recognition system for the translation from viewed spoken English to Auslan. It is also important in gesture recognition because it may provide additional information for gesture recognition when the deaf signer combines speech and signing.

Cognitive Models of Spoken and Signed Languages

In 1999 we were granted an ARC IREX Grant X00001612 to develop collaboration between the University of Southern California and the University of Western Australia. The objective of that proposal was to capitalize on the findings of Arbib and his colleagues, which show that speech and gestures and controlled by the same part of the brain. Moreover, they have discovered a "mirror system" within the brain, where the observation of grasping movements triggers similar neural responses to the execution of grasping. The focus of the collaboration is on the principled analysis of computational strategies for the recognition of hand movements in general, and of their use in sign language in particular. The UWA group is focussing on the adaptive classification of hand movements using an artificial computation system; the USC group is modelling the human "mirror system" for visually based generation and observation of hand movements using a computer model of biological neural networks in the monkey brain.

Project Objectives

Existing automatic lip reading systems use various feature extraction techniques such as the active contour models of Kass et al. (1987) to extract lip shape (Chiou & Hwang, 1994), or lip image template based features (Vanegas et al. 1998). To date, Bregler & Omonhundro (1995) have developed the most successful acoustic and visual speech recognition system. They use a combination of active contour models combined with learned non-linear manifolds to extract the outer lip contour and the image template inside this outer lip contour; these data then represent the inner mouth appearance. Principal Component Analysis is used to reduce the dimension of the lip image, and the result is sent concurrently with acoustic speech data to their Hidden Markov Model (HMM) classifier. Their system was evaluated by using 346 letters uttered by 6 speakers. Acoustic-only input reduced the recognition rate from 89% to 38% when noise was up to 15 decibels with cross-talk. The results show that the visual data improved the performance significantly for noisy environments, particularly in cross-talk experiments.

We have been working on a visual-only speech recognition system that uses a combination of mouth shape dimensions that encode width and height, and a representation of the inner mouth appearance. Currently, we have implemented and tested the following techniques.

Inner Mouth Representation

The inner mouth image is represented using cepstral analysis and Higher-Order Local Auto Correlation (HLAC) features. We have improved the technique of Kurita and Hiyamizu (1998) used for gesture recognition, and adapted it for lip reading. An initial evaluation has been conducted to recognise the numbers from `one' to `ten' uttered by a single speaker. Our system achieved a 67% recognition rate (Holden & Owens 2000), which is comparable to expert human lip reading results. The failures were caused by the similarity of the visual speech of some numbers as well as the movement of the speaker's head, resulting in distorted features. An example result is shown in Figure 1.

MPEG movie

MPEG Movie
  Figure 1: Cepstral analysis and HLAC feature extraction. We use the mouth-only cepstral images to generate a 70-dimensional feature vector of the utterance.

Lip Contour Tracking

Lip contour tracking is achieved by using a combination of active contour models and template matching. Many modifications of deformable templates exist, but our technique does not require any learning and is adaptive to different speakers. The initial mouth positions are manually selected, and subsequent movements are automatically tracked by a template-matching snake. We implemented a prototype and successfully tested it with image sequences of speakers of various mouth shapes, lip colours, and facial hair distributions (Barnard, Holden & Owens, 2001). Some tracking results are shown in Figure 2.

MPEG Movie

Figure 2: Mouth tracking result on a subject with poor differentiation between lip and skin colour.

3D Head Tracking

3D head tracking is used to detect the 3D movement of the speaker's head to correct the measured mouth shape dimensions. This technique is an adaptation of our gesture recognition work to head tracking. The 6 degrees of freedom of the head movement parameters are extracted from an image sequence. The outer corners of the eyes and one nostril define the feature points on the face, and the outer corners of the mouth and the mid points of the top and bottom lip boundaries define the feature points on the mouth. These features are manually selected at initialisation and then tracked throughout the sequence. Rotational head movement during speech causes the incorrect measurement of mouth shape dimensions. For example, turning the head sideways causes the detected mouth width to be smaller than the actual width. As speakers naturally move their head by following conversational cues, it is important to deal with this problem. We have implemented a head tracker and successfully tracked 3D head movement, correcting perceived mouth width and height dimensions in real-time. So far, only a single speaker has been used for the evaluation (Holden, Loy & Owens, 2000). Head tracking results are shown in Figure 3.

MPEG movie

Figure 3: 3D head tracking result.

Our visual speech recognition techniques are all novel and are not yet used in developing or commercial systems. While each of our prototypes has been successfully tested and published in isolation, each system still needs many improvements and they must all be integrated into a single system.

The aims of our proposed project are thus the following:

1. We aim to develop a robust visual speech recognition system by extending and further developing our solutions to the identified component problems. It is not trivial to build such a complex practical system. Firstly, to achieve this goal each of the techniques requires further improvements and extensive tests.

2. We aim to apply our proposed visual speech recognition system to the translation of viewed spoken English to signed Auslan. As well as achieving one half of our total goal, such a system would allow English speakers within Australia's deaf community to use mainstream telecommunications such as teleconferencing, where a computer animation of the spoken English could be generated for the user. Additionally, such techniques could be used to generate a computer animation of the speaker, thus enhancing voice-only systems for all Australians (for example, automated telephone systems giving standard information such as flight arrivals and departures).

3. We aim to integrate our proposed visual speech recognition system with our gesture recognition system to improve its performance and robustness. The result of this integration will be a deeper understanding of the computational models linking language and gesture that we have already begun exploring in the neural models work of Arbib and his colleagues.

SIGNIFICANCE AND INNOVATION

There are three modules to this proposed project: a robust visual speech recognition system, a sophisticated animation system for Auslan, and new insights into the computational models that govern the function of language, gesture and visual recognition.

The significance of a visual speech recognition system is substantial as an aid to audio speech recognition (it can improve recognition rates substantially (Bregler & Omonhundro (1995)). We hypothesise a similar improvement to recognition rates when visual speech recognition is used in conjunction with sign language understanding systems. Such a result will have a major impact on research in modern telecommunication systems using audio as well as visual input and output, for the integration of sound, sight and gesture will provide powerful communication environments.

The innovation of our work in this module relates to the new algorithms we have begun developing for lip tracking, head tracking, and mouth shape representation, and the work we propose in developing a classifier of such data for understanding visual speech.

The significance of the new animation system for Auslan is obvious for the deaf community of Australia, and for all Australians wishing to learn Auslan. It will provide a dynamic tutorial tool for learning Auslan, substantially enhancing the value of the written dictionaries of Auslan signs currently available. The innovation in this module will involve the development of an articulated upper body model designed specifically for language signing, using a 3D language space and kinematic representations tailored to Auslan signs.

In our third module, new cognitive models linking language and gesture will provide a significant insight into understanding the working of the brain. Our innovation in this module is substantial, developing links between previously hitherto separate cognitive functions.

APPROACH

1. To build a robust visual speech recognition system

The input to the visual speech recognition system is a sequence of grey-scale speech images, and the desired output is a recognised English utterance. Our proposed approach to building this system consists of the following four steps:

Step 1.

The first step is to isolate the mouth region within the image, determine the scale of the image and the pose of the head, and then detect the mouth contour movement and 3D head movement by an extension of our prototypical solutions. Our input to this step will be a sequence of speech images taken from a single viewpoint.

We intend to develop the technique of Okada et al. (1998) that uses frequency space filtering operations to extract features such as the eyes and mouth in facial images. The scale of the image will be determined by normalising for outer eye corner distance. The mouth contour-tracking algorithm automatically detects the mouth dimensions as well as locating the mouth region of interest throughout the image sequence. Our proposed lip tracking technique will use a combination of the active contour models of Williams & Shah (1992) and a 2D template matching technique. In this sense it is a hybrid 1D-2D deformable template. The 1D deformable template (or snake) is an energy minimising spline that usually converges to an object intensity gradient contour within the image. However, when we track the outer lip contours from speech images, the appearance of the teeth and tongue generates a large intensity gradient and causes the snake to diverge from the outer lip contours. Our early experiments show that the approach of combining a 2D template-matching index to drive the spline control points is reliable and adaptive to speakers with various mouth shapes and colours. To use this technique for practical use, we will collect 2D image templates automatically from the initial images of each speaker and then update the templates throughout the sequence.

The speaker's head orientation detection technique uses a 3D model-based object-tracking algorithm (Lowe, 1991) that detects the changes of translational and rotational parameters of the head orientation from a sequence of images. Like our hand model, the 3D head model is a kinematic chain that uses 6 degrees of freedom (3 rotational and 3 translational) of the head. Throughout the sequence, the model parameters are corrected by minimising the Euclidean distances between the projected model and the image features. We currently use three user-initialised reference points on the face as features, namely the outer corners of both eyes and the corner of one of the nostrils. For the mouth we use the outer corners of the lips, and the top and bottom mid points. For both the mouth and the head trackers, we will automate the initial selection of features. We plan to adapt the technique of Okada et al. (1998), which uses the Gabor wavelet response to detect the facial features for their face recognition system.

The desired output of Step 1 is

Step 2.

The second step is to generate the mouth shape dimensions from the mouth contour tracking result and correct them according to the 3D head orientation.

For this step the input will be a sequence of mouth shape dimensions such as width and height, detected from the mouth contour tracker, and a sequence of 6 degrees of freedom of the head orientations detected from the head tracker. (Note that only 2 degrees of freedom, the turning and nodding movements, are relevant for mouth dimension correction.)

We have shown that mouth dimensions can be corrected (at least for one speaker uttering 5 different sounds) according to the two head orientation parameters. Turning the head affects the mouth width, and nodding affects the mouth height. We will extend this solution to deal with a large population of speakers and an extensive vocabulary.

The output for step 2 will be a sequence of corrected mouth shape width and height dimensions.

Step 3.

The third step is to extract the mouth region of interest image by using the mouth contour tracking result, and to produce the corrected inner mouth representation features.

The input to this step will be a sequence of mouth region of interest images coupled with a sequence of corrected mouth shape dimensions. Each input sequence is divided into equal size frame blocks with overlap between the adjacent frame blocks. Each of the frame blocks will be processed as follows.

The inner mouth representation technique selects the mouth region of interest throughout the image sequence, then aligns and analyses this region along time sequences centred at the pixel midway between the mouth corners. It produces two cepstral images where each cepstral pixel represents the intensity changes within the mouth region of interest over time. Then the HLAC feature that consists of a 35 dimensional vector is extracted from each of the cepstral images. Because we use both first and second order cepstral images we thus generate a 70 dimensional vector.

It is this vector that must now be corrected according to the detected 3D head orientation. We will develop a technique that uses the determined affine transformation of the head motion to modify the representational space of the mouth region of interest. At this stage, dimension-reduction techniques will also be investigated.

The final mouth region of interest feature vector will then be concatenated with the 4 dimensional width and height parameters generated from the two cepstral images, giving the desired output for this step.

Step 4.

The fourth step is to recognise the utterance by using the mouth shape dimensions and inner mouth representation features.

This step of our proposal involves analysing the output from step 3 above using a Hidden Markov Model. There are excellent freeware Hidden Markov Model implementations available, such as the HTK package (http://htk.eng.cam.ac.uk). We plan to use the HTK for our experiment. Our proposed visual speech recognition technique is a modification of the acoustic speech recognition technique called Linear Predictive Coding (Rabiner & Juang, 1993). The use of a similar process for visual and acoustic speech recognition provides a good foundation for the concurrent processing of both modalities. Expertise in audio speech recognition will be provided by collaboration with Dr. Roberto Togneri in the Department of Electrical and Electronic Engineering at The University of Western Australia. Finally, the output of step 4 will be a recognised English utterance.

2. To translate spoken English to signed Auslan

Concurrently with the development of the visual speech recognition system we will extend our initial Hand Sign Translator system to a full upper body animator of signed Auslan. We have begun to build the body model, which consists of several fully object-oriented articulated body segments. Currently, only the framework of the body model is built, and we need to collect the Auslan vocabulary by devising a convenient method for collecting the body kinematic parameters. One solution is to build a sign dictionary editor that enables the system developer to graphically manipulate the model to store its configuration. Furthermore, we will build an interface where images of the spoken English input are translated into Auslan by animating the signing in computer graphics.

This project will solve a significant component of the language translation problem to the point where we should be able to parse simple utterances from a restricted domain, such as airline information at the point of checking in luggage, or hospital check-in procedures. The intention of this part of the proposal is to build a practical application that illustrates the success of the visual speech recognition system.

3. To integrate our proposed visual speech recognition system with our gesture recognition system.

Our Partner Investigator Arbib is working on computational neural models of grasping movements of the hand (Fagg and Arbib, 1998; Rizzolatti and Arbib, 1998). In biological systems it is known that the observation of hand movements triggers neural responses similar to those triggered by the execution of hand movements. Moreover, the human "mirror system" for matching execution and observation of grasping overlaps Broca's area, a key region of the language system of the brain.

Our hypothesis is that observation of language (either spoken or signed) triggers similar neural responses to the execution of language (either spoken or signed).

Through collaboration with Arbib's group, we intend to integrate our visual speech recognition system with our existing gesture recognition system. Similar techniques are used in both systems. For example, our current visual technique for recognising inner mouth appearances is an adaptation and improvement on the technique by Kurita and Hayamizu (1998) used for gesture recognition. Our work will provide a basis for further neural studies by Arbib and his colleagues to model the underlying computational processes used in both language and gesture. Moreover, the visual speech recognition system that we aim to develop in this project will be used to improve the prediction performance of our current gesture recognition system

NATIONAL BENEFIT

This proposed project is directly in the information and communication technologies, which have been identified as being of national importance in the Federal Government's recent Innovation Statement. The results of this work will have a direct impact on mainstream telecommunications, which are increasingly moving to visual technologies. More importantly, we are the only group working on such technologies for Australia's deaf community. Thus the national benefit is to the national communications industries, the national knowledge base, the Australian deaf community, and the Australian education sector, where this work will stimulate several Honours, Masters and PhD theses, contribute to conference presentations nationally and internationally, and result in several fully-refereed journal publications.

COMMUNICATION OF RESULTS

This work will continue to be published in international scientific journals and conferences. In particular, we plan to present regular papers at the International Conference on Face and Gesture Recognition. We have begun fruitful collaborations with several research groups (in particular, the Natural Language Group at Macquarie University, the Systems Engineering Department at the Australian National University, the Speech Understanding Research Group at the University at Martigny in Switzerland, the University of Canterbury at Christchurch in New Zealand, The Hankyong University in Korea, and Professor Arbib's group at the University of Southern California in US.) It is our intention to continue exchanges with these groups, particularly at the research student level.

DESCRIPTION OF PERSONEL

Chief Investigator

Professor Robyn Owens will be program coordinator on this project. She has coordinated the previous work in this area achieved by our group. Her contribution to this proposed project will be in developing the mathematical techniques used for the visual speech recognition system, in particular new methods of representing visual features that have undergone geometric transformations (mouth image correction), dimension-reduction techniques for the mouth shape space, and the animation techniques used in the Auslan animation system. She will be directly responsible for all system integration problems. Additionally, she will supervise all research students associated with this project. (6 person months)

Partner Investigator

Professor Michael Arbib will coordinate the work on the USC modelling of grasping movements to the definition of a mirror system for hand signing. He is directly responsible for all the work in cognitive modelling and he will lead the researchers at USC working in this area. His involvement in this project will be via annual visits to The University of Western Australia for periods of at least one month to integrate the results of his team with those of the UWA team. During those visits he will co-supervise research students working on the integration of the gesture and visual speech recognition systems. (3 person months)

Research Associate

Dr Eunjung Holden is critical to this project, since she has worked on it through her MSc and PhD theses, and is the Research Associate in the current related ARC Large Grant, which finishes in December 2001. She has a working knowledge of Auslan and is intimately involved in all the implemented modules to date. Her role in the proposed project is to design and carry out all data collection experiments, develop extensions to the prototype implementations we have for fine gesture recognition, mouth shape analysis, 3D head orientation detection, lip tracking and hand animation. Her CV is attached as an Appendix to this proposal. (36 person months)

REFERENCES

Barnard, M., Holden, E. J. and Owens, R. "Lip tracking using pattern matching snakes" 2001 Technical Report 01/01, Department of Computer Science & Software Engineering, The University of Western Australia.

Benussi, I. A. 2000 A Learning Tool for Sign Language, Honours Thesis (Supervised by Owens, R. and Holden E. J.), Department of Computer Science & Software Engineering, The University of Western Australia.

Bregler, C. and Omohundro, S. M. "Nonlinear manifold learning for visual speech recognition" 1995 Proceedings of 5th International Conference on Computer Vision 494-499.

Chiou, G. I. and Hwang, J. N. "Image sequence classification using a neural network based active contour model and a hidden markov model" 1994Proceedings of International Conference on Image Processing 926-930.

Greenwald, A. B. 1984 Lipreading Made Easy Graham Bell Association for the Deaf.

Kass, M., Witkin, A. and Terzopoulos, D. "Snakes: Active contour models" 1987 Proceedings of IEEE First International Conference on Computer Vision, 259-269.

Holden E. J. 1997 Visual Recognition of Hand Motion (PhD thesis, Department of Computer Science) The University of Western Australia.

Holden., E. J., and Roy, G. G. "The graphical translation of English text into Signed English in the Hand Sign Translator system" 1992 Computer Graphics Forum (Eurographics'92) 11(3) C357-C366.

Holden, E., J., Roy, G. G., Owens, R. "Hand movement classification using an adaptive fuzzy expert system" 1996International Journal of Expert Systems V9(4) 465-480.

Holden, E. J., Loy, G., and Owens, R. "Accommodating for 3D head movement in visual lipreading" 2000 Proceedings of the IASTED International Conference on Signal and Image Processing 166-171.

Holden, E. J. and Owens, R., "Visual Speech Recognition using Cepstral Images" 2000 Proceedings of the IASTED International Conference on Signal and Image Processing 331-336.

Holden, E. J. and Owens, R., "Visual Sign Language Recognition" 2001Multi-Image Seach and Analysis, Klette, R., Huang T., Gimen'farb, G. (Eds) Lecture Notes in Computer Science, Springer (to appear).

Kurita, T. and Hayamizu, S. "Gesture recognition using HLAC features of PARCOR images and HMM based recognizer" 1998Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition 422-427.

Lee, D. and Seung, S. 1999 "Learning the parts of objects by non-negative matrix factorization" Nature V401:788-796.

Lowe, D. G. "Fitting parameterized three-dimensional models to images" 1991 IEEE Transactions on Pattern Analysis and Machine Intelligence 13(5):441-450.

McGurck, D. and MacDonald J. "Hearing lips and seeing voices" 1976Nature V264 December.

Okada, K., Steffens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., and Malsburg, C. "The Bochum/USC Face Recognition System and How it Fared in the FERET Phase III Test" 1998Face Recognition: From Theory to Applications Wechsler, H., Phillips, P. J., Bruce, V., Fogelman Souwe, F., and Huang, T.S. (Ed.) Springer-Verlag 186-205.

Rabiner, L. and Juang, B. 1993 Fundamentals of Speech Recognition.

Vanegas, O., Tanaka, A., Tokuda, K. and Kitamura, T. 1998 "HMM-based visual speech recognition using intensity and location normalization" Proceedings of International Conference on Spoken Language Processing 289-292.

Williams, D. J. and Shah, M. "A fast algorithm for active contours and curvature estimation" 1992 CVGIP: Image Understanding V55(1):14-26.