Lip Language Recognition for Specific Words

.


Introduction
Owing to the fast growing technology of image processing and computer hardware, computer input interfaces have been becoming more and more important and necessary within the past few decades. Many research institutes have devoted their efforts toward automated lip reading and obtained some useful results in this decade. Some articles act as references on lip reading research. However, most of these articles only discuss the methods of lip shape feature grabbing or tracking, not a complete lip reading recognition system (Chiang et al., 2003;Yao et al.,2010). In this research, a lip reading recognition system is implemented. Supposing a user sits in front of the CCD camera, this system could track the lip position from the user's face to recognize whether the lip changing of a word pronunciation is correct or not. This system could thus be used as a lip shape control interface.
Until the writing of this paper, very limited lip-language application research was conducted as a useful tool to enhance the communication of computer. The arguments are that only the pronunciations for those characters with the changes of obvious labiates and oral area outward appearance are possibly recognized by lip language. Many pronunciations only used behind the throat or the oral cavity character are not easily recognized by lip language. Therefore the general language probably is that only 30% of Mandarin characters can be recognized by lip language.
Although lip language identification is not entirely accurate, there are still many people who do not have good hearing ability and rely on lip language every day, especially in noisy environments. According to the research results of the University of Manchester (Bauman, 2003), the hearing impaired subjects could only recognize 21% of spoken language. If they used a hearing aid, the recognition rate of speaking could be increased to 65%. If they simultaneously used a hearing aid and lip language, the recognition rate of speaking could reach 90%. Thus, if people want to raise their communication ability, using lip reading should significantly improve their communicative ability. Therefore, implemented a lip language recognition system by using image processing technology, neural network algorithm, and database to create a human computer interactive system. The dynamic image variation of the lip shape changing could be detected in this system for using as a lip shape control interface.

Image Processing Steps
The pre-processing of the lip reading image is to get the lip portion from the colored image by localizing the lip position. These steps include brightness adjustment, color space transforms, face color cutting, facial operation region, and lip localization. The lip portion of the image obtained from the pre-processing will be the identification target of all subsequent processes. The recognition portion uses the principal component analysis (PCA) to identify the lip-changing rate as the lip characteristic values. Those lip characteristic values were then learned by a Neural Network to allow the lip image analysis to acquire recognition results. The steps' flow chart of image processing is shown in Figure 1.

Position Detection and Lip Tracking
In this research, skin color is the major feature to find the lip position for implementation of the recognition system. The system input is a color image, which is grabbed by a CCD camera. The facial area detection locates the human face region by the following steps, including lighting compensation, color segmentation, and skin color detection.

Lighting Compensation
The traditional Gamma Correction is an image adjustment skill relying on a kind of power operation to change the brightness of an image, which can refine gray scale images ( (1) Where in V is the gray scale of input pixel; out V is the gray scale of output pixel; c and γ are the constants; e is the value of parallel shift. The e value is small and can be neglected.
The traditional Gamma procedure uses a constant Gamma to process the image so that it cannot have a better performance for the adjustment. Hence, used an adaptive Gamma procedure to get better gray scale enhancement as seen below in Eq. (2). (2) Where the gray scale of the input pixel in V is between 0 and 255; out V is the gray scale of the output pixel; a is a constant 0.5; 0 x and 1 x are the two threshold values of the two ends. This method could be used in color or black and white images. When it is used in color images, the Red, Green, and Blue color data can be processed one by one. Figure 2 compares the bright and dark parts of the images by traditional Gamma and adaptive Gamma. The adaptive Gamma process Figure 2. (c) has better performance in comparing the bright and dark parts of images than the traditional Gamma Figure  2

RGB and YCbCr Color space transformation
Skin color is an important recognition feature of the human face area. Different age and race causes different skin color. Regardless, brightness is a major factor to influence the skin color. When normalizing the brightness of skin images, the detection error of facial skin could be minimized and the skin portion could be detected by the integrity of skin color clustering. A suitable color model should be found to get rid of the brightness factor in a testing image ( YCbCr is a reasonable color representing in color space. The image colors are stated by color intensity Y, as well as Cb and Cr which represent the blue-and red-difference. The YCbCr model is adapted by the image contraction format of MPEG and JPEG. This model is good for the needs of processing digital images. From Chai's research (Chai &. Bouzerdoum, 2000), When observed, there is no big difference in human skin color while using YCbCr to grab the portion of skin color. The YCbCr normalization model is shown in Eq. (4):

The segmentation of skin portion
The normalization RGB model could decrease the effect of image brightness but it cannot decrease the effect of hue and saturation. Hence using both YCbCr and RGB models to segment the skin portion of the image would be a better method (Kong & Zhu,2006). From experiments, the skin portion distribution of the RGB model is shown in Eq. (5). The face skin distribution map of r (red) and g (green) pixels are shown in Figure 3.   The combination of YCbCr and RGB models could segment the skin portion of the face image. A test image processed by RGB and YCbCr created the binary output image as shown in Figure 5, where the skin portion of the face is clearly be segmented.

Lip Color Segmentation
After the face skin has been segmented, the lip area is the next step to segment. To choose the suitable R/G range by RGB method (Hsu et al., 2002), the lip portion is easy to be grabbed as shown in Figure 6. From experiments, R/G range could e found and shown in Eq. (7).

The enhancement of lip feature
Most of the lip area consists of red rather than blue color. Hence, the lip area has higher Cr values and lower Cb values. Using this method, the lip area is easily enhanced and a clear lip shape can be obtained. Owing to the noise, some facial areas have shapes similar to that of the lips, which will cause false recognition. The inflation method is a good way to enhance the lip feature and decrease the recognition error. The algorithm is shown as Eq. 8 and the processing result of the lip feature and its enhancement are shown in Figures7 and 8, respectively.

Region of Interest
In Figure 7 Figure 7 (b) shows the face skin pixel measurement. Figure 7 (c) shows the lip pixel measurement. Figure 8 shows the located results of face and lip.  Principal Component Analysis (PCA) is a method to convert all of the original variables to be some independent linear set of variables. Those independent linear sets of variables possess the most information in the original data called the principal components. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix (Shaw,2003). PCA is the simplest method to analyze the true eigenvector-based multivariate. Generally, its operation can be thought of as revealing the internal structure of the data which can be the best way to explain the variance in the data. If a multivariate data set is a set of coordinates in a high-dimensional data space, PCA supplies the user with a lower-dimensional picture, a "shadow" of this object when viewed from its most informative viewpoint. The steps of feature extraction for PCA are described as follows.
Step 1. Calculate the Covariance Matrix C. Concert the N samples of two-dimensional image data to be N plies of onedimentional data arrays. The result is shown as Figure 9.  Step 2. Calculate the eigenvalues and unit eigenvectors of the Covariance Matrix.

Feature extraction-The lip shape changing rate
The steps of feature extraction for lip shape changing rate are described as follows.
Step 1. Find the length and width of the lips. The lip image was processed by edge detection algorithm to find the contour the lips. The result is shown as Figure 11. Step 2. Calculate the lip changing rate V with Eq. (12) as its algorithm.

Neural Network
The SOM Neural Network (SOMNN) algorithm is used to transform an input signal vector of arbitrary dimension into one-or two-dimension discretion maps which display the important statistical characteristics of the input vector. After an input vector is calculated by SOMNN, a best matching or winning neuron is found in the output map. It shows that similar input vectors activate the selected neuron and its neighbors simultaneously. This indicates that the similar characteristics of neurons will group together. The input vector, representing the set of input signals, is denoted by Eq. (13): (13) There is a need to use the following formula, Eq. (14), to calculate the distance between the vector of the output layers and that of the input layers. The best-matching criterion is equivalent to the minimum Euclidean distance between vectors (Kohonen,1990).

The Topological Graph of the Neural Network
The output layer neuron displays in the output space with significant topological structure based on the features of input vector. The topological structure of output layers is allowed to respond to all distribution relations of input values. Therefore, this network is called a self-organizing characteristic mapping network.
This mapping graph is also called a topology. Each input vector would map to a coordinate of the topological graph. The relationship or similarity of any two input vectors could be calculated from the distance of two output coordinates in the topological graph. In the learning stage, input vectors of the similar character would gradually become closer. This means that similar input vector neighbors' distance would decrease to a certain degree.

Experimental Results
In the experiment, hence used 62 words and 10 consecutive lip images per word as the database to test the implemented system performance by 55 users. Each image has 200 PCA data and 10 lip changing rate data. Hence, a word has 210 floating point data to represent all 10 image features. The Self-Organizing Map is the neural network method to find the dynamic image variation of the lip shape changing. Table 2 showing the topological graph of 62 words in the database reveals the SOM topological graph whose coordinates were clustered together for the same word. Lip language recognition results are shown in Table 3. Table 3 shows the lip reading recognition rate of 62 tested Mandarin words with 55 users. The average correct rate is 85%.

Conclusion
In this study, the proposed system grabbed consecutive changing lip images and found the features of each image by using PCA algorithm and Neural Net technology. The PCA method could decrease the dimensions of raw data allowing the image data to be condensed without losing its major features. The Neural Net clustered similar image features together for reorganization. The reorganization rate of the implemented system highly depends on the lip shape. From the experimental results, the researchers discovered that the major problem in automatic lip reading is to recognize the lip shape sets when possessing the word with the similar pronunciation voice. This is the limitation of all lip reading recognition systems. However, combining the lip reading system to cluster those same-sound words could be a good supplementary tool for an optical character recognition system as a novel input device of a computer system. Additionally, only 62 words were used to analyze the lip reading recognition. The further study needs to enlarge the words-pool to make the research more comprehensive.