An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features

N  Radha; A  Shahina   and A  Nayeemulla Khan

doi:10.17485/ijst/2016/v9i44/102234

Article

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features

VIEWS 1146
PDF 234

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2016/v9i44/102234

Year: 2016, Volume: 9, Issue: 44, Pages: 1-7

Original Article

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features

N. Radha^{1 *}, A. Shahina¹ and A. Nayeemulla Khan²

¹Department of Information Technology, SSN College of Engineering, Chennai, India; radhan, [email protected] ²School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India; [email protected]

*Author for correspondence
N. Radha
Department of Information Technology, SSN College of Engineering, Chennai, India; radhan,[email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: This paper proposes a method to improve the performance of a Visual Speech Recognition (VSR) system by combining the pixel-based and geometry-based features, so as to augment the performance of audio based Automatic Speech Recognition (ASR) systems in adverse conditions. Methods/Statistical Analysis: A video database comprising of 11000 utterances of isolated words, collected from 20 speakers, is used in this study. Pixel based features (DCT and DWT) and geometric features (Active Shape Model or ASM) are fused at two levels, one at the feature level and the other at the decision level. A simple Gaussian mixture HMM word model is built for feature level fusion, while a two stream HMM model is built for decision level fusion. Findings: The VSR system built using the combined features shows a significant improvement in performance when compared to individual VSR systems built using pixel and geometric based features. The accuracy of the individual system is 76% for geometric features, 64% for DCT and 72% for DWT pixel-based features. The performance improves for combined features with an accuracy of 80% for ASM+DCT and 84.7% for DWT+ASM. A weighted decision level fusion result in further improvement, with an accuracy of 84% for ASM+DCT and 92% for ASM+DWT. Application/Improvements: The combined VSR could be preferred over individual pixel/geometric feature based systems to augment the performance of audio based Automatic Speech Recognition (ASR) systems in adverse conditions. Further studies on improving the VSR system, which could be used in lieu of audio-based ASR systems in adverse situations, are being carried out.

Keywords: HMM, Pixel and Geometric Features, Visual Speech Recognition