Language Models Creation for the Tatar Speech Recognition System

Aidar Failovich Khusainov

doi:10.17485/ijst/2017/v10i1/109954

Article

Language Models Creation for the Tatar Speech Recognition System

VIEWS 947
PDF 1379

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2017/v10i1/109954

Year: 2017, Volume: 10, Issue: 1, Pages: 1-5

Original Article

Language Models Creation for the Tatar Speech Recognition System

Aidar Failovich Khusainov^*

Kazan Federal University, Kazan, Kremlevskaya st., 18, Institute of Applied Semiotics of the Tatarstan Academy of Sciences, Kazan, Levo-Bulachnaya st., 36a; [email protected]

*Author for correspondence:
Aidar Failovich Khusainov
Kazan Federal University, Kazan, Kremlevskaya st., 18, Institute of Applied Semiotics of the Tatarstan Academy of Sciences, Kazan, Levo-Bulachnaya st., 36a;
Email: [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: The article presents the experiments on the creation of different language models for the Tatar language. N-gram statistical models are used with five different smoothing techniques. Methods: These models can be used in various applications: machine translation systems, spell checking, etc. The study intended to use the patterns in the system of Tatar speech automatic recognition. Taking into account the specifics of the Tatar language, consisting in a rich morphology, speech recognition systems may use not only words but also the building blocks of words as basic modeling units: syllables, morphemes, etc Finding: The following essential elements were chosen for a complete analysis of Tatar language models development: word, morpheme, morph (statistically selected component of a nutshell), the stem and affix chain, syllable and letter. Thus, some models constructed for all combinations of 2-, 3-, 4-grams, smoothing techniques and essential elements of the language. Besides, an experiment showing the possibility of a language model development based on word classes conducted. Conclusion: According to performed experiment results the conclusions are made about the quality of the Tatar language grammar description, the degree of coverage lexicon, and required vocabulary volume for each type of constructed models.

Keywords: Automatic Speech Recognition, Class-Based Models, Language Model, N-Grams, Tatar Language