Indian Journal of Science and Technology
Year: 2017, Volume: 10, Issue: 1, Pages: 1-5
Aidar Failovich Khusainov*
Kazan Federal University, Kazan, Kremlevskaya st., 18, Institute of Applied Semiotics of the Tatarstan Academy of Sciences, Kazan, Levo-Bulachnaya st., 36a; [email protected]
*Author for correspondence:
Aidar Failovich Khusainov
Kazan Federal University, Kazan, Kremlevskaya st., 18, Institute of Applied Semiotics of the Tatarstan Academy of Sciences, Kazan, Levo-Bulachnaya st., 36a;
Email: [email protected]
Objectives: The article presents the experiments on the creation of different language models for the Tatar language. N-gram statistical models are used with five different smoothing techniques. Methods: These models can be used in various applications: machine translation systems, spell checking, etc. The study intended to use the patterns in the system of Tatar speech automatic recognition. Taking into account the specifics of the Tatar language, consisting in a rich morphology, speech recognition systems may use not only words but also the building blocks of words as basic modeling units: syllables, morphemes, etc Finding: The following essential elements were chosen for a complete analysis of Tatar language models development: word, morpheme, morph (statistically selected component of a nutshell), the stem and affix chain, syllable and letter. Thus, some models constructed for all combinations of 2-, 3-, 4-grams, smoothing techniques and essential elements of the language. Besides, an experiment showing the possibility of a language model development based on word classes conducted. Conclusion: According to performed experiment results the conclusions are made about the quality of the Tatar language grammar description, the degree of coverage lexicon, and required vocabulary volume for each type of constructed models.
Keywords: Automatic Speech Recognition, Class-Based Models, Language Model, N-Grams, Tatar Language
Subscribe now for latest articles and news.