Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Savitha Murthy; Dinkar Sitaram

doi:10.17485/IJST/v16i4.2371

Article

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

VIEWS 711
PDF 125

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i4.2371

Year: 2023, Volume: 16, Issue: 4, Pages: 282-291

Original Article

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Savitha Murthy^1*, Dinkar Sitaram²

¹PhD Scholar, Department of CSE, PES University, India
²Director, Cloud Computing Innovation Council of India, India

*Corresponding Author
Email: [email protected]

Received Date:10 December 2022, Accepted Date:05 January 2023, Published Date:31 January 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: Improving the accuracy of low resource speech recognition in a model trained on only 4 hours of transcribed continuous speech in Kannada language, using data augmentation. Methods: Baseline language model is augmented with unigram counts of words, that are present in the Wikipedia text corpus but absent in the baseline, for initial decoding. Lattice rescoring is then applied using the language model augmented with Wikipedia text. Speech synthesis-based augmentation with multi-speaker syllable-based synthesis, using voices in Kannada and cross-lingual Telugu languages, is employed. We synthesize basic syllables, syllables with consonant conjuncts, and words that contain syllables that are absent in the training speech, for Kannada language. Findings: An overall word error rate (WER) of 9.04% is achieved over a baseline WER of 40.93%. Language model augmentation and lattice rescoring gives an absolute improvement of 16.68%. Applying our method of syllable-based speech synthesis over language model augmentation and rescoring yields a total reduction of 31.89% in WER. The proposed approach of language model augmentation is memory efficient and consumes only 1/8th the memory required for decoding with Wikipedia augmented language model (2 gigabytes versus 18 gigabytes) while giving comparable WER (22.95% for Wikipedia versus 24.25% for our method). Augmentation with synthesized syllables enhances the ability of the speech recognition model to recognize basic sounds thus improving recognition of out-of-vocabulary words to 90% and in-vocabulary words to 97%. Novelty: We propose novel methods of language model augmentation and synthesis-based augmentation to achieve low WER for a speech recognition model trained on only 4 hours of continuous speech. Obtaining high recognition accuracy (or low WER) for very small speech corpus is a challenge. In this paper, we demonstrate that high accuracy can be achieved using data augmentation for a small corpus-based speech recognition.
Keywords: Low resource; Speech synthesis; Data augmentation; Language model; Lattice rescoring

References

Pratap V, Sriram A, Tomasello P, Hannun A, Liptchinsky V, Synnaeve G, et al. Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters. Electrical Engineering and Systems Science. 2020. Available from: https://doi.org/10.48550/arXiv.2007.03001
Office of the Registrar General I (ORGI), Commissioner C. Census of India. 2011. Available from: https://censusindia.gov.in/census.website/data/census-tables
Srivastava BML, Sitaram S, Mehta RK, Mohan KD, Matani P, Satpal S, et al. Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018). 2018;p. 11–15. Available from: https://www.microsoft.com/en-us/research/publication/interspeech-2018-low-resource-automatic-speech-recognition-challenge-for-indian-languages/
Sethi N, Dev A. Survey on Automatic Speech Recognition Systems for Indic Languages. Artificial Intelligence and Speech Technology. 2022;p. 85–98. Available from: https://doi.org/10.1007/978-3-030-95711-7_8
Shinde AK, R AH, Karanth DN, K G, S VT. Development of Automatic Kannada Speech Recognition System . 2019. Available from: http://ijariie.com/AdminUploadPdf/Development_of_Automatic_Kannada_Speech_Recognition_System_ijariie10201.pdf
Yadava GT, Jayanna HS. Automatic Isolated Kannada Speech Recognition System under Degraded Conditions. International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques. 2019;p. 146–150. Available from: https://doi.org/10.1109/iceeccot46775.2019.9114658
Student P, Professor A, , , , . Kannada Speech Segmentation And Recognition For Speech To Text Conversion. International Journal of Creative Research. 2019;8(6):2320–2882. Available from: https://doi.org/10.1109/ICEECCOT46775.2019.9114658
Kumar P, Jayanna HS. Development of Speaker-Independent Automatic Speech Recognition System for Kannada Language. Indian Journal of Science and Technology. 2022;15(8):333–342. Available from: https://doi.org/10.17485/IJST/v15i8.2322
Kumar P, Yadava T, Jayanna HS. Continuous Kannada Speech Recognition System Under Degraded Condition. 2019. Available from: https://doi.org/10.1007/s00034-019-01189-9
Yadava GT, Nagaraja BG, Jayanna HS. Enhancements in Continuous Kannada ASR System by Background Noise Elimination. 2022. Available from: https://doi.org/10.1007/s00034-022-01973-0
Chellapriyadharshini M, Toffy A, M. SRK, Ramasubramanian V. Semi-supervised and Active-learning Scenarios: Efficient Acoustic Model Refinement for a Low Resource Indian Language. Interspeech . 2018;p. 1041–1046. Available from: https://doi.org/10.48550/arXiv.1810.06635
Murthy S, Sitaram D, Sitaram S. Effect of TTS Generated Audio on OOV Detection and Word Error Rate in ASR for Low-resource Languages. Interspeech . 2018;p. 1026–1056. Available from: https://www.microsoft.com/en-us/research/uploads/prod/2018/06/Effect-of-TTS-Generated-Audio-on-OOV-Detection-and-WER-in-ASR_revised_v1.pdf
Zhang X, Povey D, Khudanpur S. OOV Recovery with Efficient 2nd Pass Decoding and Open-vocabulary Word-level RNNLM Rescoring for Hybrid ASR. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020;p. 6334–6342. Available from: https://doi.org/10.1109/ICASSP40776.2020.9053872
Chen Z, Rosenberg A, Zhang Y, Wang G, Ramabhadran B, Moreno PJ. Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. Interspeech . 2020;p. 556–560. Available from: https://doi.org/10.21437/Interspeech.2020-1475
Pusateri E, Gysel CV, Botros R, Badaskar S, Hannemann M, Oualil Y, et al. Connecting and Comparing Language Model Interpolation Techniques. Interspeech . 2019. Available from: https://doi.org/10.48550/arXiv.1908.09738
Manjunath KE, Jayagopi B, Rao D, Ramasubramanian KS. Articulatory-feature-based methods for performance improvement of Multilingual Phone Recognition Systems using Indian languages. 2020. Available from: https://doi.org/10.1007/s12046-020-01428-9

Copyright

© 2023 Murthy & Sitaram. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)