Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language

Deepjyoti Kalita; Khurshid Alam Borbora; Dipen Nath

doi:10.17485/IJST/v15i27.655

Article

Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language

VIEWS 1219
PDF 248

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v15i27.655

Year: 2022, Volume: 15, Issue: 27, Pages: 1364-1371

Original Article

Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language

Deepjyoti Kalita^1*, Khurshid Alam Borbora², Dipen Nath³

¹Dept. of Computer Science & IT, Mangaldai College, Assam, India
²Dept of Computer Science, IDOL, Gauhati University, Assam, India
³Dept. of Computer Science, Kokrajhar Govt. College, Kokrajhar, Assam, India

*Corresponding Author
Email: [email protected]

Received Date:22 March 2022, Accepted Date:06 June 2022, Published Date:18 July 2022

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives : The proposed method is based on a unique technique of Deep learning for identifying spoken words with reference to Assamese language. Most of the DNN based algorithms have been successfully implemented in the field of image recognition, computer vision, natural language processing and medical picture analysis. Methods: The method used here is the Bidirectional Long Short Term Memory (BLSTM). BLSTM incorporates both past and future situations together. The speech database for this research work is hired from the repository of Indian Language Technology Proliferation and Development Center (ILTP-DC). This repository contains 32,335 utterances by 1000 numbers of male and female participants, which is comprised of 262 unique Assamese native words. The BLSTM based recognition model is using 10 out of the 262 unique words and the remaining words are used in construction or generation of synthesized sentences. The feature extraction module uses 39 feature coefficients, which are composed of MFCC, ΔMFCC and ΔΔMFCC coefficients. Findings: The Word Error Rate (WER) of the BLSTM based recognition model is 18.84% with an average accuracy of 98.12%, which sets one promising benchmark when compared to recent findings. Novelty: In this work an attempt has been made with a different approach to detect certain keywords of Assamese language by adopting deep learning methodology. The future objective of this proposed work is to improve the detection capability of this model by considering multiple DNN models together in a hybrid approach along with the inclusion of additional features.

Keywords: Bidirectional Long Short Term Memory; Deep Learning; Speech recognition; WER; MFCC

References

Kalita D, Borbora K. Keyword Detection using Auto Associative Neural Network with Reference to Assamese Language. International Journal of Recent Technology and Engineering. 2019;8(3):3290–3294. Available from: https://doi.org/10.35940/ijrte.C5428.098319
Nath D, Kalita SK. A study of Spoken Word Recognition using Unsupervised Learning with reference to Assamese Language. 2019 2nd International Conference on Innovations in Electronics, Signal Processing and Communication (IESC). 2019;p. 98–103. Available from: https://doi.org/10.1109/IESPC.2019.8902439
Lin J, Yumei Y, Maosheng Z, Defeng C, Chao W, Tonghan W. A Multiscale Chaotic Feature Extraction Method for Speaker Recognition. Complexity. 2020;2020:1–9. Available from: https://doi.org/10.1155/2020/8810901
Georgescu ALL, Pappalardo A, Cucu H, Blott M. Performance vs. hardware requirements in state-of-the-art automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing. 2021;2021(1):28. Available from: https://doi.org/10.1186/s13636-021-00217-4
Shashidhar R, Patilkulkarni S, Puneeth SB. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. International Journal of Information Technology. 2022;p. 1–2. Available from: https://doi.org/10.1007/s41870-022-00907-y
Mahalingam H, Rajakumar MP. Speech Recognition using Multiscale Scattering of Audio Signals and Long Short-Term Memory of Neural Networks”. International Journal of Advances in Computer Science and Cloud Computing (IJACSCC). 2019;7(2):12–16. Available from: http://iraj.doionline.org/dx/IJACSCC-IRAJ-DOIONLINE-16658
Singh A, Kaur N, Kukreja V, Kadyan V, Kumar M. Computational intelligence in processing of speech acoustics: a survey. Complex & Intelligent Systems. 2022;8(3):2623–2661. Available from: https://doi.org/10.1007/s40747-022-00665-1
Wiesner M, Raj D, Khudanpur S. Injecting Text and Cross-Lingual Supervision in Few-Shot Learning from Self-Supervised Models. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. Available from: arXiv preprint arXiv:2110.04863
Tang R, Lin J. Deep Residual Learning for Small-Footprint Keyword Spotting. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018;p. 5484–5488. Available from: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=8450881
Choi S, Seo S, Shin B, Byun H, Kersner M, Kim B, et al. Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices. Interspeech. 2019. Available from: arXiv preprint arXiv:1904.03814, 2019
Mittermaier S, Kurzinger L, Waschneck B, Rigoll G. Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. Available from: arXiv preprint arXiv:1911.02086, 2019
Mo T, Yu Y, Salameh M, Niu D&, Jui S. Neural Architecture Search for Keyword Spotting. Interspeech 2020. 1982. Available from: https://doi.org/10.21437/Interspeech.2020-3132
Supriya K. Trigger Word Recognition using LSTM. International Journal of Engineering Research. 2020. Available from: https://doi.org/10.17577/IJERTV9IS060092
Araya M, Alehegn M. Text to Speech Synthesizer for Tigrigna Linguistic using Concatenative Based approach with LSTM model. Indian Journal of Science and Technology. 2022;15(1):19–27. Available from: https://doi.org/10.17485/IJST/v15i1.1935
Sayda E, Tan KL. Speed Prediction on Real-life Traffic Data: Deep Stacked Residual Neural Network and Bidirectional LSTM”. In: MobiQuitous 2020 - 17th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (MobiQuitous '20). (pp. 435-443) Association for Computing Machinery..
Baroi, OL, Kabir MSA, Niaz A, Islam MJ, Rahimi MJ. Effects of Filter Numbers and Sampling Frequencies on the Performance of MFCC and PLP based Bangla Isolated Word Recognition System. International Journal of Image, Graphics and Signal Processing. 2019. Available from: https://doi.org/10.5815/ijigsp.2019.11.05
Yu J, Ye N, Du X, Han L. Automated English Speech Recognition Using Dimensionality Reduction with Deep Learning Approach. Wireless Communications and Mobile Computing. 2022;2022:1–11. Available from: https://doi.org/10.1155/2022/3597347

Copyright

© 2022 Kalita et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Published By Indian Society for Education and Environment (iSee)