An Empirical Analysis of Language Detection in Dravidian Languages

G Shimi; C Jerin Mahibha; Durairaj Thenmozhi

doi:10.17485/IJST/v17i15.765

Article

An Empirical Analysis of Language Detection in Dravidian Languages

VIEWS 369
PDF 99

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v17i15.765

Year: 2024, Volume: 17, Issue: 15, Pages: 1515-1526

Original Article

An Empirical Analysis of Language Detection in Dravidian Languages

G Shimi¹, C Jerin Mahibha^2*, Durairaj Thenmozhi³

¹Department of Computer Applications, Madras Christian College, Tambaram, Chennai, 600059, Tamil Nadu, India
²Department of Computer Science and Engineering, Meenakshi Sundararajan Engineering College, Kodambakkam, Chennai, 600 024, Tamil Nadu, India
³Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, 603 110, Tamil Nadu, India

*Corresponding Author
Email: [email protected]

Received Date:01 April 2023, Accepted Date:13 March 2024, Published Date:04 April 2024

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: Language detection is the process of identifying a language associated with a text. The proposed system aims to detect the Dravidian language that is associated with the given text using different machine learning and deep learning algorithms. The paper presents an empirical analysis of the results obtained using the different models. It also aims to evaluate the performance of a language agnostic model for the purpose of language detection. Method: An empirical analysis of Dravidian language identification in social media text using machine learning and deep learning approaches with k-fold cross validation has been implemented. The identification of Dravidian languages, including Tamil, Malayalam, Tamil Code Mix, and Malayalam Code Mix, is performed using both machine learning (ML) and deep learning algorithms. The machine learning algorithms used for language detection are Naive Bayes (NB), Multinomial Logistic Regression (MLR), Support Vector Machine (SVM), and Random Forest (RF). The supervised Deep Learning (DL) models used include BERT, mBERT and language agnostic models. Findings: The language agnostic model outperform all other models considering the task of language detection in Dravidian languages. The results of both the ML and DL models are analyzed empirically with performance measures like accuracy, precision, recall, and f1-score. The accuracy associated with different machine learning algorithms varies from 85% to 89%. It is evident from the experimental result that the deep learning model outperformed with an accuracy of 98%. Novelty: The proposed system emphasizes on the use of the language agnostic model to implement the process of detecting Dravidian languages associated with the given text which provides a promising result of 98% accuracy which is higher than the existing methodologies.

Keywords: Language, Machine learning, Deep learning, Transformer model, Encoder, Decoder

References

Harish BS, Rangan RK. A comprehensive survey on Indian regional language processing. SN Applied Sciences. 2020;2(7):1–16. doi: 10.1007/s42452-020-2983-x
Shanmugavadivel K, Sathishkumar VE, Raja S, Lingaiah TB, Neelakandan S, Subramanian M. Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data. Scientific Reports. 2022;12(1):1–12. doi: 10.1038/s41598-022-26092-3
Subramanian M, Sathiskumar VE, Deepalakshmi G, Cho J, Manikandan G. A survey on hate speech detection and sentiment analysis using machine learning and deep learning models. Alexandria Engineering Journal. 2023;80:110–121. doi: 10.1016/j.aej.2023.08.038
Shanmugavadivel K, Sathishkumar VE, Raja S, Lingaiah TB, Neelakandan S, Subramanian M. Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data. Scientific Reports. 2022;12(1):1–12. doi: 10.1038/s41598-022-26092-3
Anbukkarasi S, Elangovan D, Periyasamy J, Sathishkumar VE, Dharinya SS, Kumar MS, et al. Phonetic-Based Forward Online Transliteration Tool from English to Tamil Language. International Journal of Reliability, Quality and Safety Engineering. 2023;30(03). doi: 10.1142/s021853932350002x
Anbukkarasi S, Sathishkumar VE, Dhivyaa CR, Cho J. Enhanced Feature Model Based Hybrid Neural Network for Text Detection on Signboard, Billboard and News Tickers. IEEE Access. 2023;11:41524–41534. doi: 10.1109/access.2023.3264569
Tagg C. English language and social media. In: The Routledge Handbook of English Language and Digital Humanities (1). (pp. 568-586) Routledge. 2020.
Joshi R, Joshi R. Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text. In: ICDSMLA 2020, Lecture Notes in Electrical Engineering. (Vol. 783, pp. 795-802) Singapore. Springer . 2021.
Goyal V, Rani S, Neetika. Automatic understanding of code mixed social media text: A state of the art. In: Advances in Information Communication Technology and Computing, Lecture Notes in Networks and Systems . (Vol. 135, pp. 91-100) 2020.
Aguilar G, Kar S, Solorio T. Lince: A centralized benchmark for linguistic code-switching evaluation. 2020. Available from: https://doi.org/10.48550/arXiv.2005.04322
Chakravarthi BR, Jose N, Suryawanshi S, Sherly E, McCrae JP. A Sentiment Analysis Dataset for Code-Mixed Malayalam-English. 2020. Available from: https://doi.org/10.48550/arXiv.2006.00210
Jauhiainen T, Ranasinghe T, Zampieri M. Comparing Approaches to Dravidian Language Identification. In: Proceedings of the 8th VarDial Workshop on NLP for Similar Languages, Varieties and Dialects. (pp. 120-127) 2021.
Ansari MZ, Beg MMS, Ahmad T, Khan MJ, Wasim G, . Language identification of hindi-english tweets using code-mixed bert. In: 2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). (pp. 248-252) IEEE. 2022. 10.1109/ICCICC53683.2021.9811292
Ceolin A. Neural networks for cross-domain language identification. phlyers@ vardial 2022. In: Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects. (pp. 99-108) Association for Computational Linguistics. 2022.
Bestgen Y. Optimizing a supervised classifier for a difficult language identification problem. In: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects. (pp. 96-101) 2021.
Chakravarthi BR, Priyadharshini R, Muralidaran V, Jose N, Suryawanshi S, Sherly E, et al. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation. 2022;56(3):765–806. doi: 10.1007/s10579-022-09583-7
Sundar A, Ramakrishnan A, Balaji A, Durairaj T. Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture. SN Computer Science. 2022;3(1):1–15. doi: 10.1007/s42979-021-00943-8
Kedia K, Nandy A. indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Languages. In: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. (pp. 330-335) Association for Computational Linguistics. 2021.
Chakravarthi BR, Mihaela G, Ionescu RT, Jauhiainen H, Jauhiainen T, Lindén K, et al. Findings of the vardial evaluation campaign 2021. In: Proceedings of the 8th VarDial Workshop on NLP for Similar Languages, Varieties and Dialects. The Association for Computational Linguistics. (pp. 1-11) Association for Computational Linguistics. 2021.
Sarlis S, Maglogiannis I. On the Reusability of Sentiment Analysis Datasets in Applications with Dissimilar Contexts. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, AIAI 2020, IFIP Advances in Information and Communication Technology. (Vol. 583, pp. 409-418) Springer, Cham. 2020.
Xu S. Bayesian Naive Bayes classifiers to text classification. Journal of Information Science. 2018;44(1):48–59. doi: 10.1177/0165551516677946
Chung MK, . Introduction to Logistic Regression. 2020. Available from: https://arxiv.org/pdf/2008.13567.pdf
Sarker I. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Computer Science. 2021;2(3):1–21. doi: 10.1007/s42979-021-00592-x
Hashimoto EM, Ortega EMM, Cordeiro GM, Suzuki AK, Kattan MW. The multinomial logistic regression model for predicting the discharge status after liver transplantation: estimation and diagnostics analysis. Journal of Applied Statistics. 2020;47(12):2159–2177. doi: 10.1080/02664763.2019.1706725
Alcaraz J, Labbé M, Landete M. Support Vector Machine with feature selection: A multiobjective approach. Expert Systems with Applications. 2022;204:1–14. doi: 10.1016/j.eswa.2022.117485
Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, OT. A survey on missing data in machine learning. Journal of Big Data. 2021;8(1):1–37. doi: 10.1186/s40537-021-00516-9
Rajalaxmi RR, Prasad LVN, Janakiramaiah B, Pavankumar CS, Neelima N, Sathishkumar VE. Optimizing Hyperparameters and Performance Analysis of LSTM Model in Detecting Fake News on Social media. ACM Transactions on Asian and Low-Resource Language Information Processing. 2022;p. 1–17. doi: 10.1145/3511897
Xu N, Gui T, Ma R, Zhang Q, Ye J, Zhang M, et al. Cross-Linguistic Syntactic Difference in Multilingual BERT: How Good is It and How Does It Affect Transfer? 2022. Available from: https://doi.org/10.48550/arXiv.2212.10879
Feng F, Yang Y, Cer D, Arivazhagan N, Wang W. Language-agnostic BERT Sentence Embedding. 2020. Available from: https://doi.org/10.48550/arXiv.2007.01852

Copyright

© 2024 Shimi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)