• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2024, Volume: 17, Issue: 15, Pages: 1515-1526

Original Article

An Empirical Analysis of Language Detection in Dravidian Languages

Received Date:01 April 2023, Accepted Date:13 March 2024, Published Date:04 April 2024


Objectives: Language detection is the process of identifying a language associated with a text. The proposed system aims to detect the Dravidian language that is associated with the given text using different machine learning and deep learning algorithms. The paper presents an empirical analysis of the results obtained using the different models. It also aims to evaluate the performance of a language agnostic model for the purpose of language detection. Method: An empirical analysis of Dravidian language identification in social media text using machine learning and deep learning approaches with k-fold cross validation has been implemented. The identification of Dravidian languages, including Tamil, Malayalam, Tamil Code Mix, and Malayalam Code Mix, is performed using both machine learning (ML) and deep learning algorithms. The machine learning algorithms used for language detection are Naive Bayes (NB), Multinomial Logistic Regression (MLR), Support Vector Machine (SVM), and Random Forest (RF). The supervised Deep Learning (DL) models used include BERT, mBERT and language agnostic models. Findings: The language agnostic model outperform all other models considering the task of language detection in Dravidian languages. The results of both the ML and DL models are analyzed empirically with performance measures like accuracy, precision, recall, and f1-score. The accuracy associated with different machine learning algorithms varies from 85% to 89%. It is evident from the experimental result that the deep learning model outperformed with an accuracy of 98%. Novelty: The proposed system emphasizes on the use of the language agnostic model to implement the process of detecting Dravidian languages associated with the given text which provides a promising result of 98% accuracy which is higher than the existing methodologies.

Keywords: Language, Machine learning, Deep learning, Transformer model, Encoder, Decoder


  1. Harish BS, Rangan RK. A comprehensive survey on Indian regional language processing. SN Applied Sciences. 2020;2(7):1–16. doi: 10.1007/s42452-020-2983-x
  2. Shanmugavadivel K, Sathishkumar VE, Raja S, Lingaiah TB, Neelakandan S, Subramanian M. Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data. Scientific Reports. 2022;12(1):1–12. doi: 10.1038/s41598-022-26092-3
  3. Subramanian M, Sathiskumar VE, Deepalakshmi G, Cho J, Manikandan G. A survey on hate speech detection and sentiment analysis using machine learning and deep learning models. Alexandria Engineering Journal. 2023;80:110–121. doi: 10.1016/j.aej.2023.08.038
  4. Shanmugavadivel K, Sathishkumar VE, Raja S, Lingaiah TB, Neelakandan S, Subramanian M. Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data. Scientific Reports. 2022;12(1):1–12. doi: 10.1038/s41598-022-26092-3
  5. Anbukkarasi S, Elangovan D, Periyasamy J, Sathishkumar VE, Dharinya SS, Kumar MS, et al. Phonetic-Based Forward Online Transliteration Tool from English to Tamil Language. International Journal of Reliability, Quality and Safety Engineering. 2023;30(03). doi: 10.1142/s021853932350002x
  6. Anbukkarasi S, Sathishkumar VE, Dhivyaa CR, Cho J. Enhanced Feature Model Based Hybrid Neural Network for Text Detection on Signboard, Billboard and News Tickers. IEEE Access. 2023;11:41524–41534. doi: 10.1109/access.2023.3264569
  7. Joshi R, Joshi R. Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text. In: ICDSMLA 2020, Lecture Notes in Electrical Engineering. (Vol. 783, pp. 795-802) Singapore. Springer . 2021.
  8. Goyal V, Rani S, Neetika. Automatic understanding of code mixed social media text: A state of the art. In: Advances in Information Communication Technology and Computing, Lecture Notes in Networks and Systems . (Vol. 135, pp. 91-100) 2020.
  9. Chakravarthi BR, Jose N, Suryawanshi S, Sherly E, McCrae JP. A Sentiment Analysis Dataset for Code-Mixed Malayalam-English. 2020. Available from: https://doi.org/10.48550/arXiv.2006.00210
  10. Chakravarthi BR, Priyadharshini R, Muralidaran V, Jose N, Suryawanshi S, Sherly E, et al. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation. 2022;56(3):765–806. doi: 10.1007/s10579-022-09583-7
  11. Sundar A, Ramakrishnan A, Balaji A, Durairaj T. Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture. SN Computer Science. 2022;3(1):1–15. doi: 10.1007/s42979-021-00943-8
  12. Chakravarthi BR, Mihaela G, Ionescu RT, Jauhiainen H, Jauhiainen T, Lindén K, et al. Findings of the vardial evaluation campaign 2021. In: Proceedings of the 8th VarDial Workshop on NLP for Similar Languages, Varieties and Dialects. The Association for Computational Linguistics. (pp. 1-11) Association for Computational Linguistics. 2021.
  13. Sarlis S, Maglogiannis I. On the Reusability of Sentiment Analysis Datasets in Applications with Dissimilar Contexts. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, AIAI 2020, IFIP Advances in Information and Communication Technology. (Vol. 583, pp. 409-418) Springer, Cham. 2020.
  14. Xu S. Bayesian Naive Bayes classifiers to text classification. Journal of Information Science. 2018;44(1):48–59. doi: 10.1177/0165551516677946
  15. Hashimoto EM, Ortega EMM, Cordeiro GM, Suzuki AK, Kattan MW. The multinomial logistic regression model for predicting the discharge status after liver transplantation: estimation and diagnostics analysis. Journal of Applied Statistics. 2020;47(12):2159–2177. doi: 10.1080/02664763.2019.1706725
  16. Alcaraz J, Labbé M, Landete M. Support Vector Machine with feature selection: A multiobjective approach. Expert Systems with Applications. 2022;204:1–14. doi: 10.1016/j.eswa.2022.117485
  17. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, OT. A survey on missing data in machine learning. Journal of Big Data. 2021;8(1):1–37. doi: 10.1186/s40537-021-00516-9
  18. Rajalaxmi RR, Prasad LVN, Janakiramaiah B, Pavankumar CS, Neelima N, Sathishkumar VE. Optimizing Hyperparameters and Performance Analysis of LSTM Model in Detecting Fake News on Social media. ACM Transactions on Asian and Low-Resource Language Information Processing. 2022;p. 1–17. doi: 10.1145/3511897
  19. Feng F, Yang Y, Cer D, Arivazhagan N, Wang W. Language-agnostic BERT Sentence Embedding. 2020. Available from: https://doi.org/10.48550/arXiv.2007.01852


© 2024 Shimi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.