Hate Speech Detection based on Word Embedding and Linguistic Features

Archika Jain; Sandhya Sharma

doi:10.17485/IJST/v16i41.2128

Article

Hate Speech Detection based on Word Embedding and Linguistic Features

VIEWS 376
PDF 83

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i41.2128

Year: 2023, Volume: 16, Issue: 41, Pages: 3704-3713

Original Article

Hate Speech Detection based on Word Embedding and Linguistic Features

Archika Jain^1*, Sandhya Sharma²

¹Department of CSE, Suresh Gyan Vihar University, 302017, Jaipur, India
²Department of ECE, Suresh Gyan Vihar University, 302017, Jaipur, India

*Corresponding Author
Email: [email protected]

Received Date:22 August 2023, Accepted Date:03 October 2023, Published Date:12 November 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: To develop an improved hate speech detection method based on word embedding and linguistic features. Methods: Many machine-learning classifiers like Logistic Regression (LR), Gaussian Naive Bayes (GNB), Random Forest (RF), K-Nearest Neighbor (KNN) and Linear Support Vector Classifier (SVC) are trained on linguistic data for identifying hated speech. For this research two datasets has been used with the size of 24783 tweets and 6977 tweets for Tweet hate speech detection dataset and Hasoc19 dataset respectively. We have taken the size of training and testing dataset is 67/33 for both the dataset, in which size of training dataset is 67 and size of testing dataset is 33. Findings: On Tweet hate speech detection dataset we target the highest accuracy 0.90 and highest precision, recall and f1-score like 0.87, 0.85 and 0.90 respectively for label 0 and 0.98, 0.98 and 0.93 respectively for label 1 and 0.86, 0.85 and 0.74 for class 2 after applying random forest classifier. On Hasoc2019 dataset we achieve the highest accuracy 0.99 and highest precision, recall and f1-score values like 1.00, 0.99 and 1.00 for class 0 and 1.0, 0.99 and 0.99 for class 1 after applying Random Forest classifier with linguistic features TF-IDF word embedding technique. Novelty: Twenty linguistic features with term frequency-inverse document frequency (TF-IDF) word embedding technique make this research unique. Twenty linguistic characteristics have been chosen for detecting the despised information based on three groups of attributes which is complexity attributes, stylometric attributes and psycho-linguistic attributes have been chosen.

Keywords: Machine Learning Classifiers, Linguistic Features, Accuracy, TFIDF, Random Forest Classifier

References

Kumar A, Tyagi V, Das S. Deep Learning for Hate Speech Detection in social media. In: IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON). (pp. 1-4) IEEE. 2021.
Ahmed U, Lin JCW. Deep Explainable Hate Speech Active Learning on Social-Media Data. IEEE Transactions on Computational Social Systems. 2022;p. 1–11. Available from: https://doi.org/10.1109/TCSS.2022.3165136
Luo J, Bouazizi M, Ohtsuki T. Data Augmentation for Sentiment Analysis Using Sentence Compression-Based SeqGAN With Data Screening. IEEE Access. 2021;9:99922–99931. Available from: https://doi.org/10.1109/ACCESS.2021.3094023
Qureshi KA, Sabih M. Un-Compromised Credibility: Social Media Based Multi-Class Hate Speech Classification for Text. IEEE Access. 2021;9:109465–109477. Available from: https://doi.org/10.1109/ACCESS.2021.3101977
Mullah NS, Zainon WMNW. Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review. IEEE Access. 2021;9:88364–88376. Available from: https://doi.org/10.1109/ACCESS.2021.3089515
Roy PK, Tripathy AK, Das TK, Gao XZ. A Framework for Hate Speech Detection Using Deep Convolutional Neural Network. IEEE Access. 2020;8:204951–204962. Available from: https://doi.org/10.1109/ACCESS.2020.3037073
Zhou Y, Yang Y, Liu H, Liu X, Savage N. Deep Learning Based Fusion Approach for Hate Speech Detection. IEEE Access. 2020;8:128923–128929. Available from: https://doi.org/10.1109/ACCESS.2020.3009244
Baydogan C, BA. Metaheuristic Ant Lion and Moth Flame Optimization-Based Novel Approach for Automatic Detection of Hate Speech in Online Social Networks. 2021. Available from: https://doi.org/10.1109/ACCESS.2021.3102277
Khan S, Kamal A, Fazil M, Alshara MA, Sejwal VK, Alotaibi RM, et al. HCovBi-Caps: Hate Speech Detection Using Convolutional and Bi-Directional Gated Recurrent Unit With Capsule Network. IEEE Access. 2022;10:7881–7894. Available from: https://doi.org/10.1109/ACCESS.2022.3143799
Rodriguez A, Chen YL, Argueta C. FADOHS: Framework for Detection and Integration of Unstructured Data of Hate Speech on Facebook Using Sentiment and Emotion Analysis. IEEE Access. 2022;10:22400–22419. Available from: https://doi.org/10.1109/ACCESS.2022.3151098
Rodriguez-Sanchez F, Carrillo-De-Albornoz J, Plaza L. Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data. IEEE Access. 2020;8:219563–219576. Available from: https://doi.org/10.1109/ACCESS.2020.3042604
Plaza-Del-Arco FM, Molina-Gonzalez MD, Urena-Lopez LA, Martin-Valdivia MT. A Multi-Task Learning Approach to Hate Speech Detection Leveraging Sentiment Analysis. IEEE Access. 2021;9:112478–112489. Available from: https://doi.org/10.1109/ACCESS.2021.3103697
Ilie VI, Truica CO, Apostol ES, Paschke A. Context-Aware Misinformation Detection: A Benchmark of Deep Learning Architectures Using Word Embeddings. IEEE Access. 2021;9:162122–162146. Available from: https://doi.org/10.1109/ACCESS.2021.3132502
Lee E, Rustam F, Washington PB, Barakaz FE, Aljedaani W, Ashraf I. Racism Detection by Analyzing Differential Opinions Through Sentiment Analysis of Tweets Using Stacked Ensemble GCR-NN Model. IEEE Access. 2022;10:9717–9728. Available from: https://doi.org/10.1109/ACCESS.2022.3144266
Mehta H, Passi K. Social Media Hate Speech Detection Using Explainable Artificial Intelligence (XAI) Algorithms. 2022;15(8):291. Available from: https://doi.org/10.3390/a15080291
Oriola O, Kotze E. Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets. IEEE Access. 2020;8:21496–21509. Available from: https://doi.org/10.1109/ACCESS.2020.2968173
Naidu TA, Kumar S. Impact of Deep Learning Models On Hate Speech Detection. In: 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). (pp. 1-5) IEEE. 2021.
Alatawi HS, Alhothali AM, Moria KM. Detecting White Supremacist Hate Speech Using Domain Specific Word Embedding With Deep Learning and BERT. IEEE Access. 2021;9:106363–106374. Available from: https://doi.org/10.1109/ACCESS.2021.3100435
Soto CP, Nunes GMS, Gomes JGRC, Nedjah N. Application-specific word embeddings for hate and offensive language detection. Multimedia Tools and Applications. 2022;81(19):27111–27136. Available from: https://doi.org/10.1007/s11042-021-11880-2
Kamble S, Joshi A. Hate speech detection from code-mixed hindi-english tweets using deep learning models. 2018. Available from: https://doi.org/10.48550/arXiv.1811.05145
Wang B, Ding Y, Liu S, Zhou X. YNU_Wb at HASOC 2019: Ordered Neurons LSTM with Attention for Identifying Hate Speech and Offensive Language. FIRE. 2019;p. 191–198. Available from: https://ceur-ws.org/Vol-2517/T3-2.pdf
Kovács G, Alonso P, Saini R. Challenges of Hate Speech Detection in Social Media. SN Computer Science. 2021;2(2):95. Available from: https://doi.org/10.1007/s42979-021-00457-3

Copyright

© 2023 Jain & Sharma. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)