Social media allows users to create, remove and share their ideas freely using the Internet connection. Recently, there have been several feature changes. For example, the maximum number of characters per tweet has recently been increased from 200 to 280, encouraging greater flexibility in interaction. Nowadays, social media allows users to freely communicate and express their ideas using natural language. In addition to this, it also deals with application of data and text mining approach like analyzing social network information retrieval, discovering patterns from a collection of data to investigate the secret form of information, opinion, or sentiments
However, the challenges of using natural language over social media is generation of hate speech that violate rights of individuals by disseminating hate speech on various perspectives when users freely express opinion
As a result, social media platform does not have system that block hate content displaying and poisoning the social media environment natural language for natural languages used over it. In order to address those problems, some researchers developed hate speech for resource rich language such as Arabic
In Ethiopia, various social media such as Facebook, Pinterest, YouTube, Twitter, Instagram, Linkedin, Tumbir and others are available. The people of Ethiopia used Afan Oromo, Amharic, Tigrigna, Somali and English to on social media
In work of
Even though, users are also using Afan Oromo on social media platforms to express emotions, feelings, and opinion in form of comments and posts that contain hatred ideas which leads to discrimination, social conflict, and even human genocide, yet, no research work attempted to develop hate speech detection prototype for Afan Oromo for any social media platforms. So, it needs to develop model hate speech detection model for Afan Oromo on social media.
Therefore, to fill this gap, we aim to develop automatic hate speech detection system for Afan Oromo for Facebook and Twitter social media by using Support Vector Classifier (SVC), Multinomial, Linear Support Vector Classifier (LSVC), and Random Forest Classifier.
What are preprocessing techniques need to be applied to prepare Quality Afan Oromo hate speech data set?
What are appropriate software tools for data collection from Facebook and Twitter?
What are appropriate feature extraction techniques need to be applied to obtain important features from Afan Oromo hate speech data?
What is a framework to develop hate speech detection model for Afan Oromo social media?
Which machine learning algorithms is the most performer to build Afan Oromo hate speech detection model?
To what extent the performance of a system effective when different algorithms combined?
The current literature shown that understanding and analyzing social media become a main concern. Today, one of the main concerns about social media is positive and negative impacts that comments and posts in social media platform have either on individual, groups or society. Hence, sentiment analysis, hate speech detection and classification, abusive language detection and/or classification, offensive language detection and/or classification and cyberbullying detection and/or classification become topic of research interest for researchers, Government and Social Media Company. Hate speech detection techniques that used to identify content is displayed on social media platform in the form of comments or posts irrespective of its nature whether the content is hate or normal. It used approaches such as machine learning, natural language processing, statistics and the like to design a model that detects hate speech. Using hate speech detection, natural language processing tasks and machine learning algorithms, comments and posts in social media platforms like Facebook, YouTube and Twitter can analyze and identify either as it is hate or normal.
In this section, we review related work from the perspective of Machine Learning algorithms, hate speech detection and classification, sentiment analysis, and natural language processing. Brief summary of literature review, we discussed in this study is indicated in
The researcher collected data and applied pre-processing tasks such as tokenization and cleaning data performed on data collection in the work of
Code-mixed data concept applied for hate speech detection in work of
In study
Integrated deep feature extraction resulted from CNN trained on semantic word embedding and n-gram approaches were applied as feature extraction techniques
The author of
In study
The author of
Afan Oromo Sentiment analysis model for Facebook developed by
References |
Language |
Feature extraction techniques |
Social Medias |
Algorithms |
Dataset |
Availability |
F1-Score |
|
Danish and English |
Facebook, Reddit and Twitter |
- |
original |
No |
74% |
|
|
|
- |
|
LSTM |
Original |
no |
70.4% |
|
Indonesian |
textual, acoustic and their combination |
Facebook, Twitter, Line Today, YouTube |
Long Short-Term Memory |
Original |
No |
87.98% |
|
Indonesian |
word n-gram charactern-gram orthography lexicon |
|
Vector Machine (SVM), Naive Bayes (NB),and Random Forest Decision Tree (RFDT) |
Existing |
yes |
77.36% |
|
Arabic |
BoW, TF, and TF-IDF |
|
Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree(DT) and Random Forest (RF) |
Original |
no |
91.2% |
The following series of steps were carried out to achieve the objective of this study, that is to detect hate speech from Facebook and Twitter comments and posts effectively. Research design is an approach that enables us to combine various methodologies in research to solve identified problems. Using appropriate research design is essential for researchers to solve selected problems properly. For accomplishment this study, we used the following techniques of research design.
In the present day, social media is used as a source of data for research. The author of
Source |
Hate |
Normal |
|
2900 |
3700 |
|
3562 |
3438 |
Total |
6,462 |
7,138 |
13600 |
Our dataset is composed of user written comments and posts from different public pages on Facebook and Twitter of BBC Afan Oromo, OBN Afan Oromo, Fana Afan Oromo Program, Politicians, Activists, Religious Men, and Oromia Communication Bureau.
We collected 13600 comments and posts between September 2019 and 2020 on respective public page using Face pager (https://facepager.software.informer.com/3.6/) in which 7000 and 6600 data were collected from Twitter and Facebook.
Researchers performed word preprocessing tasks to clean up collected data from irrelevant content and consolidated all data into one single file with a comma-separated value (CSV). Data contents also labelled into main classes either hate or normal.
Removing Irrelevant Contents: The text preprocessing tasks are essential to achieve relevant dataset
Start:
Open the file
Read text in dataset;
While (! end of text in dataset):
Return corpus;
In this work, we divided hate speech identification tasks into subtasks that were finally combined to support proposed hate speech detection and classification. Each subtask purposely designed to aim of handling all about hate speech. We divided hate speech detection tasks into five tasks. Task 1. Hate speech identification tasks: At hate speech identification tasks, researchers identified whether the given posts and comments are either hating speech or normal speech. Annotations of data carried out based on the specific labels. Task 2. Automatic Hate speech detection tasks: the aim of task in step is to check whether the posts and comments are hating speech or normal based on the label in task 1. Task 3. Automatic Hate speech classification task: in this step the target of the hate speech identified. Task 4. Identifying target: the class identified in tasks 3, the aim of target identification is identifying the target of hate speech based on labeled posts and comments. Task 5, Target of Speech Identification: Analyzing the contents of the text either text content is hating or normal is essential. Therefore, target of speech identification level, the researcher analyzed content, text document content with the help of experts.
Nowadays, researchers are using machine learning techniques such as unsupervised, supervised, or semi supervised to conduct experiments. Supervised technique is the most dominated approach in the hate speech detection in social media that require manual annotation of the dataset. In this study, we used manually annotated data used for supervised machine learning algorithms. We used five experts to annotate data depending on the annotation procedure prepared. The number of experts limited to five due to resource scarcity. Experts recruited for data annotation were MA and above MA holders. The selection of experts depends on their interest, computer skill and knowledge in Afan Oromo. The annotators used Afan Oromo hate speech dataset annotation procedures to annotate data whenever they want within a limited period of time.
The researchers generated username and password for experts to login into Afan Oromo hate speech detection dataset evaluation system hosted on the website (https://www.naolinfo.info). After the experts successfully logged into the system, the system allows them to choose Afan Oromo hate speech dataset evaluation page from the displayed list of pages (see figure: 4) on Afan Oromo hate speech dataset evaluation page, the experts read the contents of document to be annotated first and then select hate or normal radio button. Finally, the expert submits his/her options by clicking the “Submit and next” button. After all, five experts submit their selection, the system assigns the class to the data depending on the number of experts select given labels.
if it contains ideas that against gender “Yoo documenting jibe sale when agarsiisuu of cases qabaate”
If it contains ideas that against groups of persons “Yoo qabiyyeen barreeffamichaa can agree namoota motion faalleessu qabaate”.
if it contains ideas that insults or curse individuals “Yoo qabiyyeen barreeffamichaa nama kan arrabsu tea”
if it contains ideas against that the specific religion, political party and Ethnicity “Yoo qabiyyeen barreeffamichaa amount, siyaasaafi Saba Tokko can faalleessu tea”
If it contains ideas that motivate the people or group of people to violence “Yoo qabiyyeen barreeffamichaa hookkaraaf Kana name kakaasuu tea”.
In the following
Sno |
Content/qabiyyee |
class/garee |
1 |
Oromo is enemy of Ethiopia “Oromoon diina Itiyoophiyaati” |
hate “jibba” |
2 |
Selfish“Abbaa garaa” |
hate “jibba |
3 |
struggle you contributed is unforgettable “qabsoon ati giite hin dagatamu” |
normal “fayyaalessa” |
The meaning of the text document in Oromo is enemy of Ethiopia “Oromoon diina Itiyoophiyaati” is against Oromo Ethnic group and has to labelled as
To extract features from all features in the dataset, we used a combination of approaches used in
TF-IDF (t, d) as computed as F (t, d) ∗ IDF(t), when IDF (t) = log [ n / (DF(t) + 1] such that n = total number of documents in the document set DF(t) = document frequency of t The weighted n-gram is given as: W(t,d) = NGram(t,d) × TF-IDF(t,d)
In our approach, we used the combination of Bigram and TFIDF to select important features to obtain relevant Afan Oromo hate speech detection dataset.
Dataset is used as input to conduct experiments and play a vital role to obtain right output from experiment. Since we used a supervised machine learning approach to the train model, the final annotated dataset is split into a train and test set with a fair distribution of classes and data. Train data contain 67% whereas the remaining part is test set.
As indicated in the previous section, we collected data from channels of BBC Afan Oromo, OBN Afan Oromo, Fana Afan Oromo Program, Politicians, Activists, Religious Men, and Oromia Communication Bureau. Face pager is a tool that retrieves data from Facebook and Twitter pages and saves retrieved data in csv format on a local machine
Machine learning algorithms applied to develop Afan Oromo hate speech detection model. The machine learning algorithms, particularly, supervised machine learning require properly annotated dataset to obtain models with highest performance. We annotated a dataset for Afan Oromo hate speech detection depending on the annotation procedure prepared.
Several activities such as text classification, text categorization, pattern recognition, pattern discovery, decision making and the like, those that need human intelligence are automating by Machine learning. Machine learning is a branch of Artificial Intelligence, which is categorized into supervised, unsupervised. In the machine learning approach for predefined classes, a document that will be classified manually by the user always exists. Therefore, the predefined data sets are used for automatically learning the meaning that the user assigned attributes to the classes due to the existence of available data. It contains two main learning approaches: unsupervised learning and unsupervised learning approaches.
Supervised learning approach needs predefined class and deals with classification techniques; whereas unsupervised learning approach does not predefined data and deals with clustering techniques. Supervised machine learning approach requires human involvement partially for a labelling class of data, to divide a dataset into train and test dataset. Decision tree, support Vector machine and Naïve Bayes are the most known supervised machine learning algorithms.
As we understand from literature review, currently, supervised machine Learning algorithms are also utilized for hate speech detection and classification. In our work, we also used machine learning algorithms listed under for conducting experiments then compared their performance
Researchers strategically identified the classes of a hate speech detection dataset into hate and normal. Afan Oromo hates speech detection dataset classes become the name of two radio buttons for row data displayed from the database that holds hate speech detection dataset which was created in the MySQL database we used as back-end software for evaluating Afan Oromo hate speech detection system. Depending on the numbers of experts assigned either hate or normal label (see
In machine learning techniques, accuracy, precision, recall and f-measure are used as the main performance evaluation techniques
Confusion matrix is a table that visualizes the performance of machine learning algorithms. In the confusion matrix, the variable has positive or negative values such that columns represent the actual values of the variable whereas the row of confusion matrix represent the predicted values.
The experiment conducted on Afan Oromo hate speech detection dataset using Python 3.6 to develop the proposed model.
We present the performance of Linear Support Vector Classifier, Multinomial NB, Random Forest Classifier, Logistic regression, Decision tree and Support vector classifier in
S.no |
Algorithm |
Precision |
Recall |
F-score |
1 |
Linear SVC |
66 |
66 |
64 |
2 |
Multinomial NB |
66 |
65 |
62 |
3 |
Random Forest Classifier |
64 |
64 |
63 |
4 |
Logistic Regression Classifier |
65 |
64 |
61 |
5 |
Support Vector Classifier |
66 |
65 |
62 |
6 |
Decision Tree Classifier |
59 |
59 |
59 |
Afan Oromo hate speech detection data collected from Facebook and Twitter social media platforms using Face pager. The system we developed using php and MySQL database assigned labels for the loaded data into the database. Generating accounts for experts of the developed system able to annotate the dataset. From Annotated Afan Oromo hate speech dataset, train and test data set obtained after the annotated dataset divided into using python programming language. The important feature selected from the prepared dataset helped to result in a Benchmark Afan Oromo hate speech dataset that contains the train and test set.
We conducted experiments by loading machine learning algorithms turn by turn on the dataset and the performance of each applied algorithm demonstrated in Table 5. The performance of each applied machine learning algorithm is also indicated in Table 5. The performance of each classifier is illustrated by accuracy, precision, recall and F-score measures Table 5. The developed Afan Oromo hate speech detection was able to be tested with the test dataset scored performance of 64%.
The study is centered on developing hate speech detection models for Afan Oromo social media platforms, specifically from Facebook and Twitter. For successful development of the proposed model, we performed a series of activities. First, data collected from selected sources and annotated according to prepared procedures. Then, text preprocessing applied on gathered data to select relevant data and remove irrelevant data. At text preprocessing phase, Afan Oromo stop words, punctuations except’, numbers, all none Afan Oromo text document, row with empty space, image, video, audio, link, emoji and email removed. All typos errors tried to replace by word correct spelling. We also applied data normalization. Next to those, feature selection techniques such Bigram and TF-IDF applied for data vectorization. On vectorized Afan Oromo hate speech data, supervised machine learning algorithms were applied. To conduct the experiment, we used linear Support Vector Classifier, Multinomial NB, Random Forest Classifier, Logistic regression and Support vector classifier. From all machine learning algorithms applied to build models, linear support vector classifiers achieved higher accuracy than others and the linear support vector classifiers selected as the highest performer.
Finally, we also tested the performance of developed Afan Oromo hate speech detection using a test data set and model scored f-score 64%
Since, the Linear support Vector classifier shows good results to detect hate contents on both training and testing depicts that linear support vector classifier has trained from training data and can also apply the knowledge to new text document with unknown class.
Finally, Afan Oromo hate speech text model from Afan Oromo posts and comments can identify hate speech contents by training by training using dataset collected from Facebook and Twitter in Afan Oromo. This model can be challenged by detecting and alerting the hate contents from Facebook and Twitter. The output of this developed Afan Oromo hate speech detection model can overcome the problems that may the country face due to hate speech if properly implemented by the Ministry of peace in Ethiopia and social media companies.
We have outlined that developing hate speech detection for Afan Oromo social media is essential to eradicate the risk of hate speech on social welfare.
Our work has led to the conclusion that machine learning is applicable for the development of hate speech detection models for Afan Oromo on Facebook and Twitter.
We conducted experiments six times by applying machine learning algorithms such as Support Vector Classifier, MultinomialNB, Linear Support Vector Classifier, Logistic Regression and Random Forest Classifier to build hate speech detection prototypes for Facebook and Twitter. To evaluate the performance of each algorithm, researchers used performance metrics such as Accuracy, Precision, Recall and F-score. The feature selection techniques for machine learning, bigram and TF-IDF applied. The result of the study indicated that Support Vector Classifiers achieved Linear support Vector classifier Performance Precision 66%, recall, 66% and F-score 64%. The Multinomial NB achieved performance Precision 60%, recall 65% and F-score 62%. The Random forest classifier achieved performance Precision 64%, recall 64% and F-score 63%. The Logistic Regression classifier achieved the Performance Precision 65%, recall 64% and F-score 61%. The Support Vector Classifier achieved performance Precision 66%, recall 65% and F-score 63%. The result of the experiment shows that the performance of Linear Support Vector Classifier scored f1-score value is 64% and we have confirmed that Linear Support Vector Classifier scored highest performance compared with others. Therefore, the researchers agreed to use Linear support vector classifiers to deploy Afan Oromo hate speech detection model.
Even though we have developed the Afan Oromo hate speech detection model using machine learning algorithms by collecting data from Facebook and Twitter, this study only investigated posts and comments in text documents. The posts and comments in mode of image/photo, audio and video data have not been considered.
In this study, experiments conducted on data were of small in size. In future study can also be conducted by collecting data from other Social Media platforms. In addition to collecting data from other social media platforms, the researchers can consider other modes of data for further research to be investigated. Applying beyond conventional machine learning algorithms for experiments can also be the next study.