The Mutli-Text classification of Urdu language through combination of Urdu and Roman Urdu text is being considered one of the most challenging task in text classification. As Urdu is one of the famous and national languages of Pakistan which is spoken
Increasing demand of information and rapid growth of online social platforms has given more importance to text classification for the purpose of text data management and arrangement
The frequent news text information circulating on different websites consist variety of text data categories such as sports, entertainment, education, politics and more. Due to the huge amount of news data there is need to categorize the news classes quickly and automatically instead of some manual operations
Lately field of text mining is gaining more importance due to availability of different data types by multiple sources in the form of unstructured and semi structured information. The prime purpose of text mining is to enable user to extract information from different sources and then perform various operations like information retrieval, classification techniques which range from (supervised, unsupervised and semi supervised), Data mining, Natural Language processing and in combination with machine learning approaches for automatic classification
Previously a lot of work has been performed on different languages but less focus is given to Urdu language by research community. Acquiring data from various sources like online blogs then for the purpose of collected text data classification different well-known classification algorithms of machine learning such as Decision Trees, Support Vector Machine, K-Nearest Neighbor etc are used. Tests through comparison analysis brought a final conclusion that K-NN is performing well than Decision tree and Support Vector Machine with the perspective of accuracy-value, precision and recall. Same Research paper has cited another research work which identified that other five main well-known classification algorithms were applied on Urdu Language data, that corpus contained 21769 documents related to news of seven different categories Culture, Business, Health, Entertainment, Sports). After applying various NLP preprocessing on it 93400 features are extracted from the corpus to apply machine learning algorithms up to 94% precision and recall using majority voting
In
In this section we have described the flow of our proposed work. Firstly data corpuses of Urdu and Roman Urdu Text have been collected from different sources. After that NLP preprocessing techniques are applied on text data for feature extraction and selection. Finally, multiple ML classification algorithms have been implemented. The
In first step we collected dataset of Urdu and Roman Urdu text data from various online platforms with the help of beautiful soup web scraping tool. The compiled data was in the form of raw data or an unstructured data so to process that text data through machine learning algorithm we need to have its structured format for any decision-making or its classification. The corpus data contains 10500 different text news of both Urdu and Roman and text data has been prepared for next steps in order to perform the text classification. The target categories of classification from the corpus are Accidental, Education, Entertainment, International, Sports and Weather news respectively.
The data imbalanced distribution problem is frequently observed in the field of data science. The problem with the imbalanced data is that the data of one class significantly has higher frequency than the other class resulting in less data points. This problem directly affects the performance of majority of ML/DL classification algorithms because they are not efficient to handle data imbalance issue. As a result they are being inclined towards the classes with majority data points.
The literature review reveals that many normal classification algorithms like Logistic Regression and Decision Trees etc are not much efficient to handle imbalanced distribution of classes. This leads to a heavy bias towards the classes with larger data points, while classes with less data points are being considered as noise, and they are mostly ignored. Hence, the outcomes of classes with minority data points have a higher misclassification rate as compared to the majority classes. Consequently, when it comes to the performance evaluation the accuracy metric is not that much relevant if the model is trained on imbalanced data. However, there are some methods to handle the data imbalance issue and these are described below.
The collection of actual data while dealing with corpus is always better approach rather than generation of artificial data via sampling the existing data points.
One way to handle the higher frequency is to remove the duplication of data from the dataset meaning that there might be similar data points as repeating multiple times in your dataset. For example “Where is the order” and “Where is my order” has the same semantic meaning. Removing such repeated text content will help to reduce the duplicate message which will help to reduce the volume of majority class.
In order to balance the text data classes in our dataset, we have implemented SMOTE (Synthetic Minority Over-sampling Technique) to fix the problem of imbalanced data. Where data is being balanced through oversampling minority class data points and undersampling the majority class instances.
It is the process to delete the data points from the majority class on random basis until both classes are balanced with equal data points. There is more probability to lose the information which ultimately cause to poor model training.
It is the process to duplicate the data points of minority class randomly. The problem with this approach is that it might lead to over fitting problem with inaccurate predictions results on the test data.
More importantly, SMOTE approach effectively forces the decision region of the minority class to become more general. It Produces synthetic data points by taking each minority class sample and it introduces the synthetic examples along the line segments joining any/all of the k minority class nearest neighbors as it can seen in the
After performing experiments on our dataset using SMOTE technique, we obtained the result in the form of balanced data as shown in the
The text data words which are not as much useful for information extraction and decision making like conjunctions, pronouns, articles and prepositions are considered Non-semantic words are usually described as stop words. In NLP preprocessing steps these words are removed from text data features because they have very minimum or no contribution in information sharing about the text sentence sentiments. Stop words in Urdu like آئے,ارے ,اس,آئ etc and In Roman Urdu, stop words can be pronouns like ’hum’, ’mein’, ’tum’, etc., these cause confusion in text classification process. After applying the stop words removal on text data we make sure there are no stop words in the feature vector. The testing result for stop word removal can be observed in
One of the important preprocessing steps in machine learning problem is feature extraction. In order to facilitate learning this approach builds the feature vector from textual data. A feature vector is simply an n-dimensional vector representation of the dataset with attributes it can be in different format like binary, categorical and numerical.
Term Frequency Inverse Document Frequency (TF-IDF) is a numeric statistical form whose main job is to highlight that how important a word is in a corpus or data collection. The decision is made on the basis of values, higher the TF-IDF words values the stronger relationship in the document. Futhermore, it has been analyzed that the combination of bag-of-words and TF-IDF can perform better either TF-IDF or bag-of-words
The major problem observed while working with language is that the classifiers and learning algorithms cannot work on raw data directly. To deal with this problem we need to have some feature extraction techniques which convert text data into matrix (or vector) of features. Therefore, during the preprocessing step, the texts are converted to a more manageable representation. The Bag-of-Words and TF-IDF are the two most common approaches used for extracting features from text data. We have used TF-IDF as feature extraction technique in this research study.Rellying only on preprocessing techniques on text is no good approach but feature extraction has significant contribution in the result improvements. The conducted research studies prove that they have extracted features from the corpus successfully. They preferred to choose different features in combination of different text features. Starting from simple features then moved to their eight features set. It was concluded that these set of eight features have been used in text sentiment analysis for English language
Specifically, for each term in our dataset, we calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf. We have used sklearn.feature_extraction.text.TfidfVectorizer library to calculate a tf-idf vector here is our lab experiment along with few other related terms as defined below for the process of feature extraction in this research.
The feature selection is an important step in data preprocessing in NLP and same when we are dealing with text data classification. One of research survey on opinion mining elaborated that a model that extracted data from Chinese opinion corpus namely (NTCIR6) used Chi- Square for the purpose of feature extraction from customer reviews
Filter method is used for the selection of a good subset of the original features set. In filter based approach we specify a metric based on that filter features. The common methods used in Filter approach are the Chi-square/Correlation.
The Wrapper based method considers the selection of set of features as a search problem; Recursive Feature Elimination (RFE) is commonly used in Wrapper based approach.
There are different algorithms used by embedded method those algorithms have built-in feature selection methods. For example RF and Lasso have their own feature selection methods.
In this research different classification algorithms have been implemented for news text classification and these are explained as below:
The implemented K-NN algorithm in this research study provides 53% accuracy compared to our multi-text classification model. The K-NN model as regression works on the basis of average values of all k and that is counted as property value for object and when it works for classification purpose it gives a class membership as output.
K-NN is considered as kind of instance learning or lazy learning based method because it doesn’t work until the whole classification is completed. Furthermore, in this method the neighbors who are nearest have more weight on average value than the far ones. Two approaches made for its neighbors selection one either from object property or from K-NN classification value
It is a very common method used for text classification known as Naïve Bayes. It belongs to the category of supervised learning algorithms. NB working principle on data is that it should be in distributed form of multiple features .It has been also analyzed that NB classifier makes assumptions independently. The concept of independent assumptions in NB is that it does not consider the features order which means every feature is independent they don’t affect each other when it comes on classification task. One of the advantages of this method is that it performs well in classification even when we have less amount of training data. So in conclusion it works well in this case we have independent features but its performance decrease when the features dependency increased on each other. It also has been analyzed that it possesses good speed and accuracy when used on large databasesr. The accuracy we got from NB in this research study for multi-text classification is significantly high, that is 92%.
It works on the classification and regression of data. It builds a decision tree at training side. It places classes and mean prediction for regression and classification from each individual tree respectively. It is named random because the classification made by Random Forest is on the basis of random selection of data point/samples from the training data and features are selected randomly during the process of induction. The way it makes prediction for classification is on the basis of majority votes and for regression averaging result is considered. It is more powerful to noise and has good performance improvement in comparison with single tree classifier like C4.5
It is the text classification approach we have used in this research study which is frequently used for effective categorization, news filtering, personalization, and information routing is Linear SVM classifier. In SVM the data samples or features scattered in the form of 2D space and observe the closest point that is called support vector. The features are treated as vectors in space, once closest point is found then draw a line connecting them. We have already made a line that separates these two points as far as possible, and the SVM says the best separated line is, that bisects the two points and is perpendicular to the line that connects them. We are making some connection between documents and classes by connecting them as well as separating them to the particular distance. Whenever a document appears, we map it to a point and check the point on the other end of the separating line, to predict its class. According to research a State-of-the-art classification accuracy can be achieved by applying a linear Support Vector Machine (SVM) to a ‘bag-of-words’ representation of the text, where each unique word in the training corpus becomes a separate feature
Logistic Regression is used for classification problems and also called linear Regression. It works on the basis of probability for its predictive analysis. The most common function used by Logistic Regression is sigmoid function and considered as one of the complex algorithms. According to this research study Linear Regression performed well on text classification with accuracy of 93% which ranks second from all other classification algorithms.
In this section various machine learning classifiers have been discussed which are selected training for classification of text data. Most common supervised learning classifiers which have implemented in this research are: Naive Bayes Classifier, Logistic Regression, Random Forest Classifier, Linear SVC, and K-Neighbors.
Sl.no | ML Classification Algorithms | Accuracy |
---|---|---|
1 | K-Nearest Neighbor (K-NN) | 53.27% |
2 | Linear SVC | 96.1% |
3 | Logistic Regression | 93.43 % |
4 | Naive Bayes Classifier | 92.42 % |
5 | Random Forest Classifier | 54.48 % |
In this work different text classification algorithms were implemented for multi-text classification including Naive Bayes Classifier, Logistic Regression, Random Forest Classifier, Linear SVC, and K-Neighbors Classifier. Prior to Multi-Text classification various NLP preprocessing techniques were applied namely, data cleaning, feature extraction, and feature selection. However, obtained results showed that Linear SVC and Logistic Regression classifiers provided 96% and 93% accuracy compared to other tested classifiers on similar dataset for multi-text classification of Urdu language.
In future, we plan to extend our work by taking more categories of news text data of Urdu Roman Language as to verify existing algorithms accuracies.