• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2019, Volume: 12, Issue: 35, Pages: 1-9

Original Article

Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms

Abstract

Objectives: Roman-Urdu consider as a non-standard language used frequently on the Internet. To classify text from article tagging on Roman-Urdu is such difficult task because of many irregularities in spellings, for example, the word khubsurat (beautiful) in Roman-Urdu has multiple spellings. It can also be written as khoobsurat, khubsoorat, and khobsorat. Methods/Statistical Analysis: In this study, we scrap Roman-Urdu language news headline from various online newspapers. Our corpus contains 12319 news headlines which contain seven categories i.e. Accident, Sports, Weather, Arrest, Conference, Operation and Violence. We also use different preprocessing approaches like Roman-Urdu Stop words and apply IR models i.e. TF-IDF and Count Vector for feature extraction before applying classifier algorithms. Findings: We also compare results between different Machine Learning algorithm such as RF, LSVC, MNB, LR, RC, PAC, Perceptron, NC, SGDC and NC. Our model predicts best result to identify desire class on SGD classifier which gives 93.50% accuracy. Application/ Improvements: It is recommended that SGD Classifiers should be used in roman-Urdu news headline text classification.

Keywords: Linear SVC, Multinomial Naïve Bays (MNB), Ridge Classifier (RC), Random Forest, Roman-Urdu, Supervised Machine Learning, Stochastic Gradient Descent (SGD), Text Classification, Tf-Idf

DON'T MISS OUT!

Subscribe now for latest articles and news.