Indian Journal of Science and Technology
DOI: 10.17485/ijst/2019/v12i35/146571
Year: 2019, Volume: 12, Issue: 35, Pages: 1-9
Original Article
Syed Muhammad Hassan1*, Fayyaz Ali2, Shaukat Wasi3 , Samreen Javeed1, Imtiaz Hussain1 and Syeda Nazia Ashraf1
1Department of computer science, Sindh Madressatul Islam University, Karachi, Pakistan M.has[email protected], [email protected], [email protected],
[email protected] 2Department of computer science, Sir Syed University of Engineering and Technology, Karachi , Pakistan; [email protected] 3Department of computer science, Mohammad Ali Jinnah University, Karachi, Pakistan; [email protected]
*Author for correspondence
Syed Muhammad Hassan
Department of computer science, Sindh Madressatul Islam University, Karachi, Pakistan [email protected]
Objectives: Roman-Urdu consider as a non-standard language used frequently on the Internet. To classify text from article tagging on Roman-Urdu is such difficult task because of many irregularities in spellings, for example, the word khubsurat (beautiful) in Roman-Urdu has multiple spellings. It can also be written as khoobsurat, khubsoorat, and khobsorat. Methods/Statistical Analysis: In this study, we scrap Roman-Urdu language news headline from various online newspapers. Our corpus contains 12319 news headlines which contain seven categories i.e. Accident, Sports, Weather, Arrest, Conference, Operation and Violence. We also use different preprocessing approaches like Roman-Urdu Stop words and apply IR models i.e. TF-IDF and Count Vector for feature extraction before applying classifier algorithms. Findings: We also compare results between different Machine Learning algorithm such as RF, LSVC, MNB, LR, RC, PAC, Perceptron, NC, SGDC and NC. Our model predicts best result to identify desire class on SGD classifier which gives 93.50% accuracy. Application/ Improvements: It is recommended that SGD Classifiers should be used in roman-Urdu news headline text classification.
Keywords: Linear SVC, Multinomial Naïve Bays (MNB), Ridge Classifier (RC), Random Forest, Roman-Urdu, Supervised Machine Learning, Stochastic Gradient Descent (SGD), Text Classification, Tf-Idf
Subscribe now for latest articles and news.