• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2020, Volume: 13, Issue: 19, Pages: 1890-1900

Original Article

Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques

Received Date:11 April 2020, Accepted Date:17 May 2020, Published Date:18 June 2020


Objectives: This research presents multi-text classification from the news text dataset. The main purpose of this work is to classify multi-text for Urdu and Roman language using Natural Language processing and Machine Learning classification models. Methods/Statistical analysis: In this research, online news data has been collected through beautiful soup web scraping tool. In order to analyze the model accuracy news data is divided into six categories which has been composed from various online newspaper platforms. The main news corpus data consists of 10500 news in Urdu and Roman Urdu language including, Accidental, Education, Entertainment, International, Sports and Weather news have been primarily focused in the proposed research study. Furthermore, preprocessing is performed on text corpus using Natural Language Processing technique; for example, data cleaning, data balancing, and stop word removal. For feature extraction count vector, TF-IDF and Chi2 are employed as word filtering. For multi-text classification the Machine Learning classification schemes have been implemented namely, Naive Bayes Classifier, Logistic Regression, Random Forest Classifier, Linear SVC, and KNeighbors Classifier. After comparative analysis results showed that Linear Support Vector Classifier provided 96% accuracy among other tested methods. Findings: Multi-Text classification of Urdu Roman language having different writing styles, word structure, irregularities, grammar, and combined corpus is a challenging task. For this purpose, we implemented different Machine Learning algorithms with Natural Language preprocessing technique which provided optimal results in classification of multi-text news data.

Keywords: MultiText Classification; Machine Learning; NLP Preprocessing Techniques 


  1. Ghulam H, Zeng F, Li W, Xiao Y. Deep Learning-Based Sentiment Analysis for Roman Urdu Text. Procedia Computer Science. 2019;147:131–135. doi: 10.1016/j.procs.2019.01.202
  2. Arif H, Munir K, Danyal AS, Salman A, Fraz MM. Sentiment analysis of roman urdu/hindi using supervised methods. Proceedings of ICICC. 2016;8:48–53. doi: 10.22581/muet1982.1902.20
  3. Hassan S, Muhammad F, Ali S, Wasi S, Javeed I, Hussain SN, et al. Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms. Indian Journal of Science and Technology. 2019;12(35):1–9. doi: 10.17485/ijst/2019/v12i35/146571
  4. Li Z, Shang W, Yan M. News text classification model based on topic model. 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). 2016;p. 1–5.
  5. Korde V. Text Classification and Classifiers:A Survey. International Journal of Artificial Intelligence & Applications. 2012;3(2):85–99. doi: 10.5121/ijaia.2012.3208
  6. Londo, Yovellia GL, Kartawijaya DH, Ivariyani HT, WP YSP, P A, et al. A Study of Text Classification for Indonesian News Article. In: Rafi M, Ariyandi D., eds. 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT). (pp. 205-208) IEEE. 2019.
  7. Rafique A, Malik MK, Nawaz Z, Bukhari F, Jalbani AH, et al. Sentiment Analysis for Roman Urdu. Mehran University Research Journal of Engineering and Technology. 2019;38(2):463–470. doi: 10.22581/muet1982.1902.20
  8. Rashid A, Anwer N, Iqbal M, Sher M, et al. A survey paper: areas, techniques and challenges of opinion mining. International Journal of Computer Science Issues (IJCSI). 2013;10(6):18.
  9. Bilal M, Israr H, Shahid M, Khan A, et al. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. Journal of King Saud University - Computer and Information Sciences. 2016;28(3):330–344. doi: 10.1016/j.jksuci.2015.11.003
  10. Usman M, Shafique Z, Ayub S, Malik K. Urdu Text Classification using Majority Voting. International Journal of Advanced Computer Science and Applications. 2016;7(8):265–273. doi: 10.14569/ijacsa.2016.070836
  11. Ali J, Khan R, Ahmad N, Maqsood I. Random forests and decision trees. International Journal of Computer Science Issues (IJCSI). 2012;9(5):272.
  12. Ikonomakis M, Kotsiantis VS, Tampakas. Text classification using machine learning techniques. WSEAS transactions on computers. 2005;4(8):966–974.


© 2020 Chhajro, Khuhro, Kumar, Wagan, Umrani, Laghari. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Published By Indian Society for Education and Environment (iSee) 


Subscribe now for latest articles and news.