Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques

M Ameen Chhajro  lowast; Mansoor Ahmed Khuhro; Kamlesh Kumar; Asif Ali Wagan; Aamir Iqbal Umrani; Asif Ali Laghari

doi:10.17485/IJST/v13i19.230

Article

Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques

VIEWS 3607
PDF 1432

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v13i19.230

Year: 2020, Volume: 13, Issue: 19, Pages: 1890-1900

Original Article

Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques

M Ameen Chhajro^1∗, Mansoor Ahmed Khuhro¹ , Kamlesh Kumar¹ , Asif Ali Wagan¹ , Aamir Iqbal Umrani¹ , Asif Ali Laghari¹

1 Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan

∗Corresponding author:
M Ameen Chhajro
Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan
Email: [email protected]

Received Date:11 April 2020, Accepted Date:17 May 2020, Published Date:18 June 2020

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: This research presents multi-text classification from the news text dataset. The main purpose of this work is to classify multi-text for Urdu and Roman language using Natural Language processing and Machine Learning classification models. Methods/Statistical analysis: In this research, online news data has been collected through beautiful soup web scraping tool. In order to analyze the model accuracy news data is divided into six categories which has been composed from various online newspaper platforms. The main news corpus data consists of 10500 news in Urdu and Roman Urdu language including, Accidental, Education, Entertainment, International, Sports and Weather news have been primarily focused in the proposed research study. Furthermore, preprocessing is performed on text corpus using Natural Language Processing technique; for example, data cleaning, data balancing, and stop word removal. For feature extraction count vector, TF-IDF and Chi2 are employed as word filtering. For multi-text classification the Machine Learning classification schemes have been implemented namely, Naive Bayes Classifier, Logistic Regression, Random Forest Classifier, Linear SVC, and KNeighbors Classifier. After comparative analysis results showed that Linear Support Vector Classifier provided 96% accuracy among other tested methods. Findings: Multi-Text classification of Urdu Roman language having different writing styles, word structure, irregularities, grammar, and combined corpus is a challenging task. For this purpose, we implemented different Machine Learning algorithms with Natural Language preprocessing technique which provided optimal results in classification of multi-text news data.

Keywords: MultiText Classification; Machine Learning; NLP Preprocessing Techniques

References

Ghulam H, Zeng F, Li W, Xiao Y. Deep Learning-Based Sentiment Analysis for Roman Urdu Text. Procedia Computer Science. 2019;147:131–135. doi: 10.1016/j.procs.2019.01.202
Arif H, Munir K, Danyal AS, Salman A, Fraz MM. Sentiment analysis of roman urdu/hindi using supervised methods. Proceedings of ICICC. 2016;8:48–53. doi: 10.22581/muet1982.1902.20
Hassan S, Muhammad F, Ali S, Wasi S, Javeed I, Hussain SN, et al. Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms. Indian Journal of Science and Technology. 2019;12(35):1–9. doi: 10.17485/ijst/2019/v12i35/146571
Dwivedi SK, Arya C. Automatic text classification in information retrieval: A survey. In: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. (pp. 1-6) 2016.
Li Z, Shang W, Yan M. News text classification model based on topic model. 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). 2016;p. 1–5.
Zhao W. Deep Active Learning for Short-Text Classification. Available from: http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A1135693
Korde V. Text Classification and Classifiers:A Survey. International Journal of Artificial Intelligence & Applications. 2012;3(2):85–99. doi: 10.5121/ijaia.2012.3208
Ali ZSA, Hassan SM. Urdu/Hindi News Headline, Text Classification by Using Different Machine Learning Algorithms. doi: 10.13140/RG.2.2.12068.83846
Zheng Y. IEEE An Exploration on Text Classification with Classical Machine Learning Algorithm. In: 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). (pp. 81-85) IEEE. 2019.
Londo, Yovellia GL, Kartawijaya DH, Ivariyani HT, WP YSP, P A, et al. A Study of Text Classification for Indonesian News Article. In: Rafi M, Ariyandi D., eds. 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT). (pp. 205-208) IEEE. 2019.
Singh G, Kumar B, Gaur L, Tyagi A. Comparison between Multinomial and Bernoulli Naïve Bayes for Text Classification. In: 2019 International Conference on Automation, Computational and Technology Management (ICACTM). (pp. 593-596) IEEE. 2019.
Bang A, Wu W, Han H, et al. Deep Active Learning for Text Classification. In: Proceedings of the 2nd International Conference on Vision, Image and Signal Processing. (pp. 1-6) 2018.
Rafique A, Malik MK, Nawaz Z, Bukhari F, Jalbani AH, et al. Sentiment Analysis for Roman Urdu. Mehran University Research Journal of Engineering and Technology. 2019;38(2):463–470. doi: 10.22581/muet1982.1902.20
Rashid A, Anwer N, Iqbal M, Sher M, et al. A survey paper: areas, techniques and challenges of opinion mining. International Journal of Computer Science Issues (IJCSI). 2013;10(6):18.
Bilal M, Israr H, Shahid M, Khan A, et al. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. Journal of King Saud University - Computer and Information Sciences. 2016;28(3):330–344. doi: 10.1016/j.jksuci.2015.11.003
Usman M, Shafique Z, Ayub S, Malik K. Urdu Text Classification using Majority Voting. International Journal of Advanced Computer Science and Applications. 2016;7(8):265–273. doi: 10.14569/ijacsa.2016.070836
Ali J, Khan R, Ahmad N, Maqsood I. Random forests and decision trees. International Journal of Computer Science Issues (IJCSI). 2012;9(5):272.
Ikonomakis M, Kotsiantis VS, Tampakas. Text classification using machine learning techniques. WSEAS transactions on computers. 2005;4(8):966–974.

Copyright

© 2020 Chhajro, Khuhro, Kumar, Wagan, Umrani, Laghari. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Published By Indian Society for Education and Environment (iSee)