Indian Journal of Science and Technology
Year: 2019, Volume: 12, Issue: 33, Pages: 1-10
Irfan Ali Kandhro1*, Sahar Zafar Jumani1, Ajab Ali Lashari2, Saima Sipy Nangraj1, Qurban Ali Lakhan1, Mirza Taimoor Baig3 and Subhash Guriro4
1Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan; [email protected], s[email protected], [email protected], [email protected]
2Department of Education, Sindh Madressatul Islam University, Karachi, Pakistan; [email protected]
3Department of Computer Science, University of Karachi, Pakistan; [email protected]
4Faculty of Media and Communication Studies, Sindh Madressatul Islam University, Karachi, Pakistan; [email protected]
*Author for correspondence
Irfan Ali Kandhro
Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan; [email protected]
Objectives: Sindhi language, historically rich belongs to Indo-Aryan language with diverse background and diverse dialects. Recent drive in globalization, e-commerce and e-literacy have influenced on languages as well. There are lots of magazines, Sindhi books, newspapers and web material available online, but unluckily still proper dataset is not designed for Sindhi information processing. This research study focuses on the Sindhi language news headline texts dataset and automated tool for the online texts’ classification based on the predefined label. Methods/Statistical Analysis: For the collection of datasets, the scraping tool is designed for extraction of the headline news from most popular newspapers: Awami Awaz and Daily Jhoongar. The dataset contains 2800 Sindhi headline news with five categories: 0. Entertainment, 1. Sports, 2. Science and Technology, 3. International, 4. National, 5. Sindhi news. The dataset is normalized by removing stop words and cleaning the spaces, punctuations and other unnecessary texts. Furthermore, the language feature is analyzed using TF-IDF and vector model. This paper presents Sindhi headline news classification model with implementation of the machine learning classification algorithms, namely. Multinomial NB, Linear SVC, Logistic Regression, MLP classifier, SGD Classifier, Random Forest Classifier, Ridge Classifier. Findings: The results show that the performance of the Linear SVC and MLP Classifier indicate better results on Sindhi headlines news categorization as compared to other classification techniques. This research study helps in improving the automatic classification of Sindhi text headline news. Application/Improvements: It is recommended that LSVC and MLP Classifiers should be used in Sindhi language news headline classification.
Subscribe now for latest articles and news.