• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2019, Volume: 12, Issue: 33, Pages: 1-10

Original Article

Classification of Sindhi Headline News Documents based on TF-IDF Text Analysis Scheme

Abstract

Objectives: Sindhi language, historically rich belongs to Indo-Aryan language with diverse background and diverse dialects. Recent drive in globalization, e-commerce and e-literacy have influenced on languages as well. There are lots of magazines, Sindhi books, newspapers and web material available online, but unluckily still proper dataset is not designed for Sindhi information processing. This research study focuses on the Sindhi language news headline texts dataset and automated tool for the online texts’ classification based on the predefined label. Methods/Statistical Analysis: For the collection of datasets, the scraping tool is designed for extraction of the headline news from most popular newspapers: Awami Awaz and Daily Jhoongar. The dataset contains 2800 Sindhi headline news with five categories: 0. Entertainment, 1. Sports, 2. Science and Technology, 3. International, 4. National, 5. Sindhi news. The dataset is normalized by removing stop words and cleaning the spaces, punctuations and other unnecessary texts. Furthermore, the language feature is analyzed using TF-IDF and vector model. This paper presents Sindhi headline news classification model with implementation of the machine learning classification algorithms, namely. Multinomial NB, Linear SVC, Logistic Regression, MLP classifier, SGD Classifier, Random Forest Classifier, Ridge Classifier. Findings: The results show that the performance of the Linear SVC and MLP Classifier indicate better results on Sindhi headlines news categorization as compared to other classification techniques. This research study helps in improving the automatic classification of Sindhi text headline news. Application/Improvements: It is recommended that LSVC and MLP Classifiers should be used in Sindhi language news headline classification.

DON'T MISS OUT!

Subscribe now for latest articles and news.