Indian Journal of Science and Technology
Year: 2017, Volume: 10, Issue: 5, Pages: 1-9
Rajnish M. Rakholia1* and Jatinderkumar R. Saini2
1School of Computer Science, R. K. University, Rajkot - 360020, Gujarat, India; [email protected] 2Narmada College of Computer Application, Bharuch - 392011, Gujarat, India; [email protected]
*Author for the correspondence:
Rajnish M. Rakholia
School of Computer Science, R. K. University, Rajkot - 360020, Gujarat, India; [email protected]
Objectives: Information overload on the web is a major problem faced by institutions and businesses today. Sorting out some useful documents from the web which is written in Indian language is a challenging task due to its morphological variance and language barrier. As on date, there is no document classifier available for Gujarati language. Methods: Keyword search is a one of the way to retrieve the meaningful document from the web, but it doesn’t discriminate by context. In this paper we have presented the Naïve Bayes (NB) statistical machine learning algorithm for classification of Gujarati documents. Six pre-defined categories sports, health, entertainment, business, astrology and spiritual are used for this work. A corpus of 280 Gujarat documents for each category is used for training and testing purpose of the categorizer. We have used k-fold cross validation to evaluate the performance of Naïve Bayes classifier. Findings: The experimental results show that the accuracy of NB classifier without and using features selection was 75.74% and 88.96% respectively. These results prove that the NB classifier contribute effectively in Gujarati documents classification. Applications: Proposed research work is very useful to implement the functionality of directory search in many web portals to sort useful documents and many Information Retrieval (IR) applications.
Keywords: Classification, Document Categorization, Gujarati Language, Naïve Bayes
Subscribe now for latest articles and news.