Total views : 90

Classification of Gujarati Documents using Naïve Bayes Classifier

Affiliations

  • School of Computer Science, R. K. University, Rajkot - 360020, Gujarat, India
  • Narmada College of Computer Application, Bharuch - 392011, Gujarat, India

Abstract


Objectives: Information overload on the web is a major problem faced by institutions and businesses today. Sorting out some useful documents from the web which is written in Indian language is a challenging task due to its morphological variance and language barrier. As on date, there is no document classifier available for Gujarati language. Methods: Keyword search is a one of the way to retrieve the meaningful document from the web, but it doesn’t discriminate by context. In this paper we have presented the Naïve Bayes (NB) statistical machine learning algorithm for classification of Gujarati documents. Six pre-defined categories sports, health, entertainment, business, astrology and spiritual are used for this work. A corpus of 280 Gujarat documents for each category is used for training and testing purpose of the categorizer. We have used k-fold cross validation to evaluate the performance of Naïve Bayes classifier. Findings: The experimental results show that the accuracy of NB classifier without and using features selection was 75.74% and 88.96% respectively. These results prove that the NB classifier contribute effectively in Gujarati documents classification. Applications: Proposed research work is very useful to implement the functionality of directory search in many web portals to sort useful documents and many Information Retrieval (IR) applications.

Keywords

Classification, Document Categorization, Gujarati Language, Naïve Bayes.

Full Text:

 |  (PDF views: 83)

References


  • Lin SH, Chen M C, Ho JM, Huang YM. ACIRD: Intelligent Internet document organization and retrieval. IEEE Transactions on Knowledge and Data Engineering. 2002; 14(3):599–614. https://doi.org/10.1109/ TKDE.2002.1000345
  • Lee LH, Isa D. Automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Systems with Applications, 2010; 37(12):8471–8. https://doi.org/10.1016/j.eswa.2010.05.030
  • Zhang H. The Optimality of Naive Bayes. Barr V, Markov Z, editors. FLAIRS Conference; AAAI Press; 2004.
  • Patil JJ, Bogiri N. Automatic text categorization Marathi documents. International Journal of Advance Research in Computer Science and Management Studies. 2015; 3(3):280–7. https://doi.org/10.1109/icesa.2015.7503438
  • Patil M, Game P. Comparison of Marathi text classifiers. ACEEE International Journal on Information Technology. 2014; 4(1):11–22.
  • mandal ak, sen r. supervised learning method for bangla web Document Categorization. International Journal of Artificial Intelligence and Applications. 2014; 5(5):93–105. https://doi.org/10.5121/ijaia.2014.5508
  • Murthy VG, Vardhan BV, Sarangam K, Reddy PVP. A comparative study on term weighting methods for automated Telugu text categorization with effective classifiers. International Journal of Data Mining and Knowledge Management Process. 2013; 3(6):95. https://doi.org/10.5121/ijdkp.2013.3606
  • Swamy MN, Hanumanthappa M. Indian language text representation and categorization using supervised learning algorithm. International Journal of Data Mining Techniques and Applications. 2013; 2:251–7.
  • Naseeb N, Gupta V. Domain based classification of punjabi text documents using ontology and hybrid based approach. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing COLING; 2012. p. 109–122.
  • Rajan K, Ramalingam V, Ganesan M, Palanivel S, Palaniappan B. Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Systems with Applications. 2009, 36(8):10914–8. https://doi.org/10.1016/j.eswa.2009.02.010
  • Raghuveer K, Murthy KN. Text categorization in Indian languages using machine learning approaches. IICAI; 2007. p. 1864–83.
  • Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2002; 10:79–86.
  • Rogati M, Yang Y. High-performing feature selection for text classification. Proceedings of the 11th International Conference on Information and Knowledge Management; 2002. p. 659–61. https://doi.org/10.1145/584792.584911
  • Forman G. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research. 2003; 3:1289–305.
  • Tan S, Zhang J. An empirical study of sentiment analysis for Chinese documents. Expert Systems with
  • Applications. 2008; 34(4):2622–9. https://doi.org/10.1016/j.swa.2007.05.028
  • Prabowo R, Thelwall M. Sentiment analysis: A combined approach. Journal of Informetrics. 2009; 3(2):143–57. https://doi.org/10.1016/j.joi.2009.01.003
  • Alsaleem S. Automated Arabic text categorization using SVM and NB. International Arab Journal of e-Technology. 2011; 2(2):124–8.
  • El Kourdi M, Bensaid A, Rachidi TE. Automatic Arabic document categorization based on the Naïve Bayes algorithm. Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Association for Computational Linguistics; 2004. p. 51–8. https://doi.org/10.3115/1621804.1621819
  • Hadni M, Lachkar A, Ouatik SA. A new and efficient stemming technique for Arabic text categorization. 2012 International Conference on Multimedia Computing and Systems (ICMCS); 2012. p. 791–6. https://doi.org/10.1109/ ICMCS.2012.6320308
  • Harrag F, El-Qawasmah E, Al-Salman AMS. Stemming as a feature reduction technique for Arabic text categorization. 2011 10th International Symposium on Programming and Systems (ISPS); 2011. p. 128–33.
  • Halder T, Karforma S, Mandal R. A novel data hiding approach by pixel-value-difference steganography and optimal adjustment to secure e-governance documents. Indian Journal of Science and Technology. 2015 Jul; 8(16):1–7. https://doi.org/10.17485/ijst/2015/v8i16/51269
  • Prakash KB. Mining issues in traditional Indian web documents. Indian Journal of Science and Technology. 2015 Nov; 8(32):1–11.
  • Antipov KV, Vinokur AI, Simakov SP, Isakov YV, Kazakova AY. Digitization of Russian parish registers of the 18-20th centuries as the contribution to the cultural foundation of historical documents. Indian Journal of Science and Technology. 2015 Dec; 8(10):1–10. https://doi. org/10.17485/ijst/2015/v8is(10)/87462
  • Posonia AM, Jyothi VL. Context-based classification of XML documents in feature clustering. Indian Journal of Science and Technology. 2014 Jan; 7(9):1–4.
  • Karthika S, Sairam N. A naïve bayesian classifier for educational qualification. Indian Journal of Science and Technology. 2015 Jul; 8(16):1–5. https://doi.org/10.17485/ ijst/2015/v8i16/62055
  • Sarangi PK, Ahmed P, Ravulakollu KK. Naïve Bayes classifier with LU factorization for recognition of handwritten Odia numerals. Indian Journal of Science and Technology. 2014 Jan; 7(1):1–4.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.