Efficient Accreditation Document Classification Using Naïve Bayes Classifier

Anna Fay E Na iuml ve; Jocelyn B Barbosa

doi:10.17485/IJST/v15i1.1761

Article

Efficient Accreditation Document Classification Using Naïve Bayes Classifier

VIEWS 1577
PDF 453

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v15i1.1761

Year: 2022, Volume: 15, Issue: 1, Pages: 9-18

Original Article

Efficient Accreditation Document Classification Using Naïve Bayes Classifier

Anna Fay E Naïve¹, Jocelyn B Barbosa^2*

¹Instructor, Department of Information Technology, University of Science and Technology of
Southern Philippines, C.M Recto Ave., Lapasan, Cagayan de Oro City, 9000, Philippines
²Associate Professor, Department of Information Technology, University of Science and
Technology of Southern Philippines, C.M Recto Ave., Lapasan, Cagayan de Oro City, 9000, Philippines

*Corresponding Author
Email: [email protected]

Received Date:10 December 2021, Accepted Date:30 December 2021, Published Date:21 January 2022

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: To develop a desktop application that automatically classifies a document as to which area of accreditation documents it should belong to. Specifically, it aims to: a) To create a predictive model that addresses document classification tasks. b) To design and develop an application that classifies documents according to document classification. c) To evaluate the performance measures of the automatic document classification. Methods: We introduce an innovative approach for the automatic classification of accreditation documents. Specifically, an approach of including scanned or captured documents in classification task using Optical Character Recognition (OCR); use TFIDF (Term-frequency Inverse Document Frequency) with stopwords removal, ngram of 1-2 in preprocessing of the text documents; and Naive Bayes algorithm with additive (Laplace/Lidstone) smoothing as a classifier in building our model. Results: Performance measures such as accuracy, precision, recall, and f-score were conducted to evaluate the efficiency of the study. The results showed 82% accuracy, 84% precision, 82% recall, and 82% F-1 score. As we explore the use of OCR for text extraction, TF-IDF for text preprocessing, and Naive Bayes classifier, the results indicate that the proposed approach is efficient. Conclusions: Classification of input documents in whatever forms, may it be captured image, scanned or simple text documents were obtained using OCR, TF-IDF, and Naive Bayes classifier. It provides an efficient way of automatic classification of accreditation documents and it gives an avenue to address limiting factors of the previous works, i.e classifying documents based on one’s opinion and time-consuming classification.

Keywords: Accreditation Document Classification; Document Classification Objective Evaluation; TF-IDF; Term frequency-inverse document frequency; Multinomial Naive Bayes; OCR; Optical Character Recognition

References

Lighten Software Inc. How can you distinguish scanned PDF from a normal PDF file? Available from: https://www.lightenpdf.com/knowledge-base/scanned-pdf-ocr.html(12.11.2018)
Berong. Document Management System For AACUP Accreditation Preparation with Suggestive Document Identifier. (Unpublished manuscript). University of Science and Technology of Science and Technology of Southern Philippines (USTP), Cagayan de Oro, Philippines. 2017.
Estrera P. Electronic Document Management System for Higher Education Institutions. International Journal of Innovative Science and Research Technology. 2017;2(5). Available from: https://ijisrt.com/wp-content/uploads/2017/06/Electronic-Document-Management-System-for-Higher-Education-Institution-1.pdf
Mata. Document Management System with Embedded Middleware for Document Uploading. (Unpublished manuscript). University of Science and Technology of Science and Technology of Southern Philippines (USTP), Cagayan de Oro, Philippines. 2017.
A. Semantic Analysis of Accreditation documents using domain ontology. International Journal of Innovative Research in Science, Engineering and Technology. 2018;6(5).
Bafna P, Pramod D, Vaidya A. Document clustering: TF-IDF approach. In: IEEE international conference on electrical, electronics, and optimization techniques (ICEEOT). (pp. 61-66) 2016. 10.1109/ICEEOT.2016.7754750
Spoorthi M. Automatic educational document classification using natural language processing. International Journal of Engineering Trends and Technology (IJETT). 2016;35(4):152–155. Available from: https://studylib.net/doc/12917730/automatic-educational-document-classification-using-natur
Mowafy M. An Efficient Classification Model for Unstructured Text Document. American Journal of Computer Science and Information Technology. 2018;6(1):16. doi: 10.21767/2349-3917.100016
Basarkar A. Document Classification Using Machine Learning, San Jose State University Scholar Works. 2017. Available from: https://doi.org/10.31979/etd.6jmu-9xdt
Saranyajothi C, Thenmozhi D. Machine Learning approach to Document Classification using Concept based Features. International Journal of Computer Applications. 2015;118(20):33–36. doi: 10.5120/20864-3578
Chaudhuri A, Mandaviya K, Badelia P, Ghosh SK. Optical Character Recognition Systems for Different Languages with Soft Computing. 2017. 10.1007/978-3-319-50252-6
Casillano NFB. Utilization of Optical Character Recognition (OCR) in the Development of a Number System Converter Application. Indian Journal of Science and Technology. 2019;12(16). doi: 10.17485/ijst/2019/v12i16/137794
Selcukalgun. Review for Tesseract and KrakenOCR for text recognition. 2018. Available from: https://medium.com/datadriveninvestor/review-for-tesseract-and-kraken-ocrfor-text-recognition-2e63c2adedd0(12.07
Rasjid ZE, Setiawan R. Performance Comparison and Optimization of Text Document Classification using kNN and Naïve Bayes Classification Techniques. Procedia Computer Science. 2017;116:107–112. Available from: https://doi.org/10.1016/j.procs.2017.10.017
Raschka S. Python Machine Learning 1st Edition September 23, 2015. Packt Publishing; 1 edition. 2016.
Jiang F, Zhang Z, Chen P, Liu Y. Naive Bayes Text Categorization Algorithm Based on TF-IDF Attribute Weighting. Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence - CSAI '18. 2018;18. doi: 10.1145/3297156.3297256
Goralewicz B. The TF*IDF Algorithm Explained. Available from: https://www.onely.com/blog/what-is-tf-idf/
Mueller J. Google’s John Mueller Discusses TF-IDF Algo. 2019. Available from: https://www.searchenginejournal.com/google-tf-idf/304361/#close

Copyright

© 2022 Naïve & Barbosa. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Published By Indian Society for Education and Environment (iSee)