OCR for historical Kannada documents using clustering methods

P Ravi; C Naveena; Y H Sharathkumar

doi:10.17485/IJST/v13i35.1287

Article

OCR for historical Kannada documents using clustering methods

VIEWS 2708
PDF 426

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v13i35.1287

Year: 2020, Volume: 13, Issue: 35, Pages: 3652-3663

Original Article

OCR for historical Kannada documents using clustering methods

P Ravi¹,²,^3*, C Naveena¹,³, Y H Sharathkumar⁴,³

¹Department of Computer Science and Engineering, SJB Institute of Technology, Bengaluru, 560060, Karnataka, India. Tel.: +919480507409
²Department of Computer Science and Engineering, Vidyavardhaka College of Engineering,Mysuru, 570002, Karnataka, India
³Affiliated to Visvesvaraya Technological University, Belagavi, 590018, Karnataka, India
⁴Department of Information Science and Engineering, Maharaja Institute of Technology,Mysuru, 571438, Karnataka, India

*Corresponding Author
Tel: +919480507409
Email: [email protected]

Received Date:31 July 2020, Accepted Date:06 September 2020, Published Date:03 October 2020

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Motivation: In India, the Language Kannada is an ancient and official language in Karnataka State. The study of ancient Kannada scripts from stone carvings, leaf, metal, cloth, paper and other sources enhances our knowledge on the traditions and culture practiced in Karnataka. Due to Poor Quality, variability and the contrast, the Kannada ancient scripts become very challenging to extract the information or to recognize the characters. Objectives: To design a suitable Optical Character Recognition (OCR) technique to read ancient Kannada scripts. Method: Clustering by fast search and find of density peaks is a state-of-the-art density-based clustering algorithm that can effectively find clusters with arbitrary shapes. However, it requires to calculate the distances between all the points in a data set to determine the density and separation of each point. Consequently, its computational cost is extremely high in the case of large-scale data sets. In this work the given document is preprocessed. The features alike SIFT and SURF are extracted and clustered using K-Means clustering. The similarity is computed using different measures. Findings: The classification accuracy was studied under different clustering methods like Kmeans, Agglomerative, Density based clustering with distance based measures like Euclidean and Manhattan. To evaluate the performance of the proposed method, we created our own database of Ashok, Kadamba, Hoysala and Mysuru scripts and experiment was conducted in a database of 4 classes under 70, 50 and 30 different training models from each class. Novelty: We propose a K-means clustering using SIFT and SURF for Kannada ancient manuscript. Experiment was conducted in our own database to validate the performance of the presented system

Keywords: Historical Kannada; Karnataka; SIFT; SURF; KMeans

References

Pourmohammad S, SR, Anthony M. An efficient character recognition scheme based on k-means clustering. 2013. Available from: https://doi.org/10.1109/ICMSAO.2013.6552640
Sheshadri K, Ambekar PKT, Prasad DP, Kumar RP. An OCR system for printed Kannada using k-means clustering. In: IEEE International Conference on Industrial Technology. Vina del Mar. p. 183–187.
Biswas S, Tai-Hoon K, Bhattacharyya D. Features extraction and verification of signature image using clustering technique. International Journal of International Journal of International Journal of International Journal of Smart. 2010;p. 43–55.
Gaur A, Yadav S. Handwritten Hindi character recognition using k-means clustering and SVM. In: 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services. (pp. 65-70) 2015.
Guruprasad P, Majumdar DJ. Optimal Clustering Technique for Handwritten Nandinagari Character Recognition. International Journal of Computer Applications Technology and Research. 2017;6(5):213–223. Available from: https://dx.doi.org/10.7753/ijcatr0605.1001
Methasate I, Sae-Tang S. The clustering technique for Thai handwritten recognition. In: Ninth International Workshop on Frontiers in Handwriting Recognition. (pp. 450-454) 2004.
Xiaodong J, Wendong G, Jie Y. Handwritten Yi Character Recognition with Density-Based Clustering Algorithm and Convolutional Neural Network. In: IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). (pp. 337-341) 2017.
Surinta O, Karaaba FM, Schomaker RBL, Wiering MA. Recognition of handwritten characters using local gradient feature descriptors. Engineering Applications of Artificial Intelligence. 2015;45:405–414. Available from: https://dx.doi.org/10.1016/j.engappai.2015.07.017
Garz A, Diem M, Sablatnig R. Detecting Text Areas and Decorative Elements in Ancient Manuscripts. In: 12th International Conference on Frontiers in Handwriting Recognition. (pp. 176-181) 2010.
Kumar M, Jindal M, Sharma R. k-nearest neighbor based offline hand-written Gurmukhi character recognition. International Conference on Image Information Processing. 2011;p. 1–4.
Ashir AM, Shehu GS. Adaptive clustering algorithm for optical character recognition. In: 7th International Conference on Electronics, Computers and Artificial Intelligence(ECAI). (pp. 13-16) 2015.
Soheili MR, Kabir E, Stricker D. Stricker, D, Sub-word image clustering in Farsi printed books. In: 7th International Conference on Machine Vision. (Vol. 9445) 2014.
Soheili MR, Yousefi MR, Kabir E, Stricker D. Merging clustering and classification results for whole book recognition. In: 10th Iranian Conference on Machine Vision and Image Processing (MVIP). (pp. 134-138) 2017.
Panda, Nayak S&, Nayak AM&. Clustering of Odia Character Images Using K-Means Algorithm and Spectral Clustering Algorithm. 2020. doi: https://doi.org/10.1007/978-981-13-8461-5_7
Belagali N, Shanmukhappa A, Angadi. OCR for Handwritten Kannada Language Script. International Journal of Recent Trends in Engineering & Research (IJRTER). 2016;02(08). Available from: https://dopi.org/10.1109/ICPR.2008.4761867
Sagar BM, Shobha G, P, Kumar R. Complete Kannada Optical Character Recognition with syntactical analysis of the script. In: International Conference on Computing, Communication and Networking. (pp. 1-4) 2008.
Manjunath AE, Sharath B. Implementing Kannada Optical Character Recognitionon theAndroid Operating System for Kannada Sign Boards. International Journal of Advanced Research in Computer and Communication Engineering. 2013;2(1).
Chandrakala HT, Thippeswamy D, G. A Comprehensive Survey onOCR Techniques for Kannada Script. International Journal of Science and Research (IJSR). 2016;5(4).
Kumar HRS, Ramakrishnan AG. Lipi Gnani. ACM Transactions on Asian and Low-Resource Language Information Processing. 2020;19(4):1–23. Available from: https://dx.doi.org/10.1145/3387632
Bai L, Cheng X, Liang J, Shen H, Guo Y. Fast density clustering strategies based on the k-means algorithm. Pattern Recognition. 2017;71:375–386. Available from: https://dx.doi.org/10.1016/j.patcog.2017.06.023
Rathore R, Prakash S, Gupta P. Efficient human recognition system using ear and profile face. In: IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS). (pp. 1-6) 2013.
Dhillon IS, Modha DS. Machine Learning. 2001;42:143–175. Available from: https://dx.doi.org/10.1023/a:1007612920971

Copyright

© 2020 Ravi et al.This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee).