Year: 2016, Volume: 9, Issue: 40, Pages: 1-7

Preparation of a Dataset and Issues Related with Recognition of Optical Character in Assamese Script


According to the website ‘ethnologue.com’, which does a lot of survey and statistical analysis on languages, has mentioned that currently 7102 living languages are available on earth. Recent trend is that the number of living languages is always going down, which is becoming an alarming matter. An article published by UNESCO in 2009, says that most of the endangered languages belong to India. In this digital era, we can keep a language alive, if it can be highly used in computers; software applications with interface in regional language. In this context, researchers from this region are working for developing an Optical Character Recognition system that can digitize the optical image written in major North-East Indian language. As the characteristics of scripts vary from one another so are the challenges. Keeping in mind the need of the researcher, we have developed a novel offline dataset of Assamese Historical and Machine Printed as well as handwritten documents, which could be used for experimentation of various techniques for Assamese character recognition task. The dataset comprise of a variety of modern and old Assamese texts that are collected from a variety of sources, which can be broadly divided into Machine printed and Handwritten documents. Both good quality and degraded documents are available in the dataset. Many researchers are working for the development of an OCR system for Assamese script; however there are a lot of challenges that need to be addressed. Discussion of various issues related with degraded text, historical documents, handwritten Assamese text and machine printed texts with reference to the data sample available in the dataset are mentioned here. Problems related with segmentation of characters in touching characters, difficulty in determining compound character and touching character. Skewed document and how its variation makes line segmentation difficult. Heavily printed documents make feature extraction a complicated task. In the dataset we have pages with backside text visible, making the document a noisy one. Besides, all these inherent issues of character recognition, issues related with recognition of old Assamese script is also discussed in detail. This dataset will be of ample use and the issues we have discussed will certainly increase attraction of researchers working in this field. More research and innovation with digitization of Assamese documents, books and historical documents will definitely help sustainability of the language and the script as well. 
Keywords: Assamese Character Recognition, Dataset of Major North-East Indian Script, Document Analysis and Retrieval, Historical Document


