Indian Journal of Science and Technology
Year: 2016, Volume: 9, Issue: 45, Pages: 1-4
Atul Kumar* and Gurpreet Singh Lehal
*Author for correspondence
Atul Kumar Department of Computer Science, Punjabi University, Patiala – 147002, Punjab, India; [email protected]
Objectives: This paper proposes a new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix. Methods/Statistical Analysis: Confusion matrix is generated from large corpus of Hindi. The system takes each word of OCR output and generate number of strings from topmost five confused characters for each character of input word along with probability of these strings for ranking. Each string is validated with the character trigram dictionary and these valid strings are used for best suggestions. Findings: The topmost five words is taken as suggestions. The system has been tested for variety of OCR outputs documents of Devanagari script. The system provides suggestions for all the correct words at top position. For more than 10000 unique words in Devanagari OCR output, system gives the accuracy of 97%. Application/Improvements: This system is used in post-processing of Devanagari OCR. With some improvements, the system can also be used for Gurumukhi Script and Urdu script.
Keywords: Automatic Text Correction, Confusion Matrix, Devanagari, OCR, Trigram
Subscribe now for latest articles and news.