A Fast and Efficient Framework for Creating Parallel Corpus

B  Premjith; S  Sachin Kumar; R  Shyam; M  Anand Kumar and K  P  Soman

doi:10.17485/ijst/2016/v9i45/106520

Article

A Fast and Efficient Framework for Creating Parallel Corpus

VIEWS 885
PDF 303

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2016/v9i45/106520

Year: 2016, Volume: 9, Issue: 45, Pages: 1-7

Original Article

A Fast and Efficient Framework for Creating Parallel Corpus

B. Premjith^*, S. Sachin Kumar, R. Shyam, M. Anand Kumar and K. P. Soman

Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India; [email protected], [email protected], [email protected], [email protected], [email protected]

*Author for correspondence
B. Premjith Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Keywords: Google OCR, Machine Translation, Parallel Corpus, Statistical Machine Translation, Scansnap SV600 Scanner