Total views : 131

A Fast and Efficient Framework for Creating Parallel Corpus

Affiliations

  • Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

Abstract


Objectives: A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Keywords

Google OCR, Machine Translation, Parallel Corpus, Statistical Machine Translation, Scansnap SV600 Scanner.

Full Text:

 |  (PDF views: 137)

References


  • How many languages are there in the world? [Internet]. 2016 [cited 2016 Jul 7]. Available from: http://www.linguisticsociety.org/content/how-many-languages-are-there-world.
  • Languages of India [Internet]. 2016 [cited 2016 Jul 7]. Available from: https://en.wikipedia.org/wiki/Languages_of_India.
  • Nirenburg S. Knowledge-based machine translation. Machine Translation. 1989 Mar 1; 4(1):5–24.
  • Koehn P. SMT. Cambridge University Press; 2009 Dec 17.
  • Sawaf H, Shihadah M, Yaghi M, inventors; AppTek, assignee. Hybrid machine translation. United States patent application US 12/606,110; 2010 Jul 15.
  • Lü Y, Huang J, Liu Q. Improving SMT performance by training data selection and optimization. In EMNLP-CoNLL. 2007 Jun 28; 34:3–350.
  • Chung J, Cho K, Bengio Y. A character-level decoder without explicit segmentation for neural machine translation [Internet]. 2016 [updated 2016 Jun 21; cited 2016 Mar 19]. Available from: arXiv: 1603.06147.
  • Firat O, Cho K, Bengio Y. Multi-way, multilingual neural machine translation with a shared attention mechanism [Internet]. 2016 [cited 2016 Jan 6]. Available from: arXiv: 1601.01073.
  • Jussà MRC, Fonollosa JA. Character-based neural machine translation [Internet]. 2016 [updated 2016 Jun 30; cited 2016 Mar 2]. Available from: arXiv: 1603.00810.
  • TDIL [Internet]. 2016 [cited 2016 Jul 7]. Available from: http://tdil.mit.gov.in/.
  • Amazon Mechanical Turk [Internet]. 2016 [cited 2016 Jul 7]. Available from: https://www.mturk.com/mturk.
  • Ambati V, Vogel S. Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 Workshop on creating speech and language data with Amazon's Mechanical Turk. Association for Computational Linguistics; 2010 Jun 6. p. 62–5.
  • Burch CC, Dredze M. Creating speech and language data with Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on creating speech and language data with Amazon's Mechanical Turk. Association for Computational Linguistics; 2010 Jun 6. p. 1–12.
  • Ansari E, Sadreddini MH, Tabebordbar A, Sheikhalishahi M. Combining different seed dictionaries to extract lexicon from comparable corpus. Indian Journal of Science and Technology. 2014 Sep 15; 7(9):1279–88.
  • Google OCR [Internet]. 2016 [cited 2016 Jul 7]. Available from: https://support.google.com/drive/answer/176692?hl=en.
  • Scansnap SV600 [Internet]. 2016 [cited 2016 Jul 7]. Available from: http://www.fujitsu.com/global/products/computing/peripheral/scanners/scansnap/sv600/.
  • SCERT, Kerala [Internet]. 2016 [cited 2016 Jul 7]. Available from: http://www.scert.kerala.gov.in/index.php?option=com_content&view=article&id=86&Itemid=76.
  • Tamil Nadu School Text Books [Internet]. 2016 [cited 2016 Jul 7]. Available from: http://www.textbooksonline.tn.nic.in/.
  • Prabhakumar TL, Balakrishnan V. EnteNadodikkathakal – My Folk tales (Bilingual). Arshaasri Publishing Co.
  • James R. The art of letter writing (Malayalam – English). Dronacharya Publications; 2006.
  • Kumar MA, Dhanalakshmi V, Soman KP, Rajendran S. Factored SMT system for English to Tamil language. Pertanika Journal of Social Science and Humanities. 2014; 22(4):1045–61,
  • Kumar MA, Dhanalakshmi V, Soman KP, Rajendran S. A sequence labeling approach to morphological analyzer for Tamil language. International Journal on Computer Science and Engineering. 2010; 2(6):1944–5.
  • Dhanalakshmi V, Rekha RU, Kumar A, Soman KP, Rajendran S. Morphological analyzer for agglutinative languages using machine learning approaches. In Advances in Recent Technologies in Communication and Computing, 2009. ARTCom'09. International Conference IEEE; 2009 Oct 27. p. 433–5.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.