Indian Journal of Science and Technology
Year: 2015, Volume: 8, Issue: 27, Pages: 1-5
Annarao Kulkarni* , B. R. Srivatsa and Chetan Baji
Centre for Development of Advanced Computing (C-DAC), Bengaluru - 560100, Karnataka, India; [email protected]
Indian languages belong to four language families, namely, the Indo-Aryan, Dravidian, Tibeto-Burman and the AustroAsiatic. Hindi and Kannada belong to Indo-Aryan and Dravidian family respectively and are evolved from the ancient Brahmi script and have a common phonetic structure. But the Named Entity writing convention is different due to dialectic influence, language specific rules, and other factors. Due to this, the Named Entity Transliteration from Hindi to Kannada and vice versa is not one to one character mapping. This introduces many problems in Machine Translation (MT), Cross Lingual Information Retrieval (CLIR) and Parallel corpus creation between Hindi and Kannada. The paper discusses the Named Entity Transliteration issues encountered between Hindi and Kannada during the parallel corpora creation from Hindi to Kannada for the Indian Language Corpus Initiative (ILCI) project. In this paper, we discuss cases of no exact equivalence character between Hindi and Kannada, multiple mappings, diacritic marks, loan words and language specific transliteration issues in detail and propose the possible solution to resolve the problem. At implementation level, one may make use of either Finite-State Transducers (FST) or Regular Expressions
Keywords: Hindi, Kannada, Named Entity, Regular Expressions, Transliteration
Subscribe now for latest articles and news.