Indian Journal of Science and Technology
Year: 2015, Volume: 8, Issue: 27, Pages: 1-9
Sharmin Muzaffar1 , Pitambar Behera2*, Girish Nath Jha2 , Lars Hellan3 and Dorothee Beermann3
1 Department of Linguistics, Aligarh Muslim University, Aligarh - 202002, Uttar Pradesh, India; [email protected]
2 Center for Linguistics, Jawaharlal Nehru University, New Delhi - 110067, India; [email protected], [email protected]
3 Norwegian University of Science and Technology, Trondheim, Norway; [email protected], [email protected]
The authors present one of the important Indo-Aryan languages i.e. Urdu on the TypeCraft platform, which is an online, multilingual, and corpus-based, natural language database and a documentary platform for natural languages. Previously, the platform has already incorporated other Indian languages like Telugu, Bengali, Hindi, and Odia. Recently, the platform has been extended to the annotation and incorporation of Urdu. The TC framework has been designed in such a manner that it can facilitate the linguistic annotation up to the level of semantics to enhance the cross-comparison of structures between languages of different families. The recent version of TC 2.2 has taken the level of annotation up to discourse and pragmatics through a closer integration of text and sentence level annotation. Theoretically speaking, the system is applicable to all languages, but practically it is also very specific with regard to encoding the salient syntactic and semantic features. The paper highlights some of the linguistic issues: Agreement, case, verbs, and mood, labeling features, glossing and technical challenges. The current study focuses on Urdu linguistic annotation taking into consideration the annotated data on the said platform.
Keywords: Linguistic Annotation, Natural Language Database, Semantic Argument Structure, South-Asian Languages (SAL), Syntactic Argument Structure, TypeCraft (TC)
Subscribe now for latest articles and news.