The TypeCraft Natural Language Database: Annotating and Incorporating Urdu

Sharmin Muzaffar; Pitambar Behera; Girish Nath Jha; Lars Hellan  and Dorothee Beermann

doi:10.17485/ijst/2015/v8i27/81728

Article

The TypeCraft Natural Language Database: Annotating and Incorporating Urdu

VIEWS 1057
PDF 245

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2015/v8i27/81728

Year: 2015, Volume: 8, Issue: 27, Pages: 1-9

Original Article

The TypeCraft Natural Language Database: Annotating and Incorporating Urdu

Sharmin Muzaffar¹ , Pitambar Behera^2*, Girish Nath Jha² , Lars Hellan³ and Dorothee Beermann³

¹Department of Linguistics, Aligarh Muslim University, Aligarh - 202002, Uttar Pradesh, India; [email protected]
² Center for Linguistics, Jawaharlal Nehru University, New Delhi - 110067, India; [email protected], [email protected]
³ Norwegian University of Science and Technology, Trondheim, Norway; [email protected], [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

The authors present one of the important Indo-Aryan languages i.e. Urdu on the TypeCraft platform, which is an online, multilingual, and corpus-based, natural language database and a documentary platform for natural languages. Previously, the platform has already incorporated other Indian languages like Telugu, Bengali, Hindi, and Odia. Recently, the platform has been extended to the annotation and incorporation of Urdu. The TC framework has been designed in such a manner that it can facilitate the linguistic annotation up to the level of semantics to enhance the cross-comparison of structures between languages of different families. The recent version of TC 2.2 has taken the level of annotation up to discourse and pragmatics through a closer integration of text and sentence level annotation. Theoretically speaking, the system is applicable to all languages, but practically it is also very specific with regard to encoding the salient syntactic and semantic features. The paper highlights some of the linguistic issues: Agreement, case, verbs, and mood, labeling features, glossing and technical challenges. The current study focuses on Urdu linguistic annotation taking into consideration the annotated data on the said platform.
Keywords: Linguistic Annotation, Natural Language Database, Semantic Argument Structure, South-Asian Languages (SAL), Syntactic Argument Structure, TypeCraft (TC)