Total views : 316

Building Arabic Corpus Applied to Part-of-Speech Tagging


  • University of Dammam, Department of Computer Science, Dammam − 31451, Saudi Arabia
  • Ajloun National University, Information Technology, Ajloun − 26810, Jordan
  • Universiti Teknologi Malaysia, UTM Johor Bahru, 81310 Johor, Malaysia


Objective: This paper aimed to review corpus linguistics sources related to part-of-speech tagging and to build up a sufficient annotated corpus for the Arabic language that contains Arabic words and their grammatical tags. Methods/ Statistical Analysis: An in-depth survey conducted by the author’s showed that there is a need for free tagged Arabic corpus that can be used in natural language processing researches. A corpus of 25,000 words collected manually from different web sources which ware written in Modern Standard Arabic. The collected words were tagged using Arabic language grammar books. Findings: The developed corpus can help the researchers in natural language processing applications. Applications/Improvements: This corpus needed to be expanded to include more words and their grammatical tags.


Arabic Language, Corpus, Linguistics, Part of Speech Tagging.

Full Text:

 |  (PDF views: 272)


  • Hays DG. Gerald Penn, Philosophy of Linguistics. 2012 Jan; 14:143.
  • Leech G. 100 million words of English, English Today.1993: 9(01):9−15.
  • McEnery AM, McEnery T. Computational Linguistics: A Handbook and Toolbox for Natural Language Processing, Sigma Press, 1992.
  • Altunyurt L, Orhan Z. Part-of-Speech Tagger for Turkish, Technical Report, Department of Computer Engineering, Jun 2006.
  • Jurafsky D, Speech MJ. Language Processing.International Edition, 2008, p. 66−7.
  • Ku H, Francis WN. Computational Analysis of Present-Day {A}merican {E}nglish. Brown University Press, 1967, p. 424.
  • The Arabic Language: The Glue that Binds the Arab World.Date Accessed: 2016. Available at:
  • Khoja S. APT: An Automatic Arabic Part-of-Speech Tagger (Doctoral Dissertation, Lancaster University), 1963, p. 1−6.
  • Jiyad M. A Hundred and One Rules! A Short Reference for Arabic Syntactic, Morphological and Phonological Rules for Novice and Intermediate Levels of Proficiency, 2006, p. 2−4.
  • Alqrainy S, Ayesh A. Developing a Tag Set for Automated POS Tagging in Arabic, WSEAS Transactions on Computers. 2006 Nov; 5(11):2787−92.
  • Aosh M. Learning Language at a Distance: An Arabic Initiative, Foreign Language Annals. 2001 Jul 1; 34(4):347−54.
  • Allen R, Allouche A. Let’s Learn Arabic: A ProficiencyBased Syllabus for Modern Standard Arabic, University of Pennsylvania; 1988.
  • Habash N. Arabic Morphological Representations for Machine Translation, In: Arabic Computational Morphology Springer Netherlands, 2007, p. 263−85.
  • Attia M. An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modeling Finite State Networks, In: Challenges of Arabic for NLP/MT Conference, The British Computer Society, London: UK, 2006 Oct 23; 10(1):72.
  • Alqrainy S. A Morphological-Syntactical Analysis Approach for Arabic Textual Tagging, Doctor of Philosophy in Computer Science, 2008, p. 228.
  • Al-Sulaiti L, Atwell E. Designing and Developing a Corpus of Contemporary Arabic, In: Proceedings of the Sixth TALC Conference, 2004 Mar, p. 1−92.
  • Khoja S. An RSS Feed Analysis Application and Corpus Builder, Interface: The Journal of Education, Community and Values. 2009; 9(3):115−18.
  • El-Haj M, Kruschwitz U, Fox C. Using Mechanical Turk to Create a Corpus of Arabic summaries, Research Repository, 2010, p. 1−4.
  • Graff D, Walker K. Arabic Newswire Part 1. Linguistic Data Consortium, Philadelphia. LDC Catalog Number LDC2001T55 and ISBN, 2001.
  • Buckwalter T. Issues in Arabic Orthography and Morphology Analysis. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Association for Computational Linguistics, 2004 Aug 28, p. 31−34.
  • Canavan A, Zipperlen G. CALLFRIEND Egyptian Arabic Speech Linguistic Data Consortium, Philadelphia, 1996.
  • Hoogland J. Lexical Gaps in Arabic Lexicography with Evidence from Arabic Dictionaries, Approaches to Arabic Linguistics: Presented to Kees Versteegh on the Occasion of his Sixtieth Birthday. 2007 Oct; 15:455−73.
  • Maamouri M, Bies A, Buckwalter T, Mekki W. The Pennarabic Tree Bank: Building a Large-Scale Annotated Arabic Corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, 2004 Sep 22-27, 466−67.
  • Saad MK, Ashour W. Osac: Open Source Arabic Corpora.In: 6th Arch Eng Int. Symposiums, EEECS, 2010 Nov 25; 10:112−17.
  • Khorsheed MS, Al-Thubaity AO. Comparative Evaluation of Text Classification Techniques using a Large Diverse Arabic Dataset, Language Resources and Evaluation. 2013 Jun 1; 47(2):513−38.
  • Alasfour AA, Trausan-Matu S. Developing an Arabic Corpus for Event Mining. In: System Theory, Control and Computing (ICSTCC), 17th International Conference, 2013 Oct 11, p. 21−28.
  • Al-Thubaity AO. A 700M+ Arabic Corpus: KACST Arabic Corpus Design and Construction, Language Resources and Evaluation. 2015 Sep 1; 49(3):721−51.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.