A Comparison of Topic Modelling Approaches for Urdu Text

Siraj Munir; Shaukat Wasi and Syed Imran Jami

doi:10.17485/ijst/2019/v12i45/145722

Article

A Comparison of Topic Modelling Approaches for Urdu Text

VIEWS 2239
PDF 289

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2019/v12i45/145722

Year: 2019, Volume: 12, Issue: 45, Pages: 1-7

Original Article

A Comparison of Topic Modelling Approaches for Urdu Text

Siraj Munir, Shaukat Wasi and Syed Imran Jami^*

Department of Computer Science, Mohammad Ali Jinnah University, Karachi, Pakistan;
[email protected] , [email protected], [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: Machine learning based approaches for topic modeling are successful in extracting logical and semantic topics from a given collection of text. We experimented topic modelling approaches for Urdu poetry text to show that these approaches perform equally well in any genre of text. Methods: Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Process (HDP), and Latent Semantic Indexing (LSI) were applied on three different datasets (i) CORPUS dataset for news, (ii) Poetry Collection of Dr. Allama Iqbal, and (iii) Poetry collection of miscellaneous poets. Furthermore, each poetry corpus includes more than five hundred poems approximately equivalent to 1200 documents. Findings: Before forwarding the raw text to aforementioned models, we did feature engineering comprising of (i) Tokenization and removal of special characters (if any), (ii) Removal of stop words, (iii) Lemmatization, and (iv) Stemming. For comparison of mentioned approaches on our test samples, we used coherence and dominance model. Applications: Our experiment shows that LDA, and LSI performed well on CORPUS dataset but none of the mentioned approaches performed well on poetry text. This brings us to a conclusion that we need to devise sequence based models that allow users to define weights for poetry specific text. This work opens a new direction for the domain of text generation and processing.
Keywords: LDA, LSI, HDP, Urdu Poetry Processing, Urdu Poetry Collection, Topic Modelling.