Indian Journal of Science and Technology
DOI: 10.17485/ijst/2019/v12i45/145722
Year: 2019, Volume: 12, Issue: 45, Pages: 1-7
Original Article
Siraj Munir, Shaukat Wasi and Syed Imran Jami*
Department of Computer Science, Mohammad Ali Jinnah University, Karachi, Pakistan;
[email protected] , [email protected], [email protected]
Objectives: Machine learning based approaches for topic modeling are successful in extracting logical and semantic topics from a given collection of text. We experimented topic modelling approaches for Urdu poetry text to show that these approaches perform equally well in any genre of text. Methods: Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Process (HDP), and Latent Semantic Indexing (LSI) were applied on three different datasets (i) CORPUS dataset for news, (ii) Poetry Collection of Dr. Allama Iqbal, and (iii) Poetry collection of miscellaneous poets. Furthermore, each poetry corpus includes more than five hundred poems approximately equivalent to 1200 documents. Findings: Before forwarding the raw text to aforementioned models, we did feature engineering comprising of (i) Tokenization and removal of special characters (if any), (ii) Removal of stop words, (iii) Lemmatization, and (iv) Stemming. For comparison of mentioned approaches on our test samples, we used coherence and dominance model. Applications: Our experiment shows that LDA, and LSI performed well on CORPUS dataset but none of the mentioned approaches performed well on poetry text. This brings us to a conclusion that we need to devise sequence based models that allow users to define weights for poetry specific text. This work opens a new direction for the domain of text generation and processing.
Keywords: LDA, LSI, HDP, Urdu Poetry Processing, Urdu Poetry Collection, Topic Modelling.
Subscribe now for latest articles and news.