A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Vasantha Kumari Garbhapu; Prajna Bodapati

doi:10.17485/IJST/v13i44.1479

Article

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

VIEWS 3928
PDF 1644

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v13i44.1479

Year: 2020, Volume: 13, Issue: 44, Pages: 4474-4482

Original Article

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Vasantha Kumari Garbhapu^1*, Prajna Bodapati²

¹Research Scholar, Department of Computer Science and Systems Engineering, Andhra University College of Engineering, Visakhapatnam, A.P, India
²Professor, Department of Computer Science and Systems Engineering, Andhra University College of Engineering, Visakhapatnam, 530003, A.P, India

*Corresponding Author
Email: [email protected]

Received Date:20 August 2020, Accepted Date:03 December 2020, Published Date:13 December 2020

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: To compare the topic modeling techniques, as no free lunch theorem states that under a uniform distribution over search problems, all machine learning algorithms perform equally. Hence, here, we compare Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to identify better performer for English bible data set which has not been studied yet. Methods: This comparative study divided into three levels: In the first level, bible data was extracted from the sources and preprocessed to remove the words and characters which were not useful to obtain the semantic structures or necessary patterns to make the meaningful corpus. In the second level, the preprocessed data were converted into a bag of words and numerical statistic TF-IDF (Term Frequency – Inverse Document Frequency) is used to assess how relevant a word is to a document in a corpus. In the third level, Latent Semantic analysis and Latent Dirichlet Allocations methods were applied over the resultant corpus to study the feasibility of the techniques. Findings: Based on our evaluation, we observed that the LDA achieves 60 to 75% superior performance when compared to LSA using document similarity within-corpus, document similarity with the unseen document. Additionally, LDA showed better coherence score (0.58018) than LSA (0.50395). Moreover, when compared to any word within-corpus, the word association showed better results with LDA. Some words have homonyms based on the context; for example, in the bible; bear has a meaning of punishment and birth. In our study, LDA word association results are almost near to human word associations when compared to LSA. Novelty: LDA was found to be the computationally efficient and interpretable method in adopting the English Bible dataset of New International Version that was not yet created.

Keywords: Topic modeling; LSA; LDA; word association; document similarity;Bible data set

References

Rüdiger M, Antons D, Salge TO. From text to data: On the role and effect of text pre-processing in text mining research. Academy of Management Proceedings. 2017;2017(1). Available from: https://dx.doi.org/10.5465/ambpp.2017.16353abstract
Antons D, Joshi AM, Salge TO. Content, contribution, and knowledge consumption: Uncovering hidden topic structure and rhetorical signals in scientific texts. Journal of Management. 2019;45(7):3035–3076. Available from: https://dx.doi.org/10.1177/0149206318774619
Antons D, Grünwald E, Cichy P, Salge TO. The application of text mining methods in innovation research: current state, evolution patterns, and development priorities. R&D Management. 2020;50:329–351. Available from: https://dx.doi.org/10.1111/radm.12408
Hammed J, Wang Y, Yuan C. LDA and Topic Modeling- Models Applications, A Survey. Multi. Tools Appl. 2019;78:15169–15211. Available from: https://doi.org/10.1007/s11042-018-6894-4
Cho HW. Topic Modeling. Osong Public Health and Research Perspectives. 2019;10:115–116. Available from: https://dx.doi.org/10.24171/j.phrp.2019.10.3.01
Albalawi R, Yeap TH, Benyoucef M. Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Frontiers in Artificial Intelligence. 2020;3. Available from: https://dx.doi.org/10.3389/frai.2020.00042
Kwantes PJ, Derbentseva N, Lam Q, Vartanian O, Marmurek HHC. Assessing the Big Five personality traits with latent semantic analysis. Personality and Individual Differences. 2016;102:229–233. Available from: https://dx.doi.org/10.1016/j.paid.2016.07.010
Kherwa P, Bansal P. Latent Semantic Analysis: An Approach to Understand Semantic of Text. Int.Conf.Cur.Tre.Compu.Elec.Elect.Comm. 2017. Available from: https://doi.org/10.1109/CTCEEC.2017.8455018
Valdez D, Pickett AC, Goodson P. Topic Modeling: Latent Semantic Analysis for the Social Sciences. Social Science Quarterly. 2018;99(5):1665–1679. Available from: https://dx.doi.org/10.1111/ssqu.12528
Blei DM, Ng AY, Jordan MI, et al. Latent Dirichlet Allocation. JMLR. 2003;3(4):993–1022. Available from: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Wang W, Feng Y, Dai W. Topic Analysis of Online Reviews for Two Competitive Products using Latent Dirichlet Allocation. Electron. Commer. Res. Appl. 2018;p. 1–30. Available from: https://doi.org/10.1016/j.elerap.2018.04.003
Xu A, Qi T, DX. Analysis of the douban online review of the mcu: based on LDA topic model. In: 2nd International Symposium on Big Data and Applied Statistics. J. Phys: Conf. Seri.. Dalian. 2019:1437.
Korshunova I, Xiong H, Fedoryszak M, Theis L. Discriminative Topic Modeling with Logistic LDA. In: 33rd Conference on Neural Information Processing Systems. 2019.
Alghamdi R, Alfalqi K. A Survey of Topic Modeling in Text Mining. International Journal of Advanced Computer Science and Applications. 2015;6(1):147–153. Available from: https://dx.doi.org/10.14569/ijacsa.2015.060121
Shi L, Liu L, Wu L, Jiang L. Event detection and user interest discovering in social media data streams. IEEE Access. 2017;5:20953–20964. Available from: https://doi.org/10.1109/ACCESS.2017.2675839
Ray SK, Ahmad A, Kumar CA. Review and Implementation of Topic Modeling in Hindi. Applied Artificial Intelligence. 2019;33(11):979–1007. Available from: https://dx.doi.org/10.1080/08839514.2019.1661576
Pilato G, Vassallo G. TSVD as a Statistical Estimator in the Latent Semantic Analysis Paradigm. IEEE Transactions on Emerging Topics in Computing. 2015;3(2):185–192. Available from: https://dx.doi.org/10.1109/tetc.2014.2385594
Kaur R, Kaur M. Latent Semantic Analysis: Searching Technique for Text Documents. Int. J. Eng. Dev. Res. 2015;3(2):803–806. Available from: https://www.ijedr.org/papers/IJEDR1502143.pdf
Tamizharasan M, Shahana RS, Subathra P. Topic modeling-based approach for word prediction using automata. J. Crit. Rev.. 2020;7(7):744–749. Available from: https://doi.org/10.31838/jcr.07.07.135
Safiie MA, et al. Latent Dirichlet Allocation (LDA) Model and kNN Algorithm to Classify Research Project Selection. IOP Conf. Ser.: Mater. Sci. Eng. 2018;333. Available from: https://doi.org/10.1088/1757-899X/333/1/012110
Gunawan D, Sembiring CA, Budiman MA. The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. Journal of Physics: Conference Series. 2018;978(1). Available from: https://dx.doi.org/10.1088/1742-6596/978/1/012120
Stevens K, Kegelmeyer P, Andrzejewski D. Exploring topic coherence over many models and many topics. In: EMNLP. (pp. 952-961).
Koltcov S, Ignatenko V, Koltsova O. Estimating Topic Modeling Performance with Sharma–Mittal Entropy. Entropy. 2019;21(7). Available from: https://dx.doi.org/10.3390/e21070660
Lund J, Armstrong P, Fearn W, Cowley S, Hales E, Seppi K. Cross-referencing using Fine-grained Topic Modeling. In: Proceedings of NAACL-HLT. (pp. 3978-3987) 2019.
Griffiths TL, Steyvers M. A probabilistic approach to semantic representation. Proce. of the 24th Annual Conference of the Cognitive Science Society. 2002. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.5961&rep=rep1&type=pdf
Qomariyaha S, Iriawanb N, Fithriasaric K. Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis. AIP.Conf.Proc.. 2019. Available from: https://doi.org/10.1063/1.5139825
Chakkarwar V, Sc T. Quick insight of research literature using topic modeling. In: YDZ, JM, CSI, NT., eds. Smart Trends in Computing and Communications. Smart Innovation, Systems and Technologies, 2020. (Vol. 165, pp. 189-197) Springer..

Copyright

© 2020 Garbhapu & Bodapati.This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)