• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2020, Volume: 13, Issue: 44, Pages: 4474-4482

Original Article

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Received Date:20 August 2020, Accepted Date:03 December 2020, Published Date:13 December 2020

Abstract

Objective: To compare the topic modeling techniques, as no free lunch theorem states that under a uniform distribution over search problems, all machine learning algorithms perform equally. Hence, here, we compare Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to identify better performer for English bible data set which has not been studied yet. Methods: This comparative study divided into three levels: In the first level, bible data was extracted from the sources and preprocessed to remove the words and characters which were not useful to obtain the semantic structures or necessary patterns to make the meaningful corpus. In the second level, the preprocessed data were converted into a bag of words and numerical statistic TF-IDF (Term Frequency – Inverse Document Frequency) is used to assess how relevant a word is to a document in a corpus. In the third level, Latent Semantic analysis and Latent Dirichlet Allocations methods were applied over the resultant corpus to study the feasibility of the techniques. Findings: Based on our evaluation, we observed that the LDA achieves 60 to 75% superior performance when compared to LSA using document similarity within-corpus, document similarity with the unseen document. Additionally, LDA showed better coherence score (0.58018) than LSA (0.50395). Moreover, when compared to any word within-corpus, the word association showed better results with LDA. Some words have homonyms based on the context; for example, in the bible; bear has a meaning of punishment and birth. In our study, LDA word association results are almost near to human word associations when compared to LSA. Novelty: LDA was found to be the computationally efficient and interpretable method in adopting the English Bible dataset of New International Version that was not yet created.

Keywords: Topic modeling; LSA; LDA; word association; document similarity;Bible data set

References

  1. Rüdiger M, Antons D, Salge TO. From text to data: On the role and effect of text pre-processing in text mining research. Academy of Management Proceedings. 2017;2017(1). Available from: https://dx.doi.org/10.5465/ambpp.2017.16353abstract
  2. Hammed J, Wang Y, Yuan C. LDA and Topic Modeling- Models Applications, A Survey. Multi. Tools Appl. 2019;78:15169–15211. Available from: https://doi.org/10.1007/s11042-018-6894-4
  3. Cho HW. Topic Modeling. Osong Public Health and Research Perspectives. 2019;10:115–116. Available from: https://dx.doi.org/10.24171/j.phrp.2019.10.3.01
  4. Albalawi R, Yeap TH, Benyoucef M. Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Frontiers in Artificial Intelligence. 2020;3. Available from: https://dx.doi.org/10.3389/frai.2020.00042
  5. Kwantes PJ, Derbentseva N, Lam Q, Vartanian O, Marmurek HHC. Assessing the Big Five personality traits with latent semantic analysis. Personality and Individual Differences. 2016;102:229–233. Available from: https://dx.doi.org/10.1016/j.paid.2016.07.010
  6. Kherwa P, Bansal P. Latent Semantic Analysis: An Approach to Understand Semantic of Text. Int.Conf.Cur.Tre.Compu.Elec.Elect.Comm. 2017. Available from: https://doi.org/10.1109/CTCEEC.2017.8455018
  7. Valdez D, Pickett AC, Goodson P. Topic Modeling: Latent Semantic Analysis for the Social Sciences. Social Science Quarterly. 2018;99(5):1665–1679. Available from: https://dx.doi.org/10.1111/ssqu.12528
  8. Blei DM, Ng AY, Jordan MI, et al. Latent Dirichlet Allocation. JMLR. 2003;3(4):993–1022. Available from: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
  9. Wang W, Feng Y, Dai W. Topic Analysis of Online Reviews for Two Competitive Products using Latent Dirichlet Allocation. Electron. Commer. Res. Appl. 2018;p. 1–30. Available from: https://doi.org/10.1016/j.elerap.2018.04.003
  10. Korshunova I, Xiong H, Fedoryszak M, Theis L. Discriminative Topic Modeling with Logistic LDA. In: 33rd Conference on Neural Information Processing Systems. 2019.
  11. Alghamdi R, Alfalqi K. A Survey of Topic Modeling in Text Mining. International Journal of Advanced Computer Science and Applications. 2015;6(1):147–153. Available from: https://dx.doi.org/10.14569/ijacsa.2015.060121
  12. Shi L, Liu L, Wu L, Jiang L. Event detection and user interest discovering in social media data streams. IEEE Access. 2017;5:20953–20964. Available from: https://doi.org/10.1109/ACCESS.2017.2675839
  13. Ray SK, Ahmad A, Kumar CA. Review and Implementation of Topic Modeling in Hindi. Applied Artificial Intelligence. 2019;33(11):979–1007. Available from: https://dx.doi.org/10.1080/08839514.2019.1661576
  14. Pilato G, Vassallo G. TSVD as a Statistical Estimator in the Latent Semantic Analysis Paradigm. IEEE Transactions on Emerging Topics in Computing. 2015;3(2):185–192. Available from: https://dx.doi.org/10.1109/tetc.2014.2385594
  15. Kaur R, Kaur M. Latent Semantic Analysis: Searching Technique for Text Documents. Int. J. Eng. Dev. Res. 2015;3(2):803–806. Available from: https://www.ijedr.org/papers/IJEDR1502143.pdf
  16. Tamizharasan M, Shahana RS, Subathra P. Topic modeling-based approach for word prediction using automata. J. Crit. Rev.. 2020;7(7):744–749. Available from: https://doi.org/10.31838/jcr.07.07.135
  17. Gunawan D, Sembiring CA, Budiman MA. The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. Journal of Physics: Conference Series. 2018;978(1). Available from: https://dx.doi.org/10.1088/1742-6596/978/1/012120
  18. Stevens K, Kegelmeyer P, Andrzejewski D. Exploring topic coherence over many models and many topics. In: EMNLP. (pp. 952-961).
  19. Koltcov S, Ignatenko V, Koltsova O. Estimating Topic Modeling Performance with Sharma–Mittal Entropy. Entropy. 2019;21(7). Available from: https://dx.doi.org/10.3390/e21070660
  20. Lund J, Armstrong P, Fearn W, Cowley S, Hales E, Seppi K. Cross-referencing using Fine-grained Topic Modeling. In: Proceedings of NAACL-HLT. (pp. 3978-3987) 2019.
  21. Griffiths TL, Steyvers M. A probabilistic approach to semantic representation. Proce. of the 24th Annual Conference of the Cognitive Science Society. 2002. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.5961&rep=rep1&type=pdf
  22. Qomariyaha S, Iriawanb N, Fithriasaric K. Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis. AIP.Conf.Proc.. 2019. Available from: https://doi.org/10.1063/1.5139825

Copyright

© 2020 Garbhapu & Bodapati.This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.