Probabilistic multiple correlation based term weighting scheme for measuring similarity of unstructured text records

J Ujwala Rekha; K Shahu Chatrapati

doi:10.17485/IJST/v13i11.2020-31

Article

Probabilistic multiple correlation based term weighting scheme for measuring similarity of unstructured text records

VIEWS 1440
PDF 294

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v13i11.2020-31

Year: 2020, Volume: 13, Issue: 11, Pages: 1276-1282

Original Article

Probabilistic multiple correlation based term weighting scheme for measuring similarity of unstructured text records

J Ujwala Rekha^1*, K Shahu Chatrapati^1,2

¹Associate Professor of CSE, JNTUH College of Engineering Hyderabad, Telangana, India
²Professor of CSE, JNTUH College of Engineering Manthani, Telangana, India

*Author for correspondence
J Ujwala Rekha
Associate Professor of CSE, JNTUH College of Engineering Hyderabad, Telangana, India
Email: [email protected]

Received Date:29 February 2020, Accepted Date:30 March 2020, Published Date:03 May 2020

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Background/Objectives: In this study, a term weighting scheme derived from probabilistic multiple correlation is defined for measuring similarity between unstructured text records. Methods: While the intra-correlation is the correlation of terms in the same record, inter-correlation is the correlation of terms that exist in different records. Probabilistic multiple correlation-based term weighting calculates the weight or relevance of a term by considering its intra-correlation with one or more terms simultaneously. Subsequently, the term weights are used in measuring the inter-correlation of terms and then the similarity between two text records. Findings: The experiments are run on unstructured text records that are incomplete and employ abbreviations. There is significant improvement in precision, recall and f-score using probabilistic multiple correlation based term weighting scheme when compared with probabilistic simple correlation weighting scheme. Applications: Using probabilistic multiple correlation based term weighting scheme can improve the overall accuracy in matching unstructured text records that contain abbreviations and incomplete data.

Keywords: Unstructured Text; Approximate String Matching; Citation Matching; Probabilistic Correlation; Term Weight; Similarity Measure

References

'Structure vs. Unstructured Data' (accessed 2020-01-23) Available from: www.datamation.com
Schneider C. 2016. Available from: https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/
Eck NJV, Waltman L. Accuracy of citation data in Web of Science and Scopus. arXiv. 2019. Available from: https://arxiv.org/abs/1906.07011
Monge A . An efficient domain-independent algorithm for detecting approximately duplicate database records. Proc. ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining. 1997;p. 23–29.
Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems (TOIS). 2000;18(3):288–321. doi: 10.1145/352595.352598
Song S, Zhu H, Chen L. Probabilistic correlation-based similarity measure on text records. Information Sciences. 2014;289:8–24.
Bilenko M, Mooney RJ. Learning to combine trained distance metrics for duplicate detection in databases. Austin, TX. 2002.
Pasula H, Marthi B, Milch B, Russell SJ, Shpitser I. Identity uncertainty and citation matching. Advances in neural information processing systems. 2003;p. 1425–1432.
Mccallum Data A. 2020. Available from: https://people.cs.umass.edu/~mccallum/data.html.Retrieved
Tejada S, Knoblock CA, Minton S. Learning object identification rules for information integration. Elsevier BV. 2001. doi: 10.1016/s0306-4379(01)00042-4

Copyright

Copyright: © 2020 Rekha, Chatrapati. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Published By Indian Society for Education and Environment (iSee)