• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2020, Volume: 13, Issue: 11, Pages: 1276-1282

Original Article

Probabilistic multiple correlation based term weighting scheme for measuring similarity of unstructured text records

Received Date:29 February 2020, Accepted Date:30 March 2020, Published Date:03 May 2020

Abstract

Background/Objectives: In this study, a term weighting scheme derived from probabilistic multiple correlation is defined for measuring similarity between unstructured text records. Methods: While the intra-correlation is the correlation of terms in the same record, inter-correlation is the correlation of terms that exist in different records. Probabilistic multiple correlation-based term weighting calculates the weight or relevance of a term by considering its intra-correlation with one or more terms simultaneously. Subsequently, the term weights are used in measuring the inter-correlation of terms and then the similarity between two text records. Findings: The experiments are run on unstructured text records that are incomplete and employ abbreviations. There is significant improvement in precision, recall and f-score using probabilistic multiple correlation based term weighting scheme when compared with probabilistic simple correlation weighting scheme. Applications: Using probabilistic multiple correlation based term weighting scheme can improve the overall accuracy in matching unstructured text records that contain abbreviations and incomplete data.

Keywords: Unstructured Text; Approximate String Matching; Citation Matching; Probabilistic Correlation; Term Weight; Similarity Measure

References

  1. 'Structure vs. Unstructured Data' (accessed ) Available from: www.datamation.com
  2. Eck NJV, Waltman L. Accuracy of citation data in Web of Science and Scopus. arXiv. 2019. Available from: https://arxiv.org/abs/1906.07011
  3. Monge A . An efficient domain-independent algorithm for detecting approximately duplicate database records. Proc. ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining. 1997;p. 23–29.
  4. Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems (TOIS). 2000;18(3):288–321. doi: 10.1145/352595.352598
  5. Song S, Zhu H, Chen L. Probabilistic correlation-based similarity measure on text records. Information Sciences. 2014;289:8–24.
  6. Pasula H, Marthi B, Milch B, Russell SJ, Shpitser I. Identity uncertainty and citation matching. Advances in neural information processing systems. 2003;p. 1425–1432.
  7. Mccallum Data A. 2020. Available from: https://people.cs.umass.edu/~mccallum/data.html.Retrieved
  8. Tejada S, Knoblock CA, Minton S. Learning object identification rules for information integration. Elsevier BV. 2001. doi: 10.1016/s0306-4379(01)00042-4

Copyright

Copyright: © 2020 Rekha, Chatrapati. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.