Indian Journal of Science and Technology
Year: 2020, Volume: 13, Issue: 11, Pages: 1276-1282
J Ujwala Rekha1*, K Shahu Chatrapati1,2
1Associate Professor of CSE, JNTUH College of Engineering Hyderabad, Telangana, India
2Professor of CSE, JNTUH College of Engineering Manthani, Telangana, India
*Author for correspondence
J Ujwala Rekha
Associate Professor of CSE, JNTUH College of Engineering Hyderabad, Telangana, India
Email: [email protected]
Received Date:29 February 2020, Accepted Date:30 March 2020, Published Date:03 May 2020
Background/Objectives: In this study, a term weighting scheme derived from probabilistic multiple correlation is defined for measuring similarity between unstructured text records. Methods: While the intra-correlation is the correlation of terms in the same record, inter-correlation is the correlation of terms that exist in different records. Probabilistic multiple correlation-based term weighting calculates the weight or relevance of a term by considering its intra-correlation with one or more terms simultaneously. Subsequently, the term weights are used in measuring the inter-correlation of terms and then the similarity between two text records. Findings: The experiments are run on unstructured text records that are incomplete and employ abbreviations. There is significant improvement in precision, recall and f-score using probabilistic multiple correlation based term weighting scheme when compared with probabilistic simple correlation weighting scheme. Applications: Using probabilistic multiple correlation based term weighting scheme can improve the overall accuracy in matching unstructured text records that contain abbreviations and incomplete data.
Keywords: Unstructured Text; Approximate String Matching; Citation Matching; Probabilistic Correlation; Term Weight; Similarity Measure
Subscribe now for latest articles and news.