Indian Journal of Science and Technology
DOI: 10.17485/ijst/2015/v8i34/75103
Year: 2015, Volume: 8, Issue: 34, Pages: 1-7
Original Article
R. Parimala Devi1* and V. Thigarasu2
1 Department of Computer Science, Karpagam University, Coimbatore - 641021, Tamil Nadu, India; [email protected]
2 Department of Computer Science, Gobi Arts and Science College, Gobichettipalayam - 638453, Tamil Nadu, India; [email protected]
Objective: The main objective of this paper is to improve the true positive level of record deduplication using Ontology based MHMM-Fuzzy clustering approach. Methods/Statistical Analysis: Most of the record deduplication system in literature used genetic programming based record deduplication which combined different pieces of evidence extracted from the data content. However, the accuracy of the system is low. To overcome this problem, a Multiple Hidden Markov Model (MHMM) is proposed and it is used to increase the accuracy and also to identify joint duplicate records. In this model, if the database has multiple columns, it performs the deduplication for the all columns which can degrade the performance of the system. To solve this problem, MHMM-Fuzzy Clustering based record deduplication is introduced. In this system Fuzzy clustering is performed through multiple observations from the Hidden Markov Model. Then the duplicate data are grouped into one cluster according to their fuzzy logic and it can be eliminated easily. However ,the true positive level of the system is low. To improve the true positive level, Fuzzy Ontology based semantic similarity is incorporated in MHMM-Fuzzy Clustering approach. This implies the improvement of the true positive level of the model. Thus, it increases the efficiency of deduplication function that identifies the records of replica and duplications. Findings: Multiple Hidden Markov Model (MHMM) based record deduplication, MHMM-Fuzzy clustering based record deduplication and Ontology based MHMM-Fuzzy clustering approach are applied on Cora Bibliographic dataset and Restaurants dataset. The performance measures are evaluated in terms of precision, recall, f-measure, execution time and accuracy results. Applications/Improvements: Thus the current research achieves improved result on record deduplication is better than previous works in terms of precision, recall, f-measure, execution time and accuracy results.
Keywords: Hidden State Sequence, Membership Function, Observation Sequence, States, Semantic Deduplication
Subscribe now for latest articles and news.