• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2015, Volume: 8, Issue: 34, Pages: 1-7

Original Article

A Semantic Deduplication of Temporal Dynamic Records from Multiple Web Databases


Objective: The main objective of this paper is to improve the true positive level of record deduplication using Ontology based MHMM-Fuzzy clustering approach. Methods/Statistical Analysis: Most of the record deduplication system in literature used genetic programming based record deduplication which combined different pieces of evidence extracted from the data content. However, the accuracy of the system is low. To overcome this problem, a Multiple Hidden Markov Model (MHMM) is proposed and it is used to increase the accuracy and also to identify joint duplicate records. In this model, if the database has multiple columns, it performs the deduplication for the all columns which can degrade the performance of the system. To solve this problem, MHMM-Fuzzy Clustering based record deduplication is introduced. In this system Fuzzy clustering is performed through multiple observations from the Hidden Markov Model. Then the duplicate data are grouped into one cluster according to their fuzzy logic and it can be eliminated easily. However ,the true positive level of the system is low. To improve the true positive level, Fuzzy Ontology based semantic similarity is incorporated in MHMM-Fuzzy Clustering approach. This implies the improvement of the true positive level of the model. Thus, it increases the efficiency of deduplication function that identifies the records of replica and duplications. Findings: Multiple Hidden Markov Model (MHMM) based record deduplication, MHMM-Fuzzy clustering based record deduplication and Ontology based MHMM-Fuzzy clustering approach are applied on Cora Bibliographic dataset and Restaurants dataset. The performance measures are evaluated in terms of precision, recall, f-measure, execution time and accuracy results. Applications/Improvements: Thus the current research achieves improved result on record deduplication is better than previous works in terms of precision, recall, f-measure, execution time and accuracy results.
Keywords: Hidden State Sequence, Membership Function, Observation Sequence, States, Semantic Deduplication 


Subscribe now for latest articles and news.