Indian Journal of Science and Technology
Year: 2015, Volume: 8, Issue: 33, Pages: 1-7
G. Somasekhar1* and K. Karthikeyan2
1 SCSE, VIT University, Vellore – 632006, Tamil Nadu, India; [email protected]
2 SAS, VIT University, Vellore – 632006, Tamil Nadu, India; [email protected]
Data matching provides valuable information relevant to complex decisions about programs or policies. For example, information about peer influences on teen behavior, achieved through data matching, can help people decide what kinds of programs would discourage early pregnancy, teenage drinking, and delinquency. If the data used in the data matching process has the big data characteristics and could not be processed on a single machine then it is termed as big data matching, where traditional data matching methods fail. Dedoop1 is the latest tool developed for big data matching. It needs a pair of clusters as input. The state-of-the-art big data matching techniques have a common disadvantage which leads to expensive redundant similarity computations. We focus on the selection process of the pair of clusters given as input to the Dedoop. This approach avoids unnecessary similarity computations. Entire data is subjected to prior selection process before entering into the Dedoop. The technique of canopy clustering is combined with the unique linkage pairs formation technique to solve the redundancy problem. A test sample of article-author information is usedto match the authors related to common subject. Though the basic pre data matching redundancy avoidance approach (BPRA2) solves the problem to some extent, it has some limitations. In addition to considerable preprocessing overhead, it does not solve scalability and incremental issues. The proposed PRAMR approach reduces the preprocessing overhead in BPRA to ‘m’ times. As the PRAMR uses the big data technique ‘MapReduce’, the scalability and incremental issues are solved. Hadoop is used for MapReduce jobs. The results are compared with the BPRA and Kolb’s approach3. In section IV and V, it is proved that PRAMR is more efficient than the state-of-the-art techniques. PRAMR shows improvementcompared to BPRA and Kolb’s approach giving a better solution to the overlapping clusters problem.
Keywords: Big Data Matching, Canopy Clustering, Data Point, Overlapping Clusters, Redundancy, Similarity Roller
Subscribe now for latest articles and news.