The Pre Big Data Matching Redundancy Avoidance Algorithm with Mapreduce

G  Somasekhar   and K  Karthikeyan

doi:10.17485/ijst/2015/v8i33/77477

Article

The Pre Big Data Matching Redundancy Avoidance Algorithm with Mapreduce

VIEWS 911
PDF 243

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2015/v8i33/77477

Year: 2015, Volume: 8, Issue: 33, Pages: 1-7

Original Article

The Pre Big Data Matching Redundancy Avoidance Algorithm with Mapreduce

G. Somasekhar^1* and K. Karthikeyan²

¹SCSE, VIT University, Vellore – 632006, Tamil Nadu, India; [email protected]
²SAS, VIT University, Vellore – 632006, Tamil Nadu, India; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Data matching provides valuable information relevant to complex decisions about programs or policies. For example, information about peer influences on teen behavior, achieved through data matching, can help people decide what kinds of programs would discourage early pregnancy, teenage drinking, and delinquency. If the data used in the data matching process has the big data characteristics and could not be processed on a single machine then it is termed as big data matching, where traditional data matching methods fail. Dedoop1 is the latest tool developed for big data matching. It needs a pair of clusters as input. The state-of-the-art big data matching techniques have a common disadvantage which leads to expensive redundant similarity computations. We focus on the selection process of the pair of clusters given as input to the Dedoop. This approach avoids unnecessary similarity computations. Entire data is subjected to prior selection process before entering into the Dedoop. The technique of canopy clustering is combined with the unique linkage pairs formation technique to solve the redundancy problem. A test sample of article-author information is usedto match the authors related to common subject. Though the basic pre data matching redundancy avoidance approach (BPRA2) solves the problem to some extent, it has some limitations. In addition to considerable preprocessing overhead, it does not solve scalability and incremental issues. The proposed PRAMR approach reduces the preprocessing overhead in BPRA to ‘m’ times. As the PRAMR uses the big data technique ‘MapReduce’, the scalability and incremental issues are solved. Hadoop is used for MapReduce jobs. The results are compared with the BPRA and Kolb’s approach3. In section IV and V, it is proved that PRAMR is more efficient than the state-of-the-art techniques. PRAMR shows improvementcompared to BPRA and Kolb’s approach giving a better solution to the overlapping clusters problem.
Keywords: Big Data Matching, Canopy Clustering, Data Point, Overlapping Clusters, Redundancy, Similarity Roller