• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2018, Volume: 11, Issue: 18, Pages: 1-9

Original Article

Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time


Objective: MapReduce is a programming model used to support massive data sets. Big data are the most important issue today to analyze these data. Methods/Statistical Analysis: MapReduce is used to discover hidden patterns and relations in data to get more helpful information by using two simple functions map and reduce written by the programmer, it includes load balancing, fault tolerance and high scalability. The most important operation in data analysis are join, but MapReduce is not directly support join. Findings: This paper explains two-way MapReduce join algorithm, semi-join and per split semi-join and proposes new algorithm hash semi-join that used hash table to increase performance by eliminating unused records as early as possible and apply join using hash table rather than using map function to match join key with other data table in the second phase but using hash tables isn’t affecting on memory size because we only save matched records from the second table only. Our experimental result shows that using a hash table with hash semi-join algorithm has higher performance than two other algorithms while increasing the data size from 100 million records to 50 billion. Application/Improvements: Running time is increased according to the size of joined records between two tables using 30 machines to run our data but our algorithm has the better running time than other algorithms.

Keywords: Hadoop, Hash Semi Join, MapReduce, Two-Way Join


Subscribe now for latest articles and news.