Indian Journal of Science and Technology
Year: 2017, Volume: 10, Issue: 19, Pages: 1-7
Sudhanshu Shekhar Bisoyi* , Pragnyaban Mishra and S. N. Mishra
*Author for the correspondence:
Sudhanshu Shekhar Bisoyi
Department of Computer Science and Engineering Science and Engineering, GIET Gunupur, Gunupur – 765022, Odisha, India; [email protected]
Objectives: File structure and storage becomes a challenging issue while processing huge amount of data in parallel and distributed environment. To increase processing capabilities an appropriate file structure must be implemented. Methods/Statistical Analysis: In our approach we have imported the data from the available relational databases like Oracle or MySql to Hive using Sqoop and analyzed the query processing based upon different file storage formats. We have focused on the Parquet, Sequence, RC file and ORC file format for query analysis in MapReduce framework on top of Hadoop. Findings: Understanding dynamic behavior of user buying habits in different web services and product recommendation using social media, e-marketing etc. the MapReduce based data warehousing system plays vital role to perform the Big Data analytic in a parallel and distributed environment. In such type of analysis the data structure used to store the data for parallel query processing effect the performance of Big Data warehouse system. During the analysis of huge amount of relational data in a parallel and distributed system few issues should be taken care to increase the query performance and optimization. These are 1. Faster loading of huge amount of relational data into the Big Data warehouse. 2. Optimized file format to efficiently manage the storage system. 3. Faster query processing by increasing the throughput. Our findings explained appropriate file formats to store the huge amount of relational data in the Big Data warehouse system based upon HDFS and MapReduce framework known as Hive and evaluated the performance of query processing in multi node Hadoop cluster. Application/Improvements: The cost of parallel query processing has been reduced as well as distributed storage efficiency increased by choosing appropriate file structure in Big Data warehouse systems.
Keywords: Big Data, HDFS, Hive, Hadoop, MapReduce, ORC File, Sqoop
Subscribe now for latest articles and news.