Total views : 92

Relational Query Optimization Technique using Space Efficient File Formats of Hadoop for the Big Data Warehouse System

Affiliations

  • Department of Computer Science and Engineering Science and Engineering, GIET Gunupur, Gunupur – 765022, Odisha, India

Abstract


Objectives: File structure and storage becomes a challenging issue while processing huge amount of data in parallel and distributed environment. To increase processing capabilities an appropriate file structure must be implemented. Methods/Statistical Analysis: In our approach we have imported the data from the available relational databases like Oracle or MySql to Hive using Sqoop and analyzed the query processing based upon different file storage formats. We have focused on the Parquet, Sequence, RC file and ORC file format for query analysis in MapReduce framework on top of Hadoop. Findings: Understanding dynamic behavior of user buying habits in different web services and product recommendation using social media, e-marketing etc. the MapReduce based data warehousing system plays vital role to perform the Big Data analytic in a parallel and distributed environment. In such type of analysis the data structure used to store the data for parallel query processing effect the performance of Big Data warehouse system. During the analysis of huge amount of relational data in a parallel and distributed system few issues should be taken care to increase the query performance and optimization. These are 1. Faster loading of huge amount of relational data into the Big Data warehouse. 2. Optimized file format to efficiently manage the storage system. 3. Faster query processing by increasing the throughput. Our findings explained appropriate file formats to store the huge amount of relational data in the Big Data warehouse system based upon HDFS and MapReduce framework known as Hive and evaluated the performance of query processing in multi node Hadoop cluster. Application/Improvements: The cost of parallel query processing has been reduced as well as distributed storage efficiency increased by choosing appropriate file structure in Big Data warehouse systems.

Keywords

Big Data, HDFS, Hive, Hadoop, MapReduce, ORC File, Sqoop.

Full Text:

 |  (PDF views: 41)

References


  • The Apache Software Foundation. Hadoop MapReduce. 2017. Available from: Crossref
  • Choi H, Son J, Yang H, Ryu H, Lim B, Kim S, Chung YD. A distributed data warehouse system on large clusters. Proceedings of the IEEE International Conference on Data Engineering; 2013. p. 1320–3.
  • Chen CP, Zhang CY. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Science. 2014; 275: 314-47. Crossref
  • Doulkeridis C, Norvag K. Surveys of large-scale analytical query processing in map reduce. VLDB J. 2014; 23(3):355– 80. Crossref
  • The Apache Software Foundation. Hive. 2016. Available from: Crossref
  • The Apache Software Foundation. Hive Wikipedia. 2011.
  • Available from: Crossref
  • Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wycko P, Murthy R. Hive: A warehousing solution over a map reduce framework. VLDB. 2009:1-4.
  • Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wycko P, Murthy R. Hive: A petabyte scale data warehouse using hadoop. IEEE. 2010:1-10.
  • Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H. Data warehousing and analytics infrastructure at Facebook. SIGMOD; 2010. p. 1013-20. Crossref
  • He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast and space-efficient data placement structure in map reduce-based warehouse systems. ICDE; 2011. p. 1-10.
  • Harizopoulos S, Liang V, Abadi DJ, Madden S. Performance tradeoffs in read-optimized databases. VLDB; 2006. p. 487– 98.
  • The Apache Software Foundation. 2016. Available from: Crossref
  • The Apache Software Foundation. 2016. Available from: Crossref
  • Holloway AL, DeWitt DJ. Read-optimized databases. PVLDB. 2008; 1(1):502–13.

Refbacks

  • »
  • »
  • »
  • »
  • »
  • »
  • »


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.