Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig

Kashish Ara Shakil; Mansaf Alam and Shuchi Sethi  nbsp

doi:10.17485/ijst/2015/v8i35/72419

Article

Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig

VIEWS 955
PDF 228

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2015/v8i35/72419

Year: 2015, Volume: 8, Issue: 35, Pages: 1-8

Original Article

Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig

Kashish Ara Shakil^* , Mansaf Alam and Shuchi Sethi^*

Department of Computer Science, Jamia Millia Islamia, New Delhi - 110025, India;
[email protected], [email protected], [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Cloud environment is usually associated with non-homogeneity and dynamicity in terms of resource usage and access at all levels. The study of this heterogeneous and non-uniform behavior is therefore an important problem. Google cluster trace which is a production trace released by Google in November 2014 serves as an example of a high scale Cloud environment. This paper deals with statistical analysis of this cluster trace. Since the size of production trace is very huge therefore, Hive which is a HadoopDistributed File System (HDFS) based platform for querying and analysis of big data, has been used. Hive was accessed through its Beeswax interface. The data was imported into HDFS through HCatalog. Apart from Hive, Pig which is a scripting language and provides abstraction on top of Hadoop was used. The method adopted deals with clustering and studying the distribution of arrival time of jobs, distribution of resource usage and also study of distribution of process runtime. To the best of our knowledge the analytical method adopted by us is novel. The findings revealed that jobs in a production trace can be classified into major, mediocre and minor resource usage types. Furthermore, it can be concluded from our study that arrival time of jobs followed weibull distribution. Usage of resources such as CPU and memory was observed to be following a zipf like distribution while study of process runtime shows that some jobs had very small values of runtime while others had very large values hence they followed heavy tailed distribution. Our analysis will help researchers in properly understanding the nonhomogenous and dynamic behavior synonymous with cloud environment. It will also help them in developing new algorithms for resource allocation and scheduling in Cloud.
Keywords: Dynamicity, Hadoop,High Scale Cloud, Hive, Pig, Non-Homogenous