Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Aanurag Kumar Srivastava; Chandan Kumar and Neha Mangla

doi:10.17485/ijst/2016/v9i47/106496

Article

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

VIEWS 879
PDF 403

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2016/v9i47/106496

Year: 2016, Volume: 9, Issue: 47, Pages: 1-5

Original Article

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Aanurag Kumar Srivastava^1*, Chandan Kumar and Neha Mangla²

¹Atria Institute of Technology, Bangalore – 24, Karnataka, India; [email protected], [email protected] ²Department of ISE, Atria Institute of Technology, Bangalore - 24, Karnataka, India; [email protected]

*Author for correspondence
Aanurag Kumar Srivastava
Atria Institute of Technology, Bangalore – 24, Karnataka, India; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: Diabetic is one of the most venerable disease spreading in the world, it will caused due to hereditary and also due to lack of diet. But if we analyze this disease then we can find some fact from the symptoms. Using these facts we can make a predicting model to predict the diabetic disease. By using this model the prediction of the diabetic will be easier and lots of benefits can be provided to the humanity. By sharing the information we extract from our model to the government will help the government for making the welfare program for the citizens. Method and Analysis: In this paper we have taken the sample of Pima Indian diabetic dataset which is having the 768 samples. So first of all that dataset will be given as input to hive so as to convert it into a formatted dataset. Then we will apply few queries on the formatted dataset in order to extract the useful information. Then we use the R tool in order to perform the statically analysis for generating the graph and also for calculating gini index and developing the prediction model, and efficiency of the model is also found. Findings: In our paper we have performed few queries on the diabetic dataset using hive such as finding the distinct values from the table and by finding it we can analyze the different attributes of the table and also time taken for analysis can also be calculated by default which is one of the positive points of using the hive. Then we will be using the r tool for statically analysis, as we all know picture speaks more than the word so by using the graph generated by r tool we can analyze the dataset easily and fast as compared to going through each rows of the dataset. We calculate gini index for attributes in order to find the inequality among the values using r tool. We also make the prediction model using KNN algorithm and we also find the accuracy of our model. These all things done by the use of r tool, which makes it simpler and also make the method easy to understand by the user to make prediction model and to calculate the efficiency of the model. By using the prediction model we can find the number of sample predictions made correctly. Improvements: We can improve the paper by doing the operations performed on large dataset such as millions of dataset in order to make paper more efficient. Our project efficiency is about 79% which can further be improved.

Keywords: Big-Data, Gini Index, Hadoop, Hive, K Nearest Neighbor, R