Indian Journal of Science and Technology
Year: 2017, Volume: 10, Issue: 19, Pages: 1-7
Geeta Chhabra1*, Vasudha Vashisht2 and Jayanthi Ranjan3
1Amity School of Institute Technology, Amity University, Noida – 201313, Uttar Pradesh, India; [email protected] 2Department of Computer Science and Engineering, Amity School of Engineering, Amity University, Noida – 201313, Uttar Pradesh, India; [email protected] 3Institute of Management Technology, Ghaziabad – 201001, Uttar Pradesh, India; [email protected]
*Author for the correspondence:
Amity School of Institute Technology, Amity University, Noida – 201313, Uttar Pradesh, India; [email protected]
Missing data is relatively common in all type of research, which can reduce the statistical power and have biased results if not handled properly. Multivariate Imputation by Chained Equations (MICE) has emerged as one of the principled method of addressing missing data. This paper provides comparison of MICE using various methods to deal with missing values. The chained equations approach is very flexible and can handle various types of data such as continuous or binary as well as various missing data patterns. Objectives: To discuss commonly used techniques for handling missing data and common issues that could arise when these techniques are used. In particular, we will focus on different approaches of one of the most popular methods, Multiple Imputation using Chained Equations (MICE). Methods/Statistical Analysis: Multivariate Imputation by Chained Equation is a statistical method for addressing missing value imputation. The paper will focus on Multiple Imputation using Predictive Mean Matching, Multiple Random Forest Regression Imputation, Multiple Bayesian Regression Imputation, Multiple Linear Regression using Non-Bayesian Imputation, Multiple Classification and Regression Tree (CART), Multiple Linear Regression with Bootstrap Imputation which provides a general framework for analyzing data with missing values. Findings: We have chosen to explore Multiple Imputation using MICE through an examination of sample data set. Our analysis confirms that the power of Multiple Imputations lies in getting smaller standard errors and narrower confidence intervals. The smaller is the standard error and narrower is the confidence interval; the predicted value is more accurate, thus, minimizing the bias and inefficiency considerably. In our results from sample data set, it has been observed that standard error and mean confidence interval length is the least in case of Multiple Imputation combined with Bayesian Regression. Also, it is obvious from the density plot that the imputed values are more close to the observed values in this method than other methods. Even in case of random forest, the results are quite close to Bayesian Regression. Application/Improvements: These Multiple Imputation methods can further be combined with machine learning and Genetic Algorithms on real set data to further reduce the bias and inefficiency.
Keywords: Missing Completely at Random, Missing at Random (MAR), Multiple Imputation, Not Missing at Random (NMAR)
Subscribe now for latest articles and news.