A Comparison of Multiple Imputation Methods for Data with Missing Values

Geeta Chhabra; Vasudha Vashisht  and Jayanthi Ranjan

doi:10.17485/ijst/2017/v10i19/110646

Article

A Comparison of Multiple Imputation Methods for Data with Missing Values

VIEWS 3513
PDF 1822

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2017/v10i19/110646

Year: 2017, Volume: 10, Issue: 19, Pages: 1-7

Original Article

A Comparison of Multiple Imputation Methods for Data with Missing Values

Geeta Chhabra^1*, Vasudha Vashisht² and Jayanthi Ranjan³

¹Amity School of Institute Technology, Amity University, Noida – 201313, Uttar Pradesh, India; [email protected] ²Department of Computer Science and Engineering, Amity School of Engineering, Amity University, Noida – 201313, Uttar Pradesh, India; [email protected]³Institute of Management Technology, Ghaziabad – 201001, Uttar Pradesh, India; [email protected]

*Author for the correspondence:
Geeta Chhabra
Amity School of Institute Technology, Amity University, Noida – 201313, Uttar Pradesh, India; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Missing data is relatively common in all type of research, which can reduce the statistical power and have biased results if not handled properly. Multivariate Imputation by Chained Equations (MICE) has emerged as one of the principled method of addressing missing data. This paper provides comparison of MICE using various methods to deal with missing values. The chained equations approach is very flexible and can handle various types of data such as continuous or binary as well as various missing data patterns. Objectives: To discuss commonly used techniques for handling missing data and common issues that could arise when these techniques are used. In particular, we will focus on different approaches of one of the most popular methods, Multiple Imputation using Chained Equations (MICE). Methods/Statistical Analysis: Multivariate Imputation by Chained Equation is a statistical method for addressing missing value imputation. The paper will focus on Multiple Imputation using Predictive Mean Matching, Multiple Random Forest Regression Imputation, Multiple Bayesian Regression Imputation, Multiple Linear Regression using Non-Bayesian Imputation, Multiple Classification and Regression Tree (CART), Multiple Linear Regression with Bootstrap Imputation which provides a general framework for analyzing data with missing values. Findings: We have chosen to explore Multiple Imputation using MICE through an examination of sample data set. Our analysis confirms that the power of Multiple Imputations lies in getting smaller standard errors and narrower confidence intervals. The smaller is the standard error and narrower is the confidence interval; the predicted value is more accurate, thus, minimizing the bias and inefficiency considerably. In our results from sample data set, it has been observed that standard error and mean confidence interval length is the least in case of Multiple Imputation combined with Bayesian Regression. Also, it is obvious from the density plot that the imputed values are more close to the observed values in this method than other methods. Even in case of random forest, the results are quite close to Bayesian Regression. Application/Improvements: These Multiple Imputation methods can further be combined with machine learning and Genetic Algorithms on real set data to further reduce the bias and inefficiency.

Keywords: Missing Completely at Random, Missing at Random (MAR), Multiple Imputation, Not Missing at Random (NMAR)