Total views : 75

A Comparison of Multiple Imputation Methods for Data with Missing Values

Affiliations

  • Amity School of Institute Technology, Amity University, Noida – 201313, Uttar Pradesh, India
  • Department of Computer Science and Engineering, Amity School of Engineering, Amity University, Noida – 201313, Uttar Pradesh, India
  • Institute of Management Technology, Ghaziabad – 201001, Uttar Pradesh, India

Abstract


Missing data is relatively common in all type of research, which can reduce the statistical power and have biased results if not handled properly. Multivariate Imputation by Chained Equations (MICE) has emerged as one of the principled method of addressing missing data. This paper provides comparison of MICE using various methods to deal with missing values. The chained equations approach is very flexible and can handle various types of data such as continuous or binary as well as various missing data patterns. Objectives: To discuss commonly used techniques for handling missing data and common issues that could arise when these techniques are used. In particular, we will focus on different approaches of one of the most popular methods, Multiple Imputation using Chained Equations (MICE). Methods/Statistical Analysis: Multivariate Imputation by Chained Equation is a statistical method for addressing missing value imputation. The paper will focus on Multiple Imputation using Predictive Mean Matching, Multiple Random Forest Regression Imputation, Multiple Bayesian Regression Imputation, Multiple Linear Regression using Non-Bayesian Imputation, Multiple Classification and Regression Tree (CART), Multiple Linear Regression with Bootstrap Imputation which provides a general framework for analyzing data with missing values. Findings: We have chosen to explore Multiple Imputation using MICE through an examination of sample data set. Our analysis confirms that the power of Multiple Imputations lies in getting smaller standard errors and narrower confidence intervals. The smaller is the standard error and narrower is the confidence interval; the predicted value is more accurate, thus, minimizing the bias and inefficiency considerably. In our results from sample data set, it has been observed that standard error and mean confidence interval length is the least in case of Multiple Imputation combined with Bayesian Regression. Also, it is obvious from the density plot that the imputed values are more close to the observed values in this method than other methods. Even in case of random forest, the results are quite close to Bayesian Regression. Application/Improvements: These Multiple Imputation methods can further be combined with machine learning and Genetic Algorithms on real set data to further reduce the bias and inefficiency.

Keywords

Missing Completely at Random, Missing at Random (MAR), Multiple Imputation, Not Missing at Random (NMAR).

Full Text:

 |  (PDF views: 55)

References


  • Li H. missing values imputation based on iterative learning. International Journal of Intelligence Science. 2013; 3(1):50– 5. Crossref
  • Meng XL. Multiple-imputation inferences with uncongenial sources of input. Statistical Science. 1994; 9(4):538–73. Crossref
  • Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple Imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistical Methods in Medical Research. 2014; 24(4):462–87. PMid: 24525487 PMCid: PMC4513015. Crossref
  • Burgette LF, Reiter JP. Multiple Imputation for missing data via sequential regression trees. American Journal of Epidemiology. 2010; 172:1070–6. PMid: 20841346. Crossref
  • Andridge RR, Little RJA. A review of hot deck imputation for survey non-response. International Statistical Review. 2010; 78(1):40–64. PMid: 21743766. PMCid: PMC3130338. Crossref
  • Doove LL, Buuren SV, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics and Data Analysis. 2014; 72:92–104. Crossref
  • Breiman L. Random forests. Machine Learning. 2001Jan; 26(2):123–40.
  • Lin Y, Jeon Y. Random forests and adaptive nearest neighbours. Journal of the American Statistical Association. 2006; 101(474):578–90. Crossref
  • Biau G. Analysis of a random forests model. The Journal of Machine Learning Research. 2012; 13(1):1063–95.
  • Schenker N, Taylor JMG. Partially parametric techniques for multiple imputation. Computational Statistics and Data Analysis. 1996; 22(4):425–46. Crossref
  • Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: A Caliber Study. American Journal of Epidemiology. 2014; 179(6):764– 74. PMid: 24589914. PMCid: PMC3939843. Crossref
  • Hastie T, Tibshirani R, Friedman J, Hastie T, Friedman J, Tibshirani R. The elements of statistical learning. 2nd ed. Springer Series in Statistics. 2009. Crossref
  • Mendez G, Lohr S. Estimating residual variance in random forest regression. Computational Statistics and Data Analysis. 2011; 55(11):2937–50. Crossref
  • Zahed H. Bayesian treatment of missing data using multiple imputation. EPPS 7390, Fall; Dallas: University of Texas; 2013. p. 1–243.
  • Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychological Methods, American Psychological Association. 2002; 7(2):147–77.
  • Gimpy MDRV. Missing value imputation in multi attribute data set. International Journal of Computer Science and Information Technologies. 2014; 5(4):1–7.
  • Dane S, Thool RC. Imputation method for missing value estimation of mixed-attribute data sets. International Journal of Advanced Research in Computer Science and Software Engineering. 2013 May; 3(5):1–6.
  • Kaiser J. Dealing with missing values in data. Journal of Systems Integration. 2014; 5(1):42–51. Crossref 19. Young W, Weckman G, Holland W. A survey of methodologies for the treatment of missing values within datasets: Limitations and benefits. Taylor and Francis. 2010 Jun; 12(1):15–43.
  • Pigott TD. A review of missing data treatment methods. Educational Research and Evaluation. 2001; 7(4):353–83. Crossref
  • Zhang S, Zhang J, Zhu XF, Qin YQ, Zhang C. Missing value imputation based on data clustering. Springer; 2008. p. 128–38. Crossref
  • Rezvan PH, Lee KJ, Simpson JA. The rise of multiple imputation: A review of the reporting and implementation of the method in medical research. BMC Medical Research Methodology. 2015. p. 1–67.
  • Nookhong J, Kaewrattanapat N. Efficiency comparison of data mining techniques for missing-value imputation. Journal of Industrial and Intelligent Information. Suan Sunandha Rajabhat University, Bangkok, Thailand. 2015 Dec; 3(4):1–5.
  • Schmitt P, Mandel J, Guedj M. A comparison of six methods for missing data imputation. J Biomet Biostat. 2015; 6(1):1–6.
  • Vink G, Frank LE, Pannekoek J, Buuren SV. Predictive mean matching imputation of semicontinuous variables. Statistica Neerlandica. Wiley Publishing. 2014; 68(1):61–90.
  • Towards an MI-proper predictive mean matching. 2016. Crossref
  • Stekhoven DJ, B’uhlmann P. Missforest non-parametric missing value imputation for mixed-type data.
  • Bioinformatics. 2012; 28(1):112–8. PMid: 22039212. Crossref
  • Yu X, Lim ZJS. Replace missing values with EM algorithm based on GMM and Naive Bayesian. International Journal of Software Engineering and its Applications. 2014; 8(5):177–88.
  • LI XB. A Bayesian approach for estimating and replacing missing categorical data. ACM Journal of Data and Information Quality. 2009 Jun; 1(1):1. Crossref
  • Lichman M. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science; 2013. PMid: 24373753.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.