• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 43, Pages: 3862-3874

Original Article

MVI-DR: An Efficient Missing Value Imputation Method Using Decision Tree and Regression Analysis

Received Date:10 August 2023, Accepted Date:16 October 2023, Published Date:14 November 2023

Abstract

Objectives: The main objective of the research work is to estimate the missing values of a dataset that contains both numeric and categorical type attributes and features. Developing a missing value imputation method to handle mixed-type data is an important problem for machine learning researchers. Methods: We developed a method called MVI-DR to estimate the missing values of a mixed-type dataset. The proposed MVI-DR method incorporates linear regression (LiR) and Decision Trees (DT) to compute the missing values for numeric and categorical data, respectively. The proposed MVI-DR method is validated using five classifiers viz., Logistic Regression (LoR), Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), DT, and Random Forest (RF) on 9 mixed-type datasets taken from UCI and Kaggle repositories. Findings: From the experimental results, we observed that the proposed MVI-DR method effectively estimates the missing values for both numeric and categorical data types. Especially on the Car, Lung Cancer, Thyroid, Melb, and Penguins datasets, the proposed method gives 75.7% accuracy, whereas the traditional method gives 75.6% accuracy using the LR model. Similarly, on the Lung Cancer dataset, MVI-DR yields 66.3%, 57.2%, 61.1%, 51%, 60.7%, and traditional one gives 62.2%, 53.2%, 55.8%, 47.4%, 58.6%, using LR, k-NN, SVM, DT and RF classifiers, respectively. In addition to accuracy, the proposed method yields better results on most of the datasets in terms of MCC. Moreover, we found that the proposed method performed better on high-dimensional mixed-type datasets. Novelty: A new missing value imputation method called MVI-DR is developed. The method can handle both numeric and categorical data types. The MVI-DR method is evaluated in terms of Accuracy, F1-score and MCC.

Keywords: Numeric, Categorical, Imputation, MVI­DR, Machine Learning, Mixed type

References

  1. Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health and Quality of Life Outcomes. 2019;17(1). Available from: https://doi.org/10.1186/s12955-019-1181-2
  2. Lin WCC, Tsai CFF. Missing value imputation: a review and analysis of the literature (2006–2017) Artificial Intelligence Review. 2020;53(2):1487–1509. Available from: https://doi.org/10.1007/s10462-019-09709-4
  3. Hasan MK, Alam MA, Roy S, Dutta A, Jawad MT, Das S. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021) Informatics in Medicine Unlocked. 2021;27(2):100799. Available from: https://doi.org/10.1016/j.imu.2021.100799
  4. Wang S, Li B, Yang M, Yan Z. Missing Data Imputation for Machine Learning. In: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. (Vol. 4, pp. 67-72) Springer International Publishing. 2019.
  5. Gad I, Hosahalli D, Manjunatha BR, Ghoneim OA. A robust deep learning model for missing value imputation in big NCDC dataset. Iran Journal of Computer Science. 2021;4(2):67–84. Available from: https://doi.org/10.1007/s42044-020-00065-z
  6. Fouad KM, Ismail MM, Azar AT, Arafa MM. Advanced methods for missing values imputation based on similarity learning. PeerJ Computer Science. 2021;7:e619. Available from: https://doi.org/10.7717/peerj-cs.619
  7. Li D, Zhang H, Li T, Bouras A, Yu X, Wang T. Hybrid Missing Value Imputation Algorithms Using Fuzzy C-Means and Vaguely Quantified Rough Set. IEEE Transactions on Fuzzy Systems. 2022;30(5):1396–1408. Available from: https://doi.org/10.1109/TFUZZ.2021.3058643
  8. Abiri N, Linse B, Edén P, Ohlsson M. Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems. Neurocomputing. 2019;365:137–146. Available from: https://doi.org/10.1016/j.neucom.2019.07.065
  9. Gjorshoska I, Eftimov T, Trajanov D. Missing value imputation in food composition data with denoising autoencoders. Journal of Food Composition and Analysis. 2022;112:104638. Available from: https://doi.org/10.1016/j.jfca.2022.104638
  10. Buuren SV, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. 2020. Available from: https://doi.org/10.18637/jss.v045.i03
  11. Khan SI, Hoque ASML. SICE: an improved missing data imputation technique. Journal of Big Data. 2020;7(1):37. Available from: https://doi.org/10.1186/s40537-020-00313-w
  12. Hannah S, Laqueur AB, Shev, Rose MC, Kagawa. SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations. American Journal of Epidemiology. 2022;191(3):516–525. Available from: https://doi.org/10.1093/aje/kwab271
  13. Samad MD, Abrar S, Diawara N. Missing value estimation using clustering and deep learning within multiple imputation framework. Knowledge-Based Systems. 2022;249:108968. Available from: https://doi.org/10.1016/j.knosys.2022.108968
  14. Dinh DTT, Huynh VNN, Sriboonchitta S. Clustering mixed numerical and categorical data with missing values. Information Sciences. 2021;571:418–442. Available from: https://doi.org/10.1016/j.ins.2021.04.076
  15. Aschenbruck R, Szepannek G, Wilhelm AFX. Imputation Strategies for Clustering Mixed-Type Data with Missing Values. Journal of Classification. 2023;40(1):2–24. Available from: https://doi.org/10.1007/s00357-022-09422-y
  16. Zhao Y, Udell M. Missing Value Imputation for Mixed Data via Gaussian Copula. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020;p. 636–646. Available from: https://doi.org/10.1145/3394486.3403106
  17. Christoffersen B, Clements M, Humphreys K, Kjellström H. Asymptotically exact and fast Gaussian copula models for imputation of mixed data types. Asian Conference on Machine Learning. 2021;p. 870–885. Available from: https://doi.org/10.48550/arXiv.2102.02642
  18. Zhao Y, Townsend A, Udell M. Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data. Advances in Neural Information Processing Systems. 2022;35:22064–22077. Available from: https://doi.org/10.48550/arXiv.2210.06673
  19. Feng H, Ning Y. High-dimensional mixed graphical model with ordinal data: Parameter estimation and statistical inference. The 22nd international conference on artificial intelligence and statistics. 2019;p. 654–663. Available from: https://proceedings.mlr.press/v89/feng19a.html
  20. Yoon G, Carroll RJ, Gaynanova I. Sparse semiparametric canonical correlation analysis for data of mixed types. Biometrika. 2020;107(3):609–625. Available from: https://doi.org/10.1093/biomet/asaa007

Copyright

© 2023 Khumukcham et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.