A New Method of Data Preparation for Classifying Diabetes Dataset

M  S  Padmavathi  and C  P  Sumathi

doi:10.17485/ijst/2019/v12i22/144929

Article

A New Method of Data Preparation for Classifying Diabetes Dataset

VIEWS 992
PDF 2344

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2019/v12i22/144929

Year: 2019, Volume: 12, Issue: 22, Pages: 1-9

Original Article

A New Method of Data Preparation for Classifying Diabetes Dataset

M. S. Padmavathi^* and C. P. Sumathi

SDNB Vaishnav College for Women, Chennai – 600044, Tamil Nadu, India; [email protected], [email protected]

*Author for correspondence
M. S. Padmavathi
SDNB Vaishnav College for Women, Chennai – 600044, Tamil Nadu, India;
[email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: Millions of people including children and pregnant women are affected by Diabetes mellitus. Undiagnosed diabetes can affect entire body system including cardiac attacks, chronic kidney disease, foot ulcers and damage to the eyes Therefore an intelligent model should be developed for early detection of diabetes. Method: Data preprocessing is an important step in building classification models. Pima Indian Diabetes dataset from University of California Irvine (UCI) repository is a challenging dataset with more number (48%) of missing values. Different steps of data preprocessing is performed on Pima Diabetes to improve the accuracy of the classification model. The proposed model includes outlier removal and imputation at stage 1, normalization at stage 2 and balancing the dataset at stage 3. After each stage of preprocessing, the model is evaluated using three classifiers: Support Vector Machine (SVM), Random Forest (RF) and K-nearest neighbor (Knn). Findings: It is clearly proved that after each stage of preprocessing, the classification accuracy increases. On completing all 3 stages of preprocessing, the diabetes dataset achieves a highest accuracy (82.14%) and balanced accuracy (81.94%) with Random Forest classifier when compared to SVM and Knn. Novelty/Improvements: The preprocessing steps, replacing the outliers using 5 and 95 percentile values with median imputation followed by Z-score normalization and balancing the dataset using smote improves the quality of Pima Diabetes dataset, thereby classification accuracy of the model increases. The same data preprocessing methods can also be applied to different datasets or different classifier models.

Keywords: Balanced Dataset, Imputation, Normalization, Outlier Removal, Random Forest