Genetic Algorithm Based Over-Sampling with DNN in Classifying the Imbalanced Data Distribution Problem

S Karthikeyan; T Kathirvalavakumar

doi:10.17485/IJST/v16i8.863

Article

Genetic Algorithm Based Over-Sampling with DNN in Classifying the Imbalanced Data Distribution Problem

VIEWS 803
PDF 149

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i8.863

Year: 2023, Volume: 16, Issue: 8, Pages: 547-556

Original Article

Genetic Algorithm Based Over-Sampling with DNN in Classifying the Imbalanced Data Distribution Problem

S Karthikeyan^1*, T Kathirvalavakumar²

¹Research Scholar, Research centre in Computer Science, V.H.N.Senthikumara Nadar College, Virudhunagar, Tamil Nadu, India
²Associate Professor, Research centre in Computer Science, V.H.N.Senthikumara Nadar College, Virudhunagar, Tamil Nadu, India

*Corresponding Author
Email: [email protected]

Received Date:26 April 2022, Accepted Date:13 January 2023, Published Date:27 February 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: Data imbalance exists in many real-life applications. In the imbalanced datasets, the minority class data creates a wrong inference during the classification that leads to more misclassification. More research has been done in the past to solve this issue, but as of now there is no global working solution found to do efficient classification. After analyzing various existing literatures, it is proposed to minimize the misclassification through genetic based oversampling and deep neural network (DNN) classifier. Method: In the proposed oversampling method synthetic samples are generated based on genetic algorithm. Initial populations for the genetic algorithm are generated using Gaussian weight initialization technique and the fittest individual from the population are selected by Euclidean distance for further processing to generate synthetic data in double the minority class size and the dataset is classified with the DNN. Findings: The performance of the oversampled training data with DNN Classifier is compared with C4.5 and Support Vector Machine (SVM) classifiers and found that the DNN classifier outperforms the other two classifiers. The data generated using SMOTE and ADASYN are considered for comparison. It is found that the proposed approach outperforms the other approaches. It is also proved from the experiment that misclassification is reduced and the proposed method is statistically significant and is comparatively better. Novelty: Initial population generation by Gaussian weight initialization, the fittest sample selection by Euclidean distance measure, synthetic samples with double the minority class size and DNN for classification to reduce the misclassification is novelty in this work.

Keywords: Genetic algorithm; Gauss weight initialization; SMOTE; ADASYN; Imbalanced data; Classification

References

Wang L, Han M, Li X, Zhang N, Cheng H. Review of Classification Methods on Unbalanced Data Sets. IEEE Access. 2021;9:64606–64628. Available from: https://doi.org/10.1109/ACCESS.2021.3074243
Li Q, Zhao C, He X, Chen K, Wang R. The Impact of Partial Balance of Imbalanced Dataset on Classification Performance. Electronics. 2022;11(9):1322. Available from: https:// doi.org/10.3390/electronics11091322
Douzas G, Lechleitner M, Bacao F. Improving the quality of predictive models in small data GSDOT: A new algorithm for generating synthetic data. PLOS ONE. 2022;17(4):e0265626. Available from: https://doi.org/10.1371/journal.pone.0265626
Gnip P, Vokorokos L, Drotár P. Selective oversampling approach for strongly imbalanced data. PeerJ Computer Science. 2021;7:e604. Available from: https://doi.org/10.7717/peerj-cs.604
Rahnamayan S, Tizhoosh HR, Salama MMA. A novel population initialization method for accelerating evolutionary algorithms. Computers & Mathematics with Applications. 2007;53(10):1605–1614. Available from: https://doi.org/10.1016/j.camwa.2006.07.013
Hasanzadeh MR, Keynia F. A new population initialisation method based on the Pareto 80/20 rule for meta‐heuristic optimisation algorithms. IET Software. 2021;15(5):323–347. Available from: https://doi.org/10.1049/sfw2.12025
Zhou X, Miao F, Ma H. Genetic Algorithm with an Improved Initial Population Technique for Automatic Clustering of Low-Dimensional Data. Information. 2018;9(4):101. Available from: https://doi.org/10.3390/info9040101
Karia V, Zhang W, Naeim A, Ramezani R. GenSample: A genetic algorithm for oversampling in imbalanced datasets. 2019. Available from: https://doi.org/10.48550/arXiv.1910.10806
Ghazikhani A, Yazdi HS, Monsefi R. Class imbalance handling using wrapper-based random oversampling. 20th Iranian Conference on Electrical Engineering (ICEE2012). 2012;2012:1–6. Available from: https://doi.org/10.1109/IranianCEE.2012.6292428
Arun C, Lakshmi C. Genetic algorithm-based oversampling approach to prune the class imbalance issue in software defect prediction. Soft Computing. 2022;26(23):12915–12931. Available from: https://doi.org/10.1007/s00500-021-06112-6
Katharopoulos A, Fleuret F. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. 2018. Available from: https://arxiv.org/abs/1803.00942v3
Suh S, Lukowicz P, Lee YO. Discriminative feature generation for classification of imbalanced data. Pattern Recognition. 2022;122:108302. Available from: https://doi.org/10.1016/j.patcog.2021.108302
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002;16:321–357. Available from: https://doi.org/10.1613/jair.953
He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks. 2008;p. 1322–1328. Available from: https://doi.org/10.1109/IJCNN.2008.4633969
Xu J, Chen Z, Lu Y, Yang X, Pumir A. Improved Preterm Prediction Based on Optimized Synthetic Sampling of EHG Signal. Available from: https://doi.org/10.48550/arXiv.2007.01447
Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing. Available from: https://sci2s.ugr.es/keel/datasets.php
Imbalanced Datasets from UCI repository . Available from: https://archive.ics.uci.edu/ml/datasets.php
Imbalanced Datasets from PROMISE repository. Available from: http://promise.site.uottawa.ca/SERepository/
Chawla NV, C4. Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. Proc Intl Conf Mach Learn Work Learn from Imbalanced Data Sets II. 2003;8. Available from: https://www3.nd.edu/~dial/publications/chawla2003c45.pdf
Kulkarni A, Chong D, Batarseh FA. Foundations of data imbalance and solutions for a data democracy. 2021. Available from: https://doi.org/10.48550/arXiv.2108.00071

Copyright

© 2023 Karthikeyan & Kathirvalavakumar. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)