A Hybrid Data Resampling Algorithm Combining Leader and SMOTE for Classifying the High Imbalanced Datasets

S Karthikeyan; T Kathirvalavakumar

doi:10.17485/IJST/v16i16.146

Article

A Hybrid Data Resampling Algorithm Combining Leader and SMOTE for Classifying the High Imbalanced Datasets

VIEWS 625
PDF 1270

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i16.146

Year: 2023, Volume: 16, Issue: 16, Pages: 1214-1220

Original Article

A Hybrid Data Resampling Algorithm Combining Leader and SMOTE for Classifying the High Imbalanced Datasets

S Karthikeyan^1*, T Kathirvalavakumar²

¹Research Scholar, Research centre in Computer Science, V.H.N. Senthikumara Nadar College, Virudhunagar, Tamil Nadu, India
²Associate Professor, Research centre in Computer Science, V.H.N. Senthikumara Nadar College, Virudhunagar, Tamil Nadu, India

*Corresponding Author
Email: [email protected]

Received Date:20 January 2023, Accepted Date:29 March 2023, Published Date:24 April 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: The traditional classifiers are ineffective in classifying the imbalanced datasets. Most popular approach in resolving this problem is through data re-sampling. A hybrid resampling method is proposed in this paper that reduces the misclassification in all the classes. Method: The proposed method employs the Leader algorithm for under sampling and SMOTE algorithm for oversampling. It generates the desired number of samples in both the classes based on the problem that overcomes the over-fitting and under-fitting issues. Findings: To evaluate the performance of the proposed work, it is tested on 13 high imbalanced datasets obtained from the keel repository and the results are compared with the state-of-the-art hybrid data resampling methods such as SMOTE+Tomek Links, SMOTE+ENN, and SMOTE+RSB*. From the experiment it is observed that among the 13 high imbalanced datasets, the proposed method outperforms in 12 datasets and produces the same result in 1 dataset. The proposed method reduces the misclassification rates of minority and majority classes and is more suitable for the extreme imbalanced datasets. Novelty: This research work introduces a novel approach for classification by combining machine learning algorithms with domain-specific knowledge and resulting in significantly improved accuracy in classifying the extreme imbalanced datasets compared to the traditional methods. The uniqueness of the work is the utilization of the Leader algorithm and the SMOTE algorithm with a required resampling ratio instead of balancing and it improves the performance of the classification on the imbalanced data.

Keywords: Imbalanced Data; Leader; SMOTE; Hybrid Sampling; Resampling; Classification

References

Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. Journal of Big Data. 2018;5(1):1–30. Available from: https://doi.org/10.1186/s40537-018-0151-6
Kraiem MS, Sánchez-Hernández F, Moreno-García MN. Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models. Applied Sciences. 2021;11(18):8546. Available from: http://dx.doi.org/10.3390/app11188546
Wongvorachan T, He S, Bulut O. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information. 2023;14(1):54. Available from: https://doi.org/10.3390/info14010054
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002;16:321–357. Available from: https://doi.org/10.1613/jair.953
Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):1–6. Available from: https://doi.org/10.1186/1471-2105-14-106
Xu Z, Shen D, Nie T, Kou Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics. 2020;107:103465. Available from: https://doi.org/10.1016/j.jbi.2020.103465
Batista G, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;6(1):20–29. Available from: https://doi.org/10.1145/1007730.1007735
Wilson DL. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics. 1972;SMC-2(3):408–421. Available from: https://doi.org/10.1109/TSMC.1972.4309137
Ramentol E, Caballero Y, Bello R, Herrera F. SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems. 2012;33(2):245–265. Available from: https://doi.org/10.1007/s10115-011-0465-6
Wang S, Dai Y, Shen J, Xuan J. Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports. 2021;11(1):1. Available from: https://doi.org/10.1038/s41598-021-03430-5
Salunkhe UR, Mali SN. A Hybrid Approach for Class Imbalance Problem in Customer Churn Prediction: A Novel Extension to Under-sampling. International Journal of Intelligent Systems and Applications. 2018;10(5):71–81. Available from: https://doi.org/10.5815/ijisa.2018.05.08
Zhao J, Jin J, Chen SJ, Zhang R, Yu B, Liu Q. A weighted hybrid ensemble method for classifying imbalanced data. Knowledge-Based Systems. 2020;203:106087. Available from: https://doi.org/10.1016/j.knosys.2020.106087
Liu CLL, Hsieh PYY. Model-Based Synthetic Sampling for Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. 2020;32(8):1543–1556. Available from: https://doi.org/10.1109/TKDE.2019.2905559
Hartigan J. Wiley Series in Probability and Mathematical Statistics. 1975.
Vijaya PA, Murty MN, Subramanian DK. An efficient incremental protein sequence clustering algorithm. TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region. 2003;1:409–413. Available from: https://doi.org/10.1109/TENCON.2003.1273355
Imbalanced Classification Datasets. Available from: https://sci2s.ugr.es/keel/datasets.php
Mahmudah KR, Indriani F, Takemori-Sakai Y, Iwata Y, Wada T, Satou K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences. 2021;11(17):7825. Available from: https://doi.org/10.3390/app11177825

Copyright

© 2023 Karthikeyan & Kathirvalavakumar. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)