• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 16, Pages: 1214-1220

Original Article

A Hybrid Data Resampling Algorithm Combining Leader and SMOTE for Classifying the High Imbalanced Datasets

Received Date:20 January 2023, Accepted Date:29 March 2023, Published Date:24 April 2023

Abstract

Objective: The traditional classifiers are ineffective in classifying the imbalanced datasets. Most popular approach in resolving this problem is through data re-sampling. A hybrid resampling method is proposed in this paper that reduces the misclassification in all the classes. Method: The proposed method employs the Leader algorithm for under sampling and SMOTE algorithm for oversampling. It generates the desired number of samples in both the classes based on the problem that overcomes the over-fitting and under-fitting issues. Findings: To evaluate the performance of the proposed work, it is tested on 13 high imbalanced datasets obtained from the keel repository and the results are compared with the state-of-the-art hybrid data resampling methods such as SMOTE+Tomek Links, SMOTE+ENN, and SMOTE+RSB*. From the experiment it is observed that among the 13 high imbalanced datasets, the proposed method outperforms in 12 datasets and produces the same result in 1 dataset. The proposed method reduces the misclassification rates of minority and majority classes and is more suitable for the extreme imbalanced datasets. Novelty: This research work introduces a novel approach for classification by combining machine learning algorithms with domain-specific knowledge and resulting in significantly improved accuracy in classifying the extreme imbalanced datasets compared to the traditional methods. The uniqueness of the work is the utilization of the Leader algorithm and the SMOTE algorithm with a required resampling ratio instead of balancing and it improves the performance of the classification on the imbalanced data.

Keywords: Imbalanced Data; Leader; SMOTE; Hybrid Sampling; Resampling; Classification

References

  1. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. Journal of Big Data. 2018;5(1):1–30. Available from: https://doi.org/10.1186/s40537-018-0151-6
  2. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002;16:321–357. Available from: https://doi.org/10.1613/jair.953
  3. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):1–6. Available from: https://doi.org/10.1186/1471-2105-14-106
  4. Xu Z, Shen D, Nie T, Kou Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics. 2020;107:103465. Available from: https://doi.org/10.1016/j.jbi.2020.103465
  5. Batista G, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;6(1):20–29. Available from: https://doi.org/10.1145/1007730.1007735
  6. Wilson DL. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics. 1972;SMC-2(3):408–421. Available from: https://doi.org/10.1109/TSMC.1972.4309137
  7. Ramentol E, Caballero Y, Bello R, Herrera F. SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems. 2012;33(2):245–265. Available from: https://doi.org/10.1007/s10115-011-0465-6
  8. Wang S, Dai Y, Shen J, Xuan J. Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports. 2021;11(1):1. Available from: https://doi.org/10.1038/s41598-021-03430-5
  9. Salunkhe UR, Mali SN. A Hybrid Approach for Class Imbalance Problem in Customer Churn Prediction: A Novel Extension to Under-sampling. International Journal of Intelligent Systems and Applications. 2018;10(5):71–81. Available from: https://doi.org/10.5815/ijisa.2018.05.08
  10. Zhao J, Jin J, Chen SJ, Zhang R, Yu B, Liu Q. A weighted hybrid ensemble method for classifying imbalanced data. Knowledge-Based Systems. 2020;203:106087. Available from: https://doi.org/10.1016/j.knosys.2020.106087
  11. Liu CLL, Hsieh PYY. Model-Based Synthetic Sampling for Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. 2020;32(8):1543–1556. Available from: https://doi.org/10.1109/TKDE.2019.2905559
  12. Vijaya PA, Murty MN, Subramanian DK. An efficient incremental protein sequence clustering algorithm. TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region. 2003;1:409–413. Available from: https://doi.org/10.1109/TENCON.2003.1273355
  13. Mahmudah KR, Indriani F, Takemori-Sakai Y, Iwata Y, Wada T, Satou K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences. 2021;11(17):7825. Available from: https://doi.org/10.3390/app11177825

Copyright

© 2023 Karthikeyan & Kathirvalavakumar. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee

DON'T MISS OUT!

Subscribe now for latest articles and news.