KNN Based Under Sampling: A Cognitive Centred Solution for Imbalanced Dataset Problem in Anaphora Resolution

K Arul Deepa; Shanmuga Priya; P Velvizhy

doi:10.17485/IJST/v16i30.1523

Article

KNN Based Under Sampling: A Cognitive Centred Solution for Imbalanced Dataset Problem in Anaphora Resolution

VIEWS 472
PDF 87

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i30.1523

Year: 2023, Volume: 16, Issue: 30, Pages: 2317-2324

Original Article

KNN Based Under Sampling: A Cognitive Centred Solution for Imbalanced Dataset Problem in Anaphora Resolution

K Arul Deepa^1*, Shanmuga Priya², P Velvizhy²

¹Dept. of IST, College of Engineering Guindy, Anna University, Chennai
²Dept. of CSE, College of Engineering Guindy, Anna University, Chennai

*Corresponding Author
Email: [email protected]

Received Date:20 June 2023, Accepted Date:28 June 2023, Published Date:08 August 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Background: Like many other real world applications, the machine learning system of anaphora resolution also struggles with skewed data. The problem of imbalanced classes occurs with classification task where there a huge difference exists in the number of instances among the involved classes.Objectives: The proposed framework intends to remove the imbalance first between positive and negative class instances before classifying them by KBUS that makes use of cognitive knowledge about the language and analysis is done at attribute level. Method: Nine pruning rules are crafted by KBUS(KNN Based Under Sampling) for TDIL dataset. Findings: During experimentation, number of positive instances are increased from 5.32% to 43.95%, whereas the number of negative instances are decreased from 94.68% to 56.05%. Loss ratio of positive and negative instance is 1:112. Finally the pruned dataset is classified by a list of classifiers namelyNaïve Bayes, SVM, Random forest, decision tree and k-NN. Novelty: Classifier results are discussed in two perspectives: Firstly the number of input instances and secondly the performance improvement achieved after pruning. It is adduced that pruning shows a remarkable improvement for all the classifiers. The proposed system produced an encouraging result as 78% of f-measure for k-NN and 77% for decision tree. Performance is presented in a comparative manner before and after pruning and the improvement of fmeasureranges from 13% (k-NN) to 41% (Random Forest). Thus this work has come up with a machine learning model to resolve Tamil anaphoric situations effectively in an imbalanced classification environment.

Keywords: Imbalanced Dataset; Classification; Anaphora Resolution Machine Learning; Pronominal Reference

References

Malik KEF. New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset. Computing and Informatics. 2022;41:981–1001. Available from: https://doi.org/10.31577/cai20224 981
Ezzini S, Abualhaija S, Arora C, Sabetzadeh M. Automated handling of anaphoric ambiguity in requirements. In: Proceedings of the 44th International Conference on Software Engineering. (pp. 187-199) ACM. 2022.
Kunakorntum I, Hinthong W, Phunchongharn P. A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets. IEEE Access. 2020;8:114692–114704. Available from: https://doi.org/10.1109/ACCESS.2020.3003346
Karthikeyan S, Kathirvalavakumar T. Genetic Algorithm Based Over-Sampling with DNN in Classifying the Imbalanced Data Distribution Problem. Indian Journal Of Science And Technology. 2023;16(8):547–556. Available from: https://doi.org/10.17485/IJST/v16i8.863
Liang D, Zhang JW, Tang YP, Huang SJ. MUS-CDB: Mixed Uncertainty Sampling With Class Distribution Balancing for Active Annotation in Aerial Object Detection. IEEE Transactions on Geoscience and Remote Sensing. 2023;61:1–13. Available from: https://doi.org/10.1109/TGRS.2023.3285443
Wongvorachan T, He S, Bulut O. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information. 2023;14(1):54. Available from: https://doi.org/10.3390/info14010054
Wang S, Dai Y, Shen J, Xuan J. Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports. 2021;11(1). Available from: https://doi.org/10.1038/s41598-021-03430-5
Stefanowski. Dealing with Data Difficulty Factors While Learning from Imbalanced Data. Studies in Computational Intelligence. 2016;605:333–363. Available from: https://doi.org/10.1007/978-3-319-18781-5_17
Yue L, Cai W, Cao D, Liu Y, Li Y, Wu J. Mitigate the Inter-channel Interference in Coherent Sampling-Based Nyquist OTDM Demultiplexer Using KNN Classifier. In: 2022 Asia Communications and Photonics Conference (ACP). (pp. 935-938) IEEE. 2022.
Susan S, Kumar A. The balancing trick: Optimized sampling of imbalanced <scp>datasets—A</scp> brief survey of the recent State of the Art. Engineering Reports. 2021;3(4). Available from: https://dx.doi.org/10.1002/eng2.12298
Mahmudah KR, Indriani F, Takemori-Sakai Y, Iwata Y, Wada T, Satou K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences. 2021;11(17):7825. Available from: https://doi.org/10.3390/app11177825
Kumar P, Bhatnagar R, Gaur K, Bhatnagar A. Classification of Imbalanced Data:Review of Methods and Applications. IOP Conference Series: Materials Science and Engineering. 2021;1099(1):012077. Available from: https://dx.doi.org/10.1088/1757-899x/1099/1/012077
Eldho KJ. Impact of Unbalanced Classification on the Performance of Software Defect Prediction Models. Indian Journal of Science and Technology. 2022;15(6):237–242. Available from: https://doi.org/10.17485/IJST/v15i6.2193
Makki S, Assaghir Z, Taher Y, Haque R, Hacid MSS, Zeineddine H. An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection. IEEE Access. 2019;7:93010–93022. Available from: https://dx.doi.org/10.1109/access.2019.2927266

Copyright

© 2023 Deepa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)