• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 30, Pages: 2317-2324

Original Article

KNN Based Under Sampling: A Cognitive Centred Solution for Imbalanced Dataset Problem in Anaphora Resolution

Received Date:20 June 2023, Accepted Date:28 June 2023, Published Date:08 August 2023

Abstract

Background: Like many other real world applications, the machine learning system of anaphora resolution also struggles with skewed data. The problem of imbalanced classes occurs with classification task where there a huge difference exists in the number of instances among the involved classes.Objectives: The proposed framework intends to remove the imbalance first between positive and negative class instances before classifying them by KBUS that makes use of cognitive knowledge about the language and analysis is done at attribute level. Method: Nine pruning rules are crafted by KBUS(KNN Based Under Sampling) for TDIL dataset. Findings: During experimentation, number of positive instances are increased from 5.32% to 43.95%, whereas the number of negative instances are decreased from 94.68% to 56.05%. Loss ratio of positive and negative instance is 1:112. Finally the pruned dataset is classified by a list of classifiers namelyNaïve Bayes, SVM, Random forest, decision tree and k-NN. Novelty: Classifier results are discussed in two perspectives: Firstly the number of input instances and secondly the performance improvement achieved after pruning. It is adduced that pruning shows a remarkable improvement for all the classifiers. The proposed system produced an encouraging result as 78% of f-measure for k-NN and 77% for decision tree. Performance is presented in a comparative manner before and after pruning and the improvement of fmeasureranges from 13% (k-NN) to 41% (Random Forest). Thus this work has come up with a machine learning model to resolve Tamil anaphoric situations effectively in an imbalanced classification environment.

Keywords: Imbalanced Dataset; Classification; Anaphora Resolution Machine Learning; Pronominal Reference

References

  1. Malik KEF. New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset. Computing and Informatics. 2022;41:981–1001. Available from: https://doi.org/10.31577/cai20224 981
  2. Ezzini S, Abualhaija S, Arora C, Sabetzadeh M. Automated handling of anaphoric ambiguity in requirements. In: Proceedings of the 44th International Conference on Software Engineering. (pp. 187-199) ACM. 2022.
  3. Kunakorntum I, Hinthong W, Phunchongharn P. A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets. IEEE Access. 2020;8:114692–114704. Available from: https://doi.org/10.1109/ACCESS.2020.3003346
  4. Karthikeyan S, Kathirvalavakumar T. Genetic Algorithm Based Over-Sampling with DNN in Classifying the Imbalanced Data Distribution Problem. Indian Journal Of Science And Technology. 2023;16(8):547–556. Available from: https://doi.org/10.17485/IJST/v16i8.863
  5. Liang D, Zhang JW, Tang YP, Huang SJ. MUS-CDB: Mixed Uncertainty Sampling With Class Distribution Balancing for Active Annotation in Aerial Object Detection. IEEE Transactions on Geoscience and Remote Sensing. 2023;61:1–13. Available from: https://doi.org/10.1109/TGRS.2023.3285443
  6. Wang S, Dai Y, Shen J, Xuan J. Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports. 2021;11(1). Available from: https://doi.org/10.1038/s41598-021-03430-5
  7. Stefanowski. Dealing with Data Difficulty Factors While Learning from Imbalanced Data. Studies in Computational Intelligence. 2016;605:333–363. Available from: https://doi.org/10.1007/978-3-319-18781-5_17
  8. Mahmudah KR, Indriani F, Takemori-Sakai Y, Iwata Y, Wada T, Satou K. Classification of Imbalanced Data Represented as Binary Features. Applied Sciences. 2021;11(17):7825. Available from: https://doi.org/10.3390/app11177825
  9. Kumar P, Bhatnagar R, Gaur K, Bhatnagar A. Classification of Imbalanced Data:Review of Methods and Applications. IOP Conference Series: Materials Science and Engineering. 2021;1099(1):012077. Available from: https://dx.doi.org/10.1088/1757-899x/1099/1/012077
  10. Eldho KJ. Impact of Unbalanced Classification on the Performance of Software Defect Prediction Models. Indian Journal of Science and Technology. 2022;15(6):237–242. Available from: https://doi.org/10.17485/IJST/v15i6.2193
  11. Makki S, Assaghir Z, Taher Y, Haque R, Hacid MSS, Zeineddine H. An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection. IEEE Access. 2019;7:93010–93022. Available from: https://dx.doi.org/10.1109/access.2019.2927266

Copyright

© 2023 Deepa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.