Evaluation of Cost Sensitive Learning for Imbalanced Bank Direct Marketing Data

Khor Kok Chin  and Ng Keng Hoong

doi:10.17485/ijst/2016/v9i42/100812

Article

Evaluation of Cost Sensitive Learning for Imbalanced Bank Direct Marketing Data

VIEWS 892
PDF 278

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2016/v9i42/100812

Year: 2016, Volume: 9, Issue: 42, Pages: 1-8

Original Article

Evaluation of Cost Sensitive Learning for Imbalanced Bank Direct Marketing Data

Khor Kok-Chin^* and Ng Keng-Hoong

Faculty of Computing and Informatics, Multimedia University, 63100, Cyberjaya, Selangor; [email protected], [email protected]

*Author for correspondence
Khor Kok-Chin
Faculty of Computing and Informatics, Multimedia University, 63100, Cyberjaya, Selangor; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: The imbalanced bank direct marketing data set utilized in this study is a two-class data mining problem, where a customer may or may not subscribe a product from a bank. Methods/Statistical Analysis: The data set inherited the rare class problem where the classification rate attained for the rare class is low. In this study, we attempted cost sensitive learning to mitigate the problem, and to address that there are various costs involved when misclassification occurs. Three learning algorithms, namely, Naive Bayes (NB), C4.5 and Naive Bayes Tree (NBT) were involved in the cost sensitive learning and their results were empirically evaluated. Findings: The results were also compared with two previous studies that utilized the cost insensitive SVM and over-sampling, respectively. Although cost sensitive learning is claimed able to handle imbalanced data sets, but we noticed that the learning is less effective for the bank direct marketing data set in overall. Cost sensitive learning provides a way of “wrapping” learning algorithms that are not designed to handle imbalanced class distributions. Therefore, it may not work well for certain imbalanced data sets. Over-sampling, on the other hand, worked well for the data set. Improvements/Applications: Over-sampling helped to generalize the decision region of the rare class clearly and subsequently improved the classification result.

Keywords: Bank Direct Marketing, Cost Sensitive Learning, Imbalanced Data Set, Rare Class Problem, Over-Sampling