A credit scoring model is an analysis tool used to determine the creditworthiness of a loan applicant based on historical data and by estimating the default probability. The performance of the credit scoring model is proven to be more effective by using ensemble modeling. The models are designed by training single base classifiers and the resulting output is integrated by using an ensemble strategy to enhance the performance.
Credit scoring models are used for evaluating financial threats associated with applicant’s credit granting process
The 2 most commonly and widely used statistical methods in credit scoring are Logistic Regression (LR) and Linear Discriminant Analysis (LDA). Machine learning classification approaches like KNearest Neighbor (KNN), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), Naïve Base (NB), Classification and Regression Tree (CART), Genetic Algorithms (GA), and Artificial Neural Networks (ANN) are extensively used in credit scoring.
Credit industry demands a wellstructured and effective credit scoring systems. The research incentive is to have an improved classification accuracy. The research affinity is heading towards building hybrid models and ensemble models.
To improve the machine learning algorithms performance, it is very important to adopt hybridization and ensemble learning approach. The authors
Handling of imbalanced classification problem is very important and tough job
In recent years, the research focus is mainly on ensemble strategy. Different approaches are adopted to perform ensemble learning, by using different base classifiers and different consensus methods
The authors in
The authors in
The authors in
The authors in
The authors in
The authors in
Authors in
A wide series of machine learning algorithms are available in current era and that can be used to build individual models. Each algorithm has its own strength and weaknesses. There is no such algorithm which can achieve best on all available credit datasets. A broad set of different and accurate algorithms can be ensemble to get superior performance
Ensemble method is presently booming in banking and finance industry. Ensemble modelling is the ability of combining different classifiers together to improve the predictive power and the stability of classification model. An effective prediction system will help bankers to assess credit risk when making loans to credit applicants
A stacked support vector machine is proposed in
Hence, this study uses a stacking strategy to construct an ensemble model by integrating base classifiers. In this paper six different classifiers are used for building heterogeneous classifier ensemble. The six base classifiers used are SVM, LR, KNN, RF, NB and DT.
The proposed model is made up of the following steps:
Collection of datasets.
Splitting the available data sets into training and testing samples.
Apply the classification algorithm.
Build an ensemble classifier using stacking and voting method.
Using performance evaluation measure, assess the built models.
The result of every individual classifier is compared with heterogeneous classifier. The rest of the paper is structured as follows: Section 2 details on the research methodology. Section 3 carries out discussion on results and its analysis. Finally, section 4 provides the conclusion and future research recommendations.
Support vector machine (SVM) is a classification technique and was proposed by Vapnik. It is extensively used in the field of credit risk assessment due to its powerful assessment capabilities. It performs better compared to other algorithms due to the better solution to solve the sparseness problem. In fact, its main idea is to find a hyperplane by projecting input data into a feature space of higher dimension. The hyperplane is supported by support vectors which are used to separate the two classes with the maximum margin.
Let S be a dataset with M observations. (xi, yi) is labelled instance pairs, for i=1,2,….M and
where w denotes the plane’s normal and b denotes the plane’s intercept.
To solve the quadratic optimization problem, we can use Lagrange multipliers. The Lagrange function is:
Where ai represents Lagrange multipliers.
To obtain an peak saddle point, Lp must be maximized with respect to the nonnegative dual variable ai and minimized with reference to variables w and b. Lp is converted to the dual Lagrangian LD(
Where
if the corresponding
Using the above function in equation 4, SVM classifies samples as class 0 if
We can predict the label of the new input sample using the features of the support vectors. Many kernels can be selected to map input data to multidimensional feature space in SVMs, such as linear, sigmoid, radial basis function (RBF) and polynomial.
All the four linear kernel, sigmoid kernel, radial basis kernel and polynomial kernel are used in our ensemble aggregation.
Logistic regression is a specific form of the linear regression. LR is used to measure correlation between a dependent and one or more independent variables. It is used in the area of social sciences, medical and machine learning. LR model consists of n predictors and one dichotomous output (response) variable. The response variable has only two possible outcomes: 1 (good) and 0 (bad). The equation is as shown below,
Where, p represents specific customer’s probability. bi, (intercept term) is the coefficient related to predictors xi (i = 1, ..., k). The purpose of logistic regression model is to estimate the conditional probability of a particular instance belonging to a particular class.
Random Forest is a classification technique based on ensemble learning. Each classifier in the ensemble is built using DT classifier. It is a collection of classifiers which forms a forest. Individual decision trees are constructed by using attributes randomly selected at each node. Each individual tree votes during classification. Output is based on votes, where the most voted class is considered.
Bagging or Bootstrap aggregation techniques are applied for random forest during training. Consider a training set
RF classifier to select attributes uses Gini index. Attributes impurity with respect to the class is measured by Gini index. The Gini index for selecting one case at random for a given training set is;
Where, f (Ci, T) / T ) represents the probability of the selected case Ci.
Decision tree is powerful and very popular classification algorithm with the ability to interpret simple rules with very little user intervention. The most widely used DT algorithms are ID3, C4.5 and C5. Building an optimal decision tree is a key task in decision tree classifier. Many decision trees can be built with the given attributes set. In building decision tree, at each step information gain is used to determine on which feature to spilt. Based on information theory, entropy is given as,
Where p1, p2, …, pj are fractions. These fractions are added up to 1which represents the percentage of each class present in the node.
K – nearest neighbor (KNN) classifier is the most frequently used methods for credit scoring. It is a nonparametric classification method. Nonparametric means, the method does not make any assumptions on the underlying data. KNN considers the entire dataset for making decision.
KNN is used as a benchmark for many classifiers. Euclidean distance is adopted in KNN for computing distance between a test sample and training samples. The prediction of a new point in KNN is that, it chooses k nearest neighbors from the training dataset and computes the average of k nearest neighbors. KNN can be easily handled as there is only one parameter ‘k’ in KNN algorithm.
The Naïve Bayesian (NB) algorithm is based on applying Bayesian theorem with strong independence assumptions amongst the features. To estimate the probability terms that are required for classification a set of training data is used. This performance is measured by the accuracy of the predictable required probability terms.
The naïve Bayes classifier mainly focuses on conditional probability. It assumes that the attributes and features are independent and it is suitable for high dimensional inputs. The assumption is that given the target value of the instance, the probability of noticing the conjunction a1, a2, …, an is the product of the probabilities for the individual attributes.
Where, ai is distinct attribute value and vj is distinct target value.
Each individual classifier performs differently on different datasets. On a specified dataset, it is difficult to predict which classifier performs best. Ensemble classifier is an ideal classifier for any dataset
Ensemble classification methods are used to solve the same problem by training multiple classifiers. Ensemble learning comprises of a set of base classifiers, which are trained individually. The predicted output of these classifiers are combined using majority voting, weighted voting, bagging, stacking and boosting
Each classifier is trained on heterogeneous data, which will yield a predicted output. These predictions are combined together in several ways: i) Simple average: In this, for each sample the average of predictions of all the classifiers is calculated to produce the final prediction. ii) Majority voting: Here, predictions of all classifiers are combined together and for each sample the class that has highest number of votes is selected as final output
In this study, a combination of stacking and majority voting model is developed for aggregating the output of six base classifiers to improve the predictive accuracy of credit scoring system.
Two real world datasets, namely German and Australian, taken from UCI machine learning repository are used in the experiments. The details of the datasets are presented in Table 1.
The German credit dataset contains a total of 1000 samples with 700 positive samples and 300 negative ones. Each instance has 7 numerical features, 13 categorical features and a target attribute. Australian dataset consists of 690 samples among them 307 samples are positive and 383 samples are negative. Each instance comprises of 8 numerical features, 6 categorical features and a target attribute.







German 
1000 
700 
300 
13 
7 
20 
Australian 
690 
307 
383 
6 
8 
14 
Feature selection provides costeffective and faster classifiers to improve the prediction of credit scoring systems. The process of feature selection can be combined with the subset selection. There are three categories for selecting subsets of features namely: wrappers, filters and embedded methods
In this work, we have used random forest method for feature selection – a tree based feature selection method. Most important features are selected and irrelevant features are removed by computing feature importance using random forest method. Random forest is commonly used as a classifier. It also has the capacity to estimate the feature importance and hence can be used as feature selector.
Random forest builds a set of decision trees for its working. Given a difference of the performance for the tree i, denoted by di, the final important feature Aj can be computed as
where SEd denotes the standard error of di considering all trees ( SEd = SDdi
The proposed model is built using six base classifiers for ensemble aggregation. The
A feature selection algorithm based on the random forest technique is used for selecting the best features. The credit dataset is divided into training and test sets. The proposed model uses stacking and majority voting method for ensemble classification. Initially, stacking is applied to the base classifiers. This is done in two levels. First the training dataset is split into 10 folds for cross validation. In each iteration 101 (9) folds were used to train base classifiers and, remaining one fold is used for output prediction. After 10 iterations, the predicted result for the entire training set is obtained. The output of each classifier is taken and the dataset is updated with the metafeatures, i.e., the predictions made by each classifier. In the second level, three metaclassifiers (MC), namely LR, SVM and RF are used. Majority voting is applied to the output of these metaclassifiers for the prediction.
The selection of evaluation measures is very important for validating the performance of the classification models. Confusion matrix has been considered for various assessment measures in prediction. The confusion matrix is shown in








True Positive (TP) 
False Negative (FN) 

False Positive (FP) 
True Negative (TN) 
The true positives (TP) are the positive instances which are predicted as positive. The False positives (FP) are negative samples which are predicted as positive. Likewise, false negatives (FN) are positive samples which are predicted as negative, and the true negatives (TN) are negative samples which are predicted as negative. Using confusion matrix, Accuracy, Precision, Recall, Fmeasure, specificity and Area Under ROC Curve (AUC) are expressed as,
Fmeasure given in equation (13) is a measure of models accuracy. It is computed as two times the product of recall and precision to the ratio of sum of recall and precision. The precision given in equation (11) is TP (number of correctly classified positive results) divided by the sum of TP and FP. Recall is the number of correctly classified positive results divided by the sum of TP and FN. AUC (Area Under Curve) is measure of twodimensional area Receiver Operating Characteristic (ROC) curve.
It represents the area under the ROC curve. Truepositive rate (sensitivity, equation (12)) is represented along yaxis and the falsepositive rate (calculated as 1specificity (equation (14))) is represented along xaxis. A model with larger AUC indicates better performance.
Based on the proposed heterogeneous ensemble model and the experimental setup, we compare the predictive performance between the heterogeneous ensemble model and other 6 individual machine learning models, and the final computational results are shown in
The dataset is split into two parts, 80% as training dataset and 20% as testing dataset. SVM, LR, KNN, RF, Naive Base, DT are used as base models.
To examine whether the proposed ensemble model is effective in terms of accuracy, the following steps are executed:
Each single classifier is tested using test dataset and the results are noted.
The proposed heterogeneous ensemble model is then tested on test dataset.
Finally, the accuracy of individual and ensemble models is compared for selecting the best model.
The results reported in this section were based on the testing set of both German and Australian datasets. Tenfold cross validation is applied during training of each classifier.
For Australian dataset, Random Forest classifier shows good performance among individual classifiers. And in German dataset, Logistic Regression performed well among single classifiers. The proposed Ensemble method has achieved the highest accuracy on both Australian and German datasets with 91.56% and 84.35% respectively. We can observe that ensemble classifier gives better classification when compared with single classifiers. The comparison of single base classifiers with the proposed model is shown in





LR 
85.35 
78.15 
71.36 
73.57 
DT 
88.23 
87.42 
65.47 
77.12 
SVM 
70.65 
69.56 
64.35 
68.54 
KNN 
70.95 
69.84 
66.09 
67.46 
NB 
83.62 
74.47 
75.64 
72.35 
RF 
89.50 

83.54 



86.45 

85.42 





LR 
79.21 
86.26 
70.86 
68.43 
DT 
74.52 
83.45 
66.37 
68.18 
SVM 
74.65 
84.62 
54.58 

KNN 
73.76 
85.28 
65.35 
68.18 
NB 
74.54 
83.63 
74.24 
65.56 
RF 
77.15 

82.35 
69.09 


87.52 

70.56 
The validity of the proposed model is compared with the stateoftheart models proposed in




Our Proposed Model 
Accuracy 
91.56 
84.35 
H Van Sang et al. 
Accuracy 
89.40 
76.20 
S. Wei et al. 
Accuracy 
87.92 
 
S. Guo et al. 
Accuracy 
87.4 
78.3 
In this study we had built six individual credit scoring models and proposed heterogeneous ensemble model. The proposed ensemble model was developed as follows; First, started with collecting datasets (Australian and German), next, the dataset is split into training (80%) and testing (20%) sets, then six individual classifiers and ensemble classifier were built. Each classifier was trained on both Australian and German datasets. Finally, the output of each of the classifier was combined in ensemble using stacking and majority voting to achieve final results. The result shows that the ensemble model has achieved the accuracy 91.56% on Australian dataset and 84.35% on German dataset compared to other individual classifiers.
Future research directions focus more on feature selection and to extend the work by using ensemble of feature selection methods that is combining different feature selection algorithms. Although the developed model resulted in better performance, our future work focuses on building credit scoring system using neural network classifier in ensemble aggregation and also to adopt weighted voting approach.