Defect prediction is essential in the software field in terms of quality and reliability, and it is one of the major comparative research areas in the modern software engineering approach. Numerous defect prediction models have been introduced for the class imbalance problem by means of the continuous development of machine learning and data mining. In machine learning, classifiers are created to eliminate errors and increase accuracy. Classification imbalance has gradually become the current dominant research hotspot in software engineering. Generally, unbalanced classification refers to the phenomenon that the sample size distribution among different categories is unbalanced. For example, in the binary classification problem, when the sample size of the two categories differs greatly, the classification imbalance problem appears. In real time, classification imbalance problems are common and need to be addressed to
Machine Learning prediction model was utilized for medical datasets for acute organ failure in critical patients
The main contribution of this article is to i) introduce a classification imbalance impact analysis method and transform the original imbalance data set into a new data set with an increasing imbalance rate, and select different prediction models to predict the new data set to evaluate different predictions. The degree of stability of the prediction model is analyzed when the classification is unbalanced. ii) Experiments were carried out on 3 existing typical prediction models (KNN, NB, and BPM) and 2 proposed machine learning and extreme learning methods (SVM and ELM) to show the stability in classification of unbalanced data. iii) The performance stability of different existing prediction models (KNN, NB, BPM) is derived in a PROMISE data library when the classification is unbalanced to showcase the proposed SVM and ELM as reasonable prediction models for practical applications, which has a certain guiding role for the research of software defect prediction.
The proposed method constructs a new data set with an increasing imbalance rate from the original unbalanced data set and then selects eight typical classification models as defect prediction models, respectively, for the constructed new data set and uses ROC (Receiver Operating Characteristic Curve (Area under the Curve). The index is used to evaluate the classification performance of different prediction models, and at the same time, different coefficients of variation are used to predict the different coefficients of the prediction model. The experimental results show that the performance of the three existing prediction models, BPN, NB, and KNN, decreases with the increase of the imbalance rate, indicating that the performance of these three models is very susceptible to classification imbalance, and that SVM and ELM outperform with high performance in defect prediction.
In order to make the improvised prediction model, it is necessary to clean the noisy or imbalanced data in the repository. The quality of datasets is highly essential for experimentation and evaluation. Machine learning methods are highly dominant in dealing with imbalanced datasets, minimizing error rates and removing noisy datasets from large numbers of datasets. The proposed SVM and ELM methods aim to evaluate the datasets in the PROMISE library to transform the original imbalanced datasets into the new ones for imbalanced classification. The preliminary statistical results are shown in
The number of nondefective samples in the data set is much higher than the number of defective samples, a phenomenon called unbalanced classification. The data set D = {x1, x2,..., xn}, xi∈ Rd (i = 1, 2, ..., n), which includes a number of samples, and each sample contains a In addition, it also includes a category feature to mark the category of the sample, i.e., defective or nondefective. According to the characteristics of the category, the data set can be divided into defective type C1 and nondefective type C2. The number of samples is n1 and n2, respectively, and n = n1 + n2. As a result, the imbalance rate of the data set D (Imbalance Ratio)
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression challenges. However, it is mostly used in classification problems. In the SVM algorithm, each data item is plotted as a point in an ndimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate to derive the unbalanced data in an optimized manner.
In ELM, the number of defects is identified and TP and TN are performed. The transformation of the original imbalanced dataset is classified into new datasets and used ROC for migration, where the parameters of hidden or noisy are removed. The imbalanced rate is calculated. The number of defective samples (n1) and the number of nondefective samples (n2) with size is calculated. The tasks of classification and transformation are accomplished by running extensive tests on imbalanced classification datasets with various class ratios in the PROMISE data library. A set of data sets with different imbalance rates is needed in order to explore the impact of classification imbalance on the performance of prediction models. That is, the changes in the performance of each prediction model in the case of classification imbalance. Therefore, this paper designs a new data set construction algorithm to transform the original unbalanced data set into a new data set with successively increasing unbalance rates. The specific process is shown in Algorithm 1.
Input: DataSet  The original unbalanced data set
Output: NewDataSet  New data set
1. According to category and characteristics DataSet is divided into DefectSet and NonDefectSet;
2. Number of defective samples n1=DefectSet.Size();
3. Number of nondefective samples n2=Non NonDefectSet.Size();
4. Imbalance rate r=Math.floor(n2/n1);
5. Form newDataSet=DefectSet;
6. Form restNonDefectSet=NonDefectSet;
7. WHILE restNonDefectSet=NonDefectSet;
8. Randomize the restNonDefectSet
9. IF restNonDefectSet.Size() >=2n1
10. choose n1 samples randomly from restNonDefectSet
Save as newDataSet
11. Remove the selected sample from the restNonDefectSet;
12. ELSE
13. save the remaining sample in restNonDefectSet as new DataSet;
14. restNonDefectSet = null;
15. END IF
16. Save newDataSet;
17. END WHILE
18. RETURN all new datasets as newDataSet;
The performance analysis was conducted against KNN, NB, and BPN
The proposed method selects 2 unbalanced classification data sets. Basic information is shown in








Jedit4.3 
Java 
20 
492 
11 
481 
2.24 
43 
Tomcat 
Java 
20 
858 
77 
781 
8.97 
10 
The prediction result is jointly determined by the data set and the prediction model. For a certain data set, which includes many software features, all these features are used to train the prediction model when feature selection is not performed. All the features in the above data set are used to train the new machine learning prediction model. In order to explore the difference in performance stability of different prediction models, KNN, NB, and BPN with SVM and ELM are compared in performance analysis ^{7}. ^{ }
A binary classification problem, such as faultprone (positive) and not faultprone (negative), has four possible prediction outcomes: True Positive (TP) (i.e., correctly classified positive instance), False Positive (FP) (i.e., negative instance classified as positive), True Negative (TN) (i.e., correctly classified negative instance), and False Negative (FN) (i.e., positive instance classified as negative). The four values form the basis for several other performance measures that are well known and commonly used for classifier evaluation. Overall Accuracy (OA) provides a single value that ranges from 0 to 1. It can be calculated by the equation, OA = (T P + T N)/N, where N represents the total number of instances in a dataset. While overall accuracy facilitates model performance comparisons, it is not always regarded as a reliable performance metric, particularly in the presence of class imbalance. The area under the ROC (receiver operating characteristic) curve is a singlevalue measurement that originated in the field of signal detection. The value of the AUC ranges from 0 to 1. The ROC curve describes the tradeoff between True Positive Rate (TPR) TPR = TP/(TP+FN) and False Positive Rate (FPR) FPR = FP/(FP+TN). A classifier that provides a large area under the curve is preferable to a classifier with a smaller area under the curve. The TPR (True Positive Rate) refers to the ratio of the number of correctly predicted positive cases to the actual number of positive cases, that is, the ratio of the number of correctly predicted defective samples to the actual number of defective samples. The FPR (False Positive Rate) refers to the ratio of the number of false positive cases to the actual number of negative cases; that is, the ratio of the number of falsely predicted defective samples to the actual number of nondefective samples.
For a specific prediction model and training data set, the prediction result corresponds to a point on the ROC curve. By adjusting the threshold of the model, a curve passing through (0, 0) and (1, 1) can be obtained below the curve. The area of A is the value of A. In particular, the value range of AT is 0 to 1. When AT is 0.5, it represents the performance of the random guessing model. The larger the value of A, the better the performance of the model. Therefore, a good prediction model should be as close as possible to the upper left corner of the coordinate system. Use the prediction models presented in
hows the experimental results of each prediction model on different data sets, including the mean μ, standard deviation σ, and coefficient of variation CV. The larger the coefficient of variation (CV) shows, the more unstable performance of the prediction model, which has a greater impact of imbalance in classification on the performance of the prediction model.













KNN^{ }^{7} 
0.513 
0.021 
1.710 
0.582 
0.024 
2.135 
NB ^{7} 
0.524 
0.021 
1.902 
0.621 
0.026 
2.102 
BPN ^{7} 
0.581 
0.022 
1.903 
0.624 
0.053 
2.821 
















It can be seen from
Finally, there are external factors, such as the quality of the data set, which may affect the evaluation of the stability of the predictive model's performance by the methods used in this paper. Therefore, more sufficient experiments on data sets of different scales, different defect rates, and different imbalance rates ensure that all experimental data sets are real data sets in defect prediction and are also the most commonly used defect data sets to ensure the validity and reliability of the prediction results.
The research study proposed a new model with machine learning techniques such as SVM and ELM to classify the imbalanced data in the PROMISE library to evaluate the software defect prediction. This method transforms the original unbalanced data set into a new set of data sets with an increasing unbalance rate, and then selects a typical prediction model to predict and evaluate the new data set, respectively. The experimental results show that new defect prediction machine learning techniques SVM and ELM outperform KNN, NB, and BPN in terms of imbalance rate and data classification. SVM records 29% more than KNN up to 2.102 and ELM records 19% more than NB and BPN up to 2.204 towards Jedit and 6.031 and 7.035 towards Tomcat. The limitations of the study lie in the fact that the prediction and classification accuracy level might vary depending on the datasets. In the future, the model can be improvised to predict the faulty data in large datasets in an optimized manner as software quality and reliability are the main concerns in the modern software engineering era.