Software Effort Estimation is used to predict effort in terms of person-months or person-hours. For successful development of software, Software Effort Estimation (SEE) is one of the challenging tasks though several models exist. Several models were proposed for software effort estimation ^{1}. Initially Software Effort estimations are carried out using Expert judgment, User Stories, Analogy based estimations and Use case point approach. Later various Machine learning algorithms alike Linear regression, Logistic regression, Multiple linear regression, Stepwise regression, Ridge regression, Lasso regression, Elasticnet regression, Decision tree, Neural networks, Support vector machine, Random forest, Naïve bayes, etc., are used for estimation. Ensemble approaches also gained more attention and produce more prediction than individual algorithms for effort prediction. The following are the survey of various models used for effort estimation.

According to ^{2}, estimation is produced based on the judgmental process. It includes experience and advice of experts based on the degree to which the new project matches with the previously completed projects of the expert with their experience. The techniques used for expert estimation are Delphi technique and Work Breakdown Structure (WBS) ^{3}. ^{4} to avoid inaccurate estimations. ^{5}. The project manager and team work together for this kind of approach. Once the requirements of the present project are known, they search for a similar past project from the database ^{6}. ^{7} that checks the functionality of the system based on the user’s point of view. Function point analysis is initially used to estimate size using which effort can be estimated ^{8}. ^{9} is based on the use cases involved in the project. Generally use cases are user interaction with the system. The actors or users are categorized and weighted based on the type of work they perform ^{10}.

Supervised machine learning techniques are majorly used for estimation. It consists of input variables or independent variables and an output variable or target variable that is to be predicted from input variables. Types of supervised learning are as follows:

Regression produces continuous output variable or dependent variable ^{11}. It provides relationship between two or more input variables ^{12}. There are various types of Regression algorithms alike Simple Linear regression, Multiple linear regression, Logistic regression, Stepwise regression, Ridge regression, Lasso regression and ElasticNet regression.

In ^{13}. It is of two kinds that can be either binomial or multinomial logistic regression. Binomial logistic regression have only two possible outcomes like yes or no, good or bad, true or false, 1 or 0, etc., Multinomial logistic regression have more than 2 possible categorical outcomes like poor or average or good or very good or excellent, very small or small or big or very big, etc., it’s also referred as sigmoid function.

^{13} performs well when there are multiple independent variables (input variables). The main purpose of this technique is to maximize the prediction using minimum number of input variables or predictors. ^{13}. ^{13} is similar to that of ridge regression but it uses L1 regularization technique to minimize error between actual and predicted value. Elasticnet regression is the combination of both Ridge and Lasso regression. It uses both L1 and L2 regularization technique. This type of regression is used when there are more number of features and when they suffer with multi collinearity.

Classification produces discrete output variable or dependent variable. Various types of Classification algorithms are Decision Tree Classifier, Support Vector Machines, Naïve Bayes classifier, K-Nearest Neighbors, Random Forest classifier, Neural Network, etc., ^{14}. Internal nodes represent attributes. Branches represent decisions and leaf nodes are the outcomes that can be either categorical or continuous variable. Thus decision trees can be used for both categorical and regressive problems. ^{15}. It can be used for both regression and classification problems but predominantly used for classification problems that is either two class classification and multi class classification.

^{ 16}. It is the best algorithm for large dataset. The basic principle of naïve bayes is each pair of features that are classified are independent to one another by applying bayes theorem as, P(A|B)=P(B|A)*P(A)/P(B) where A and B are events. ^{ 17}. The advantage of this algorithm is when the number of trees increases, the accuracy also increases. Decision tree(CART model) algorithm is the basis for random forest algorithm ^{18}. ^{19} which includes Input layer, Hidden layer and Output layer. Error is corrected by adjusting weights accordingly until error goes below threshold value ^{20}. In ^{21} estimation of effort using hybrid multilayer perceptron was carried by using complex non-linear input output relationship of a dataset.

Ensemble based approaches also became popular for effort estimations. Ensembling of machine learning algorithms^{ 22} provide accurate results when compared with individual predictive machine learning algorithms. Hybrid of fuzzy based technique with function point analysis^{ 23} and ensembling of fuzzy with analogy based estimation^{ 24} paved attention in effort estimations.

Benefits and limitations of the aforementioned methods are discussed in

S. No. | Estimation methods | Benefits | Drawbacks | ||
---|---|---|---|---|---|

1.1 | Expert Judgment | Inexpensive method. | Estimation can be accurate or inaccurate depends upon the experience of the expert member. | ||

Delphi Technique | It incurs less cost.Experts can figure out the requirements for the future project from their past projects experiences. | Error prone method. This method often leads to overoptimistic estimation | |||

Work Breakdown Structure | By this method, project risks can be identified during earlier stages. Improves productivity. | Complex process. It uses step by step approach. | |||

1.2 | Analogy based Estimation | This approach is simple and fast. | This approach will not be always accurate in estimation. | ||

1.3 | User Stories | Story points are relative to the size of the project | User stories differs between between teams in a project. | ||

1.4 | Function point Estimation | This method can be applied during earlier stage of software development. This method is independent of any programming language. | It is a time consuming method and has less accuracy as it is based on judgmental approach. | ||

1.5 | Use case point approach | Use case point approaches are good measures for size prediction. | Use cases are large unit of work and estimations can be done only when all use cases are written. | ||

Supervised machine learning techniques- Regression algorithms | |||||

1.6 | Linear regression | It is the simplest method to find the relationship between two or more variables. | This method is able to give relationship between the independent and dependent variables that are linear. | ||

Logistic regression | Logistic regression is used when the independent variables (input variables) are categorical or/and continuous. It is an efficient and easy method to implement. | Using this method only linear problems can be solved and non-linear problems cannot be solved. | |||

Stepwise regression | Stepwise regression can handle large number of independent variables (predictor or input variables). | Using this method only linear problems can be solved and non-linear problems cannot be solved. | |||

Ridge regression | More number of independent variables can be used. | The drawback in this method is the model is considered to be complex that in turn leads to poor performance. This method generally produces high bias. | |||

Lasso regression | Lasso regression avoids overfitting and feature selection can be done. | This method is not often stable and selecting features among high correlated features is random. | |||

Elastic Net regression | Elastic Net is more preferred when compared to Ridge or Lasso regression. | Computational cost is high | |||

Classification algorithms | |||||

1.7 | Decision Tree Classifier | It is a simple method. It is a better method for estimating categorical data. | It provides less accuracy in prediction when compared to other machine learning algorithms. | ||

Support Vector Machine | This method can also be applied to unstructured or semi structured data. They perform better even with many attributes. | It takes longer time for prediction in larger datasets. | |||

Naive Bayes Classifier | It is an easy method for implementation. This method produces better result if input variables are independent in nature. | This method always assumes that input variables are always independent, which cannot be always true. | |||

K Nearest Neighbor algorithm | Optimal for larger sample input. | It requires large storage requirement. It is sensitive to noise. | |||

Random Forest classifier | This method is user friendly and strong against overfitting. It can handle huge datasets. | It is time consuming and complex. It uses black box approach. | |||

Artificial Neural Network | It can learn from previous data.It is suitable for complex dataset.It is suitable for linear and non-linear functions, thus produces high prediction of software effort. | Slow convergence speed and overfitting problem occurs. | |||

1.8 | Ensemble approaches | They combine multiple models into aggregated better model. | Ensemble approaches are computationally expensive. |

Machine learning algorithms considered for estimation are Multilinear Regression (MR), Random Forest(RF), Support Vector Machine(SVM), Decision Tree(DT), NeuralNet(NN), Ridge Regression(RR), Lasso Regression(LR) and ElasticNet Regression(ER).

Datasets considered for Effort estimation are Desharnais, Maxwell, China and Albrecht^{ 25}. Datasets repository, attributes and records are elaborated in

Dataset Name | Source Repository | No. of Records | No. of Attributes | Output Attribute-Effort (Unit) | |
---|---|---|---|---|---|

Dataset1- Desharnais | GITHUB | 81 | 12 | Person-hours | |

Dataset2- Maxwell | PROMISE | 62 | 27 | Person-hours | |

Dataset3- China | PROMISE | 499 | 16 | Person-hours | |

Dataset4- Albrecht | PROMISE | 24 | 8 | Person-Months |

i.

It is the average sum of absolute errors ^{26}.

Prediction error=Actual value-Predicted value

Absolute error=|Prediction error|

MAE=Average of all absolute errors is given by Eq. (1)

It is the average of square of errors ^{27} in the data set and is given by Eq. (2)

It is the measure of standard deviation of predicted deviation ^{28} and is given by Eq. (3).

Where, Xobs-observed value, Xmodel-modelled value.

It is also known as co-efficient of determination. Higher the value of R-squared, better is the model.

Machine learning algorithms considered for estimation are Multilinear Regression (MR), Random Forest (RF), Support Vector Machine (SVM), Decision Tree(DT), NeuralNet(NN), Ridge Regression(RR), Lasso Regression(LR) and ElasticNet Regression(ER). The software used for estimation is RStudio. Datasets used for effort estimation are Desharnais, Maxwell, China and Albrecht. Metrics used for evaluation are Mean Absolute Error (MAE), Mean Squared Error(MSE), RMSE(Root Mean Square Error(RMSE) and R-Squared. Lesser the values of MAE, MSE and RMSE, better is the model and if the R-squared value is higher, it is the better model.

Dataset 1-Desharnais | ||||
---|---|---|---|---|

Algorithms | MAE | MSE | RMSE | R-squared |

Multilinear Regression | 2575.103 | 11499285 | 3391.06 | -1.12404 |

Random Forest | 2018.067 | 7465119 | 2732.237 | -0.3788865 |

Support Vector Machines | 1888.018 | 5576003 | 2361.356 | -0.02994683 |

Decision Tree | 2945.62 | 17502217 | 4183.565 | -2.232845 |

Neuralnet | 2024.692 | 5566429 | 2359.328 | -0.0281783 |

Ridge Regression | 2044.344 | 7283709 | 2698.835 | -0.3453782 |

Lasso Regression | 2562.611 | 11377363 | 3373.035 | -1.101519 |

ElasticNet Regression | 2562.606 | 11375202 | 3372.714 | -1.10112 |

Dataset2-Maxwell | ||||
---|---|---|---|---|

Algorithms | MAE | MSE | RMSE | R-squared |

Multilinear Regression | 6200.33 | 63759737 | 7984.969 | 0.1710158 |

Random Forest | 3593.95 | 23507201 | 4848.423 | 0.6943667 |

Support Vector Machines | 4276.059 | 38859106 | 6233.707 | 0.494766 |

Decision Tree | 4328.487 | 44327679 | 6657.903 | 0.4236654 |

Neuralnet | 6482.574 | 78104934 | 8837.7 | -0.01549594 |

Ridge Regression | 3895.332 | 22003377 | 4690.776 | 0.713919 |

Lasso Regression | 3273.256 | 22017498 | 4692.281 | 0.7137354 |

ElasticNet Regression | 3113.22 | 20578229 | 4536.323 | 0.7324483 |

Dataset3-China | ||||
---|---|---|---|---|

Algorithms | MAE | MSE | RMSE | R-squared |

Multilinear Regression | 427.1405 | 1448957 | 1203.726 | 0.9640306 |

Random Forest | 574.369 | 6345708 | 2519.069 | 0.8424721 |

Support Vector Machines | 1070.117 | 15775971 | 3971.898 | 0.6083722 |

Decision Tree | 857.3184 | 4579875 | 2140.064 | 0.8863077 |

Neuralnet | 3553.69 | 40430223 | 6358.476 | -0.003652838 |

Ridge Regression | 645.274 | 1420046 | 1191.657 | 0.9647483 |

Lasso Regression | 330.683 | 293412.4 | 541.6755 | 0.9927162 |

ElasticNet Regression | 345.1585 | 315410.5 | 561.6142 | 0.9921701 |

Dataset4-Albrecht | ||||
---|---|---|---|---|

Algorithms | MAE | MSE | RMSE | R-squared |

Multilinear Regression | 8.375672 | 87.21542 | 9.33892 | 0.1998402 |

Random Forest | 4.626463 | 32.98749 | 5.743474 | 0.6973555 |

Support Vector Machines | 3.581059 | 15.27448 | 3.908258 | 0.940334 |

Decision Tree | 15.36563 | 261.2114 | 16.16204 | -1.39649 |

Neuralnet | 12.37725 | 212.2981 | 14.57045 | -0.9477339 |

Ridge Regression | 5.77861 | 45.74711 | 6.763661 | 0.5802921 |

Lasso Regression | 6.214668 | 53.05237 | 7.283706 | 0.5132699 |

ElasticNet Regression | 5.803795 | 49.15886 | 7.011338 | 0.5489909 |

This study compares various machine learning algorithms like Multilinear Regression, Ridge Regression, Lasso Regression, ElasticNet Regression, Random Forest, Support Vector Machine, Decision Tree and NeuralNet using Desharnais, Maxwell, China and Albrecht datasets. Software Effort Estimation (SEE) is predicting the amount of time taken in human hours or months for software development. It is difficult to forecast SEE during initial stages due to uncertainties. Estimation is the process that is used as input for pricing process, project planning, iteration planning, budget and investment analysis. Based on the comparative study of various machine learning algorithms, it is found that Support Vector Machine (SVM) outperforms other algorithms. Evaluation metrics considered are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE) and R-Squared.