The IPL (Indian Premier League) is a 20-20 cricket league in India where eight teams (representing eight cities in India) play against each other. This game is India's biggest cricket festival - the most celebrated and the most viewed, where the action is just not limited to the cricket field. The clatters, promotional events, cheerleaders, advertisements, fan clubs, interactions, and betting are celebrated along with the players and the matches.
The entire revenue cycle of the IPL revolves around advertising. IPL also utilizes television timeouts, and there are other humongous opportunities associated with advertising. Apart from national and global broadcasts, the matches are transmitted to regional channels in eight different languages. The brand value of the IPL was ₹475 billion (US$6.7 billion) in 2019
"Due to the saturated market, it is especially important for sports organizations to function with maximum efficiency and to make smart business decisions"
In this paper, models using machine learning to predict IPL matches' outcomes were developed.
During the research, a multi-step approach was taken to gather and pre-process the historical data. Feature engineering
Many researchers have contributed towards predicting the results of cricket matches. Authors
Although not related to cricket match prediction, the authors conducted a study
The historical dataset was obtained from various sources – Kaggle
Class Imbalance is a problem in machine learning where the class distribution is highly imbalanced
A few assumptions to make the model accurate and robust were followed. The owners changed the names of few teams due to legal actions or due to the change in the ownership; however, the players and team dynamics did not change. The name Delhi Capitals was changed to Delhi Daredevils, Deccan Chargers to Sunrisers Hyderabad, and Pune Warriors to Rising Pune Supergiant. In these cases, the same team irrespective of the change in the name was taken. Moreover, the data of only 11 players for a team based on the highest number of matches they have played during the IPL was considered.
Refer
Player |
Mat |
Inns |
NO |
Runs |
HS |
Ave |
BF |
SR |
100 |
50 |
4’s |
6’s |
Overs |
Mdns |
Wkts |
Econ |
Ct |
St |
|
a. Features extracted from ESPN Cricinfo |
||||
Team |
Won |
Lost |
||
Tied |
Pts |
|||
b. Features extracted from IPL T20 |
||||
Season |
City |
Team_1 |
Team_2 |
|
Toss_Winner |
Toss_Decision |
Winner |
||
c. Features extracted from IPL T20 |
Since the algorithms do not interpret string values, label encoding on the above three features was done, as follows:
1. City: If the match is played on the home ground of Team 1, the city value is taken as zero. If the match is played on the home ground of Team 2, then the city's value is taken as 1, and if the match is played in some other city, then the city value is taken as 2.
2. Toss Winner: If the Toss is won by Team 1, the Toss Winner value is taken as zero. If the Toss is won by Team 2, the Toss Winner value is taken as 1.
3. Toss Decision: If the Toss winner chooses to Bat, the value of the Toss Decision is taken as zero, and if the Toss winner chooses to Bowl, then the value of the Toss Decision is taken as 1.
for all players p ∈ P do
φ ← φ(p)
u ← (1* ΦRuns_Scored) +(1*Φnum_4s) + (2*Φnum_6s) + (8*Φfifties) + (16*Φhundreds) - (2*Φfduck)
if Φbat_strike_rate < =50:
w ← v * Φbat_strike_rate* Φbat_innings
y ← u + w
φBatsman Score ← y
end for
for all players p ∈ P do
φ ← φ(p)
u ← (25*Φwickets) + ( 8*Φctchs) + (12*Φstmp) + (8*Φ4_wicket_haul ) + (16*Φ5_wicket_haul) + (8*Φmaidens)
if Φbowl_economy < = 6 and Φbowl_economy > 5:
w ← v * Φbowl_economy* Φbowl_innings
y ← u + w
φBowling_Score ← y
end for
for all players p ∈ P, φBowling_Score, φBatting_Score do
φ Total_Strength ← (φBowling_Score + φBatting_Score) / φtot_matches
φBatsman_Strength ← φBatting_Score / φBat_innings
φBowling_Strength ←φBowling_Score/ φBowl_innings
endfor
for all players p ∈ P , φTotal_Strength do
φTeam _Strength ← (
φTeam_Batting_Strength ← (
φTeam _Bowling_Strength ← (
endfor
For a particular year, Team Strength represents the previous year's performance, whereas the Cumulative Team Strength signifies the mean of the Team Strength of all the last years. For example – for the Mumbai Indians in 2016, the Strength will be the 2015 strength, and Cumulative Strength will be the mean of the Strength from 2008 to 2015. From this section, eight significant features were collected, mentioned below:
1. |
2. |
3. |
4. |
5. |
6. |
7. |
8. |
For Dream 11 strength feature distribution, refer to
Different measures highlight different aspects of a player's ability, which makes some features essential compared to others. For example, the strike rate is a necessary feature for a game - especially T20. In T20, the number of overs is less, which makes this feature more crucial as it adds to the team's ability to score maximum runs. The features were weighted according to their relative importance over other measures (features) in the research. The Analytic Hierarchy Process (AHP) was adopted to determine these weights for each player to calculate their bowling and batting features. Besides, we calculated the weights for each team based on their past performance.
The Analytic Hierarchy Process is a method for decision-making in complex conditions in which many variables or criteria are considered in prioritizing and selecting options
Priority Order: The attributes were arranged in their decreasing order of importance based on the knowledge and experience from the T20 cricket matches, as below:
Subsequently, a matrix was created to compare the importance of each attribute. Refer to
Batting |
Average |
INN |
SR |
50's |
100's |
0's |
Average |
1 |
2 |
3 |
5 |
6 |
7 |
INN |
0.5 |
1 |
2 |
4 |
5 |
6 |
SR |
0.333333 |
0.5 |
1 |
3 |
4 |
5 |
50's |
0.2 |
0.25 |
0.333333 |
1 |
2 |
3 |
100's |
0.166667 |
0.2 |
0.25 |
0.5 |
1 |
2 |
0's |
0.142857 |
0.166667 |
0.2 |
0.333333 |
0.5 |
1 |
Finally, from each attributed, weights were noted: Batting Average: 0. 3887, Innings: 0. 2601, Strike Rate: 0. 1754, Fifties: 0. 0834, Centuries: 0. 0550, Zeros: 0. 0373. Using these values the Batting strength through AHP was calculated.
Priority Order: The attributes were arranged in their decreasing order of importance based on the knowledge and experience from the T20 cricket matches, as below:
Subsequently, a matrix was created to compare the importance of each attribute. Refer to
Finally, the weights for each attributes were noted: Overs: 0.4174, Economy: 0.2634, Wickets: 0.1602, Bowling Average: 0.0975, Bowling Strike Rate: 0.067862, 4-Wickets Haul: 0.0615. Using these values the Bowling strength through AHP was calculated.
AHP bowl = 0.387509 * Overs + 0.281308 * Economy + 0.158765 * Wickets + 0.073609 * Bowling Average + 0.067862 * Bowling Strike Rate + 0.030947 * 4W Haul
Bowling |
Overs |
Economy |
Wickets |
Bowling Avg |
Bowling strike rate |
4W Haul |
Overs |
1 |
2 |
4 |
6 |
6 |
7 |
Economy |
0.5 |
1 |
4 |
5 |
5 |
6 |
Wickets |
0.25 |
0.25 |
1 |
4 |
4 |
6 |
Bowling Avg |
0.166666 |
0.2 |
0.25 |
1 |
1 |
5 |
Bowling SR |
0.166666 |
0.2 |
0.25 |
1 |
1 |
4 |
4W Haul |
0.142857 |
0.166666 |
0.166666 |
0.2 |
0.25 |
1 |
From this section, four essential features were formed, mentioned below:
1. |
2. |
3. |
4. |
5. |
6. |
For AHP Strength Feature Distribution, refer to
Using the AHP, the coefficient for the win rate of each team against the other were derived. Assumption: KTK (Kochi Tuskers Kerala) and GL(Gujrat Lions) Teams were dropped while calculating the weights, as they never played against each other.
Rank |
CSK |
DD |
KKR |
KXIP |
MI |
RCB |
RPS |
RR |
SRH |
CSK |
1 |
2.5 |
1.857143 |
1.333333 |
0.6875 |
2.142857 |
2 |
2 |
2.142857 |
DD |
0.4 |
1 |
0.769231 |
0.642857 |
1 |
0.571429 |
1.25 |
0.72727 |
1 |
KKR |
0.538462 |
1.3 |
1 |
2.125 |
0.315789 |
1.4 |
3.5 |
1 |
1.888889 |
KXIP |
0.75 |
1.555556 |
0.470588 |
1 |
0.846154 |
1 |
1 |
0.9 |
0.846154 |
MI |
1.454545 |
1 |
3.166667 |
1.181818 |
1 |
1.777778 |
1.4 |
1 |
1.181818 |
RCB |
0.466667 |
1.75 |
0.714286 |
1 |
0.5625 |
1 |
3.5 |
0.7 |
0.785714 |
RPS |
0.5 |
0.8 |
0.285714 |
1 |
0.714286 |
0.285714 |
1 |
0.25 |
0.666667 |
RR |
0.5 |
1.375 |
1 |
1.111111 |
1 |
1.428571 |
4 |
1 |
1.5 |
SRH |
0.466667 |
1 |
0.529412 |
1.181818 |
0.846154 |
1.272727 |
1.5 |
0.66666 |
1 |
Further, the yearly ranks of each team based on the win ratios was noted and the ranks were derived using AHP. Refer to
Teams |
RPS |
DD |
SRH |
GL |
KTK |
RCB |
KXIP |
RR |
KKR |
MI |
CSK |
Coefficients |
0.6043 |
0.8090 |
0.9042 |
1 |
1 |
1 |
0.9397 |
1.272 |
1.277 |
1.5188 |
1.6931 |
Ranks |
9 |
8 |
7 |
5 |
5 |
5 |
6 |
4 |
3 |
2 |
1 |
For the KTK and GL, the mean value which is 1 as the coefficients was taken and two features were formed from this section, as below:
For AHP Rank Feature Distribution, refer to
For a cricket match, the win rate almost determines the overall performance of a team. A team is continuously winning the matches against other teams is a sign that the team's form is good and the probability of the team winning the upcoming matches is higher. On the other hand, a losing team reflects that it is not in good form and may even lose games further.
As next steps, the entire IPL match list played every year by each team from 2008 to 2019 was crawled. If the two teams played against each other for the first time, the win rate was reset to 0 for both the teams. Subsequently, all the played matches were checked and the winners for such occurrences were noted. This helped in defining a ratio for each team. For a match, the past win rate ratio of the team was considered as below:
Φwin_rate(Match R) = Total Number of wins till match R-1/ Total Number of matches played till R-1
Two important features from this section were derived, as below:
For Win Rate Feature Distribution, refer to
The IPL is a league tournament based on a point system. Every year, two teams play against each other twice before entering the semi-final match, if not eliminated. The point table comprises teams, match won/lost/tied, and net run rate. Teams' ranking was done according to the teams' points, and past performance features were fed to the model for predicting the results. Four significant features were formed from this section as below:
1. |
2. |
3. |
4. |
For a particular year, Team Point represents the previous year’s performance, whereas the Cumulative Team Point represents the mean of the strengths of all the previous years.
For Team Point Feature Distribution, refer to
The consistency of a team adds more weightage to its current performance than the overall performance. Therefore, 80 percent weightage was allotted to the current performance of a team and 20 percent weightage to their overall performance.
Two features were formed from this section, mentioned below:
For Consistency Feature Distribution, refer to
The individual strength of a team represents how strong a team is by considering the stats. However, various other factors impact the winning of a team - for example - playing a team's sequence, performance as a team, and sentiments of the audience. This information was captured by multiplying the strength with the previous win rate of the team.
Four features were derived from this section, mentioned below:
1. |
2. |
3. |
4. |
For Win Strength Feature Distribution, refer to
With all the formulated Base and Intersection features, Transformed features were developed. These features were created by subtracting two base features or intersection features from the same category. For example, Team1_Team_Strength is subtracted from the Team2_Team_Strength to create a new feature.
Since many new features based was created on base and intersection features for the model, multicollinearity
As per the primary assumption, every team's performance is independent of the opposition team, toss decision, home-field advantage, and progress into the series. This allowed to make independent team features that will be present in both TEAM1 and TEAM2. The features generated can be broadly bucketed into Match and Team Features. As there are similar features for both TEAM1 and TEAM2, symmetry in the dataset was observed (Refer to
Team1 |
Team 2 |
Team1_Strength |
Team2_Strength |
Winner |
Team1 |
CSK |
MI |
X |
Y |
1 |
CSK |
It is apparent to a human that while switching TEAM1 with TEAM2, the results will be the same. However, a machine learning model is asymmetric in nature and is neither capable of identifying the symmetry of features nor has a way to input the information about the symmetry of features. Hence, this information was entered to the model by generating a symmetric duplicate for every row in the training set (Refer to
Team1 |
Team2 |
Team1_Strength |
Team2_Strength |
Winner |
CSK |
MI |
X |
Y |
1 |
MI |
CSK |
Y |
X |
0 |
The below steps were taken to the train and test sets:
The original dataset is split using train_test_split from sklearn
The training set is then mirrored as shown above and append to the original training set which increases in training set size
The test set is also mirrored but the test sets were not appended to create two test sets
The mirroring of the rows only tells the model about the existence of a symmetric scenario, but the model will still interpret the mirrored rows as new training set rows completely unrelated to the original rows. This asymmetric nature of the model leads to ambiguity in the results in certain rows ( Refer to
Test Set |
Team1 |
Team2 |
Winner |
Prediction |
Team |
1 |
KKR |
KXIP |
1 |
1 |
KKR |
2 |
KXIP |
KKR |
0 |
1 |
KXIP |
The model was tested for a given match in two configurations. The model interprets both the cases as two different test cases. As a result, sometimes, the model returns different predictions for the same case. Such an occurrence is called Model Ambiguity. Note: This occurrence is not an incorrect prediction, as the prediction will be counted correct in either test set 1 accuracy or test set 2 accuracy.
To tackle this phenomenon of Model Ambiguity, the model was evaluated using five parameters apart from just training and test accuracy:
Training Accuracy: % of correct predictions in mirrored and merged train set
Test 1 Accuracy: % of correct prediction in the original test set
Test 2 Accuracy: % of correct prediction in the mirrored test set
Real Test Accuracy: % of correct prediction after discrediting the scores for ambiguous rows
Ambiguity: % of rows in which ambiguity is observed
The objective of hyperparameter tuning was to maximize real test accuracy by driving down the ambiguity while evaluating the overfitting of the model using training accuracy and test 1 & 2 accuracies.
Changing the random state in dataset the accuracy differs a lot was noted. This change occurs because the training and testing dataset is randomly split based on the state in which the data was put. To prevent such a scenario and to make the model robust RepeatedStratifiedKFold
The model was evaluated using accuracy and Standard Deviation, Cohen Kappa
8 Supervised algorithms to train the derived model were selected:
The Real test accuracy of 58.233 % with a standard deviation of 5.5 % and ambiguity of 3.0% were derived (Refer to
Ambiguity |
Real Test Accuracy |
Train Accuracy |
Cohen Kappa score |
3.008 ± 2.5 % |
58.233 ± 5.5 % |
60.757 ± 0.9 % |
0. 1891 |
The Area under the Curve is 0.63. The ROC curve was plotted with the best result using Naïve Bayes. The distribution of Real Test Accuracy was done to derive Skewness and Kurtosis of the Real Test Accuracy (Refer to
Kurtosis of the Real Test Accuracy is -0.7954
Skewness of the Real Test Accuracy: -0.2357
The model was tuned with over 1232 combinations. Refer to Appendix C (a). The best results derived: Real Test Accuracy of 57.78% with a standard deviation of 5.8% and ambiguity of 2.2 % (Refer
penalty |
l2 |
solver |
liblinear |
max_iter |
400 |
tol |
1 |
C |
2 |
Ambiguity |
2.199455± 1.4 % |
Real Test Accuracy |
57.77618± 5.8 % |
Train Accuracy |
61.11655± 3.2 % |
Cohen Kappa score |
0.1351 |
Further, the ROC curve with the best result was made and AUC value of 0.57 was derived. The Real test accuracy distribution was plotted for deriving the Kurtosis and Skewness. Refer
Kurtosis of the Real Test Accuracy: 0.5892
Skewness of the Real Test Accuracy: 1.4699
The model was tuned with over 25 combinations. Refer to Appendix C (b). Real test accuracy of 58.416% with a standard deviation of 5.69% and ambiguity of 0.24% was derived (Refer to
c |
0.1 |
Gamma |
0.001 |
Kernal |
rbf |
Ambiguity |
0.24 ± 0.3% |
Real Test Accuracy |
58.416% ± 5.69% |
Train Accuracy |
61.13 ± 4.2% |
Cohen Kappa score |
0.1921 |
The Area under the Curve is 0.72. The ROC curve was plotted with the best result from Support Vector Machines. The distribution of Real Test Accuracy was done to derive Skewness and Kurtosis of the Real Test Accuracy (Refer to
Kurtosis of the Real Test Accuracy is 1.6979
Skewness of the Real Test Accuracy: 0.4171
The model was tuned with over 300 combinations. Refer to Appendix C (c). Real test accuracy of 53.472% with a standard deviation of 5.2% and ambiguity of 1.90% was derived (Refer to
n_neighbors |
15 |
weights |
uniform |
metrics |
manhattan |
Leaf-size |
20 |
Ambiguity |
1.900 ± 1.1% |
Real Test Accuracy |
53.472% ± 5.2% |
Train Accuracy |
62.043 ± 4.8 % |
Cohen Kappa score |
0.1634 |
The Area under the Curve is 0.81. The ROC curve was plotted with the best result using Knn. The distribution of Real Test Accuracy was done to derive Skewness and Kurtosis of the Real Test Accuracy (Refer to
Kurtosis of the Real Test Accuracy is 0.0502
Skewness of the Real Test Accuracy: -0.3635
The model was tuned with over 56 combinations. Refer to Appendix C (d). The best result with the corresponding hyper-parameters were derived - Real test accuracy is 60.035% with a standard deviation of 6.2% and ambiguity of 5.4% (Refer to
learning_rate |
0.01 |
n_estimators |
150 |
Ambiguity |
0.402 ± 0.9 % |
Real Test Accuracy |
60.035 ± 6.2 % |
Train Accuracy |
62.127 ± 0.9 % |
Cohen Kappa score |
0.194 |
Further the ROC curve with the best result was made and the AUC value of 0.62 was derived. The Real test accuracy distribution with ADABOOST was plotted for deriving the Kurtosis and Skewness (Refer
Kurtosis of the Real Test Accuracy: -0.6021
Skewness of the Real Test Accuracy: -0.4677
The model was tuned with over 3600 combinations. Refer to Appendix C (e). The best result with the corresponding hyper-parameters were derived - Real test accuracy is 55.42 % with a standard deviation of 5.9% and ambiguity of 7% (Refer to
learning_rate |
0.05 |
max_depth |
4 |
min_child_weight |
1 |
gamma |
0 |
colsample_bytree |
0.3 |
n_estimators |
100 |
Ambiguity |
7.098 ± 2.9 % |
Real Test Accuracy |
55.42 ± 5.9 % |
Train Accuracy |
78.079 ± 0.9 % |
Cohen Kappa |
0.228 |
Further the ROC curve with the best result was made and the AUC value of 0.62 was derived. The Real test accuracy distribution with XGBOOST was plotted for deriving the Kurtosis and Skewness (Refer
Kurtosis of the Real Test Accuracy is -0.8633
Skewness of the Real Test Accuracies: 0.0456
The model was tuned with over 320 combinations. Refer to Appendix C (f). The best results derived: Real Test Accuracy of 59.506 % with a standard deviation of 5.9% and ambiguity of 4.3% (Refer to
n_estimators |
2100 |
max_depth |
12 |
max_features |
log2 |
min_sample_leaf |
12 |
Ambiguity |
4.286 ± 2.0 % |
Real Test Accuracy |
59.506 ± 5.9 % |
Train Accuracy |
74.71 ± 0.5 % |
Cohen Kappa |
0.1864 |
Further the ROC curve with the best result was made and the AUC value of 0.64 was derived. The Real test accuracy distribution with ExtraTreesClassifiers was plotted for deriving the Kurtosis and Skewness (Refer to
Kurtosis of the Real Test Accuracy is -0.2121
Skewness of the Real Test Accuracies: 0.5902
The model was tuned with over 1200 combinations
max_features |
0.5 |
bootstrap |
True |
max_depth |
3 |
min_samples_leaf |
4 |
min_samples_split |
2 |
colsample_bytree |
0.3 |
n_estimators |
1200 |
Ambiguity |
1.404 ± 1.4 % |
Real Test Accuracy |
60.043 ± 6.3 % |
Train Accuracy |
65.978 ± 0.7 % |
Cohen Kappa |
0.1785 |
Kurtosis of the Real Test Accuracy is -0.8606
Skewness of the Real Test Accuracies: -0.2491
The research focused on predicting the winner for an IPL match using machine learning and utilizing the available historical data of IPL from season 2008-2019. In the process, various Data Science methods were adopted to conduct the study, including data mining, visualization, preparation of database, feature engineering, applying the Analytic hierarchical process, creating prediction models, and training classification techniques.
The IPL dataset was gathered and pre-processed. The missing values were removed, and variables were encoded into the numerical format to make the dataset uniform. The essential features were then derived from data using the domain knowledge to extract raw data features via data mining techniques, and the results were derived from the model. Since the dataset that is available for IPL is limited and small, multiple levels of features were created to make sure that the derived model is not underfit. Almost every feature that can affect the result of a match was derived. Further, the problem of multicollinearity was solved and the issue of data symmetry was identified (termed as model ambiguity). Several machine learning models were applied to the selected features to predict the IPL match results. The best results were concluded using the tree-based classifiers. The highest accuracy of 60.043% with Random Forest with a standard deviation of 6.3% and an ambiguity of 1.4% was observed (Refer to
Algorithm |
Accuracy |
Cohen Kappa |
Ambiguity |
Naïve Bayes |
58.23 ± 5.5 % |
0.19 |
3.00 % |
Adaboost |
60.03 ± 6.2 % |
0.19 |
0.40% |
Logistic Regression |
57.77± 5.8 % |
0.13 |
2.20% |
Support Vector Machines |
58.42% ± 5.69% |
0.19 |
0.24% |
Knn |
53.47% ± 5.2% |
0.16 |
1.90% |
XGBoost |
55.42 ± 5.9 % |
0.23 |
7.10 % |
Extra Trees Classifier |
59.51 ± 5.9 % |
0.19 |
4.30% |
Random Forest Classifier |
60.04 ± 6.3 % |
0.18 |
1.40% |
In this research, the player's series-wise performance rather than their match-wise performance was taken while calculating the player's strength. For a more thorough approach to further develop this research, match wise data can be considered. The research can also be further enhanced by adding other factors like comparing players' performances at a particular stadium.
Notation |
Type of Points |
Weight |
φMatches |
Being a part of the starting XI |
4 |
φRuns_Scored |
Every run scored |
1 |
Φctchs |
Total catches taken |
8 |
Φfifties |
Total number of 50s scored |
8 |
Φhundreds |
Total number of 100s scored |
16 |
Φnum_4s |
Total number of 4s scored |
1 |
Φnum_6s |
Total number of 6s scored |
2 |
φstmp |
Stumping/ Run Out (direct) |
12 |
Φr_out |
Run Out (Thrower/Catcher) |
8/4 |
Φfduck |
Dismissal for a Duck (only for batsmen, wicket-keepers and all-rounders) |
-2 |
Φbat_innings |
Number of times a player has batted in a match |
|
Φbowl_innings |
Number of times a player has bowled in a match |
|
Φwickets |
Number of wickets taken by a bowler in the season |
25 |
Φmaidens |
Number of times a bowler has bowled an over without conceding any runs |
8 |
Φ4_wicket_houl |
Number of times a player has taken 4 wickets in a single match |
8 |
Φ5_wicket_houl |
Number of times a player has taken 5 wickets in a single match |
16 |
Φbowl_economy |
Bowling economy of a player |
|
Φbat_strike_rate |
Batting Strike Rate of a player |
|
Φmax_matches |
Maximum matches played by a team |
Type of Points |
Weight |
Every boundary hit |
1 |
Every six-hit |
2 |
Half-Century (50 runs scored by a batsman in a single inning) |
8 |
Century (100 runs scored by a batsman in a single inning) |
16 |
Maiden Over |
8 |
4 wickets |
8 |
5 wickets |
16 |
Type of Points |
Weight |
Minimum overs bowled by a player to be applicable |
2 overs |
Between 6 and 5 runs per over |
2 |
Between 4.99 and 4 runs per over |
4 |
Below 4 runs per over |
6 |
Between 9 and 10 runs per over |
-2 |
Between 10.01 and 11 runs per over |
-4 |
Above 11 runs per over |
-6 |
Type of Points |
Weight |
Minimum balls faced by a player to be applicable |
10 balls |
Between 60 and 70 runs per 100 balls |
-2 |
Between 50 and 59.99 runs per 100 balls |
-4 |
Below 50 runs per 100 balls |
-6 |
penalty |
l2, l1 |
solver |
liblinear |
max_iter |
100, 200, 300, 400, 600, 900, 1200, 1500, 1800, 2100 |
tol |
0.0001, 0.00001, 0.0005, 0.001, 0.1, 0.5, 1 |
C |
1.0, 1.5, 2, 1.25, 1.75, 3, 4, 5, 7, 9 ,12 |
b. Hyperparameters for Support Vector Machines
Kernel |
rbf |
Gamma |
1, 0.1, 0.01, 0.001, 0.0001 |
C |
0.1, 1, 10, 100, 1000 |
n-neighbors |
1, 3, 5,6, 8, 10,12, 14, 15, 18 |
Weights |
uniform, distance |
Metric |
euclidean, manhattan, hamming |
Leaf_Size |
10, 15,20, 25,30 |
learning_rate |
0. 005 ,0.01 ,0.02, 0.05,0.15, 0.5, 0.1,1 |
n_estimators |
10,20,50,100,200,800,1000 |
learning_rate |
0.05, 0.10,0.15, 0.25 |
n_estimators |
50,100,200,500,700,1000 |
max_depth |
4,5,6,7,8 |
gamma |
0,0.1, 0.2,0.3 |
min_child_weight |
1,3 |
colsample_bytree |
0.3, 0.4, 0.7 |
e. Hyperparameters for ExtraTreeClassifier
n_estimators |
100,200,600,900,1200,1500,1800,2100 |
max_depth |
3,4,5,12,15 |
max_features |
Sqrt, log2 |
min_sample_leaf |
3,5,8,12 |
f. Hyperparameters for RandomForestClassifier
bootstrap |
True, False |
max_depth |
3,4,5,6,7 |
max_features |
0.5, 'sqrt','log2' |
min_samples_leaf |
2, 4 |
min_samples_split |
2, 5 |
criterion |
Gini, entropy |
n_estimators |
800,1000,1200,1600,2000 |