Over the last ten years, heart disease has become the top cause of death worldwide. Heart disease is associated with a variety of symptoms, making it challenging to identify it quickly and accurately. Large volumes of healthcare data are collected by the healthcare industries, which must be mined to uncover hidden information for successful decisionmaking. In the healthcare industry, data science plays a critical role in analyzing vast amounts of data. Machine learning is a type of data analysis that allows computers to learn from data, recognize patterns, and make judgments without the need for human interaction. Heart disease is predicted using a variety of machine learning
Five categories of machine learning models are used in this research. The models used are “Logistic Regression Model”, “Decision Tree”, “Support Vector Machine”, “KNearest Neighbor”, and “Naive Bayes”. The accuracy of each model is determined and compared to the Hybrid Ensemble model, which is a composite of all five models. The term "Hybrid" is employed because ensemble models use a homogeneous set of machine learning models, whereas this study uses a heterogeneous set of machine learning models.
Many studies have been conducted to predict cardiac disease at an early stage.
ChuHsing Lin et al. compared “Convolutional Neural Networks” to “Conventional Neural Networks” to predict heart disease in
The synthetic minority oversampling technique was employed by
Random Forest predicts heart disease more accurately, according to Riddi Kasabe et al.
The importance of care for patients at an early stage was described by Montu Saw et al.
Noor Basha et al.
Rahul Kataria et al.
Using the SVM and KNN machine learning methods, heart disease prediction using the risk analysis model was developed by Latin Miao et al.
The support vector machine model had a high accuracy than KNN. With 84.28 percent accuracy, Halima El Hamdaoui et al.
Daniel AnaneyObiri et al.
In a study of several machine learning algorithms, Rohit Bharti et al.
Muktevi Srivenkatesh employed “KNearest Neighbor”, “Support Vector Machines”, “Logistic Regression”, “Naive Bayes”, and “Random Forest” in his paper
In
Abdulwahab Ali Almazroi et al.
In
Abdulaziz Albahr et al.
The dataset used in this study is “heart” which was taken from Kaggle [www.kaggle.com]. The heart dataset contains 13 features and a target variable. The dataset description is given in



1 
Age 
Age of the patient 
2 
Sex 
Female = “0” and male =”1” 
3 
Chest Pain(cp) 
Chest Pain is categorized into 4 types 0 = ” typical angina” 1 = ” atypical angina” 2 = ” nananginal pain” 3 = ” asymptotic” 
4 
Resting Blood Pressure (trestbps) 
The patient's resting blood pressure is measured in millimetres of mercury (mmHg) (unit) 
5 
Cholesterol (chol) 
The cholesterol of the patient is in mg/dl (unit). 
6 
Fasting Blood Sugar (FBS) 
Fasting blood sugar is represented as a number between 0 and 1, such as 1 = if fbs >120 mg/dl (true) and 0 = if fbs >120 mg/dl (false) (false). 
7 
Resting ECG (restecg) 
Resting ECG is divided into three types from 0 to 2 defining: 0 = “normal”, 1 = “having STT wave abnormality”, 2 = “left ventricular hypertrophy” 
8 
Max Heart Rate (thalach) 
Maximum heart rate of the patient 
9 
Exerciseinduced angina (exang) 
Exerciseinduced angina is represented as 0 or 1 such as 0 = “No” and 1 = “Yes” 
10 
Oldpeak 
The value of ST depression is displayed. 
11 
Slope 
The peak of exercise during the ST segment 0 = “upslope”, 1 = “flat”, 2 = “downslope” 
12 
No. of major vessels (ca) 
Colored fluoroscopy is used to classify it into four categories ranging from 0 to 4. 
13 
Thalassemia (thal) 
ranges from 1 to 3, where 1 = “normal”, 2 = “fixed defect”, 3 = “reversible defect” 
14 
Target 
Prediction attribute 0 = no possibility of heart attack 1= possibility of a heart attack. 
In
The highest heart rate of the patients is depicted in
The heart dataset was preprocessed before applying the machine learning models. The standard scalar method is used to preprocess the dataset. The dataset is split into two parts: a training set (70%) and a testing set (30%). This work uses Kfold crossvalidation for model evaluation and model selection. The K value chosen is ten. The five machine learning models studied in this study are the “Logistic Regression Model”, “Decision Tree”, “Support Vector Machine”, “KNearest Neighbor”, and “Naive Bayes”, and then the “Hybrid Ensemble model” is created by combining these five techniques. A confusion matrix is created as a result of the machine learning models' output, which contrasts the actual target values with those predicted by the machine learning model. The “accuracy” metric is used in this paper to compare the models. For classification accuracy, divide the total number of true predictions by the total number of predictions made. The confusion matrix and formula for calculating accuracy are shown below in
Crossvalidation is a technique used in applied machine learning to estimate a machine learning model's skill on unknown data. The process includes only one parameter, k, which specifies the number of groups into which a given data sample should be divided. As a result, the process is frequently referred to as Kfold crossvalidation. When a precise value for K is specified, it can be substituted for K in the model's reference, for example, K=10 for 10fold crossvalidation.
The general technique for Kfold validation is as follows:
1. Shuffle the dataset at random
2. Sort the information into K groups
3. Write the following for each separate group
Use the group as a holdout or test data set.
Use the remaining groupings as a training data set.
Create a model for the training set and compare it to the test set.
Keep the evaluation result but discard the model
4. Summarize the model's ability using the sample of model evaluation ratings
Importantly, each observation in the data sample is assigned to a separate group and stays in that group throughout the technique. This means that each sample has a chance to appear in the holdout set and train the model K several times.
Bagging is a type of ensemble machine learning strategy that improves performance by combining the outputs of multiple learners. These methods work by dividing the training set into subsets and putting them through several machinelearning models, then aggregating their predictions when they return to create an overall forecast for each instance in the original data. Bagging is also known as bootstrap aggregation. It's a data sampling approach that uses replacement to sample data. Bootstrap aggregation is a machine learning ensemble metaalgorithm for lowering the variance of a bagged estimate, hence improving its bias and stability. Bagging classifiers combine the predictions of various estimators, reducing variance. In this study, we used five machine learning models, resulting in a total of 25 poor learners. Finally, the Bagging classifier is used, and the ensemble model's final class prediction is the class predicted by the weak learners.
Five categories of machine learning models are used in this work, which is outlined below:
It's a probabilistic analytic algorithm that predicts outcomes. A more sophisticated cost function is used in Logistic Regression. The 'Sigmoid function' or 'logistic function' can be used to describe this cost function. Value of cost function should be confined between 0 and 1, which is the rule in the logistic regression hypothesis. In classification and regression, this approach is utilized.
A treestructured classifier is a decision tree classifier. The internal nodes of this classifier represent attributes, branches represent decision rules, and the output is by leaf nodes. For predicting the dataset's class, the decision tree classifier starts at the root node. It compares the value of the root node with the attribute and jumps to the next node based on the result of the comparison. The technique is repeated until the tree's leaf node is reached.
In machine learning, the "Support Vector Machine" is the most commonly used Supervised Learning approach. Both classification and regression analysis can be done using this model. Optimum line or decision boundary was generated by SVM which is used to divide the ndimensional space into classes and new data points are classified in the future. The newly generated boundary is referred to as a hyperplane. The hyperplane is created using SVM, which selects the extreme points or vectors. Support vectors are the names given to these extreme points, and the process is known as a Support Vector Machine.
The simplest Machine Learning algorithm is “KNearest Neighbour”. The value of “k” has a significant impact on the correctness of the algorithm's output. KNN calculates the “Euclidean”, “Manhattan”, or “Minkowski” distance between feature points to compare unclassified and classified data. It is also known as a lazy learner.
This machine learning model is based on the Bayes theorem and assumes predictor independence. According to the "Naïve Bayes” model, existing feature presence in a class is considered to be independent of the presence of any other feature. To develop models with analytical skills, the Naive Bayes model is used. It offers novel approaches to analyzing and comprehending datasets. When data is high, qualities are unrelated to one another, and a more efficient output is expected, Nave Bayes is chosen compared to other methods.
In the proposed method, five weak learners such as “Logistic Regression Model”, “Decision Tree”, “Support Vector Machine”, “KNearest Neighbor”, and “Naive Bayes” are used. We used five machine learning models in this investigation, resulting in a total of 25 weak learners. Finally, the Bagging classifier is used, and the final class prediction of the ensemble model is the class predicted by the weak learners. The accuracy of each model is determined and compared to the “Hybrid Ensemble Model”, which is a composite of all five models.
The accuracy of the machine learning models such as “Logistic Regression Model”, “Decision Tree”, “Support Vector Machine”, “KNearest Neighbor”, “Naive Bayes”, and “Hybrid Ensemble model” is shown in
Figure 6 demonstrates that Logistic Regression has an accuracy of 80%, Decision Tree has an accuracy of 75%, SVM has an accuracy of 87%, KNN has an accuracy of 82%, Nave Bayes has an accuracy of 79 % and the proposed Hybrid Ensemble model has an accuracy of 98%. With 98 % accuracy, the "Hybrid Ensemble model" surpassed all of the individual models, allowing the physician to effectively identify heart disease.
The accuracy comparison of the heart dataset for predicting heart disease by various authors and the proposed “Hybrid Ensemble model” is shown in



ChuHsing Lin et al. 
Convolutional Neural Networks (CNN) 
93 
Montu Saw et al. 
Logistic Regression 
80 
Noor Basha et al. 
KNN 
85 
Halima El Hamdaoui et al. 
Naïve Bayes 
84.28 
Rohit Bharti et al. 
Deep Learning 
94.28 
Muhammad Zeeshan Younas 
Logistic Regression 
86.89 
Abdulwahab Ali Almazroi et al. 
Decision Tree 
82 
D. Deepika et al. 
MLPEBMDA 
94.28 
Abdulaziz Albahr et al. 
RSDANN 
96.3 



In
Noor Basha et al.
Halima El Hamdaoui et al.
In
Muhammad Zeeshan
Abdulwahab’s
Deepika et al.
Abdulaziz Albahr et al.
As mentioned in
We proposed a “Hybrid Ensemble Model” in this study, in which we compared the accuracy of weak learners such as "Logistic Regression," "Decision Tree," "Support Vector Machine”, “KNearest Neighbor", and "Naive Bayes" to the proposed "Hybrid Ensemble Model," which yielded encouraging results. Many researchers have previously suggested in various studies to apply the machine learning techniques and achieved higher results. Our proposed model, on the other hand, predicted the best with 98% accuracy, and thus this study might be useful to doctors and patients in predicting heart disease in advance.
As the scope of future work, this research can be extended to larger datasets, comparing the proposed technique with deep learning models. Various alternative optimization approaches, as well as different methods of data normalization, can be applied, and the results may be compared to improve accuracy. Incorporating the proposed model with userfriendly mobile or webbased application can be developed for the easier usage of doctors and patients in the real world.