The contribution of this present work is as follows: (a) Enhanced pre-processing is carried out to remove the noise, improve the quality of the image, enhancing some features and so on, (b) Selecting the novel optimal method that is the C4.5 with Firefly
There have been plenty of processes executed in recent years to detect discrete thyroid disorders. Many researchers have made use of numerous machine learning techniques
According to the researchers, the main goal of investigations is to suggest a Machine Learning approach for improving the preciseness of hypothyroidism detection by incorporating patient queries with test outcomes during the detection procedure. The second objective is to decrease the hazards associated with dialysis problems. Data are utilized from UCI database, which contained 3163 samplings. Among them 151 are hypothyroid and the remaining were hypothyroid. The logical determination was resolved if the new instances were hypothyroid. To eliminate the unbalanced distribution, distinct selection approaches were used in data collection, and models were developed to detect hypothyroidism using K-Nearest Neighbor (KNN), Logistic Regression (LR) and Support Vector Machine (SVM) classifiers. In this regard, this article revealed the impact of selection methods on the detection of hypothyroidism. When compared to all the models, the Logistic Regression classifier formed the optimal results. For this examination, the dataset utilizing over-sampling techniques was trained and obtained a precision of 97.8 %, the F-Score of 82.26%, Matthews Correlation Coefficient (MCC) of 81.8 % and ROC of 93.2 %.
New statistical analysis and data mining techniques are utilized by researchers to develop tools that help healthcare professionals to easily and efficiently diagnose thyroid related diseases. Useful knowledge can be extracted from the databases where a significant amount of relevant data is stored. A new decision-based hybrid system for the diagnosis of thyroid diseases is presented here. The proposed system consists of three stages. In the first stage, 25 features of the dataset (retrieved from the University of California Irvin machine learning repository) were reduced using Information Gain method to avoid data redundancy and reduce computation time. In the second stage, the missing values in the dataset are dealt with K-Nearest Neighbor (KNN) weighting pre-processing scheme. Finally, the resultant data is provided as input to the Adaptive Neuro-Fuzzy Inference System for the purpose of input-output mapping in the last stage of this proposed system. Classification accuracy for this proposed approach was calculated to be 99.1%, whereas sensitivity and specificity results were 94.77% and 99.70%, respectively. This approach is able to get highest classification accuracy with minimum possible features of the dataset and can be applied to diagnose other lethal diseases.
For researchers, medical diagnosis and the extraction of designs can convert into applicable knowledge is a difficult task. Medical documents depend on real-time data with increased dimensionality, making pattern extraction even more difficult. The prediction of various diseases, such as thyroid, diabetes, and cancer, is frequently hampered by their high dimensionality. Machine learning is the process of extracting facts and building an understanding base on massive amounts of information that stays useful. These functionalities are accomplished using clustering, classification, prediction and regression. Reducing dimensionality for the purpose of knowledge discovery is an important factor that aids in the prediction/decisiveness for diverse fields such as medical identification, business modelling, and data examination. The article described a suitable method for thyroid classification that combines C4.5 and the Random Forest Classification Technique (CCTML). When the exploratory outcomes are compared to other traditional approaches, the CCTML results show more accurateness
Research works were carried out research based on the application of a two Multilayer Perceptron (MLP) classifier for categorizing thyroid disorders namely thyroid, hypothyroid and hyperthyroid, as well as achieving maximum accuracy in the shortest amount of time
Recently, authors researched based on an Extreme Machine Learning technique based on Graph Clustering Ant Colony Optimization to detect thyroid threats
Thyroid disease diagnosis necessitates the accurate performance of operational data from the thyroid gland in order to form hormones that restrain the human body's metabolic rate
|
|
|
Sparano et al. |
Logistic Regression |
97.8% |
Dharamkar et al. |
Fusing C4.5 and random forest |
97% |
Marques et al. |
Multilayer Perceptron |
97.4% |
Yazdani et al |
Graph Clustering Ant Colony Optimization |
98.5% |
Jha et al. |
Linear discriminant analysis, kNN-based weighed preprocessing, and adaptive neuro fuzzy inference system |
98.5% |
The images in our investigation are acquired from https://www.kaggle.com/datasets/yasser hessein/thyroid-disease-data-set for analyzing and diagnosing diseases, and the kind of data acquired is associated to thyroid disorder with 4672 samples of people including both females and males. These samples include people who have hypothyroidism and hyperthyroidism, as well as healthy people without thyroid disorder. The data was gathered over one year with the primary goal of classifying thyroid disease using ML algorithms. These data comprise of gender, age, Thyroid Hormone (T4), Triiodothyronine (T3), Thyroid Stimulating Hormone (TSH) and so on. As exposed in
|
|
|
|
|
1 |
Gender |
Female or Male |
Integer |
[0,1] |
2 |
Age |
In Years |
Real |
[0.00,0.90] |
3 |
Pregnant |
During pregnancy |
Integer |
[0,1] |
4 |
Sick |
Illness |
Integer |
[0,1] |
5 |
ATD |
Antithyroid drug medication |
Integer |
[0,1] |
6 |
TSH |
Thyroid-stimulating hormone |
Real |
[0.0,0.53] |
7 |
Class |
Category |
Integer |
|
8 |
Li |
Lithium |
Integer |
[0,1] |
9 |
Query thyroxine |
- |
Integer |
[0,1] |
10 |
T3 |
Triiodothyronine |
Real |
[0,1] |
11 |
Hyperthyroid |
- |
Integer |
[0,1] |
12 |
Hypothyorid |
- |
Integer |
[0,1] |
13 |
TSH_M |
Thyroid stimulating hormone medication |
Real |
[0,1] |
14 |
TT4 |
Total Thyroxin |
Real |
[0.0020, 0.6] |
15 |
T4 |
Thyroid hormone |
Real |
[0,1] |
16 |
Thyroid surgery |
- |
Integer |
[0,1] |
17 |
Tumor |
Cyst |
Integer |
[0,1] |
18 |
Hypopituitary |
- |
Integer |
[0,1] |
19 |
T4U |
Thyroxin utilization rate |
Real |
[0.017, 0.233] |
20 |
FT1 |
Free Thyroxin Index |
Real |
[0.0020, 0.642] |
Pre-processing is used to determine various kinds of concerns like noisy data, missing data, redundant data or values, etc. High-quality information leads to high-quality outcomes and the cost of ML algorithms are reduced. It is a minor but significant process. Generalizations, cleaning, feature removal, feature selection, and so on are the examples of data processing.
Raw data has lower signal than audio, lost values, and inconsistencies that affect data processing results, partitioning to help improve the data aspect, improve efficiency, and simplicity in the extraction process
The primary goal of ML algorithms is to distinguish between three types of thyroid disorder. The three types of diseases namely hyperthyroidism, hypothyroidism and stable patients with no thyroid problems were examined in this section.
Decision Tree is a Supervised Learning (SL) approach that can be utilized to differentiate between Regression problems, but it is most commonly used to solve separation problems. This approach depends on a decision-boosting machine that is examined for forecasting the energy utilization factor in order to attempt a tree-based process to identify the most effective predictors. The model is designed to improve the decision necessitates including thousands of trees, separately expanded by employing data from the earlier tree. A few tuning parameters are available like the shrinkage parameter, number of trees, total splits in each tree, etc. To construct the learning computation, a split and conquer method, also known as divide and conquer is used
An arrangement of attributes connects the arrangements of instances. The decision tree is made up of leaves and hubs, with leaves representing the class of an illustration that meets the requirements and hubs representing a trial on the measures of a trait. Standards can be obtained by observing the track from the core to the leaf and operating the hubs along the way as pre-conditions for the authority to anticipate the category at the leaf. The tree must be pruned to terminate excessive pre-conditions and duplications
Decision Trees are based on the Sum of Product (SOP) representation. Each branch from the tree's core to a leaf node along with the category is an intersection (product) of weights and dissimilar branches terminate in that category form a predicate (sum)
Step 1: Consider placing the dataset most acceptable feature at the pinnacle of the tree.
Step 2: Subdivide the training set into smaller subsets.
Step 3: Replicate steps 1 and 2 per subset until after obtaining the leaf node and all of the tree components.
Step 1: Initially, the complete training set is measured as the core (root).
Step 2: Domain-specific feature weights are preferred. If the weights are successive, they are discretized prior to the model being constructed.
Step 3: Documents are distributed in a recursive manner depending on feature weights.
Step 4: A statistical procedure is employed to specify which features should be identified as the tree's core (root) or an interior node.
The architectural process C4.5 decision tree technique was used to perform disease classification techniques. The input data is loaded into databases to perform the classification process. Input data can have missing values, noisy values, or be consistent. Such data must be pre-processed as part of the pre-processing step.
The optimal choice of features is a critical step in improving classification. The best features are chosen, and the dataset is separated into testing and training data for the classification process. The classification algorithm is fed with training data and authorized with testing process. The overall performance of the data is measured using parameters such as accuracy, precision, F1-score and recall. Finally, the prediction values are displayed in the results. The prediction process flow diagram is depicted in
Input: Xab matrix consisting of continuous parameters
X0 : feasible parameter;
cp0 : feasible cut point;
For each parameter Xi, i=1,…,n do
Sort parameter values a1i,….,ani
Identify all possible cut point ccp1i,….,ccpmi
For each cut point ccpij, i=1,…m do
Compute information gain Gain (A, ccpij)
Choose the feasible cut point cp0i
Compute the splitting performance Split (A, ccp0j)
Compute gain ratio Gain_ratio (A, ccp0j)
Choose the feasible parameter X0 and cut point cp0
End
Firefly algorithm is a meta-heuristic algorithm based on flashing patterns and behavior of fireflies
Input: X(a)
Output: ranking firefly
Procedure: Selecting from the total populace
begin:
X(a), a=(a1,a2, a3,…, an)
For total populace m, firefly aj, where j = 1,2,3,….m
Compute light intensity l, connected X(a)
while do (k < MaxVal)
define γ
for (x=1;x<n; x++) do
for (y=1;y<n;y++) do
if then ly > lx
pull x towards y
compute distance of each firefly
estimate Intensity I
return ranking of firefly
The proposed C4.5 algorithm extends the ID3 algorithm by determining the best feature based on the Information Gain (IG) ratio. It can manage continuous features by suggesting two separate tests based on the type of attribute value. In order to build the decision tree during the training phase, the C4.5 employs a top-down design depending on the divide-and-conquer methodology. Before producing nodes from the core (root) to the leaves, it locates the training dataset and utilizes the IG ratio as a metric to identify splitting features. Each instructive route from the origin node to the leaf node comprises a decision rule for determining which category a unique instance belongs to.
To account for unknown attribute values, the origin (root node) contains the entire training collection, with all training case values set to 1.0. The algorithm terminates if all of the present node's training cases belong to a single class. Conversely, if all of the training possibilities are from more than one category, this technique computes the IG ratio for every feature. To split information at the node, the attribute with the greatest IG ratio is chosen. The IG ratio for a discrete feature is calculated by dividing training sets of the existing node in the operation of every value. If the attribute is continuous, a threshold value for splitting must be determined.
The decision tree method is a Machine Learning (ML) technique that employs a persistent data subdividing tool among specified metrics. Interior nodes represent features, branches describe determination laws, and every leaf node interprets the outcome in the recursive DT algorithm. The Top-Down approach was used to categorize the data, which implicates ordering from the root (origin) to a leaf/terminal (last) node. It divides the dataset into small subsets after acquiring the training dataset by using the Information Gain, Gain Index, Gain Ratio and Entropy.
IG is calculated for splitting by duplicate weight entropies of every branch from the authentic entropy. When using these metrics to qualify a DT, the best splitting process is specified by maximizing the information gain.
The Gain Index is computed by deducting the summation of every squared probability.
In the majority of datasets, the Information Gain and Gini Index are calculated. The procedure is replicated for every child row until all rows fit to the exact class and no further features are required. Gini Index is operated to improve detection accurateness and preciseness.
The entropy of a random variable is a measure of its anticipation. It calculates the average portion of the information that is lost when the random variable's value is unknown.
The gain of information is biased toward selecting features with a huge volume of weights as root nodes. It indicates that chosen features are represented with distinct weights. Determine the information gain of each attribute separately, and then calculate the average information gain. Second, compute the gain ratio of all features whose calculated information gain are higher than or equivalent to the computed average information gain, and then specify the feature with the highest gain ratio to be further subdivided.
This proposed algorithm uses a top-down greedy search strategy to travel the potential decision-making distance, never retracing and reanalyzing prior choices. This algorithm's metric is employed for determining the best feature at every point of the tree expansion referred to as information gain. At the beginning stage, a predictive model must be given proper training utilizing available information.
Later, reliability and accuracy is verified specifically to predict proven test outputs. Various methods were utilized like the Gini index or information gain, to discriminate the optimal feature for splitting nodes utilizing Attribute Selection Measures (ASM). This feature then divides the dataset into more diminutive subsets for until no longer a child additional feature is found. Initially, the model is built using training data. The accuracy of the model is defined, and it is enhanced by evaluating the recognized output by utilizing data. Finally, the prototype can be employed to forecast forthcoming results. After the computational process, the predictive score will be computed. These values serve as a percentage of the accurate predictor in the final node of the eligible model (or leaf). Figure 2 represents the decision tree's working process.
The experiment was carried out in MATLAB R2019a on a Windows 64-bit operating system with an AMD Ryzen 5 4500U 2.38 GHz processor and 8 GB of RAM.
Accuracy is measured in terms of positives and negatives, with a scale ranging from 0 to 100 percent.
Here, TN denotes True negative, FN denotes False Negative, TP denotes True Positive, and FP denotes False Positive values.
Precision and recall seek to measure the percentage of True Negative (TN) and True Positive (TP). Precision is the capacity of a classifier not to recognize a positive specific case that is essentially negative, and it is stated as follows:
Conversely, recall assesses the model's sensitivity. It is mentioned as the ratio of exactly predicted value for a class to the sum of patients occurred, and it is calculated as follows:
Many real-world classification problems have unequal class distributions. When the TP and TN seems to be more significant, accuracy is utilized, but F1-score is employed when the FN and FP are extremely crucial. As a result, F1-score may be a better metric to use to evaluate our model. The numerical average of precision and recall is considered as F1-Score.
Maximum of the available datasets were found to be suffering from inappropriate as well as noise features which might result the classification procedure to generate disagreeable outcomes, consequently in this stage feature selection procedure was employed for reducing the dimensionalities of dataset as well as for eliminating unwanted features.
Concerning machine learning scenario, extraction or selection of finest feature possesses significant part in order to improve the entire classification procedure. For investigating the prominence of features selection concerning classification, dissimilar datasets were classified deprived of features selection.
The results clearly showed that, feature selection was considered to be one of the significant stages in classification, in which the dataset suffers from noise as well as inappropriate data that might have negative effect over classification accuracy. Consequently, outcomes with no feature selection processes may be poorer because of dimensionalities as well as noise data, however accounting with feature selection, overall accuracy was enhanced.
Thyroid patients were divided into three groups: normal, hyperthyroidism, and hypothyroidism. There were 4672 samples after data pre-processing. 3066 images were taken for training and 1606 images were taken for testing out of the total.
They are split into 3574 training and 1098 testing samples, with a train and test split ratio of 70:30. A confusion matrix is computed to alleviate the problem with FP and FN numbers.
Naive Bayes (NB) represents a simpler classification procedure to predict the model possessing perfect semantics that represents probabilistic learning technique on the basis of Bayesian theorems. NB classifiers assume the values of a single attribute to be not depending over values of other attributes, thereby NB adopts that the existence or non-existence of specific attributes in prediction possessing maximum probabilies. In case of this proposed method, the collaborative filtering helps in the construction of recommendation systems using machine learning methods for filtering hidden data as well as for predicting that users are seeking a given feature.
K-Nearest Neighbors (KNN) techniques does not make any assumption regarding distribution of information on which it was grounded. Normally, KNN models were found to be successfully employed over specific datasets. KNN is advantageous while functioning over few real-world datasets. In this proposed research, the datasets were associated with thyroid disorder possessing hypothyroidism and hyperthyroidism, moreover fit people without thyroid disorder. KNN makes delay in testing as well as requires greater amount of timing as well as memory space. In KNN, K specifies the control parameter in the prediction models, however, KNN shall not be learning instantaneously from training sets but it upholds the datasets and employs them for classification, which is not the case in this proposed method.
AdaBoost machine learning models characteristically encompasses several thousands of shallower decision tree, possibly causing much larger search spaces. Considering the input x∈X, it becomes possible for reducing these spaces exponentially on taking into account only the decision path of x in all the decision trees as well as dropping every additional branches. These paths preserve every information regarding the determination of g(x). In this proposed method, a statistical procedure was employed to identify the features to be identified as tree's root or interior nodes.
|
|
NB |
0.0923 |
KNN |
0.5453 |
Adaboost |
4.1251 |
Proposed |
1.117 |
ompares the accuracy obtained with the deployed classifiers. The proposed method achieved the highest accuracy of 99.81 percent, followed by Adaboost, KNN, and NB, which achieved 96.65 percent, 94.47 percent, and 25.09 percent, respectively. When the TP and TN are more important, accuracy is used, but F1-score is used when the FN and FP are critical. When an F1-score is 1, it is considered perfect. Among all the classifiers, the suggested method has the highest F1-score of 0.9951. Thus, in terms of precision, accuracy, F1-score and recall, the proposed method outperformed the other models. This demonstrates that the proposed classifier is significantly better at Thyroid disease prediction.
In real-time applications, it is required to prioritize model accuracy over speed. Results shows that while NB and KNN take less time to train, their accuracy is very low. Adaboost accepted 4.1251s in our case, which really is slower than the most precise proposed model. In this investigation, the proposed method is more suitable for further processing.
In this article, a novel algorithm C4.5 with Firefly Optimization Algorithm (CFOA) was proposed to speed-up and enhances the effectiveness of the machine learning algorithm. The speed of the NB, KNN, Adaboost and proposed algorithms are 0.0923 sec, 0.5453 sec, 4.1251 sec and 1.117 sec respectively, that clearly demonstrates that the proposed classifier is suggestively faster in thyroid disease prediction. Moreover, the accuracy is 0.2509, 0.9447, 0.9665 and 0.9981 for NB, KNN, Adaboost and proposed algorithms respectively, that shows that the proposed algorithm displayed highest accuracy of 99.81%
In future, more additional attributes necessitate prominent clinical tests, which are both costly and time-consuming. As a result, there is a necessity to design such algorithms and predictive standards that demand the least number of metrics from a person to analyze thyroid disease and save both money and period of time for the patient.