• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2024, Volume: 17, Issue: 27, Pages: 2873-2879

Original Article

A Refined Approach Involving Feature Correlation Analysis Followed by Classification using Python for Gene Assortment

Received Date:05 April 2024, Accepted Date:29 June 2024, Published Date:19 July 2024

Abstract

Objectives: To make a study on hybrid scheme of selecting a most representative subset of genes to improve microarray data classification accuracy using machine learning algorithm. Method: ML-based SGSA approach proposed here has two parts of execution on gene expression datasets. First it utilizes an entropy based IGKullback–Leibler divergence to select the most informative genes. Subsequently, an attribute evaluation performed using correlation-based feature selection. After that a random forest based classifier is employed with 10-fold cross-validation. The proposed method involves data pre processing, testing, training, fitting an algorithm and then finds the best accuracy with comparable CPU cost. Findings: The rationale behind this study is that only most informative genes are submitted to classifier for classification task. Proven accuracy by this approach is 98.48 for Lymphoma, 89.69 for Breast, 86.67 for CNS, 97.22 for Leukemia, 98.4 for Lung cancer, 96.6 for MLL, 97.21 for Ovarian cancer and 98.5 for SRBCT over traditional machine learning algorithms like naïve bias, J48 and SMO. These results demonstrate the effectiveness of the suggested approach in accurately classifying tumors. The numerical illustration also showed that the new estimator is more efficient in terms of CPU cost. Novelty: The major problem with traditional machine learning algorithms is that all features are treated equally important during the classification process which makes them susceptible to the influence of outliers and difficult to find a meaningful class in the dataset. In this study, classification accuracy is improved by processing the most informative features in the classifiers. The primary contribution of the research is a hybrid ML model which uses IG based feature selection followed by correlation analysis. The result is then fed to RF based classifier which significantly enhance accuracy as well as CPU cost.

Keywords: Machine Learning, Classification, Correlation, Feature Selection, Gene Expression

References

  1. Lugagne JB, Blassick CM, Dunlop MJ. Deep model predictive control of gene expression in thousands of single cells. Nat Commun. 2024;15:2148. Available from: https://doi.org/10.1038/s41467-024-46361-1
  2. Ghaleb MS, Ebied HM, Tolba MF. Lung cancer stages classification based on differential gene expression. The 3rd International Conference on Artificial Intelligence and Computer Vision (AICV2023). 2023;164:272–281. Available from: https://doi.org/10.1007/978-3-031-27762-7_26
  3. Senbagamalar L, Logeswari S. Genetic clustering algorithm-based feature selection and divergent random forest for multiclass cancer classification using gene expression data. International Journal of Computational Intelligence Systems. 2024;17:23. Available from: https://doi.org/10.1007/s44196-024-00416-9
  4. Das A, Chatterjee S;, Sanyal G, Travieso-González CM, Awasthi S, Pinto CMA, et al. Cancer classification based on an integrated clustering and classification model using gene expression data. In: International Conference on Artificial Intelligence and Sustainable Engineering. (Vol. 836, pp. 461-70) Springer. 2022.
  5. Abdulqader D, Abdulazeez AM, Zeebaree D. Machine learning supervised algorithms of gene selection: A Review. Technology Reports of Kansai University. 2020;62(3):233–277. Available from: https://www.researchgate.net/publication/341119469_Machine_Learning_Supervised_Algorithms_of_Gene_Selection_A_Review
  6. Naji AM, Filali SE, Aarika K, Benlahmar EH, Abdelouhahid RA, Debauche O. Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Computer Science. 2021;191:487–492. Available from: https://doi.org/10.1016/j.procs.2021.07.062
  7. Mehla A, Deora SS, Dalal S. Improving crop yield prediction models with optimization-based feature selection and filtering approaches. Indian Journal of Science and Technology. 2023;16(47):4512–4524. Available from: https ://doi.org/10.17485/IJST/v16i47.1602
  8. Borodulin A, Gladkov A, Gantimurov A, Kukartsev V, Evsyukov D. Using machine learning algorithms to solve data classification problems using multi- attribute dataset. BIO Web Conf. 2024;84:2001. Available from: https://doi.org/10.1051/bioconf/20248402001
  9. Ajmal HB, Madden MG. Dynamic Bayesian Network Learning to Infer Sparse Models From Time Series Gene Expression Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2022;19(5):2794–2805. Available from: https://doi.org/10.1109/TCBB.2021.3092879
  10. Michal I, Treepop W, Nicholas A, Kaehler BD. Beating Naive Bayes at Taxonomic Classification of 16S rRNA Gene Sequences. Frontiers in Microbiology. 2021;12. Available from: https://doi.org/10.3389/fmicb.2021.644487

Copyright

© 2024 Machchhar et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.