Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms

Pramod P Ghogare; Husain H Dawoodi; Manoj P Patil

doi:10.17485/IJST/v17i15.2979

Article

Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms

VIEWS 247
PDF 218

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v17i15.2979

Year: 2024, Volume: 17, Issue: 15, Pages: 1545-1556

Original Article

Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms

Pramod P Ghogare^1*, Husain H Dawoodi², Manoj P Patil³

¹Research Scholar, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, Maharashtra, India
²System Analyst, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, Maharashtra, India
³Assistant Professor, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, Maharashtra, India

*Corresponding Author
Email: [email protected]

Received Date:22 November 2023, Accepted Date:06 March 2024, Published Date:12 April 2024

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: This article proposes a content-based spam email classification by applying various text pre-processing techniques. NLP techniques have been applied to pre-process the content of an email to get the optimal performance of spam email classification using machine learning. Method: Several combinations of pre-processing methods, such as stopping, removing tags, converting to lower case, removing punctuation, removing special characters, and natural language processing, were applied to the extracted content from the email with machine learning algorithms like NB, SVM, and RF to classify an email as ham or spam. The standard datasets like Enron and SpamAssassin, along with the personal email dataset collected from Yahoo Mail, are used to evaluate the performance of the models. Findings: Applying stemming in pre-processing to the RF classifier yielded the best results, achieving 99.2% accuracy on the SpamAssassin dataset and 99.3% accuracy on the Enron dataset. Lemmatization followed closely with 99% accuracy. In real-world testing on a personal Yahoo email dataset, the proposed method significantly improved accuracy from 89.82% to 97.28% compared to the email service provider's built-in classifier. Additionally, the study found that SVM performs accurately when stop words are retained. Novelty: This article introduces a unique perspective by highlighting the fine-tuning of pre-processing techniques. The focus is on removing tags and certain special characters, while retaining those that improve spam email classification accuracy. Unlike prior works that primarily emphasize algorithmic approaches and pre-defined processing functions, our research delves into the intricacies of data preparation, showcasing its significant impact on spam email classifiers. These findings emphasize the crucial role of pre-processing and contribute to a more nuanced understanding of effective strategies for robust spam detection.

Keywords: Spam, Classification, Pre-processing, NLP, Machine Learning

References

Hacohen-Kerner Y, Miller D, Yigal Y. The influence of preprocessing on text classification using a bag-of-words representation. PLOS ONE. 2020;15(5):1–22. Available from: https://doi.org/10.1371/journal.pone.0232525
Diale M, Walt CVD, Celik T, Modupe A. Feature selection and support vector machine hyper-parameter optimisation for spam detection. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech). IEEE. 2017.
Reddy A, Reddy KH, Abhishek A, Manish M, Dattu GVS, Ansari NM. Email Spam Detection Using Machine Learning. Journal of Survey in Fisheries Sciences. 2023;10(1): 2658–2664. Available from: http://sifisheriessciences.com/index.php/journal/article/view/1249
Urmi AS, Ahmed MT, Rahman M, Islam AT. A Proposal of Systematic SMS Spam Detection Model Using Supervised Machine Learning Classifiers. In: Computer Vision and Robotics, Algorithms for Intelligent Systems. (pp. 459-471) Singapore. Springer . 2022.
Chakraborty A, Das UK, Sikder J, Maimuna M, Sarek KI. Content Based Email Spam Classifier as a Web Application Using Naïve Bayes Classifier. In: International Conference on Intelligent Computing & Optimization: ICO 2022, Lecture Notes in Networks and Systems. (Vol. 569, pp. 389-398) Springer, Cham. 2023.
Moutafis I, Andreatos A, Stefaneas P. Spam Email Detection Using Machine Learning Techniques. In: Proceedings of the 22nd European Conference on Cyber Warfare and Security, ECCWS 2023, European Conference on Cyber Warfare and Security. (Vol. 22, No. 1, pp. 303-310) Academic Conferences International Ltd. 2023.
Goyal NK, Pal A, Keswani B, Goyal DK, Gupta MK. A Novel Hybrid Feature Extraction Technique and Spam Review Detection using Ensemble Machine Learning Algorithm by Web Scrapping. Indian Journal Of Science And Technology. 2023;16(29):2261–2268. Available from: https://doi.org/10.17485/IJST/v16i29.1500
Ruskanda FZ. Study on the Effect of Preprocessing Methods for Spam Email Detection. Indonesian Journal on Computing (Indo-JC). 2019;4(1):109–118. Available from: https://doi.org/10.21108/INDOJC.2019.4.1.284
Sohrab H, Abtahee A, Kashem I, Hoque MM, Sarker IH. Crime Prediction Using Spatio-Temporal Data. In: International Conference on Computing Science, Communication and Security, COMS2 2020, Communications in Computer and Information Science. (Vol. 1235, pp. 277-289) Singapore. Springer. 2020.
Rusland NF, Wahid N, Kasim S, Hafit H. Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets. In: International Research and Innovation Summit (IRIS2017) , IOP Conference Series: Materials Science and Engineering. (Vol. 226, pp. 1-9) IOP Publishing. 2017.
Trivedi SK, Dey S. A Modified Content-based Evolutionary Approach To Identify Unsolicited Emails. Knowledge and Information Systems. 2019;p. 1427–1451. Available from: https://doi.org/10.1007/s10115-018-1271-1
V B, Thomas C. Performance evaluation of classifiers for spam detection with benchmark datasets. In: 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE). IEEE. 2016.
Mohammad RMA. A lifelong spam emails classification model. Applied Computing and Informatics. 2024;20(1/2):35–54. Available from: https://www.emerald.com/insight/content/doi/10.1016/j.aci.2020.01.002/full/pdf?title=a-lifelong-spam-emails-classification-model
Gaurav D, Tiwari SM, Goyal A, Gandhi N, Abraham A. Machine intelligence-based algorithms for spam filtering on document labeling. Soft Computing. 2020;24(13):9625–9638. Available from: https://doi.org/10.1007/s00500-019-04473-7

Copyright

© 2024 Ghogare et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)