A Novel Hybrid Feature Extraction Technique and Spam Review Detection using Ensemble Machine Learning Algorithm by Web Scrapping

Navin Kumar Goyal; Anil Pal; Bright Keswani; Dinesh Goyal; Mukesh Kr Gupta

doi:10.17485/IJST/v16i29.1500

Article

A Novel Hybrid Feature Extraction Technique and Spam Review Detection using Ensemble Machine Learning Algorithm by Web Scrapping

VIEWS 472
PDF 170

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i29.1500

Year: 2023, Volume: 16, Issue: 29, Pages: 2261-2268

Original Article

A Novel Hybrid Feature Extraction Technique and Spam Review Detection using Ensemble Machine Learning Algorithm by Web Scrapping

Navin Kumar Goyal^1*, Anil Pal², Bright Keswani³, Dinesh Goyal⁴, Mukesh Kr Gupta⁵

¹Research Scholar, Department of Computer Engineering and Information Technology, Suresh Gyan Vihar University, Jaipur, Rajasthan, India
²Professor, Department of Computer Application, Suresh Gyan Vihar University, Jaipur, Rajasthan, India
³Professor, Department of Computer Application, Poornima University, Jaipur, Rajasthan, India
⁴Director, Poornima Institute of Engineering and Technology, Jaipur, 302022, Rajasthan, India
⁵Professor, Department of Electrical Engineering, Suresh Gyan Vihar University, Jaipur, Rajasthan, India

*Corresponding Author
Email: [email protected]

Received Date:19 June 2023, Accepted Date:28 June 2023, Published Date:05 August 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: To develop a novel hybrid method for feature generation and a novel dataset for experimenting and extracting the features for numerical representation. Methods: In the pursuit of the best spam review detection model, a four-stage process was undertaken. Initially, a dataset ‘Fake reviews’ was collected from Flipkart, containing 9926 samples from the home and kitchen products domain. Next, the data underwent pre-processing using the Natural Language Toolkit (NLTK) library. A novel Hybrid Feature Generator (HFG) was then developed, extracting informative features based on parameters like TF-IDF (Term Frequency - Inverse Document Frequency), sentiment analysis scores, and syntactic patterns. Finally, the model was trained on these generated features using Gaussian Naïve Bayes (GNB), Multinomial Naïve Bayes (MNB), and Bernoulli Naïve Bayes (BNB) algorithms. Performance evaluation was conducted using metrics such as accuracy, precision, recall, and F1-score, comparing the model’s results to gold standard or known spam reviews. Findings: The feature generation technique was implemented on three different models, and the models were trained using 70% of the available data. The results of these experiments showed that GNB, NB, and NB achieved testing accuracies of 99.7%, 96.4%, and 99%, respectively. The performance of these models was compared with and without the inclusion of extracted product review features. The results demonstrated that the GNB algorithm outperformed the other methods in terms of accuracy and precision. Novelty: This study presents a novel HFG for feature extraction from review-text and a novel dataset that outperforms hitherto reportedapproaches.

Keywords: Fake Reviews Detection; Ensemble Machine Learning; Feature Engineering; Naïve Bayes; Web Scrapping

References

Barbado R, Araque O, Iglesias CA. A framework for fake review detection in online consumer electronics retailers. Information Processing & Management. 2019;56(4):1234–1244. Available from: https://doi.org/10.1016/j.ipm.2019.03.002
Zhao H, Liu Z, Yao X, Yang Q. A machine learning-based sentiment analysis of online product reviews with a novel term weighting and feature selection approach. Information Processing & Management. 2021;58(5):102656. Available from: https://doi.org/10.1016/j.ipm.2021.102656
Guptta SD, Shahriar KT, Alqahtani H, Alsalman D, Sarker IH. Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques. Annals of Data Science. 2022. Available from: https://doi.org/10.1007/s40745-022-00379-8
Aljabri M, Zagrouba R, Shaahid A, Alnasser F, Saleh A, Alomari DM. Machine learning-based social media bot detection: a comprehensive literature review. Social Network Analysis and Mining. 2023;13(1). Available from: https://doi.org/10.1007/s13278-022-01020-5
Kaur G, Sharma A. A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis. Journal of Big Data. 2023;10(1). Available from: https://doi.org/10.1186/s40537-022-00680-6
Rayan A. Analysis of e-Mail Spam Detection Using a Novel Machine Learning-Based Hybrid Bagging Technique. Computational Intelligence and Neuroscience. 2022;2022:1–12. Available from: https://doi.org/10.1155/2022/2500772
Elmogy AM, Tariq U, Mohammed A, Ibrahim A. Fake Reviews Detection using Supervised Machine Learning. International Journal of Advanced Computer Science and Applications. 2021;12(1). Available from: https://thesai.org/Downloads/Volume12No1/Paper_69-Fake_Reviews_Detection_using_Supervised_Machine.pdf
Alsubari SN, Deshmukh SN, Alqarni AA, Aldhyani T, Alsaade FW, Khalaf OI. Data Analytics for the Identification of Fake Reviews Using Supervised Learning. Computers, Materials & Continua. 2022;70(2):3189–204. Available from: https://doi.org/10.32604/cmc.2022.019625
Zhong M, Li Z, Liu S, Yang B, Tan R, Qu X. Fast Detection of Deceptive Reviews by Combining the Time Series and Machine Learning. Complexity. 2021;2021:1–11. Available from: https://doi.org/10.1155/2021/9923374
Tang L, Mahmoud QH. A Survey of Machine Learning-Based Solutions for Phishing Website Detection. Machine Learning and Knowledge Extraction. 2021;3(3):672–694. Available from: https://doi.org/10.3390/make3030034
Joni S, Chandrashekhar K, Ahmed MK, Jung SG, Bernard JJ. Creating and detecting fake reviews of online products. Journal of Retailing and Consumer Services. 2022;64. Available from: https://doi.org/10.1016/j.jretconser.2021.102771
Rosario C, Luca B, Nicola M, Vladimiro S, Massimo M, Hamido F, et al. A New Italian Cultural Heritage Data Set: Detecting Fake Reviews with BERT and ELECTRA Leveraging the Sentiment. IEEE Access. 2023;1. Available from: https://doi.org/10.1109/ACCESS.2023.3277490
Dutta AK. Detecting phishing websites using machine learning technique. PLOS ONE. 2021;16(10):e0258361. Available from: https://doi.org/10.1371/journal.pone.0258361
Gupta BB, Yadav K, Razzak I, Psannis K, Castiglione A, Chang X. A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Computer Communications. 2021;175:47–57. Available from: https://doi.org/10.1016/j.comcom.2021.04.023
Deepa ST. Phishing Website Detection Using Novel Features and Machine Learning Approach. Turk. Turkish Journal of Computer and Mathematics Education. 2021;12:2648–2653. Available from: https://doi.org/10.17762/turcomat.v12i7.3638

Copyright

© 2023 Goyal et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)