• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 29, Pages: 2261-2268

Original Article

A Novel Hybrid Feature Extraction Technique and Spam Review Detection using Ensemble Machine Learning Algorithm by Web Scrapping

Received Date:19 June 2023, Accepted Date:28 June 2023, Published Date:05 August 2023


Objectives: To develop a novel hybrid method for feature generation and a novel dataset for experimenting and extracting the features for numerical representation. Methods: In the pursuit of the best spam review detection model, a four-stage process was undertaken. Initially, a dataset ‘Fake reviews’ was collected from Flipkart, containing 9926 samples from the home and kitchen products domain. Next, the data underwent pre-processing using the Natural Language Toolkit (NLTK) library. A novel Hybrid Feature Generator (HFG) was then developed, extracting informative features based on parameters like TF-IDF (Term Frequency - Inverse Document Frequency), sentiment analysis scores, and syntactic patterns. Finally, the model was trained on these generated features using Gaussian Naïve Bayes (GNB), Multinomial Naïve Bayes (MNB), and Bernoulli Naïve Bayes (BNB) algorithms. Performance evaluation was conducted using metrics such as accuracy, precision, recall, and F1-score, comparing the model’s results to gold standard or known spam reviews. Findings: The feature generation technique was implemented on three different models, and the models were trained using 70% of the available data. The results of these experiments showed that GNB, NB, and NB achieved testing accuracies of 99.7%, 96.4%, and 99%, respectively. The performance of these models was compared with and without the inclusion of extracted product review features. The results demonstrated that the GNB algorithm outperformed the other methods in terms of accuracy and precision. Novelty: This study presents a novel HFG for feature extraction from review-text and a novel dataset that outperforms hitherto reportedapproaches.

Keywords: Fake Reviews Detection; Ensemble Machine Learning; Feature Engineering; Naïve Bayes; Web Scrapping


  1. Barbado R, Araque O, Iglesias CA. A framework for fake review detection in online consumer electronics retailers. Information Processing & Management. 2019;56(4):1234–1244. Available from: https://doi.org/10.1016/j.ipm.2019.03.002
  2. Zhao H, Liu Z, Yao X, Yang Q. A machine learning-based sentiment analysis of online product reviews with a novel term weighting and feature selection approach. Information Processing & Management. 2021;58(5):102656. Available from: https://doi.org/10.1016/j.ipm.2021.102656
  3. Guptta SD, Shahriar KT, Alqahtani H, Alsalman D, Sarker IH. Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques. Annals of Data Science. 2022. Available from: https://doi.org/10.1007/s40745-022-00379-8
  4. Aljabri M, Zagrouba R, Shaahid A, Alnasser F, Saleh A, Alomari DM. Machine learning-based social media bot detection: a comprehensive literature review. Social Network Analysis and Mining. 2023;13(1). Available from: https://doi.org/10.1007/s13278-022-01020-5
  5. Rayan A. Analysis of e-Mail Spam Detection Using a Novel Machine Learning-Based Hybrid Bagging Technique. Computational Intelligence and Neuroscience. 2022;2022:1–12. Available from: https://doi.org/10.1155/2022/2500772
  6. Elmogy AM, Tariq U, Mohammed A, Ibrahim A. Fake Reviews Detection using Supervised Machine Learning. International Journal of Advanced Computer Science and Applications. 2021;12(1). Available from: https://thesai.org/Downloads/Volume12No1/Paper_69-Fake_Reviews_Detection_using_Supervised_Machine.pdf
  7. Alsubari SN, Deshmukh SN, Alqarni AA, Aldhyani T, Alsaade FW, Khalaf OI. Data Analytics for the Identification of Fake Reviews Using Supervised Learning. Computers, Materials & Continua. 2022;70(2):3189–204. Available from: https://doi.org/10.32604/cmc.2022.019625
  8. Zhong M, Li Z, Liu S, Yang B, Tan R, Qu X. Fast Detection of Deceptive Reviews by Combining the Time Series and Machine Learning. Complexity. 2021;2021:1–11. Available from: https://doi.org/10.1155/2021/9923374
  9. Tang L, Mahmoud QH. A Survey of Machine Learning-Based Solutions for Phishing Website Detection. Machine Learning and Knowledge Extraction. 2021;3(3):672–694. Available from: https://doi.org/10.3390/make3030034
  10. Joni S, Chandrashekhar K, Ahmed MK, Jung SG, Bernard JJ. Creating and detecting fake reviews of online products. Journal of Retailing and Consumer Services. 2022;64. Available from: https://doi.org/10.1016/j.jretconser.2021.102771
  11. Rosario C, Luca B, Nicola M, Vladimiro S, Massimo M, Hamido F, et al. A New Italian Cultural Heritage Data Set: Detecting Fake Reviews with BERT and ELECTRA Leveraging the Sentiment. IEEE Access. 2023;1. Available from: https://doi.org/10.1109/ACCESS.2023.3277490
  12. Gupta BB, Yadav K, Razzak I, Psannis K, Castiglione A, Chang X. A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Computer Communications. 2021;175:47–57. Available from: https://doi.org/10.1016/j.comcom.2021.04.023
  13. Deepa ST. Phishing Website Detection Using Novel Features and Machine Learning Approach. Turk. Turkish Journal of Computer and Mathematics Education. 2021;12:2648–2653. Available from: https://doi.org/10.17762/turcomat.v12i7.3638


© 2023 Goyal et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.