• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2021, Volume: 14, Issue: 24, Pages: 2039-2050

Original Article

A Hybrid of Proposed Filtration and Feature Selections to Enhance the Model Performance

Received Date:07 December 2020, Accepted Date:05 July 2021, Published Date:15 July 2021


Objectives: Toextract and identify the subjective information of social media user from the unstructured data. To overcome the high dimensionality and sparsity those are the two major challenges in sentiment analysis of text datasets. To increase the model performance by using possibly minimum feature sets in a text classification problem. Methods: We proposed a new filtration method which is applied for the removal of correlated features and zero importance features in addition to the various feature selection methods. The various feature selections such as Mutual Info, Lasso, Recursive Feature Elimination and dimensionality reduction, Principal Component Analysis (PCA) have been used along with the proposed filtration to find the compelling features. This approach was evaluated using three Indian Government Schemes and these tweets were classified using Random Forest classifier. The performance was evaluated using various metrics such as accuracy, precision, recall, f1_score, log loss and roc-auc. Findings: In this research, we proposed a model for selecting relevant and non-correlated feature subsets from the unstructured dataset. From this model, accuracy of 92% with the minimum log loss 0.22 was achieved through the minimum number of feature set. Improvements: This study proves that the performance of the model will be improved by overcoming those two problems (dimensionality and sparsity). Here various feature selection methods have been applied with the proposed filtration in order to minimize the number of features. The computing time and the model performance will be improved as a result of decreasing the features. And this will be more effective in case of large datasets. Even though Random Forest performs well in high dimensional datasets we need some more optimization.

Keywords: Mutual Information (MI); Lasso (L1); Recursive Feature Elimination (RFE); Random Forest (RF); Principal Component Analysis (PCA)


  1. Jotheeswaran J, Koteeswaran S. Feature Selection using Random Forest Method for Sentiment Analysis. Indian Journal of Science and Technology. 2016;9(3):1–7. Available from: https://dx.doi.org/10.17485/ijst/2016/v9i3/86387
  2. Yang A, Zhang J, Pan L, Xiang Y. Enhanced Twitter Sentiment Analysis by Using Feature Selection and Combination. International Symposium on Security and Privacy in Social Networks and Big Data (SocialSec). 2015;p. 52–57. Available from: 10.1109/SocialSec2015.9
  3. Sharma S, Jain A. Hybrid Ensemble Learning With Feature Selection for Sentiment Classification in Social Media. International Journal of Information Retrieval Research. 2020;10(2):40–58. Available from: https://dx.doi.org/10.4018/ijirr.2020040103
  4. Rani S, Gill NS. Hybrid Model For Twitter Data Sentiment Analysis Based On Ensemble Of Dictionary Based Classifier And Stacked Machine Learning Classifiers-Svm, Knn And C5.0. Journal of Theoretical and Applied Information Technology. 2020;98(04):624–635.
  5. Parlar T, Sarac E. IWD Based Feature Selection Algorithm for Sentiment Analysis. Elektronika ir Elektrotechnika. 2019;25(1):54–58. Available from: https://dx.doi.org/10.5755/j01.eie.25.1.22736
  6. Kumar HMK, Harish BS. A New Feature Selection Method for Sentiment Analysis in Short Text. Journal of Intelligent Systems. 2018;29(1):1122–1134. Available from: https://dx.doi.org/10.1515/jisys-2018-0171
  7. Madasu A, Elango S. Efficient feature selection techniques for sentiment analysis. Multimedia Tools and Applications. 2020;79(9-10):6313–6335. Available from: https://dx.doi.org/10.1007/s11042-019-08409-z
  8. Arya P, Bhagat A, Nair R. Improved Performance of Machine Learning Algorithms via Ensemble Learning Methods of Sentiment Analysis. International Journal on Emerging Technologies. 2019;10(2):110–116.
  9. Ahuja R, Chug A, Kohli S, Gupta S, Ahuja P. The Impact of Features Extraction on the Sentiment Analysis. Procedia Computer Science. 2019;152:341–348. Available from: https://dx.doi.org/10.1016/j.procs.2019.05.008
  10. Madhuri DK, D. A Machine Learning based Framework for Sentiment Classification: Indian Railways Case Study. International Journal of Innovative Technology and Exploring Engineering (IJITEE). 2019;8(4):2278–3075.
  11. Le NQK, Nguyen TTD, Ou YY. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. Journal of Molecular Graphics and Modelling. 2017;73:166–178. Available from: https://dx.doi.org/10.1016/j.jmgm.2017.01.003
  12. Prusa JD, Khoshgoftaar TM, Dittman DJ. Impact of Feature Selection Techniques for Tweet Sentiment Classification. Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference. 2015;p. 299–304.
  13. Payal B, Awachate, Kshirsagar PV. Improved Twitter Sentiment Analysis Using N Gram Feature Selection and Combinations. International Journal of Advanced Research in Computer and Communication Engineering. 2016;5(9):154–157. Available from: 10.17148/IJARCCE.2016.5935
  14. Larasati UI, Muslim MA, Arifudin R, Alamsyah A. Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis. Scientific Journal of Informatics. 2019;6(1):138–149. Available from: https://dx.doi.org/10.15294/sji.v6i1.14244
  15. Singh M, Gupta S. Sentiment Analysis using Naive Bayes Classifier and Information Gain Feature Selection over Twitter. International Journal of Computer Trends and Technology. 2020;68(5):84–91. Available from: https://dx.doi.org/10.14445/22312803/ijctt-v68i5p117
  16. Zahra T, HG, Hussain I. Sentiment Analysis Of Twitter Dataset Using Lle And Classification Methods. International Research Journal Of Modernization In Engineering Technology And Science. 2021;3(1):1151–1164.
  17. Narang A. Twitter Sentiment Analysis on Citizenship Amendment Act in India. International Journal for Research in Applied Science and Engineering Technology. 2020;8(7):1714–1724. Available from: https://dx.doi.org/10.22214/ijraset.2020.30636
  18. Oshiro TM, Perez PS, Baranauskas JA. How Many Trees in a Random Forest? Lecture Notes in Computer Science. 2012;p. 154–168. Available from: https://doi.org/10.1007/978-3-642-31537-4_13
  19. Sujatha E, Radha R. A Sentiment Classification on Indian Government Schemes Using PySpark. International Journal on Emerging Technologies. 2020;11(2):25–30.


© 2021 Sujatha & Radha. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.