A Hybrid of Proposed Filtration and Feature Selections to Enhance the Model Performance

E Sujatha; R Radha

doi:10.17485/IJST/v14i24.2017

Article

A Hybrid of Proposed Filtration and Feature Selections to Enhance the Model Performance

VIEWS 1775
PDF 281

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v14i24.2017

Year: 2021, Volume: 14, Issue: 24, Pages: 2039-2050

Original Article

A Hybrid of Proposed Filtration and Feature Selections to Enhance the Model Performance

E Sujatha^1*, R Radha²

¹Research Scholar, Research Dept of Computer Science, SDNBV College for Women, University of Madras, Chrompet, Chennai, 600 044, India
²Associate Professor, Research Dept of Computer Science, SDNBV College for Women, Chrompet, Chennai, 600 044, India

*Corresponding Author
Email: [email protected]

Received Date:07 December 2020, Accepted Date:05 July 2021, Published Date:15 July 2021

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: Toextract and identify the subjective information of social media user from the unstructured data. To overcome the high dimensionality and sparsity those are the two major challenges in sentiment analysis of text datasets. To increase the model performance by using possibly minimum feature sets in a text classification problem. Methods: We proposed a new filtration method which is applied for the removal of correlated features and zero importance features in addition to the various feature selection methods. The various feature selections such as Mutual Info, Lasso, Recursive Feature Elimination and dimensionality reduction, Principal Component Analysis (PCA) have been used along with the proposed filtration to find the compelling features. This approach was evaluated using three Indian Government Schemes and these tweets were classified using Random Forest classifier. The performance was evaluated using various metrics such as accuracy, precision, recall, f1_score, log loss and roc-auc. Findings: In this research, we proposed a model for selecting relevant and non-correlated feature subsets from the unstructured dataset. From this model, accuracy of 92% with the minimum log loss 0.22 was achieved through the minimum number of feature set. Improvements: This study proves that the performance of the model will be improved by overcoming those two problems (dimensionality and sparsity). Here various feature selection methods have been applied with the proposed filtration in order to minimize the number of features. The computing time and the model performance will be improved as a result of decreasing the features. And this will be more effective in case of large datasets. Even though Random Forest performs well in high dimensional datasets we need some more optimization.

Keywords: Mutual Information (MI); Lasso (L1); Recursive Feature Elimination (RFE); Random Forest (RF); Principal Component Analysis (PCA)

References

Jotheeswaran J, Koteeswaran S. Feature Selection using Random Forest Method for Sentiment Analysis. Indian Journal of Science and Technology. 2016;9(3):1–7. Available from: https://dx.doi.org/10.17485/ijst/2016/v9i3/86387
Datta S, Chakrabarti S. Aspect based sentiment analysis for demonetization tweets by optimized recurrent neural network using fire fly-oriented multi-verse optimizer. Sadhana. 2021;p. 46–79. Available from: https://doi.org/10.1007/s12046-021-01608-1
Bhagat M. Sentiment Analysis using an ensemble of Feature Selection Algorithms. thesis
Rintyarna BS, Sarno R, Fatichah C. Evaluating the performance of sentence level features and domain sensitive features of product reviews on supervised sentiment analysis tasks. Journal of Big Data. 2019;6(1):1–19. Available from: https://dx.doi.org/10.1186/s40537-019-0246-8
Yang A, Zhang J, Pan L, Xiang Y. Enhanced Twitter Sentiment Analysis by Using Feature Selection and Combination. International Symposium on Security and Privacy in Social Networks and Big Data (SocialSec). 2015;p. 52–57. Available from: 10.1109/SocialSec2015.9
Sharma S, Jain A. Hybrid Ensemble Learning With Feature Selection for Sentiment Classification in Social Media. International Journal of Information Retrieval Research. 2020;10(2):40–58. Available from: https://dx.doi.org/10.4018/ijirr.2020040103
Rani S, Gill NS. Hybrid Model For Twitter Data Sentiment Analysis Based On Ensemble Of Dictionary Based Classifier And Stacked Machine Learning Classifiers-Svm, Knn And C5.0. Journal of Theoretical and Applied Information Technology. 2020;98(04):624–635.
Parlar T, Sarac E. IWD Based Feature Selection Algorithm for Sentiment Analysis. Elektronika ir Elektrotechnika. 2019;25(1):54–58. Available from: https://dx.doi.org/10.5755/j01.eie.25.1.22736
Ghosh M, Sanyal G. An ensemble approach to stabilize the features for multi-domain sentiment analysis using supervised machine learning. Journal of Big Data. 2018;5(1):1–25. Available from: https://dx.doi.org/10.1186/s40537-018-0152-5
Kumar HMK, Harish BS. A New Feature Selection Method for Sentiment Analysis in Short Text. Journal of Intelligent Systems. 2018;29(1):1122–1134. Available from: https://dx.doi.org/10.1515/jisys-2018-0171
Madasu A, Elango S. Efficient feature selection techniques for sentiment analysis. Multimedia Tools and Applications. 2020;79(9-10):6313–6335. Available from: https://dx.doi.org/10.1007/s11042-019-08409-z
Arya P, Bhagat A, Nair R. Improved Performance of Machine Learning Algorithms via Ensemble Learning Methods of Sentiment Analysis. International Journal on Emerging Technologies. 2019;10(2):110–116.
Ahuja R, Chug A, Kohli S, Gupta S, Ahuja P. The Impact of Features Extraction on the Sentiment Analysis. Procedia Computer Science. 2019;152:341–348. Available from: https://dx.doi.org/10.1016/j.procs.2019.05.008
Madhuri DK, D. A Machine Learning based Framework for Sentiment Classification: Indian Railways Case Study. International Journal of Innovative Technology and Exploring Engineering (IJITEE). 2019;8(4):2278–3075.
Wisnu H, Afif M, Ruldevyani Y. Sentiment analysis on customer satisfaction of digital payment in Indonesia: A comparative study using KNN and Naïve Bayes. Journal of Physics: Conference Series. 2020;1444. Available from: https://dx.doi.org/10.1088/1742-6596/1444/1/012034
Le NQK, Nguyen TTD, Ou YY. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. Journal of Molecular Graphics and Modelling. 2017;73:166–178. Available from: https://dx.doi.org/10.1016/j.jmgm.2017.01.003
Prusa JD, Khoshgoftaar TM, Dittman DJ. Impact of Feature Selection Techniques for Tweet Sentiment Classification. Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference. 2015;p. 299–304.
Payal B, Awachate, Kshirsagar PV. Improved Twitter Sentiment Analysis Using N Gram Feature Selection and Combinations. International Journal of Advanced Research in Computer and Communication Engineering. 2016;5(9):154–157. Available from: 10.17148/IJARCCE.2016.5935
Larasati UI, Muslim MA, Arifudin R, Alamsyah A. Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis. Scientific Journal of Informatics. 2019;6(1):138–149. Available from: https://dx.doi.org/10.15294/sji.v6i1.14244
Singh M, Gupta S. Sentiment Analysis using Naive Bayes Classifier and Information Gain Feature Selection over Twitter. International Journal of Computer Trends and Technology. 2020;68(5):84–91. Available from: https://dx.doi.org/10.14445/22312803/ijctt-v68i5p117
Zahra T, HG, Hussain I. Sentiment Analysis Of Twitter Dataset Using Lle And Classification Methods. International Research Journal Of Modernization In Engineering Technology And Science. 2021;3(1):1151–1164.
Narang A. Twitter Sentiment Analysis on Citizenship Amendment Act in India. International Journal for Research in Applied Science and Engineering Technology. 2020;8(7):1714–1724. Available from: https://dx.doi.org/10.22214/ijraset.2020.30636
Oshiro TM, Perez PS, Baranauskas JA. How Many Trees in a Random Forest? Lecture Notes in Computer Science. 2012;p. 154–168. Available from: https://doi.org/10.1007/978-3-642-31537-4_13
Sujatha E, Radha R. A Sentiment Classification on Indian Government Schemes Using PySpark. International Journal on Emerging Technologies. 2020;11(2):25–30.

Copyright

© 2021 Sujatha & Radha. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)