Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data

Neha Garg  lowast; Kamlesh Sharma

doi:10.17485/IJST/v13i40.1451

Article

Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data

VIEWS 3434
PDF 517

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v13i40.1451

Year: 2020, Volume: 13, Issue: 40, Pages: 4216-4224

Original Article

Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data

Neha Garg^1∗, Kamlesh Sharma²

¹ Assistant Professor, Department of Computer Science & Engineering, Manav Rachna International Institute of Research & Studies, Faridabad, Haryana, India.
² Associate Professor, Department of Computer Science & Engineering, Manav Rachna International Institute of Research & Studies, Faridabad, Haryana, India.

∗ Corresponding author:
Tel: +91- 8882879289
[email protected]

Received Date:26 September 2020, Accepted Date:08 November 2020, Published Date:20 November 2020

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Background: Evaluating the sentiments of tweets, blogs, comments and posts have become a crucial part of many applications. Sentiment analysis of social network data is very helpful for decision-making in application areas like movie reviews, product feedback and impact of the speech of a politician etc. The users often comment in their native languages or in slang languages or more often they use abbreviations and do not even stick to grammatical rules of the language. The bilingual and multilingual community often mixes two or more than two languages in their comments. Unavailability of annotated code-mixed data for native language adds to the difficulty in performing sentiment analysis. Objectives: The motive of this article is to present the process of creating annotated corpus for code mixed social media text in Hindi, English and Hinglish which is collected from twitter. Ambiguous meaning words and inconsistent spellings of both the languages have also been included in the study to provide wide spread canvas. Method: This study will provide significant elements that should be considered while developing the annotated corpus of Hindi, Hinglish & English dataset. The annotation is calculated on the basis of polarity of words in three categories as positive, negative and neutral. There are words which have mixed feeling i.e. these words have positive as well as negative sentiments. To consider these words, inner agreement among the polarities has been considered. The words used for sarcasm or slangs have also been taken into account. The study has included ambiguous meaning and inconsistent spelling words of both languages as well. Findings: The proposed work provides a standard annotated corpus for code switched social media text in Hindi-English (Hinglish). The process of developing the corpus and calculating the polarity has been shown. It is found that if one considers the code-mixed text, the accuracy can be enhanced. Application: The proposed corpus can be utilized in the area of market analysis, customer behavior, polling analysis, brand monitoring, etc. The corpus serves as dataset which can further be extended according to the problem definition.
Keywords: Machine learning; sentiment analysis; data preprocessing; data cleaning; code-switch; linguistic-switching; multilingual

References

Singhal S, Garg N. Hybrid web-page segmentation and block extraction for small screen terminals. International Journal of Computer Applications. 2013;975.
Singhal S, Garg N. Web Page Representation Using Backtracking with Multidimensional Database for Small Screen Terminals. In: Innovations in Computational Intelligence 2018. (pp. 299-307) Singapore. Springer. 2018.
Boiy E, Moens MF. A machine learning approach to sentiment analysis in multilingual Web texts. Information Retrieval. 2009;12:526–558. Available from: https://dx.doi.org/10.1007/s10791-008-9070-z
Chandu K, Loginova E, Gupta V, Genabith JV, Neumann G, Chinnakotla M, et al. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In: Third Workshop on Computational Approaches to Linguistic Code-Switching. (pp. 29-38) Association for Computational Linguistics. ACL. 2019.
Srivastava R, Bhatia MP, Tayal V, Verma JK. Framework for real-world event detection through online social networking sites. In: Data and Communication Networks. (pp. 195-203) Singapore. Springer. 2019.
Tripathy A, Agrawal A, Rath SK. Classiﬁcation of Sentimental Reviews Using Machine Learning Techniques. Procedia Computer Science. 2015;57:821–829. Available from: https://dx.doi.org/10.1016/j.procs.2015.07.523
Pang B, Lee L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. 2004.
Pang B, Lee L, Vaithyanathan S. Sentiment classification using machine learning techniques. 2002.
Turney PD. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. 2002.
Atefeh F, Khreich W. A survey of techniques for event detection in twitter. Computational Intelligence. 2015;31(1):132–164. Available from: https://doi.org/10.1111/coin.12017
Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media. In: Proceedings of the 19th international conference on World wide web. (pp. 591-600) 2010.
Denecke K. Using sentiwordnet for multilingual sentiment analysis. 2008 IEEE 24th international conference on data engineering workshop. 2008;p. 507–512. Available from: https://doi.org/10.1109/ICDEW.2008.4498370
Balahur A, Perea-Ortega JM. Sentiment analysis system adaptation for multilingual processing: The case of tweets. Information Processing & Management. 2015;51:547–556. Available from: https://dx.doi.org/10.1016/j.ipm.2014.10.004
Oborník. Endosymbiotic evolution of Algae, secondary heterotrophy and parasitism. Biomolecules. 2019;9(7):266. Available from: https://dx.doi.org/10.3390/biom9070266
Lo SL, Cambria E, Chiong R, Cornforth D. Multilingual sentiment analysis: from formal to informal and scarce resource languages. Artificial Intelligence Review. 2017;48(4):499–527. Available from: https://dx.doi.org/10.1007/s10462-016-9508-4
Sitaram D, Murthy S, Ray D, Sharma D, Dhar K. Sentiment analysis of mixed language employing Hindi-English code switching. In: 2015 International Conference on Machine Learning and Cybernetics (ICMLC). (Vol. 1, pp. 271-276) 2015.
Sharma P, Moh TS. Prediction of Indian election using sentiment analysis on Hindi Twitter. In: 2016 IEEE international conference on big data (big data. (pp. 1966-1971) 2016. https://doi.org/10.1109/BigData.2016.7840818
Kaur A, Gupta V. A survey on sentiment analysis and opinion mining techniques. Journal of Emerging Technologies in Web Intelligence. 2013;5(4):367–371. Available from: https://doi.org/10.4304/jetwi.5.4.367-371
Balamurali AR, Joshi A, Bhattacharyya P. Cross-lingual sentiment analysis for Indian languages using linked wordnets. In: Proceedings of COLING. (pp. 73-82) 2012.
Etoori P, Chinnakotla M, Mamidi R. Automatic spelling correction for resource-scarce languages using deep learning. In: Proceedings of ACL 2018, Student Research Workshop. (pp. 146-152) 2018.
Thakur V, Sahu R, Omer S. Current State of Hinglish Text Sentiment Analysis. In: Proceedings of the International Conference on Innovative Computing & Communications (ICICC) 2020, Available at SSRN. 2020.
Sasidhar TT, B P, P SK. Emotion Detection in Hinglish(Hindi+English) Code-Mixed Social Media Text. Procedia Computer Science. 2020;171:1346–1352. Available from: https://dx.doi.org/10.1016/j.procs.2020.04.144
Singh P, Lefever E. Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings. In: Proceedings of the The 4th Workshop on Computational Approaches to Code Switching. (pp. 45-51) 2020.
Joshi A, Prabhu A, Shrivastava M, Varma V. Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. (pp. 2482-2491) 2016.
Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M. A dataset of Hindi-English code-mixed social media text for hate speech detection. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media. (pp. 36-41) 2018.
Pravalika A, Oza V, Meghana NP, Kamath SS. Domain-specific sentiment analysis approaches for code-mixed social network data. In: 8th international conference on computing, communication and networking technologies (ICCCNT). (pp. 1-6) IEEE. 2017.
Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management. 2015;35(2):137–144. Available from: https://dx.doi.org/10.1016/j.ijinfomgt.2014.10.007
Lesser V, Horling B, Klassner F, Raja A, Wagner T, Zhang SX. BIG: An agent for resource-bounded information gathering and decision making. Artificial Intelligence. 2000;118(1-2):197–244. Available from: https://dx.doi.org/10.1016/s0004-3702(00)00005-9
Chen P, Zhang CL, CY. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci. 2014;275:314–347. Available from: http://www.sciencedirect.com/science/article/pii/S0020025514000346
Sinha RMK, Thakur A. Machine Translation Of Bi-Lingual Hindi-English (Hinglish) text. In: 10th Machine Translation summit (MT Summit X). p. 149–156.
Garg N, Sharma K. Machine Learning in Text Analysis. In: Handbook of Research on Emerging Trends and Applications of Machine Learning. (pp. 383-402) IGI Global.

Copyright

© 2020 Garg & Sharma. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee).