• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2020, Volume: 13, Issue: 40, Pages: 4216-4224

Original Article

Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data

Received Date:26 September 2020, Accepted Date:08 November 2020, Published Date:20 November 2020


Background: Evaluating the sentiments of tweets, blogs, comments and posts have become a crucial part of many applications. Sentiment analysis of social network data is very helpful for decision-making in application areas like movie reviews, product feedback and impact of the speech of a politician etc. The users often comment in their native languages or in slang languages or more often they use abbreviations and do not even stick to grammatical rules of the language. The bilingual and multilingual community often mixes two or more than two languages in their comments. Unavailability of annotated code-mixed data for native language adds to the difficulty in performing sentiment analysis. Objectives: The motive of this article is to present the process of creating annotated corpus for code mixed social media text in Hindi, English and Hinglish which is collected from twitter. Ambiguous meaning words and inconsistent spellings of both the languages have also been included in the study to provide wide spread canvas. Method: This study will provide significant elements that should be considered while developing the annotated corpus of Hindi, Hinglish & English dataset. The annotation is calculated on the basis of polarity of words in three categories as positive, negative and neutral. There are words which have mixed feeling i.e. these words have positive as well as negative sentiments. To consider these words, inner agreement among the polarities has been considered. The words used for sarcasm or slangs have also been taken into account. The study has included ambiguous meaning and inconsistent spelling words of both languages as well. Findings: The proposed work provides a standard annotated corpus for code switched social media text in Hindi-English (Hinglish). The process of developing the corpus and calculating the polarity has been shown. It is found that if one considers the code-mixed text, the accuracy can be enhanced. Application: The proposed corpus can be utilized in the area of market analysis, customer behavior, polling analysis, brand monitoring, etc. The corpus serves as dataset which can further be extended according to the problem definition.
Keywords: Machine learning; sentiment analysis; data preprocessing; data cleaning; code-switch; linguistic-switching; multilingual 


  1. Singhal S, Garg N. Hybrid web-page segmentation and block extraction for small screen terminals. International Journal of Computer Applications. 2013;975.
  2. Boiy E, Moens MF. A machine learning approach to sentiment analysis in multilingual Web texts. Information Retrieval. 2009;12:526–558. Available from: https://dx.doi.org/10.1007/s10791-008-9070-z
  3. Chandu K, Loginova E, Gupta V, Genabith JV, Neumann G, Chinnakotla M, et al. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In: Third Workshop on Computational Approaches to Linguistic Code-Switching. (pp. 29-38) Association for Computational Linguistics. ACL. 2019.
  4. Srivastava R, Bhatia MP, Tayal V, Verma JK. Framework for real-world event detection through online social networking sites. In: Data and Communication Networks. (pp. 195-203) Singapore. Springer. 2019.
  5. Tripathy A, Agrawal A, Rath SK. Classification of Sentimental Reviews Using Machine Learning Techniques. Procedia Computer Science. 2015;57:821–829. Available from: https://dx.doi.org/10.1016/j.procs.2015.07.523
  6. Pang B, Lee L, Vaithyanathan S. Sentiment classification using machine learning techniques. 2002.
  7. Atefeh F, Khreich W. A survey of techniques for event detection in twitter. Computational Intelligence. 2015;31(1):132–164. Available from: https://doi.org/10.1111/coin.12017
  8. Denecke K. Using sentiwordnet for multilingual sentiment analysis. 2008 IEEE 24th international conference on data engineering workshop. 2008;p. 507–512. Available from: https://doi.org/10.1109/ICDEW.2008.4498370
  9. Balahur A, Perea-Ortega JM. Sentiment analysis system adaptation for multilingual processing: The case of tweets. Information Processing & Management. 2015;51:547–556. Available from: https://dx.doi.org/10.1016/j.ipm.2014.10.004
  10. Lo SL, Cambria E, Chiong R, Cornforth D. Multilingual sentiment analysis: from formal to informal and scarce resource languages. Artificial Intelligence Review. 2017;48(4):499–527. Available from: https://dx.doi.org/10.1007/s10462-016-9508-4
  11. Kaur A, Gupta V. A survey on sentiment analysis and opinion mining techniques. Journal of Emerging Technologies in Web Intelligence. 2013;5(4):367–371. Available from: https://doi.org/10.4304/jetwi.5.4.367-371
  12. Balamurali AR, Joshi A, Bhattacharyya P. Cross-lingual sentiment analysis for Indian languages using linked wordnets. In: Proceedings of COLING. (pp. 73-82) 2012.
  13. Sasidhar TT, B P, P SK. Emotion Detection in Hinglish(Hindi+English) Code-Mixed Social Media Text. Procedia Computer Science. 2020;171:1346–1352. Available from: https://dx.doi.org/10.1016/j.procs.2020.04.144
  14. Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management. 2015;35(2):137–144. Available from: https://dx.doi.org/10.1016/j.ijinfomgt.2014.10.007
  15. Lesser V, Horling B, Klassner F, Raja A, Wagner T, Zhang SX. BIG: An agent for resource-bounded information gathering and decision making. Artificial Intelligence. 2000;118(1-2):197–244. Available from: https://dx.doi.org/10.1016/s0004-3702(00)00005-9


© 2020 Garg & Sharma. This is an open-access article distributed under the terms of the  Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee).


Subscribe now for latest articles and news.