Sentiment Analysis: A Comprehensive Overview and the State of Art Research Challenges

The big data from micro blogging sites attracts many communities to explore the hidden content to get valuable information out of it. Sentiment analysis is one such research area which concentrates on identifying subjective information from a given piece of text. Opinions that are expressed in social media serve as a major input for detecting public outlook across various areas such as buying products, predicting the share market and movie reviews. Such web generated contents play a major role in mining user sentiments for customer relationship management and public opinion tracking. Sentiment analysis is basically a natural language processing technique that uses computational linguistics and text mining to identify the polarity of the text as positive, negative and neutral. Sentiment analysis is defined as the automated knowledge discovery technique which identifies the hidden patterns in reviews, blogs and tweets1. In recent days, companies have started to use sentiment analysis as part of their research. Apart from the data received from social networking sites, companies create their own web sites to gather review about their products. Mining these reviews, they are able to build better customer relationship and also create recommendation systems with the help of the positive and negative feedback from customers. Another advantage of sentiment analysis is that the companies are able to develop their marketing strategies by predicting public attitude towards their product. Numerous companies have already developed tools that crawl online information and summarize that information in graphical representation of the recent trends2. Sentiment analysis is broadly classified into three categories namely, document level, sentence level and aspect based. The goal of document level sentiment classification is determining the overall sentiment of a given review document3. Sentence level4 analysis focuses on categorizing the text at the level of subjective and objective nature. Aspect based5 approach is more pinpointed as it splits the entire document into various Abstract


Introduction
The big data from micro blogging sites attracts many communities to explore the hidden content to get valuable information out of it. Sentiment analysis is one such research area which concentrates on identifying subjective information from a given piece of text. Opinions that are expressed in social media serve as a major input for detecting public outlook across various areas such as buying products, predicting the share market and movie reviews. Such web generated contents play a major role in mining user sentiments for customer relationship management and public opinion tracking. Sentiment analysis is basically a natural language processing technique that uses computational linguistics and text mining to identify the polarity of the text as positive, negative and neutral. Sentiment analysis is defined as the automated knowledge discovery technique which identifies the hidden patterns in reviews, blogs and tweets 1 .
In recent days, companies have started to use sentiment analysis as part of their research. Apart from the data received from social networking sites, companies create their own web sites to gather review about their products. Mining these reviews, they are able to build better customer relationship and also create recommendation systems with the help of the positive and negative feedback from customers. Another advantage of sentiment analysis is that the companies are able to develop their marketing strategies by predicting public attitude towards their product. Numerous companies have already developed tools that crawl online information and summarize that information in graphical representation of the recent trends 2 . Sentiment analysis is broadly classified into three categories namely, document level, sentence level and aspect based. The goal of document level sentiment classification is determining the overall sentiment of a given review document 3 . Sentence level 4 analysis focuses on categorizing the text at the level of subjective and objective nature. Aspect based 5 approach is more pinpointed as it splits the entire document into various aspects (entities) and sentiment analysis is carried out on each entity to find out the overall polarity.
A detailed study 6 has been carried out on the various techniques for sentiment analysis. The study reviewed the recent work that has been carried out with various techniques. It also discussed about some of the feature selection methods and related fields to sentiment analysis. A detailed survey 7 on the various applications and challenges in sentiment analysis is presented. In 8 has written a survey that covers the latest trends in sentiment analysis. The works discussed above deals with sentiment analysis from the evolution to the multimodal sentiment analysis.
Rest of this paper is divided into various sections. Section 2 briefs some of the common sentiment analysis tasks, section 3 gives an overview of sentiment analysis across various domains, section 4 discusses the various challenges in this field and section 5 concludes the research.

Common Sentiment Analysis Tasks
The various tasks involved in sentiment analysis are subjectivity detection, feature selection in sentiment classification and sentiment classification.

Subjectivity Detection
Subjectivity detection is the process of identifying the subjective sentences. Sentences can be classified as subjective sentence and objective sentence. Subjectivity indicates that the text contains/bears opinion content whereas objectivity indicates that the text is without opinion content. For example, "This movie is superb. " is a subjective sentence since it has an opinion as it talks about the movie and the writer's feeling about the same. "Fruits are good for health" is the sentence that is a fact, general information rather than an opinion or a view of some individual and hence its objective. Sentence level sentiment analysis deals with the process of subjectivity detection. Various approaches like bootstrapping 9,10 , Conditional random fields 11 , Viterbi algorithm 12 are used for subjectivity detection. SVM (sequential minimal optimization algorithm with poly kernel) is used for classification 13 . Objective words from SentiWordNet are used to improve the sentiment classification.

Features for Sentiment Classification
Feature engineering is one of the basic and most important steps in sentiment classification. The English sentences should be converted into feature vector in order to perform sentiment classification. The most commonly used features are Term presence and frequency 14 , n-gram 15,16 , Negation, Adjectives, Adverb-Adjective combination, Gini index 17 . Feature selection methods are divided into two categories they are Lexicon-based and Statistical based. Some of the statistical feature selection methods are Point-wise Mutual Information (PMI) 18,19 Chi-square 20 and Latent Semantic Indexing (LSI).

Sentiment Classification
The sentiment classification techniques are broadly classified into three categories as Machine learning methods, lexicon based approach and Hybrid approach. Machine learning approach deals with the machine learning algorithms to solve the sentiment analysis problem. Machine learning techniques are broadly classified as Supervised 21,22 and Unsupervised 23,24 algorithms. Machine learning algorithms are widely used for sentiment analysis problems, some of them are Naïve Bayes classifier 25 , Support Vector Machine (SVM) [26][27][28] , Neural network 29 , Conditional random fields (CRF) [30][31][32] and Rule based classifier 33,34 . Some of the approaches in Lexicon based approach are Dictionary based 35,36 and Corpus based 37,38 . Hybrid approach is in its early stage and not much work has been done in the topic.

Study of Sentiment Analysis Application Across Various Domains
Sentiment analysis is a hot research topic and various works has been done in various domains. Few of them are interpreting public sentiment variation 39 , classifying customer reviews as positive and negative, detecting internet hotspots 40 , and predicting stock market behavior.

Movie Reviews
Sentiment analysis has been extensively carried out for Movie review. The analysis has greater impact on the success of the movie as in recent days people watch movies that have got good reviews. The data is taken from benchmark datasets like, IMDB, rottentomatoes. com. Few of the works that are carried out in this domain shows positive results 41-45 .

Product Reviews
Sentiment analysis is mostly used by the marketing companies to increase the sales of their products. Sentiment analysis has been carried out for many products like iPhone, cameras, hardware components, printers and scanners. Apart from just products, many works are carried out for restaurant reviews. Various aspects of the restaurant like food, services have been reviewed. The data for review is mainly taken from social networking sites like twitter, Face book and from other review sites created by the respective companies 46-49 .

Stock Market
Sentiment analysis is more useful in the stock market to predict the performance of shares. The data is collected from Yahoo Finance discussion board and other networking sites. In general, the shares are categorized into five categories and weights are assigned to them accordingly. The five categories are (2) for "Strong Buy", (1) for "Buy", (0) for "Hold", (-1) for "Sell", (-2) for "Strong sell" 50 .

Crime Analysis
A preliminary work 51 has been carried out in predicting crime with sentiment analysis techniques. In the research work, the author has carried out spacio temporal mining to identify the crimes that are happening in various fields. Linguistic analysis and statistical topic modeling is used to automatically identify discussion topics across a major city in the United States, and then incorporated them in the crime prediction model.

Disaster Recovery
A number of works has been carried out in analyzing the mood of the people during crisis and disasters. Few of the works include, analyzing how social networking sites are used during disasters. Such analyses are helpful in reaching out people in need and help them. Voluntary organizations can read the data and render help to people who are in need. Some of the disasters that are analyzed are earthquakes, typhoons 52 .

Challenges
Sentiment analysis is a growing field; still there are many research challenges that need to be addressed. Some of the open challenges in text mining are summarized as follows.
• Negation 53 is very important because negation changes the text polarity. Negation terms affect the contextual polarity of words but the presence of a negation word in a sentence does not mean that all of the words conveying sentiments are inverted. Negation is not only conveyed by common negation words (not, never, no) but also by other lexical units. • Another major hurdle is the handling of anaphora resolution. Anaphora means referring to same meaning but with different phrases. This problem mainly occurs while grouping the entities in aspect based sentiment analysis. For example, "battery life" and "power usage" refer to the same aspect of a phone, sentiments about both of these aspects should be combined in order to produce accurate results. • Word arrangement in a sentence plays a vital role in identifying the subjective nature of the text. Word order is important in deciding the polarity of a text. In a given piece of text, if the words order changes, the polarity of the text gets affected. • Implicit sentiment and Sarcasm: Without the presence of any sentiment bearing words, sentences may have an implicit sentiment. For example, "How can you do this?" In this sentence, none of the words express negative opinion, but the meaning of the sentence is negative. Thus identifying semantics is very important in semantic analysis. • Spam Detection: Anyone from any location can express their views in social media without disclosing their true identity. By this way, many fake reviews are written in order to promote the sales of the product. Such an activity is called opinion spamming. Apart from individuals, there are also commercial companies that are into this business spreading fake information. It is a challenging task to identify such opinion spams to extract the exact sentiment. • Conjunctions: Presence of conjunctions in a sentence changes the entire meaning of the sentence. For example: "The restaurant was very nice, but the service was poor". This sentence is split into two parts. When we analyze the first part, we get a positive sentiment. But the presence of the other words reverses the entire meaning of the sentence. So conjunctions should be considered for sentiment analysis. • Co-reference resolution is one of the biggest research challenges. This has to be done in both aspect level and entity level. This is more applicable in places where comparative texts are used. The reference between the sentences must be effectively resolved in order to produce better analysis. For example, consider the following opinionated text, "Comparing Nikon's Cool pix to its main competitor the Canon, it takes excellent photos and is quite compact". In the above sentence, the pronoun "it" refers to 'Nikon Cool pix' . If this co-reference is not identified correctly, sentiment analysis cannot be carried out effectively 54 . • Domain adaptation is another important aspect of sentiment analysis. Most of the available sentiment lexicons are general-purpose; though these are general-purpose, it is important that to study the ways for adapting to a specific domain. In this regard, there are three main issues. First is the same entity term that has different polarity in different domains. The next issue is assigning a strength marker for each and every sentiment word. Third is the difference in vocabularies across different domains which make sentiment analysis a domain dependent application.

Conclusion and Future work
This paper has presented a brief survey on various aspects of sentiment analysis. Naïve Bayes classifier and Support vector machines are the most commonly used methods for sentiment classification. Most of the researches concentrate on English language and they can be extended to other languages to understand the regional trend. In sentiment classification, machine learning and Lexicon based approach are the most widely used approaches but hybrid approach needs to be explored further for better results. Since sentiment analysis is a domain dependent problem, there are only a few domains in which work has been carried out and there are a quiet lot of domains that need to be explored.