Sentiment analysis is a continuous study of knowledge discovery . This is a process for determining people’s thoughts, opinions and feelings about a given topic or item. “Sentiment analysis” is highly domain-dependent
Sentiment analysis aims to categorize product reviews or opinions into good or bad to measure the comprehensive customer sentiment behind their brand. But, the problem is that most sentiment analysis work uses simple terms to express sentiment about product or service. “There are three classification levels in sentiment analysis, document level, sentence level and aspect level sentiment analysis. Document level sentiment analysis aims to classify an opinion document as expressing a positive or negative opinion or sentiment. It considers the whole document a basic information unit. Sentence level sentiment analysis aims to classify emotion expressed in each sentence. Aspect level sentiment analysis aims to classify the sentiment with respect to an aspect of entities.”
With the development e-commerce and social networking websites, the social media has huge amounts of data. Opinions shared on social media websites play a prominent part in the analysis of the business industry. As the amount of posts has been expanding at a quick rate, analyzing comments written on social media becomes complex and difficult for the customer
In Ref.
In Ref.
In Ref.
In Ref.
In Ref.
In Ref.
Consequently, we need a system of an opinion mining which allows for the collection of comments. Companies can use such opinion mining system to determine how the person perceives their product, how the match up to competition. It is perfectly natural to often rely on the comments and the knowledge of others during the purchasing of items
The key objective of the study would be to evaluate the performance with ensemble methods and neural network learning on various combinations of unigram, bigram and trigram feature vectors along with attribute choice and attribute reduction for opinion classification of movie reviews.
This paper consists of five sections. The introduction and literature surveys are is discussed in section 1. The methodology is covered in section 2. Section 3 discusses performance validation. Section 4 contains the study outcome and discussions. Section 5 deals with the conclusion of this study.
Ensemble learning methods such as “bagging” and “AdaBoost” are used to enhance the accuracy in classifying the review dataset. The movie review dataset is collected and the opinions are categorized into good or bad class labels. Opinions are preprocessed and the attributes are extracted. The extracted attributes are grouped based on unigram, bigram, and trigram. After preprocessing, the IG and PCA are used to reduce the dimensionality of the movie review dataset. An analysis is done to compare the outcome of the ensemble approaches and the neural network method and provide the best combination result. The methodology for testing the different classification models has been described below.
Data preprocessing is performed to remove the irrelevant data.
Attribute vector space for the model, m-i is developed with unigram, the model, m-ii with unigram and bigram, and the model, m-iii with unigram, bigram, and trigram are developed.
IG on the models m-i, m-ii, m-iii to produce selected feature sets.
PCA on the models m-i, m-ii,m-iii to produce reduced feature sets.
The following classification models are developed with the models m-i,m-ii,m-iii
support vector machine (SVM) model
Naive Bayes model
Bagging with support vector machine(SVM) model
Bagging with Naive Bayes model
AdaBoost with support vector machine(SVM) model
AdaBoost with naive Bayes model
Neural Network (Backpropagation) model
The class label of the respective user opinion in the testing dataset is classified.
The performance measure accuracy and other measures are computed from the final table values.
Compare all the model’s performance values.
In preprocessing steps, movie review datasets are preprocessed. The preprocessing includes tokenization, conversion of the upper case to lowercase, stop-word, stemming, and TF- IDF value is calculated for each feature in the review dataset. Tokenization is a process used to split the sentence into multiple words. Case transformation operation is used to convert the review upper case into a lower case. Stop words are words that no meaning to be used in a review dataset. The stemming process is used to remove suffix words from the given word in the dataset. The suffix characters are removed to reduce the word length in the dataset. The TF- IDF is a statistical measure which represents how important a feature for reviews. The number of times a feature occurs in a review is called “term frequency”. The inverse document frequency is a statistical weight used for measuring the importance of a feature in a review data and it can be calculated as follows
Where D is the number of opinion reviews, and DW is the number of terms w present in reviews. TF -IDF can be calculated as follows
In the general attribute selection and attribute reduction methods are used to reduce the dimension of feature vectors
IG is a method of selecting attributes. The preprocessed data of the review dataset are given as input for feature selection. “It is used to reduce bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute. It is applied to attributes that can take on a large number of distinct values and might learn the training set too well. It is often used to select which of the attributes are the most frequently used in the dataset.
The PCA is commonly used as a statistical technique. In PCA technique, the original data space is mapped into the standardized transformation matrices. The covariance matrix for each word in a given review dataset is calculated and the correlation matrix is also calculated from the covariance matrices. The correlation matrix shows how the words in the dataset are strongly correlated. The value ranges from -1 to 1 in the correlation matrix. It is represented in a symmetric matrix. If the correlation value is -1, then the attribute is negative. If the correlation value is +1, then the attribute is positive. If the correlation value is 0, then the word is equally correlated
The bagging algorithm generates a series of models for learning a system in which every model provides a prediction with equal weight. The classifier is trained on a selection of review words taken from the training set and predicts the class label for each word in each classifier. Then each classifier returns its majority vote and predicts the class label to word for unknown sample X. Where M is the model to construct the classifier and P is the predicted output value for each classifier
AdaBoost algorithm work as follows. Given a training review dataset R of X={X1, X2... Xn} and Y= {positive, negative} training words. The AdaBoost assigns each training data to an equal weight of 1/Y, for each iteration k (i.e. k=10), the training words from movie reviews dataset R are sampled and derive a model M using a base classifier and calculate the error for each classifier M
Support Vector Machine (SVM) is a best classification method for linear as well as non-linear sentiment dataset. The “support vector machine” is used to evaluate opinion data used for classification
Xi is the set of training words in movie reviews, Yi is the good or bad attribute label,
The navie bayes (NB) is one of supervised learning classification algorithms. Let R be the movie review dataset with class labels positive and negative. Each review dataset R is viewed by “n-dimensional” attribute vector, X={X1, X2... Xn}. The probability of each class label negative and positive, P (Cpositive) and P (Cnegative) can be computed based on the training words and conditional probabilities P (X|Cpositive) and P (X|Cnegative) and each attribute vector needs to be maximized P (X|Ci) P (Ci)
Backpropagation algorithm is a technique in neural network learning. It learns by iteratively processing reviews dataset of training words, comparing the network prediction for each word with a class label (positive/ negative). The words in the review dataset are entered and passed through the input neurons and are sent to a second layer known as hidden neurons. The input and output for the units in the hidden layer can be calculated from a weighted sum of input neurons along with weight at input level. The weighted output of the hidden unit is input to the output nodes, which provide network prediction of the given word in the dataset
A confusion matrix includes information on classification actual and predicted classifications.
Actual | Predicted | ||
Negative | Positive | ||
Negative | True Negative(TN) | False Positive (FP) | |
Positive | False Negative (FN) | True positive (TN) |
“Accuracy” is defined as proportions of total number of predictions were correct. “Precision” is defined as the ratio of estimated positive words, recall as the correctly identified ratio of positive words. “F- Measure” is defined as proportion of precision and recall of correctly identified instances. Performance measures can be calculated from the confusion matrix.
In this Section, comparison studies are performed with two feature selection techniques (IG, and PCA), two classifiers (SVM, and NB), two ensemble methods, and Back propagation. These methods are discussed in the methodology section. The features are extracted from the movie review dataset. The attribute vector model is developed with a TF-IDF measure. The attributes are grouped as “unigram”, “bigram”, and “trigram”. The impact of an attribute size is identified in three models and its features shown in
The weka tool is used to identify the PCA of models m-i, m-ii, and m-iii. The stopping rule used is an eigenvalue, which is greater than value 1. The number of “principal components” is reduced to 165, 150, and 147 for the models, m-i, m-ii, and m-iii. Cumulative variance of 30% gained for the models, m-i, m-ii, and m-iii feature vector. The variance percentage is also reduced based on the stopping rule selected. The “information gain” value for the models, m-i, m-ii, and m-iii is identified by setting the threshold value of 0.002.
The SVM and NB are the base learners used in “ensemble learning” technique to classify training and testing dataset. The support vector machine that uses the kernel function, the chosen kernel type normalized poly kernel with a default value of kernel parameter. A backpropagation algorithm is used in neural network learning. The parameter for momentum is set to 0.2, learning value is assigned to 0.3 along with training time is assigned to 50.
Based on the workflow, the performance of different classification algorithms are analyzed on various combinations of unigram, bigram, and trigram attribute vectors. In this analysis, accuracy is used to evaluate the various approach to the movie review dataset. All studies have been verified through tenfold cross-validation.
Properties | Model | ||
m-i | m-ii | m-iii | |
Number of reviews | 2000 | 2000 | 2000 |
Positive Review | 1000 | 1000 | 1000 |
Negative Review | 1000 | 1000 | 1000 |
Number of feature | 1160 | 1138 | 1144 |
Type / Model | Unigram | Bigram | Trigram |
Properties | Model | ||
m-i | m-ii | m-iii | |
Number of feature | 1160 | 1138 | 1144 |
Variance covered | 0.37 | 0.37 | 0.37 |
Threshold value | 0.002 | 0.002 | 0.002 |
Number of principal components | 165 | 150 | 147 |
Number of feature in IG | 168 | 153 | 150 |
Model/ Performance | IG | PCA | ||||||||||
SVM | NB | SVM | NB | |||||||||
m-i | m-ii | m-iii | m-i | m-ii | m-iii | m-i | m-ii | m-iii | m-i | m-ii | m-iii | |
Accuracy | 84.10 | 83.65 | 84.40(↑) | 79.60 | 81.50 | 82.00 | 84.05 | 83.90 | 83.70 | 81.50 | 79.50 | 80.05 |
Precision | 83.20 | 84.10 | 84.70 | 81.10 | 80.50 | 80.80 | 83.80 | 83.80 | 83.50 | 82.50 | 80.80 | 80.05 |
Recall | 84.80 | 83.30 | 84.10 | 78.70 | 82.05 | 82.70 | 84.30 | 83.90 | 83.80 | 80.80 | 79.10 | 79.70 |
F-Measure | 83.24 | 82.50 | 84.30 | 79.80 | 81.26 | 81.73 | 84.00 | 83.80 | 83.60 | 81.64 | 79.90 | 79.80 |
Model/ Performance | IG | PCA | ||||||||||
Bagging + SVM | Bagging + NB | Bagging + SVM | Bagging + NB | |||||||||
m-i | m-ii | m-iii | m-i | m-ii | m-iii | m-i | m-ii | m-iii | m-i | m-ii | m-iii | |
Accuracy | 84.35(↑) | 83.15 | 83.90 | 80.75 | 81.60 | 82.25 | 82.45 | 81.85 | 83.70 | 80.95 | 78.90 | 79.90 |
Precision | 84.20 | 83.10 | 84.10 | 82.10 | 80.80 | 81.40 | 81.10 | 81.30 | 83.50 | 81.70 | 79.50 | 80.30 |
Recall | 84.50 | 83.40 | 83.76 | 79.94 | 82.10 | 82.80 | 83.30 | 82.90 | 83.80 | 80.40 | 78.50 | 79.60 |
F-Measure | 84.30 | 83.24 | 83.92 | 81.00 | 81.40 | 82.09 | 82.18 | 82.20 | 83.60 | 81.04 | 78.90 | 79.90 |
Model/ Performance | IG | PCA | ||||||||||
AdaBoost + SVM | AdaBoost +NB | AdaBoost + SVM | Adaboost + NB | |||||||||
m-i | m-ii | m-iii | m-i | m-ii | m-iii | m-i | m-ii | m-iii | m-i | m-ii | m-iii | |
Accuracy | 84.15 | 83.95 | 84.45(↑) | 81.50 | 83.40 | 83.65 | 84.10 | 84.15 | 83.78 | 81.50 | 80.35 | 79.45 |
Precision | 83.20 | 82.50 | 84.80 | 81.90 | 85.00 | 83.10 | 83.80 | 83.90 | 83.50 | 82.50 | 80.00 | 78.60 |
Recall | 84.80 | 83.90 | 84.15 | 81.25 | 82.36 | 84.02 | 84.30 | 84.80 | 83.80 | 80.80 | 80.50 | 79.90 |
F-Measure | 83.99 | 83.19 | 84.32 | 81.57 | 83.65 | 83.55 | 84.00 | 84.10 | 83.60 | 81.64 | 80.24 | 78.25 |
Model/ Performance | IG + AdaBoost + SVM | IG + NN | ||||
m-i | m-ii | m-iii | m-i | m-ii | m-iii | |
Accuracy | 84.15 | 83.95 | 84.45(↑) | 80.95 | 80.35 | 82.15 |
Precision | 83.20 | 82.50 | 84.70 | 83.40 | 84.30 | 79.70 |
Recall | 84.80 | 83.90 | 84.10 | 79.50 | 78.12 | 83.80 |
F-Measure | 83.99 | 83.19 | 84.30 | 81.40 | 81.00 | 81.69 |
The classification results of NB show that accuracy result is comparatively lesser than other classifier. NB is not an efficient algorithm on bigram. The reason for higher error rate performance is that all features are independent.The classification results obtained for feature selection IG, ensemble bagging and classifiers, SVM and NB combinations and feature reduction PCA, ensemble bagging, and classifiers, SVM and NB combination results are tabulated in
The classification results attained for feature selection IG, ensemble AdaBoost and classifiers, SVM, and NB combinations and also feature reduction PCA, ensemble AdaBoost and classifiers, SVM and NB combinations are tabled in
, shows the classification result of IG along with AdaBoost and SVM, and the combination of IG, Neural Network. It can be observed from
Among all models, model, m-iii of AdaBoost, SVM predicts reviews with high accuracy of 84.45%. This shows that this model predicts positive reviews that were correctly classified as positive reviews and negative reviews were correctly classified as negative reviews for trigram feature. Among all models, the combination of PCA, Bagging, model, m-iii gives less accuracy of 78.90%. This shows that this model predicts negative reviews that were incorrectly classified as positive reviews and vice versa for trigram feature.
In this study, movie reviews are collected and preprocessed. The features are extracted from the movie review dataset and the attribute vector model is developed with a TF-IDF measure. The attributes are grouped as “unigram”, “bigram”, and “trigram” and the impact of an attribute size is identified in three models. We empirically compared two feature selection techniques (IG, and PCA), two classifiers (SVM, and NB), two ensemble methods, and Back propagation. Between the IG and PCA methods, IG performs better than PCA. Between the two ensembles methods and Back propagation, Adaboost + SVM outperform in classifying the sentiment of movie reviews for m-iii feature vector.