The raw data consists of noise like random variation of brightness or color information, removing noise from the images drastically improves the performance of the facial emotion recognition models. To eliminate noise from images there are many denoising techniques such as gaussian blur, bilateral filter, non-local means filtering. Gaussian Blur helps in blurring the edges and reducing the contrast, but it reduces the details
Conditional generative adversarial network is one of approach used to reduce the intra-class variations. The proposed approach consists of a generator G and discriminators (Di, Da and Dexp). For learning the generative and discriminative representations, three loss functions were designed. But there is one limitation in this approach is that the model is trained individually for each different datasets, a model which is trained on a particular dataset may result in poor accuracy on another dataset
Emotion-specific activation maps are constructed to set up infrared thermal facial image sequences as a different approach to finding out the correlation between emotional triggers and changes in facial temperature. Data that is stored in the International Affective Picture System are used to create emotional clips during the testing process. The results show the difficulty of selecting local regions when examining frame temperature had been resolved
Two convolutional neural networks, one for face identification and the second one for expression recognition are used for facial expression recognition. For face recognition two models are used, one is known for low reasoning speed, but very accurate and the other model is known for high reasoning speed, but less accurate
Group-based emotion recognition plays a vital role in real world applications, Multivariate Local Texture Pattern, Local energy based shape Histogram and gray-level co-occurrence matrix are used for feature extraction. The proposed model achieved 99.16% accuracy 99.33% recall 99% precision and 99.93% sensitivity. This method achieved 87.8% accuracy on low resolution images
A 3 Dimensional convolutional neural network is designed for facial expression recognition on videos. The proposed method was carried out by using Tensorflow(deep learning framework)
There is little research work going in the field of Facial Expression Recognition in low resolution Images. So, in this work we are proposing a novel convolutional neural network and a novel hybrid denoising method for facial expression recognition. The proposed neural network is a simple architecture and this proposed model is compared with state-of-art models on Fer2013 dataset. The batch size employed in this work is 64 and the model is trained for 100 epochs. In order to deal with over fitting, dropout and batch normalization are used.
For recognizing facial expressions from low resolution images, we created a low resolution facial expression (LRFE) dataset, which contains more than 6000 images of seven types of facial expression. FERConvNet, FERConvNet with hybrid denoising method (FERConvNet_HDM), VGG16, VGG19 are tested on this dataset and compared the results.
Our primary contributions in this research paper can be outlined as follows: (1) Novel convolutional neural network model is proposed for facial expression recognition (2) Novel hybrid denoising method is presented (3) We created a low resolution facial expression (LRFE) dataset for facial expression recognition in low resolution images.
Category |
JAFFE |
MMI |
FER2013 |
LRFE |
Static images |
219 |
740 |
35887 |
6100 |
Downloadable |
Yes |
Yes |
Yes |
No |
No.of emotion expression |
7 |
7 |
7 |
7 |
Gender |
Female |
Female/Male |
Female/Male |
Female/Male |
Glasses |
No |
Yes |
Yes |
Yes |
The main aim of our research is to compare the proposed convolutional neural network with state-of-art models. Filtering techniques like Gaussian, Bilateral, Non local Means are applied to the images to remove any unwanted noise from the images, because having any noise in images can decrease the performance of convolutional neural network. A hybrid denoising method is proposed by combining the Gaussian, Bilateral, Non local means denoising techniques. Gaussian filter is a 2D convolution filter, which blur the image, helping in removal of noise. The only limitation with this technique is, the loss of image details is high when compared to other techniques. Bilateral is a non-linear filtering technique used to remove noise from the image by preserving the edges. The limitation of this technique is that it introduces false edges in the image. Non local means filter, unlike taking the mean value of a group of pixels, Non local means takes a mean of all pixels and unlike other techniques which blur the image, Non local means can restore the texture of image.
Equations used for each filtering techniques are given below,
Gaussian Filtering,
Non Local Means Filtering,
Normalizing factor N(x) is given by,
The existing Fer2013 dataset contains 35887 images of facial expressions belonging to seven expressions (Happy, Disgust, Fear, Sad, Neutral, Angry, Surprise). This dataset contains 4593 angry images, 547 disgust images, 5121 fear images, 8989 happy images, 6077 sad images, 4002 surprise images, 6198 neutral images. All these images are grayscale and 48X48 sized. We created LRFE dataset by collecting images from various sources, nearly 6000 images are collected, which belong to seven categories of facial expression. Since the raw images collected are having different file extension formats (.JPG, .PNG, .GIF), we converted all these into .JPG format. Since convolutional neural networks require large samples of training images, we used three image appearance filters and four affine transform matrices. The three filters are average, Gaussian, Bilateral Filters. Therefore, the number of samples in LRFE dataset is 35000, which belong to seven facial expressions. All these images are then converted into grayscale and then resized to 48X48 pixels. We now divided this dataset into a training set and testing set in the ratio 80:20.
A novel convolutional neural network is proposed for facial expression recognition and compared it with state-of-art models. Various Filtering techniques like Bilateral filter, Gaussian filter, Nonlocal means denoising are applied to all images to remove the noise from the images. A Hybrid denoising method is designed by combining the Gaussian, bilateral, non-local means denoising techniques. For dividing the data into train and test sets, we used 80:20 ratio. Then the model is trained on the train set and evaluated with the test set and the performance metrics are displayed.
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|
||
|
||
|
||
|
||
|
||
|
||
|
||
{ |
||
|
||
|
||
|
||
|
||
} |
||
|
||
|
The Execution of the model is done on kaggle platform, which provides a single NVIDIA Tesla P100, TPU V3-8, 9 hours execution time, 20 Gigabytes of disk space. The GPU specifications are 2 CPU cores and 13 Gigabytes of RAM. For implementing this model we used LRFE dataset and Fer2013 dataset. The Fer2013 contains 35887 images, which are divided into train and test sets in the ratio 80:20. The train set and test set of Fer2013 dataset contain 28709 and 7178 images respectively. LRFE dataset contains 6000 images of facial expression belonging to seven emotions (Happy, Sad, Surprise, Neutral, Fear, Disgust, Angry), which are collected from various sources.
Fer2013 dataset contains 35887 images, each image labeled as one of the seven emotions. All the images are in grayscale format and 48X48 pixels. Both posed and unposed images are present in Fer2013 dataset. This dataset contains 4953 angry images, 547 disgust images, 5121 fear images, 8989 happy images, 6077 sad images, 4002 surprise images, 6198 neutral images.
Dataset |
Name & No. of images in each emotion |
||||||
Happy |
Sad |
Angry |
Disgust |
Sad |
Surprise |
Neutral |
|
Fer2013 |
8989 |
6077 |
4953 |
547 |
6077 |
4002 |
6198 |
We now present the results on Fer2013 dataset, where the Fer2013 dataset is divided in the ratio 80:20 for training and testing, validation. The batch size used is 64 and trained for 100 epochs.
S.no |
Model Name |
Dataset |
Train Accuracy |
Test Accuracy |
Train loss |
Test loss |
1 |
VGG16 |
Fer2013 |
0.63 |
0.60 |
1.01 |
1.10 |
2 |
VGG19 |
Fer2013 |
0.54 |
0.53 |
1.22 |
1.20 |
3 |
FERConvNet |
Fer2013 |
0.79 |
0.65 |
0.69 |
1.07 |
4 |
EfficientNetB7 |
Fer2013 |
0.63 |
0.60 |
1.10 |
1.09 |
S.no |
Model Name |
Dataset |
Train Accuracy |
Test Accuracy |
Train loss |
Test loss |
1 |
FERConvNet_Gaussian |
Fer2013 |
0.65 |
0.55 |
1.05 |
1.33 |
2 |
FERConvNet_Bilateral |
Fer2013 |
0.80 |
0.65 |
0.69 |
1.09 |
3 |
FERConvNet_Nonlocal Means |
Fer2013 |
0.79 |
0.65 |
0.71 |
1.09 |
4 |
FERConvNet_HDM |
Fer2013 |
0.87 |
0.85 |
0.47 |
0.56 |
We then applied filtering techniques like Gaussian, Bilateral and Non local Means on the Fer2013 dataset. The results show that proposed model (FERConvNet) with Guassian Filter, Bilateral Filter, Non local Means Filter obtained 55% 65% 65% accuracies respectively on Fer2013 dataset. Similarly when the proposed novel hybrid denoising method, which is a combination of Gaussian, Bilateral, Non local means Filters, applied on Fer2013 dataset, the proposed model with hybrid denoising method (FERconvNet_HDM) achieved 85% accuracy on the test set. The FERConvNet_HDM, when compared to traditional filtering techniques performs better for facial expression recognition. The
Low resolution facial expression (LRFE) dataset is created from various sources, the primary intention to create this dataset is, there are no existing datasets that contain images of low resolution for facial expression recognition. All the existing work in facial expression recognition is done on recognizing emotions on well posed conditions, but not in wild or real world conditions. So we created this LRFE dataset, where the images are taken in real world conditions. This dataset contains nearly 6000 images belonging to seven emotions (Happy, Sad, Surprise, Angry, Neutral, Disgust, Fear). Since all the images are collected from various resources, they are of different file extension formats (.JPG, .PNG, .GIF), we converted all the images to .JPG format. We used three image appearance filters and four affine transform matrices to increase the number of samples since convolutional neural networks require a large number of samples for training purposes.
The three image appearance filters used are average, bilateral, Gaussian filters. Therefore, the number of samples in LRFE dataset are now 35000. Now all the images are converted to grayscale and resized to 48X48 pixels. Then LRFE dataset is divided into training and testing set in the ratio 80:20 (80% training and 20% testing).
Dataset |
Name & No. of images in each emotion |
||||||
Happy |
Sad |
Angry |
Disgust |
Fear |
Surprise |
Neutral |
|
LRFE |
5162 |
5148 |
5218 |
5155 |
5002 |
5194 |
5274 |
We now present the results on LRFE dataset, where the LRFE dataset is divided in the ratio 80:20 for training and testing, validation. The batch size used is 64 and trained for 100 epochs.
S.no |
Model Name |
Dataset |
Train Accuracy |
Test Accuracy |
Train Loss |
Test Loss |
1 |
VGG16 |
LRFE |
0.87 |
0.69 |
0.40 |
1.16 |
2 |
VGG19 |
LRFE |
0.84 |
0.66 |
0.47 |
0.96 |
3 |
FERConvNet |
LRFE |
0.95 |
0.71 |
0.16 |
1.19 |
4 |
EfficientNetB7 |
LRFE |
0.79 |
0.65 |
0.71 |
1.09 |
S.no |
Model Name |
Dataset |
Train Accuracy |
Test Accuracy |
Train Loss |
Test Loss |
1 |
FERConvNet_Gaussian |
LRFE |
0.98 |
0.58 |
0.30 |
3.00 |
2 |
FERConvNet_Bilateral |
LRFE |
0.98 |
0.63 |
0.43 |
2.52 |
3 |
FERConvNet_Nonlocal Means |
LRFE |
0.93 |
0.61 |
0.79 |
2.32 |
4 |
FERConvNet_HDM |
LRFE |
0.98 |
0.95 |
0.07 |
0.33 |
In this study, a novel convolutional neural network (FERConvNet) and a new hybrid denoising method, which is a combination of Gaussian, Bilateral and Non local means filters, are presented. Since there are no existing datasets for low resolution images for facial expression recognition, we created a low resolution facial expression (LRFE) dataset. This dataset contains nearly 6000 images of seven different emotions (Happy, Sad, Surprise, Fear, Angry, Neutral, Disgust). Since convolutional neural networks require large number of samples for training, we used three image appearance filters and four affine transform matrices to increase the number of samples. After applying these techniques, the number of samples in LRFE dataset increased to 35000. The proposed FERConvNet_HDM approach achieved 85% accuracy on Fer2013 dataset, outperforming the VGG16, VGG19 and EfficientNetB7 models, whose accuracies are 60% 53% 60% on Fer2013 dataset respectively. The same FERConvNet_HDM approach when applied on LRFE dataset achieved 95% accuracy. After analyzing the results, our FERConvNet_HDM approach performs better than VGG16, VGG19 and EfficientNetB7 on both Fer2013 and LRFE dataset. Our approach is computationally simple and robust in terms of low resolution images, which are close to real world conditions, making our proposed model as promising for real world applications.