Human Behavior classification becomes a most populous and active research area for researchers from the past two decades. It is a categorization issue that considers identifying human movements and activities for monitoring purposes as well as the detection of anomalies in behavior. It plays an important role in humantohuman interaction and interpersonal relations. Examining human activity from still photos or video clips is the aim of understanding human activities recognition. In this work, 2DCNN, VGG16 and ResNet50 are used to classify the human digital images into different classes like normal behaviors based on their behaviors. In
The purpose of this work is to develop a classification system for different human behaviors using 2D CNN, VGG16 and ResNet50. The proposed system detects the normal behaviors of humans like sitting, walking and standing. For training, 2271 human behavior images are used, 539 human behavior images are used for testing for normal human behaviors like sitting, standing and walking. The steps followed in this work are as follows: Dataset collection, converting videos to frames, Training the 2DCNN model, VGG16 and ResNet50 and testing the different human behavior images. The block diagram of the proposed system is shown in
CNN can be used to play a key role in the fields like image processing, natural language processing (NLP), computer vision and other cognitive processing. A convolutional neural network is made up of an input layer, an output layer, and a numerous hidden layers. An image that is a matrix of pixel values provide as input. The matrices values for each pixel will range from 0 to 255 if the image is grayscale, the channel value will be 1. If a color image is present, the channel value is 3, which stands for Red, Green, and Blue. The main purpose of convolution layer is to extract features from input image and it contains with parameters like filter, stride and depth then it generates feature map. Filter size used for convolution is 3x3. The number of filters utilized is related to depth. The stride defines how many inputs steps the filter slides through. When the stride value is 1, the filters are moved one pixel at a time. When sliding the filters around with a stride of 2, they jump 2 pixels at a time. The activation function used in 2DCNN are RELU, Softmax, tanH and the Sigmoid functions. RELU activation function is used in this work. The advantages of RELU are replacing negative values in the feature map by zero, Maximum Threshold values are infinite, so there is no issue of vanish gradient problem, the output prediction accuracy and their efficiency is maximum, speed is fast compare to other activation functions. When convolution is applied to an input, the produced output size of the matrix is reduced, resulting in information loss, for avoiding this padding concept is implemented. Padding is done through the input volume with zeros at the border. Valid and same are two popular padding options. The same padding indicates that output size remains the same as input size, and valid padding means no padding is added. After convolution layers, CNNs frequently employ the pooling layer operation, which has the goal of reducing the dimension, also known as down sampling. For pooling layer, max pooling is used which takes the maximum values from the feature map. Then features are fed into Fully Connected (FC) Layer which uses flattening. Flattening is used to convert all the resultant 2Dimensional arrays from pooled feature maps into a single long continuous linear vector
In
The ReLU activation function uses the equation below:
Convolution layers, Maxpooling layers, and a flattening layer are all included in a 2D CNN model. The variable number of fully connected dense layers get the flattened output. The first layer is made up of the unprocessed pixels from a 100x100 human behavior images with three color channels. The first convolution layer, which has 32 filters, then the size is 100x100x32, by performing a dot product of the weights of the filters and the input image pixel values. Maxpooling layer is applied along the spatial dimension (height x width), and this layer reduces the dimension to 50x50x32.down sampling operation is accomplished. Following the output of the first maxpooling layer of size 2x2 as input to next level and output dimension of 25x25x64 is down sampled, the CNN filter of size 3x3 and with 64 filters is applied in the second layer. The output of the third layer's CNN filter, which has a 3x3 filtering matrix, is 25x25x128. The output of the second 2x2 maxpooling layer is 12x12x128. Then, in layer 4, the images are "flattened." The output layer, which will be the final layer, have a softmax activation function and include six output neurons for classification. Standing, walking and sitting normal categories and hit, kick and punch abnormal activities are used for the proposed human behavior classification system.
In
Researchers at Microsoft Research first proposed ResNet in 2015 and introduced the residual network architecture, a new design. The network's performance decreases or becomes saturated as it gets deeper. Since gradients are vanishing, accuracy is reduced. The idea of a residual network provides a solution to vanishing gradient during back propagation. Skip connections is a method used by residual networks. A skip connection links straight to the output after skipping a few stages of training. Gradients can pass directly from later levels to starting layers through the skip connections which is shown in
‘Skip connection’ is a direct connection that skips over some layers of the model. The output is not the same due to this skip connection. Without the skip connection, input ‘X gets multiplied by the weights of the layer followed by adding a bias term. The activation function, F() and the output is shown in
But with skip connection technique, the output is:
In ResNet50, there are two kinds of blocks — 1. Identity Block, 2. Convolutional Block.
The value of ‘x’ is added to the output layer if and only if the 
input size=output size
If this is not the case, then add a
There are 2 ways to make the input size equal to the output size 
To equal input and output size the equation used is
where, n= input image size, p=padding, s=stride, f=number of filters.
In CNNs, to reduce the size of the image, pooling is used. In Resnet 50, make use of stride=2 instead.
The ResNet 50 architecture contains the following element: a convolution with 64 distinct kernels, each having a stride of size 2, and a kernel size of 7 x 7, providing layer 1. Next max pooling with also a stride size of 2. The next convolution has three levels: 1x1,64 kernel, 3x3,64 kernel, and finally a 1x1,256 kernel. These three layers are repeated a total of three times, yielding nine layers in this step. The kernel of 1 x 1,128 is displayed next, followed by the kernel of 3 x 3,128 and, finally, the kernel of 1 x 1,512. This procedure was performed four times for a total of 12 layers. Following that, we have a kernel of size 1 x 1, 256, followed by two more kernels of size 3 x 3,256 and size 1 x 1,1024; this is repeated six times, giving us a total of 18 layers. After that, a 1x1,512 kernel was added, followed by two more kernels of 3x3,512 and 1x1,2048. This process was done three times, giving us a total of nine layers. Following that, add an average pool, finish it with a layer that has three fully linked nodes those are Sitting, Standing and Walking, and then add a softmax function to give one layer.
The dataset is collected using real time video of human behaviors and then converted to frames as input images. All the RGB images are fed to the CNN. The dataset is divided into training set and test set 80:20 ratio. For training, 2271 human behavior images are used, 539 human behavior images are used for testing for normal human behaviors like sitting, standing and walking which is shown in
The CNN model is trained with varying numbers of dense layers and then the model is trained with various numbers of parameters and their weights are adjusted. The model with six labels are trained with 3,999,079





Sitting 
310 
201 
Standing 
990 
32 
Walking 
971 
306 
Convolution1 Number of filters are 32 RGB x Kernel size x Stride +1 x depth of the filters 3x((3x3)x1)+1x32 = 28x32=896 Convolution 2 Number of filters are 64 Kernel size x previous layer depth filters+1 x current layer depth of filters. ((3x3)x32+1)x64 289x64=18496 Convolution 3 Number of filters are 128 Kernel size x previous layer depth filters+1 x current layer depth of filters. ((3x3)x64+1)x128=73856 Convolution 4 Number of filters are 256 Kernel size x previous layer depth filters+1 x current layer depth of filters. ((3x3)x128+1)x256 =295168 Convolution 5 Number of filters are 512 Kernel size x previous layer depth filters+1 x current layer depth of filters. ((3x3)x256+1)x512=1180160 FC1 4609x500 2304500 FC2 501x250 125250 FC3 251x3 753
First convolution is made up of kernel size with 3x3 and stride value 1 with 32 depth of the filters. As input image contains color image so channel value is 3 i.e. Red, Green and Blue. Second convolution is made up of previous depth of the filters i.e. 32 and kernel size 3x3 with 64 depth of the filters in that layer. Third convolution is made up of previous depth of the filters i.e. 64 and kernel size 3x3 with 128 depth of the filters in that layer. Fourth convolution is made up of previous depth of the filters i.e. 128 and kernel size 3x3 with 256 depth of the filters in that layer. Fifth convolution is made up of previous depth of the filters i.e. 256 and kernel size 3x3 with 512 depth of the filters in that layer. Total trainable parameters are 3999079.
In this work human behavior classification using 2DCNN, VGG16 and ResNet50 are tested and the results are compared.
As discussed in previous sections VGG16 is trained with 16 layers consisting of 13 convolution layers with five max pooling layers and 3 fully connected layers. The input dimension of Conv. Layer1 is 224 �224 x 64, conv. Layer2 is 224 �224x64, and Max_pooling1 is 112 �112x64. The following table shows the 16 layers structure and its dimensions. Network parameters are given in the



Conv2d_1 
224x224x64 
1792 
Conv2d_2 
224x224x64 
36928 
Max_Pooling2d_1 
112x112x64 
0 
Conv2d_3 
112 x 112 x 128 
73856 
Conv2d_4 
112 x 112 x 128 
147584 
Max_pooling2d_2 
56 x 56 x 128 
0 
Conv2d_5 
56x56x256 
295168 
Conv2d_6 
56x56x256 
590080 
Conv2d_7 
56x56x256 
590080 
Max_pooling2d_3 
28x28x256 
0 
Conv2d_8 
28x28x512 
1180160 
Conv2d_9 
28x28x512 
2359808 
Conv2d_10 
28x28x512 
2359808 
Max_pooling2d_4 
14x14x512 
0 
Conv2d_11 
14x14x512 
2359808 
Conv2d_12 
14x14x512 
2359808 
Conv2d_13 
14x14x512 
2359808 
VGG16(Max Pooling2d) 
7x7x512 
0 
Flatten 
25088 
0 
FC1(Dense) 
256 
6422784 
Fc2(Dense 
128 
32896 
Softmax 
3 
387 
Trainable param 

21,170,755 
The back propagation method is used in this case. Convergence becomes more difficult as the network grows deeper. As discussed in previous sections ResNet50 is trained with 50 layers. Consisting of convolution layers with zero padding, max pooling and activation function, batch normalization layers, average pooling and fully connected layers. The input dimension of Conv. Layer 1 is 224 �224x64, Conv. Layer 2 is 224 �224x64, and Max_pooling1 is 55 �55x64. The


Conv1 
7x7,64, stride 2 

3x3x max pool, stride 2 
Conv2_x 
[ 1 x 1,64 3 x 3,64 1 x 1,256] x 3 
Conv3_x 
[ 1 x 1,128 3 x 3,128 1 x 1,512] x 4 
Conv4_x 
[ 1 x 1,256 3 x 3,256 1 x 1, 1,1024] x 6 
Conv5_x 
[ 1 x 1,512 3 x 3,512 1 x 1,1024] x 3 

Average pool 















Sitting 
0.99 
0.95 
0.97 
0.99 
0.96 
0.97 
1.00 
1.00 
1.00 
Standing 
0.88 
0.98 
0.93 
0.91 
0.98 
0.94 
1.00 
0.93 
0.97 
Walking 
1.00 
0.99 
1.00 
1.00 
1.00 
1.00 
0.98 
1.00 
0.99 





Human Behavior Classification dataset 
2DCNN 
VGG16 
ResNet50 
2DCNN92.63% VGG1694.56% ResNet5099.72% 




CNN+LSTM 
90.89% 

VGG16+SVM model 
79.55% 

LSTM 
85.6% 



This proposed research uses “convolutional neural networks”, VGG16 and ResNet50 to build a system that can recognize the actions like sitting, standing, and walking. The human behavior classification dataset is created for this work and the experimental results has shown that ResNet50 has outperformed VGG16 and 2DCNN. Multiple human activities can be done in future work with the use of LSTM Architecture.