Over the past few years neural networks have supported diverse tasks in computer vision, medicaldiagnosis etc. Neural networks are designed to handle variety in the input data such that it can classify that variety in a generic way. Artificial neural networks do not have a concept of filters, pooling unlike CNN. The number of parameters that are supposed to be trained and altered in back propagation to reduce the cost function are very large in number. This goes beyond the memory of our normal system as it slows down training the model. Secondly, training too much of neurons and more number of parameters also means overfitting, thus it can also affect the performance of our model. Another benefit of CNN is that they can capture or are able to learn relevant features from an image at different levels (since we use filters) similar to human brain/ intelligence, a concept called feature learning. These details are explained by implementing image classification on a clothing dataset, Fashionmnist. In
Since there only a few studies done on how image classification is performed by traditional ANN and how today’s CNN outperforms it, this proposed study aims at showing these two kinds of Neural Networks. The study proposed here performs image classification on FashionMNIST dataset using ANN, CNN against different optimizers namely SGD, Adagrad, RmsProp and Adam.
Images from the Fashionmnist dataset are taken in which 60,000 are considered as training samples and 10,000 as test samples. An artificial Neural Network model is created wherein input layer is formed by flattening images of size 28*28*1 into 1D vector followed by making up the other layers of ANN, ReLu as activation function and Softmax in the last layer is used since we are performing multiclass classification. Similarly, CNN is designed with two convolution layers of 32 and 64 filters respectively, ReLu and Softmax activation functions are used and different optimizers are considered to check the performance. Dataset considered, architecture along with trainable parameters are presented clearly in the next sections accordingly.
The fashion MNIST dataset of clothing article images is the most easily available and a convenient way to consider and work with. Considering this as a base dataset for training/building the models for predictions (using ANN and CNN in this study), makes it much easier to implement and understand the diversified concepts of classification and prediction algorithms. Fashion MNIST dataset consists of 60000 training set examples and a test set of 10000 examples where each sample is a 28 x 28 grayscale image associated with a label of 10 classes. There are in total 785 columns the very first column being the class label to represent the article of clothing. In the study proposed here, two architectures are built for the image classification one using ANN and the other using CNN and their performance is assessed against different optimizers for image classification.
In the work proposed here, ANN built is a simple sequential model consisting of an input layer and three dense layers in which the last dense layer is the output layer. For the neurons to make up the input layer the input images (of size 28 x 28 x 1), are flattened into one dimensional vectors. The second i.e; first dense or hidden layer is made up of 3,000 neurons and the activation function used in this layer is Relu to introduce non linearity in the model. The third layer comprises 1,000 neurons with Relu as its activation function considered here. The final or output layer consists of a neurons, along with the softmax activation function also called as categorical cross entropy used for multiclass classification problems in the output layer.
In contrast to the ANN, CNN here is a sequential model consisting of a stack of layers where a lot of computation is done at each layer to figure out the most prominent features as we move deep into the network. The input images of size 28 x 28 x 1 are convolved by applying the 32 filters/ Kernel of size 3 x3. The activation function used here is relu. The max pooling operation is done by considering a kernel size of 2 x 2. This forms the first convolution layer. The second convolution layer consists of 64 filters of size 3 X 3 with relu as the activation function and a maxpool of 2 x 2. After having two consecutive convolutional layers, the next Layer is the first fully connected layer with 64 neurons. The feature map from the previous convolutional layers is flattened and connected to the 64 neurons here. The activation function used here is relu.The output layer is simply a dense network of 10 neurons with softmax as the activation function used. These models are compiled and trained for classification against different optimizers like SGD, Adagrad, RMSprop and Adam to reduce the overall loss and make the models better at their predictions.
A graphic representation of the normal Gradient Descent is presented in
Since there is a lot of noise in SGD while converging, smoothening happens using exponentially weighted moving averages. As we move forward in time, we keep on encountering new data points. In Exponential weighted moving average. The average is calculated at each step. As we encounter new points, it is calculated in such a way that we give higher weightage to the newer points, while the lower weightage to the older points. The exponentially weighted moving average is calculated with the help of this equation Vt=β*V(t1) +(1β)*θt with average at any time stamp T Vt is calculated by multiplying this beta hyper parameter with the previous average and 1  beta with the current data point. So will calculate different v’s at different timestamps based on the above concept. Final plot will be that graph..Red line that acts as an approximate average.
Hyper parameter beta value is also considered between 0 and 1. Mostly considered as 0.9. At any time stamp t, Vt= (1β) *[θt+β^1*θt1+β^2*θt2+β^3*θt3+⋯+t1]*θȷ+β^t*θ0]. We will use this to implement momentum, We know that the weight updation in gradient descent is given by this equation.W=Wα*(∂cost)/∂W, B=Bα*(∂cost)/∂B (here ∂cost)/∂W is the change in the cost/error with respect to the weights(old). Instead of this del cost/delW and delcost/delB, we will replace it with vdw and vdb. W=Wα*vdW, b=bα*vdb, where vdW=β*vdWprev+ (1β) dW (here dW is (∂cost)/∂W), vdb=β*vdbprev+ (1β) db.Vdw and Vdb be are nothing but the exponentially moving weighted average. Now as we are taking the exponentially weighted moving average of these points, the average of these points in the vertical direction will be approximately close to zero only while the average in the horizontal direction will be higher (consider the counter plot for the corresponding 3D graph for 2 parameters (weights), say W and b in the vertical and horizontal directions respectively) shown in Fig.6. Thus, the net result will be mostly in the horizontal direction while very little in the vertical direction thus in this way Momentum will increase or speed up the training of our model.
Rms prop is an optimization algorithm which speedups the training of our model. Faster than SGD with momentum seen above. (Doublessince we are using square in the equation here). We know that the weight updation in the gradient descent is given by these two equations;
W=Wα*(∂cost)/∂W, B=Bα*(∂cost)/∂B,
Now for RMSProp, it is
W=Wαdw/ (√ (sdw) +ϵ) (here dw= (∂cost)/∂W)
b=bαdb/ (√ (sdb) +ϵ)
Where sdw=β*sdwprev+(1β)(dw)^2,sdb=β*sdbprev+(1β)*(db)^2.
We are going to have faster training of our model and as we are taking the square of dw and then taking its root this algorithm is called root mean square propagation. The difference for change in gradient Descent and RMS crop is the way of how the gradients for slopes or derivative terms are calculated in each of them. RMS prop restricts or obstructs the oscillations in the vertical direction. Hence learning rates could be made Better by letting our algorithm consider larger or big steps in horizontal direction while reaching the local minimum Quicker as per the counter plot given above.
If you know about momentum then you will know that the weight updation is given by this equation. ForAdam optimizer we combine both the momentum as well as the rms prop into one single equation.
W=Wα⋅vdw/(√(sdw )+ϵ) (here dw=(∂cost)/∂W )
b=bα⋅vdb/(√(sdb )+ϵ)
Where vdw=β⋅vdwprev+(1β)⋅dw
vdb=β⋅vdbprev+(1β)⋅db
sdw=β⋅sdwprev+(1β)⋅(dw)^2
sdb=β⋅sdbprev+(1β)⋅(db)^2
Since there is no concept of momentum in undergrad Optimizer it is simpler than stochastic gradient Descent but with a minor drawback. Adagrad has a separate learning rate for each iteration unlike the other optimization algorithms. Observing the concept and equation AdaGard, there is a presence of accumulation of squares of the gradients in the denominator so each and every term that is added is positive. This accumulated sum term might be growing/ increasing during training. This may lead to shrinking of the learning rate constantly after which the algorithm might not be able to learn any further. Therefore, the other algorithms like RMS prop and Adam overcomes this shortcoming of Adagard by using exponentially weighted moving average concept.
SGD⇒wt=w_ (t1)η ∂L/ (∂w_ (t1))
Adagrad⇒w_t=w_ (t1)η_t1 ∂L/ (∂w_ (t1))
Where η_t1=η/√ (α_t+ε), ε is a small " +ve" number to avoid divisibility by” 0 α_t=∑_(i=1)^t (∂L/(∂w_(t1) ))^2, summation of gradient square
Considering both the aforementioned CNN and ANN architectural models for image classification of fashion MNIST dataset, it was observed that CNN worked better than ANN and has yielded an improvised accuracy for both training data and testing data against prominent optimizers (RmsProp, Adam) when compared to ANN. This is probably due to the power of convolution operations for feature consideration in CNN. All the results are tabulated in





Training Accuracy (%) 
Testing Accuracy (%) 
Testing Accuracy (%) 
TESTING ACCURACY (%) 
SGD 
89 
87 
88 
87 
ADAGRAD 
86 
84 
81 
81 
RMSPROP 
88 
87 
93 
89 
ADAM 
91 
88 
95 
91 
Diving deep into the individual performance against each considered optimizer, it is observed that for SGD, ANN has given accuracy of 89% on training data and 87% on test data respectively as shownin Figure whereas CNN for SGD yielded an accuracy of 89% on training data and 81% on test data respectively.
AdaGrad optimizer gives 86% and 84% of training and testing accuracy respectively for ANN and 81% and 81% training and testing accuracy respectively for CNN.
Similarly, using the RMS optimizer, the ANN model for image classification gives an overall accuracy of 88% on the training data and 87% on the test data as shown in figure. CNN outperformed when used along with the same Optimizer giving a training accuracy of 93% and test accuracy of 89% respectively.
Lastly Adam Optimizer worked best among all the other optimizers discussed in this study. For ANN the training accuracy for image classification was about 91% and test accuracy was 88% as in figure Adam along with CNN yielded an accuracy of 95% for training and 91% on test data. Sample classification report for CNN with Adam optimizer is shown in
Hence it was found that although both models work well in classification of images using different Optimizers, comparatively CNN worked better than ANN and in order to reduce the overall error or cost while training the model the Optimizer that outperformed among all is the Adam optimizer. Therefore, CNN can fit well to diversify applications since they highly reduce the number of parameters to be trained that reduces the computational time and speeds up the training process.
A clear comparison of CNN and ANN along with their internal computations of this proposed model (such as number of parameters being trained after each layer, and the corresponding layers’ input and output) is expressed in figure and figure. For ANN, as per
In
Moreover, to be specific on image classification problems they require the best and most prominent features to be detected and uncovered, this can be achieved using CNN since it has the concept of convolution using filters at its Core. Hence CNN is highly recommended for such image classification applications than the traditional artificialneuralnetworks.
After the entire study, it was found that different optimization techniques work differently for ANN and CNN. Overall, Adam optimizer outperformed when used along with CNN architecture yielding maximum accuracy of about 95%. Moreover, another finding observed was the time taken and number of parameters while the architectures are being trained are much less in case of CNN than ANN. Therefore, CNN can be vastly used for diversified computer vision applications because of the power of convolution operation unlike the regular ANN. Because of this power of CNN, the scope of using CNN goes beyond and is not restricted to limited applications. In the future, this work can be extended to classify a lot more classes of textiles/clothing.