I Jeter 039112021
I Jeter 039112021
Received Date : October 04, 2021 Accepted Date : October 25, 2021 Published Date : November 07, 2021
ABSTRACT
1379
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
Backbones are convolutional neural networks based on Faster R-CNN[8] stands for “Faster Region-based
different layers also, moreover, the Neck stage refers to a Convolutional Neural Networks”. The main idea behind
collection of layers that collect feature maps and they are Faster R-CNN [8]is to integrate the region proposal model
composed of several top-down paths and several bottom-up into CNN which going to make the R-CNN [4]family train
paths. Next, the head of the model that can predict bounding rapidly. This model is proposed in 2016 its architecture is
boxes of objects and their classes, can be either a one-stage based on constructing a unified model composed of region
detector or a two-stage detector. Two-stage detectors are proposal network and Fast R-CNN[5] meanwhile a shared
more complicated than one-stage detectors which are elegant convolutional feature layer.
and straightforward.
Let us see some of the architectures of two-stage detectors
that are complicated and let us observe their improvements.
Ranging from object detection to object segmentation. In
other words, starting from R-CNN[4] to Mask R-CNN[7].
R-CNN[4]stands for “Region-based Convolutional Neural
Networks”. It is one of the famous models that gave a lot of
performance to object detection. The idea behind its
architecture is composed of two steps. Firstly, relying on
selective search to identify several bounding boxes object
region candidates that are named region of interest or Roi. Its
next step based on CNN can extracts features from each
region separately for classification(see Figure 3).
Figure 5: Faster R-CNN Architecture
2. BACKGROUND
1380
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
in the ImageNet classification contest, to their models to the deep learning boom. AlexNet[11]competed in the
gain better performance. These convolutional neural famous ImageNet Large Scale Visual Recognition
networks have different architectures and characteristics. Challenge in 2012. The proposed network achieved high
accuracy. AnAlexNet[11]architectural model is depicted in
Figure 7.AlexNet [11]Architecture is composed of 8 layers.
AlexNet[11]is repeatedly considered the pioneer of
It contains eight learned layers, i.e., five convolutional and
convolutional neural networks and the beginning point of
Max Pooling 1 three fully connected in which three softmax pooling.
Max Pooling 2
Max Pooling 5
Dense 8
Dense 6
Dense 7
Conv 4
Conv 5
Conv 1
Conv 2
Conv 3
Figure 7: An illustration of the AlexNet architecture
VGG16[9] is convolutional neural network that won Figure 8, VGG16[9] have 5 Convolution block and 1 fully
ImageNet Large Scale Visual Recognition Challenge connected block. Each convolution block contains a set of
competition in 2014. VGG16 [9]has been regarded as the convolutional layers with a pooling. Finally, three fully
best model at that time. 16 in VGG16[9] refers to its 16 connected layers are referred to as Dense in Figure 8.
layers. Indeed VGG16 [9]is a large model with 138
parameters approximately. As shown in
Conv 1-1
Conv 1-2
Conv 2-1
Conv 2-2
Conv 3-1
Conv 3-2
Conv 3-3
Conv 4-1
Conv 4-2
Conv 4-3
Conv 5-1
Conv 5-2
Conv 5-3
Pooling
Pooling
Pooling
Pooling
Pooling
Dense
Dense
Dense
Figure 8: An illustration of the VGG16 architecture
ResNet18[12]is a convolutional neural network that won trained networks with 100 and 1000 layers also. 18 refers to
ImageNet Large Scale Visual Recognition Challenge the number of convolutions that are 18 and two pooling.
Classification competition in 2015.Residual Network
Avg Pooling
Conv 2-1
Conv 2-2
Conv 2-3
Conv 2-4
Conv 3-1
Conv 3-2
Conv 3-3
Conv 3-4
Conv 4-1
Conv 4-2
Conv 4-3
Conv 4-4
Conv 5-1
Conv 5-2
Conv 5-3
Conv 5-4
Pooling
Conv 1
Dense
Figure 9: An illustration of ResNet18 architecture
GoogleNet[13]is based on inceptions as shown in figure 10. and max pooling. The inception module contains four
Each inception is composed of several convolutional layers parallel operations.
Filter Concatenation
3*3 Convolutions 5*5 Convolutions 1*1 Convolutions
1*1 Convolutions
1*1 Convolutions 1*1 Convolutions 3*3 max pooling
Previous layer
GoogleNet[13] architecture contains 22 layers with 27 the inception modules, there is the global average pooling as
pooling layers. In total there are 9 inception modules. After illustrated in Figure 11.
1381
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
Dropout 40%
Inception(3b)
Inception(4b)
Inception(4d)
Inception(5b)
Inception(3a)
Inception(4a)
Inception(4c)
Inception(4e)
Inception(5a)
Max Pooling
Max Pooling
Max Pooling
Max Pooling
Avg Pooling
Convolution
Convolution
SoftMax
Linear
Figure 11: An illustration of GoogleNet architecture
In DenseNet [14]architecture, each layer is connected to that is extremely powerful. Hence, The input of each layer
every other layer, thus the name Densely Connected inside DenseNet [14]is the concatenation of feature maps
Convolutional Network. This is the main idea of DenseNet from previous existent layers (see Figure 12).
12 dense layers
24 dense layers
16 dense layers
6 dense layers
Convolution
Convolution
Convolution
Avg Pooling
Avg Pooling
Avg Pooling
Max Pooling
Convolution
Avg Pooling
SoftMax
Dense
Dense block
Dense block
Dense block
Dense block
Transition
Transition
Transition
Layer
Layer
Layer
Figure 12: An illustration of DenseNet architecture
MobileNet[15] utilizes depthwise separable convolutions construct lightweight deep neural networks for embedded
instead of the standard convolutions to reduce computation and mobile vision applications. All layers are followed by
and model size except for the first layer. Thus, it can be used batch normalization and ReLU non-linearity. However, the
to final layer is a fully connected layer without any non-
linearity and then softmax for classification (see Figure 13).
Conv dw
Conv
Avg Pooling
Conv dw
Conv dw
Conv dw
Conv dw
Conv dw
Conv dw
Conv dw
Conv dw
SoftMax
Dense
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
times
5
3. COMPARAISON OF BACKBONES
This table illustrates the deep learning model used for the describes generally how the model performs across all
classification task of the ImageNet Large Scale Visual classes. It is counted based on the ratio of the correct
Recognition Challenge. The number associated with each number of predictions to the number total of predictions.
name referred to the number of layers. This table contains The accuracy metric is between 0% and 100%. There are
the model name, reference, paper title, accuracy, finally, and also other performances such as Recall, Precision,
time(see Table 1).Our comparison criteria in terms of time
and accuracy. Time refers to the training time on the
ImageNet dataset. Accuracy is an evaluation metric that
1382
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
After gathering the main methods to compare them (see Here is a bar plot shows the best method.
Table 1).One on the one hand, with a view to detect the best
model in term of time.
1383
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
Figure 12: Time Training Comparisonof Classification Models Based Deep Learning
Figure 13: Accuracy Metric comparison Of Classification Models Based Deep Learning
1384
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
1385
Sara Bouraya et al., International Journal of Emerging Trends in Engineering Research, 9(11), November 2021, 1379 – 1386
1386