What is the real architecture(layers) of YOLOv3? - yolo

Recently i'm reading yolov3's paper and code and i found a question.
In yolov3 it is darknet-53, which means it has 53 convolutional layers, but when i see this picture and count, i only get 52 convolutional layers.
this picture is from it's paper
And i also see the yolov3.cfg, totally it has 107 layers(from 0 to 106), and my result is:
75 convolutional layers + 23 shortcut layers + 3 yolo layers + 4 route layers + 2 upsample layers = 107 layers
I want to know did i misunderstand something? or where is the 53th convolutional layer in the picture above?
Thanks in advance.

Like most of the neural network models, in darknet-53, 53 refers to the total trainable layers (like 19 in VGG-19). So 52 CNN layers that you counted plus the last connected layer gives darknet-53.
Note, the connected layer can also be created using CNN layer, that's why yolo is being called fully convolutional neural network.
For your second query, I personally think you are right. There is a total of 107 layers in yolov3.cfg file.
52 layers are taken from darknet-53 (of course excluding connected layer), 27 other convolutional layers are added including 3 YOLO layers. This gives a total of 79 trainable convolutional layers in total.

Related

In CNN network, should input image go to all neurons in first convolution layer (I mean first hidden layer) or not?

I am new to CNN topic, I have one basic question regarding mapping between input image with neurons in first convolution layer.
My question is :
should input image go to all neurons in first convolution layer (I mean first hidden layer) or not ?
For example: if my first hidden layer in CNN as 8 neurons, in that case complete input image is passed to all these 8 neurons or only set of pixels of input image is passed to each neuron.
I am not sure whether you understand what you are asking because the number of neurons in a convolutional layer is something that you don't concern yourself too much with unless you are building your own implementation of a CNN.
To answer your question - No, each neuron in the first convolutional layer is connected only to pixels in its own receptive field (given by kernel size) and the same logic also applies to next convolutional layers as well except now they are connected to lower layer neurons in their receptive field.
For example: if my first hidden layer in CNN as 8 neurons
How do you know that it has 8 neurons? Unless you are doing some low level CNN programming, you do not specify the number of neurons. The number of neurons that you need is usually given by a combination of kernel size, stride, type of padding and the amount of filters that you have chosen to use. These 4 things (together with input size of the image) tells you exactly how many neurons you need.
For example, in Keras (since you have tagged this question with tensorflow) you might see convolutional layer like this one:
keras.layers.Conv2D(filters=128, kernel_size=3, strides=1, activation="relu",
input_shape=(100, 100, 1))
As you can see, you don't specify anything like the number of neurons here (at least not directly). Under this settings, you end up with output whose width and height are cropped by 2 (due to the default padding) so the output shape of this layer is (128, 98, 98) (technically it has a shape of (None, 128, 98, 98) where None is for batch size). If you would flatten this and feed the output into a single neuron (let's say in a dense layer), you would end up with 128 * 98 * 98 = 1,229,313 weights just between these two layers.
So for the analogy between dense and convolutional layer, the above convolutional layers with 128 filters connected to one output neuron is similar to having dense layer with 1,229,313 neurons connected to one output neuron.

Last fc layers in VGG16

The VGG16 architecture has input: 224x224x3 images.I want to have 48x48x3 inputs but to do this in keras, we remove the last fc layers which have 4096 neurons each.Why we have to do this? and is it needed to add another size of fc layers for this input?
Final pooling layer of VGG16 has dimension 7x7x512 for 224x224 input image. From there VGG16 uses fully connected layer of (7x7x512)x4096 to get 4096 dimensional output. However, since your input size is different your feature output dimension from final pooling layer will also be different (2x2x512 I think). So you need to change matrix dimension for fully connected layer to make it work. You have two other options though
use a global average pooling across spatial dimension to get 512 dimensional feature and then use few fully connected layers to get to your number of classes.
Resize you input image to 224x224x3 and you won't need to change anything in model architecture.
Removing the last FC layers is for fine-tuning or transfer learning, where you adapt an existing network to a new problem, such as changing the number of categories that your classifier can choose between.
You are adapting the network to take a different sized input, so you need to adjust the first layer(s) of the network.

Keras CIFAR-10 Dense Layer Code Why 512 Neurons in Last Layer?

I am using Keras to build a CNN to work with the CIFAR-10 dataset. I am slightly confused at one of the last lines of an online tutorial. They take 50,000 32x32 color images and process them through 4 convolutional layers and one fully connected layer. The last part is accomplished by:
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
I am trying to understand why it is model.add(Dense(512)) and not some other number. For example, I thought 32x32 images can be flattened to a 1024-size vector. But, why did they choose 512 here?
Thanks!
Actualy not 32x32, it's 32x32x3 because of color channels and flatten and dense different methods I think you won't get the code there is low level implementation:
W1=tf.Variable(tf.random_normal([32*32*3,512]),name="W1") #variable
x=tf.placeholder(tf.float32,[batch,32,32,3]) #placeholder for inputs
flat=tf.reshape(x,[batch,32*32*3]) #model.add(Flatten())
mul1=tf.matmul(flat,W1) #model.add(Dense(512))
relu=tf.nn.relu(mul1) #model.add(Activation('relu'))
flat's shape=[batch,32*32*3]
mul1's shape=[batch,512]
Of course it could be 1024 or 5000 but it becomes harder to optimize.

How to decide number of nodes for CNN model for image classification using tensorflow ? Images are of size 178,218

How to select the following: Size of filter for convolution, strides, pooling, and densely connected layer
There is no single answer to this question. This Reddit and this answer have some nice discussion. To quote the second post on the Reddit, "Start simple."
Celeba has similar, maybe exactly the same, image size. When I was working with Celeba on a DCGAN project, I gently cropped and then reshaped the images to 64 x 64 x 3. My discriminator was a convolutional neural network and used 4 convolutional layers and one fully connected layer. All conv layers had 5 x 5 window size and stride size of 2 x 2. SAME padding and no pooling. The output channels per layer were 128 -> 256 -> 512 -> 1024. So, the last conv layer output a 4 x 4 x 1024 tensor. My dense layer then had a weight size of classes x 1024. (I had 1 class since its purpose was to determine whether the input image was from the dataset or made by the generator.)
That relatively simple architecture had good results, but it was intentionally built to not overpower the generator. If you're looking for pure classification, you may want a deeper architecture. You might not want to crop as aggressively as I did. Then you can include more conv layers before the fully connected layer. You may want to use a 3 x 3 window size with a stride size of 1 x 1 and use pooling - although I see architectures abandoning pooling in favor of larger stride size. If your dataset is small, it is prone to overfitting. Having smaller weights helps combat this when dropout isn't enough. That means fewer output channels per layer.
There are a lot of possibilities when choosing an architecture, and there is no hard-and-fast rule for the best architecture. Remember to start simple.

TensorFlow CNN some kernels not learning

I have an AlexNet implementation in TensorFlow which is converging, though not to the supposed loss values... I start with a learning rate of 0.01 (as stated in the paper for ILSVRC2012) and right before the first LR drop, the loss is about 6.0, where it should be around 3.0 and 3.5... now that I exported the weights for visualisation, I noticed that some of the channels in the convolutional layers are not learning... in the first layer I have 96 channels, but only 63 of them have "normal values", the remaining 33 have 0 values...
Here is an image of the kernels, each normalised to its maximum. The unlearned kernels are the ones that are "random".
Any ideias as to why? I have a dropout rate of 50% during train in the first 2 FC layers.
Thanks in advance.