Understanding darknet's yolo.cfg config files - yolo

I have searched around the internet but found very little information around this, I don't understand what each variable/value represents in yolo's .cfg files. So I was hoping some of you could help, I don't think I'm the only one having this problem, so if anyone knows 2 or 3 variables please post them so that people who needs such info in the future might find them.
The main one that I'd like to know are :
batch
subdivisions
decay
momentum
channels
filters
activation

Here is my current understanding of some of the variables. Not necessarily correct though:
[net]
batch: That many images+labels are used in the forward pass to compute a gradient and update the weights via backpropagation.
subdivisions: The batch is subdivided in this many "blocks". The images of a block are ran in parallel on the gpu.
decay: Maybe a term to diminish the weights to avoid having large values. For stability reasons I guess.
channels: Better explained in this image :
On the left we have a single channel with 4x4 pixels, The reorganization layer reduces the size to half then creates 4 channels with adjacent pixels in different channels.
momentum: I guess the new gradient is computed by momentum * previous_gradient + (1-momentum) * gradient_of_current_batch. Makes the gradient more stable.
adam: Uses the adam optimizer? Doesn't work for me though
burn_in: For the first x batches, slowly increase the learning rate until its final value (your learning_rate parameter value). Use this to decide on a learning rate by monitoring until what value the loss decreases (before it starts to diverge).
policy=steps: Use the steps and scales parameters below to adjust the learning rate during training
steps=500,1000: Adjust the learning rate after 500 and 1000 batches
scales=0.1,0.2: After 500, multiply the LR by 0.1, then after 1000 multiply again by 0.2
angle: augment image by rotation up to this angle (in degree)
layers
filters: How many convolutional kernels there are in a layer.
activation: Activation function, relu, leaky relu, etc. See src/activations.h
stopbackward: Do backpropagation until this layer only. Put it in the panultimate convolution layer before the first yolo layer to train only the layers behind that, e.g. when using pretrained weights.
random: Put in the yolo layers. If set to 1 do data augmentation by resizing the images to different sizes every few batches. Use to generalize over object sizes.
Many things are more or less self-explanatory (size, stride, batch_normalize, max_batches, width, height). If you have more questions, feel free to comment.
Again, please keep in mind that I am not 100% certain about many of those.

More complete explanation about the cfg parameters, copied from the author of YOLO v4 https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section
and https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-different-layers
Below is only the snapshot of the documentation, please refer to the above links for a better format
CFG-Parameters in the [net] section:
[net] section
batch=1 - number of samples (images, letters, ...) which will be precossed in one batch
subdivisions=1 - number of mini_batches in one batch, size mini_batch = batch/subdivisions, so GPU processes mini_batch samples at once, and the weights will be updated for batch samples (1 iteration processes batch images)
width=416 - network size (width), so every image will be resized to the network size during Training and Detection
height=416 - network size (height), so every image will be resized to the network size during Training and Detection
channels=3 - network size (channels), so every image will be converted to this number of channels during Training and Detection
inputs=256 - network size (inputs) is used for non-image data: letters, prices, any custom data
max_chart_loss=20 - max value of Loss in the image chart.png
For training only
Contrastive loss:
contrastive=1 - use Supervised contrastive loss for training Classifier (should be used with [contrastive] layer)
unsupervised=1 - use Unsupervised contrastive loss for training Classifier on images without labels (should be used with contrastive=1 parameter and with [contrastive] layer)
Data augmentation:
angle=0 - randomly rotates images during training (classification only)
saturation = 1.5 - randomly changes saturation of images during training
exposure = 1.5 - randomly changes exposure (brightness) during training
hue=.1 - randomly changes hue (color) during training https://en.wikipedia.org/wiki/HSL_and_HSV
blur=1 - blur will be applied randomly in 50% of the time: if 1 - will be blured background except objects with blur_kernel=31, if >1 - will be blured whole image with blur_kernel=blur (only for detection and if OpenCV is used)
min_crop=224 - minimum size of randomly cropped image (classification only)
max_crop=448 - maximum size of randomly cropped image (classification only)
aspect=.75 - aspect ration can be changed during croping from 0.75 - to 1/0.75 (classification only)
letter_box=1 - keeps aspect ratio of loaded images during training (detection training only, but to use it during detection-inference - use flag -letter_box at the end of detection command)
cutmix=1 - use CutMix data augmentation (for Classifier only, not for Detector)
mosaic=1 - use Mosaic data augmentation (4 images in one)
mosaic_bound=1 - limits the size of objects when mosaic=1 is used (does not allow bounding boxes to leave the borders of their images when Mosaic-data-augmentation is used)
data augmentation in the last [yolo]-layer
jitter=0.3 - randomly changes size of image and its aspect ratio from x(1 - 2*jitter) to x(1 + 2*jitter)
random=1 - randomly resizes network size after each 10 batches (iterations) from /1.4 to x1.4 with keeping initial aspect ratio of network size
adversarial_lr=1.0 - Changes all detected objects to make it unlike themselves from neural network point of view. The neural network do an adversarial attack on itself
attention=1 - shows points of attention during training
gaussian_noise=1 - add gaussian noise
Optimizator:
momentum=0.9 - accumulation of movement, how much the history affects the further change of weights (optimizer)
decay=0.0005 - a weaker updating of the weights for typical features, it eliminates dysbalance in dataset (optimizer) http://cs231n.github.io/neural-networks-3/
learning_rate=0.001 - initial learning rate for training
burn_in=1000 - initial burn_in will be processed for the first 1000 iterations, current_learning rate = learning_rate * pow(iterations / burn_in, power) = 0.001 * pow(iterations/1000, 4) where is power=4 by default
max_batches = 500200 - the training will be processed for this number of iterations (batches)
policy=steps - policy for changing learning rate: constant (by default), sgdr, steps, step, sig, exp, poly, random (f.e., if policy=random - then current learning rate will be changed in this way = learning_rate * pow(rand_uniform(0,1), power))
power=4 - if policy=poly - the learning rate will be = learning_rate * pow(1 - current_iteration / max_batches, power)
sgdr_cycle=1000 - if policy=sgdr - the initial number of iterations in cosine-cycle
sgdr_mult=2 - if policy=sgdr - multiplier for cosine-cycle https://towardsdatascience.com/https-medium-com-reina-wang-tw-stochastic-gradient-descent-with-restarts-5f511975163
steps=8000,9000,12000 - if policy=steps - at these numbers of iterations the learning rate will be multiplied by scales factor
scales=.1,.1,.1 - if policy=steps - f.e. if steps=8000,9000,12000, scales=.1,.1,.1 and the current iteration number is 10000 then current_learning_rate = learning_rate * scales[0] * scales[1] = 0.001 * 0.1 * 0.1 = 0.00001
label_smooth_eps=0.1 - use label smoothing for training Classifier
For training Recurrent networks:
Object Detection/Tracking on Video - if [conv-lstm] or [crnn] layers are used in additional to [connected] and [convolutional] layers
Text generation - if [lstm] or [rnn] layers are used in additional to [connected] layers
track=1 - if is set 1 then the training will be performed in Recurrents-tyle for image sequences
time_steps=16 - training will be performed for a random image sequence that contains 16 images from train.txt file
for [convolutional]-layers: mini_batch = time_steps*batch/subdivisions
for [conv_lstm]-recurrent-layers: mini_batch = batch/subdivisions and sequence=16
augment_speed=3 - if set 3 then can be used each 1st, 2nd or 3rd image randomly, i.e. can be used 16 images with indexes 0, 1, 2, ... 15 or 110, 113, 116, ... 155 from train.txt file
sequential_subdivisions=8 - lower value increases the sequence of images, so if time_steps=16 batch=16 sequential_subdivisions=8, then will be loaded time_steps*batch/sequential_subdivisions = 16*16/8 = 32 sequential images with the same data-augmentation, so the model will be trained for sequence of 32 video-frames
seq_scales=0.5, 0.5 - increasing sequence of images at some steps, i.e. the coefficients to which the original sequential_subdivisions value will be multiplied (and batch will be dividied, so the weights will be updated rarely) at correspond steps if is used policy=steps or policy=sgdr
CFG-Parameters in the different layers
Image processing [N x C x H x W]:
[convolutional] - convolutional layer
batch_normalize=1 - if 1 - will be used batch-normalization, if 0 will not (0 by default)
filters=64 - number of kernel-filters (1 by default)
size=3 - kernel_size of filter (1 by default)
groups = 32 - number of groups for grouped-convolutional (depth-wise) (1 by default)
stride=1 - stride (offset step) of kernel filter (1 by default)
padding=1 - size of padding (0 by default)
pad=1 - if 1 will be used padding = size/2, if 0 the will be used parameter padding= (0 by default)
dilation=1 - size of dilation (1 by default)
activation=leaky - activation function after convolution: logistic (by default), loggy, relu, elu, selu, relie, plse, hardtan, lhtan, linear, ramp, leaky, tanh, stair, relu6, swish, mish
[activation] - separate activation layer
activation=leaky - activation function: linear (by default), loggy, relu, elu, selu, relie, plse, hardtan, lhtan, linear, ramp, leaky, tanh, stair
[batchnorm] - separate Batch-normalization layer
[maxpool] - max-pooling layer (the maximum value)
size=2 - size of max-pooling kernel
stride=2 - stirde (offset step) of max-pooling kernel
[avgpool] - average pooling layer input W x H x C -> output 1 x 1 x C
[shortcut] - residual connection (ResNet)
from=-3,-5 - relative layer numbers, preforms element-wise adding of several layers: previous-layer and layers specified in from= parameter
weights_type=per_feature - will be used weights for shortcut y[i] = w1*layer1[i] + w2*layer2[i] ...
per_feature - 1 weights per layer/feature
per_channel - 1 weights per channel
none - weights will not be used (by default)
weights_normalization=softmax - will be used weights normalization
softmax - softmax normalization
relu - relu normalization
none - without weights normalization - unbound weights (by default)
activation=linear - activation function after shortcut/residual connection (linear by default)
[upsample] - upsample layer (increase W x H resolution of input by duplicating elements)
stride=2 - factor for increasing both Width and Height (new_w = w*stride, new_h = h*stride)
[scale_channels] - scales channels (SE: squeeze-and-excitation blocks) or (ASFF: adaptively spatial feature fusion) -it multiplies elements of one layer by elements of another layer
from=-3 - relative layer number, performs multiplication of all elements of channel N from layer -3, by one element of channel N from the previous layer -1 (i.e. for(int i=0; i < b*c*h*w; ++i) output[i] = from_layer[i] * previous_layer[i/(w*h)]; )
scale_wh=0 - SE-layer (previous layer 1x1xC), scale_wh=1 - ASFF-layer (previous layer WxHx1)
activation=linear - activation function after scale_channels-layer (linear by default)
[sam] - Spatial Attention Module (SAM) - it multiplies elements of one layer by elements of another layer
from=-3 - relative layer number (this and previous layers should be the same size WxHxC)
[reorg3d] - reorg layer (resize W x H x C)
stride=2 - if reverse=0 input will be resized to W/2 x H/2 x C4, if reverse=1thenW2 x H*2 x C/4`, (1 by default)
reverse=1 - if 0(by default) then decrease WxH, if1thenincrease WxH (0 by default)
[reorg] - OLD reorg layer from Yolo v2 - has incorrect logic (resize W x H x C) - depracated
stride=2 - if reverse=0 input will be resized to W/2 x H/2 x C4, if reverse=1thenW2 x H*2 x C/4`, (1 by default)
reverse=1 - if 0(by default) then decrease WxH, if1thenincrease WxH (0 by default)
[route] - concatenation layer, Concat for several input-layers, or Identity for one input-layer
layers = -1, 61 - layers that will be concatenated, output: W x H x C_layer_1 + C_layer_2
if index < 0, then it is relative layer number (-1 means previous layer)
if index >= 0, then it is absolute layer number
[yolo] - detection layer for Yolo v3 / v4
mask = 3,4,5 - indexes of anchors which are used in this [yolo]-layer
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 - initial sizes if bounded_boxes that will be adjusted
num=9 - total number of anchors
classes=80 - number of classes of objects which can be detected
ignore_thresh = .7 - keeps duplicated detections if IoU(detect, truth) > ignore_thresh, which will be fused during NMS (is used for training only)
truth_thresh = 1 - adjusts duplicated detections if IoU(detect, truth) > truth_thresh, which will be fused during NMS (is used for training only)
jitter=.3 - randomly crops and resizes images with changing aspect ratio from x(1 - 2*jitter) to x(1 + 2*jitter) (data augmentation parameter is used only from the last layer)
random=1 - randomly resizes network for each 10 iterations from 1/1.4 to 1.4(data augmentation parameter is used only from the last layer)
resize=1.5 - randomly resizes image in range: 1/1.5 - 1.5x
max=200 - maximum number of objects per image during training
counters_per_class=100,10,1000 - number of objects per class in Training dataset to eliminate the imbalance
label_smooth_eps=0.1 - label smoothing
scale_x_y=1.05 - eliminate grid sensitivity
iou_thresh=0.2 - use many anchors per object if IoU(Obj, Anchor) > 0.2
iou_loss=mse - IoU-loss: mse, giou, diou, ciou
iou_normalizer=0.07 - normalizer for delta-IoU
cls_normalizer=1.0 - normalizer for delta-Objectness
max_delta=5 - limits delta for each entry
[crnn] - convolutional RNN-layer (recurrent)
batch_normalize=1 - if 1 - will be used batch-normalization, if 0 will not (0 by default)
size=1 - convolutional kernel_size of filter (1 by default)
pad=0 - if 1 will be used padding = size/2, if 0 the will be used parameter padding= (0 by default)
output = 1024 - number of kernel-filters in one output convolutional layer (1 by default)
hidden=1024 - number of kernel-filters in two (input and hidden) convolutional layers (1 by default)
activation=leaky - activation function for each of 3 convolutional-layers in the [crnn]-layer (logistic by default)
[conv_lstm] - convolutional LSTM-layer (recurrent)
batch_normalize=1 - if 1 - will be used batch-normalization, if 0 will not (0 by default)
size=3 - convolutional kernel_size of filter (1 by default)
padding=1 - convolutional size of padding (0 by default)
pad=1 - if 1 will be used padding = size/2, if 0 the will be used parameter padding= (by default)
stride=1 - convolutional stride (offset step) of kernel filter (1 by default)
dilation=1 - convolutional size of dilation (1 by default)
output=256 - number of kernel-filters in each of 8 or 11 convolutional layers (1 by default)
groups=4 - number of groups for grouped-convolutional (depth-wise) (1 by default)
state_constrain=512 - constrains LSTM-state values [-512; +512] after each inference (time_steps*32 by default)
peephole=0 - if 1 then will be used Peephole (additional 3 conv-layers), if 0 will not (1 by default)
bottleneck=0 - if 1 then will be used reduced optimal versionn of conv-lstm layer
activation=leaky - activation function for each of 8 or 11 convolutional-layers in the [conv_lstm]-layer (linear by default)
lstm_activation=tanh - activation for G (gate: g = tanh(wg + ug)) and C (memory cell: h = o * tanh(c))
Detailed-architecture-of-the-peephole-LSTM
Free-form data processing [Inputs]:
[connected] - fully connected layer
output=256 - number of outputs (1 by default), so number of connections is equal to inputs*outputs
activation=leaky - activation after layer (logistic by default)
[dropout] - dropout layer
probability=0.5 - dropout probability - what part of inputs will be zeroed (0.5 = 50% by default)
dropblock=1 - use as DropBlock
dropblock_size_abs=7 - size of DropBlock in pixels 7x7
[softmax] - SoftMax CE (cross entropy) layer - Categorical cross-entropy for multi-class classification
[contrastive] - Contrastive loss layer for Supervised and Unsupervised learning (should be set [net] contrastive=1 and optionally [net] unsupervised=1)
classes=1000 - number of classes
temperature=1.0 - temperature
[cost] - cost layer calculates (linear)Delta and (squared)Loss
type=sse - cost type: sse (L2), masked, smooth (smooth-L1) (SSE by default)
[rnn] - fully connected RNN-layer (recurrent)
batch_normalize=1 - if 1 - will be used batch-normalization, if 0 will not (0 by default)
output = 1024 - number of outputs in one connected layer (1 by default)
hidden=1024 - number of outputs in two (input and hidden) connected layers (1 by default)
activation=leaky - activation after layer (logistic by default)
[lstm] - fully connected LSTM-layer (recurrent)
batch_normalize=1 - if 1 - will be used batch-normalization, if 0 will not (0 by default)
output = 1024 - number of outputs in all connected layers (1 by default)
[gru] - fully connected GRU-layer (recurrent)
batch_normalize=1 - if 1 - will be used batch-normalization, if 0 will not (0 by default)
output = 1024 - number of outputs in all
connected layers (1 by default)

Although this is a quite old request of help, for the future users looking for an answer,
you can find all the explanation on the Wiki page inside the most famous fork of the original Yolo project
https://github.com/AlexeyAB/darknet/wiki
In particular, copying and pasting only the [net] part from here as follows:
[net]
batch=1 - number of samples (images, letters, ...) which will be precossed in one batch
subdivisions=1 - number of mini_batches in one batch, size mini_batch = batch/subdivisions, so GPU processes mini_batch samples at once, and the weights will be updated for batch samples (1 iteration processes batch images)
width=416 - network size (width), so every image will be resized to the network size during Training and Detection
height=416 - network size (height), so every image will be resized to the network size during Training and Detection
channels=3 - network size (channels), so every image will be converted to this number of channels during Training and Detection
inputs=256 - network size (inputs) is used for non-image data: letters, prices, any custom data
Anyway, you should even try to look in the relative Github/issues part for something, even naive, you want to know, because usually it has already been asked and answered.
Good luck.

batch the number of images chosen in each batch to reduce loss
subdivisions division of batch size to no. of sub batches for parallel processing
decay is a learning parameter and as specified in the journal a momentum of 0.9 and decay of 0.0005 is used
momentum is a learning parameter and as specified in the journal a momentum of 0.9 and decay of 0.0005 is used
channels Channels refers to the channel size of the input image(3) for a BGR image
filters the number of filters used for a CNN algorithm
activation the activation function of CNN: mostly Leaky RELU function is used ( what i have seen mostly in the configuration files)

Related

Number of nodes in output later greater than number of classes in a neural network

While training a neural network, on the fashion mnist dataset, I decided to have a greater number of nodes in my output layer than the number of classes in the dataset.
The dataset has 10 classes, while I trained my network to have 15 nodes in the output layer. I also used a softmax.
Now surprisingly, this gave me an accuracy of 97% which is quite good.
This leads me to the question, what do those extra 5 nodes even mean, and what do they do here?
Why is my softmax able to work properly when the label range(0-9) isn't equal to the number of nodes(15)?
And finally, in general, what does it mean to have more nodes in your output layer than the number of classes, in a classification task?
I understand the effects of having lesser nodes than the number of classes, and also that the rule of thumb is to use number of nodes = number of classes. Yet, I've never seen someone use a greater number of nodes, and I'd like to understand why/why not.
I'm attaching some code so that the results can be reproduced. This was done using Tensorflow 2.3
import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.mnist
(training_images, training_labels) , (test_images, test_labels) = mnist.load_data()
training_images = training_images/255.0
test_images = test_images/255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(15, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5)
model.evaluate(test_images, test_labels)
The only reason you are able to use such a configuration is because you have specified your loss function as sparse_categorical_crossentropy.
let's understand the effects of greater output nodes in forward propagation.
Consider a neural network with 2 layers.
1st layer - 6 neurons (Hidden layer)
2nd layer - 4 neurons (output layer)
You have dataset X whose shape is(100*12) ie. 12 features and 100 rows.
you have labels y whose shape is (100,) containing two unique values 0 and 1.
Therefore essentially this is a binary classification problem but we will use 4 neurons in our output layer.
Consider each neuron as a logistic regression unit. Therefore each of your neurons will 12 weights (w1, w2,.....,w12)
Why? - Because you have 12 features.
Each neuron will output a single term given by a. I will give the computation of a in two steps.
z = w1x1 + w2x2 + ........ + w12*x12 + w0 # w0 is bias
a = activation(z)
Therefore, your 1st layer will output 6 values for each row in our dataset.
So now you have a feature matrix of 100 * 6.
This is passed to the 2nd layer and the same process repeats.
So in essence you are able to complete the forward propagation step even when you have more neurons than the actual classes.
Now let's see backpropagation.
For backpropagation to exist you must be able to calculate the loss_value.
we will take a small example:
y_true has two labels as in our problem and y_pred has 4 probability values since we have 4 units in our final layer.
y_true = [0, 1]
y_pred = [[0.03, 0.90, 0.02, 0.05], [0.15, 0.02, 0.8, 0.03]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy() # 3.7092905
How is it calculated:
( log(0.03) + log(0.02) ) / 2
So essentially we can compute the loss so we can also compute its gradients.
Therefore no problem in using backpropagation too.
Therefore our model can very well train and achieve 90 % accuracy.
So the final question, what are these extra neurons representing. ie( neuron 2 and neuron 3).
Ans - They are representing the probability of the example being of class 2 and class 3 respectively. But since the labels contain no values of class 2 and class 3 they will have zero contribution in calculating the loss value.
Note- If you encode your y_label in one-hot-encoding and use categorical_crossentropy as your loss you will encounter an error.

Keras dense model gradient explosion

I have a very simple dense layer model takes 10 input values, 20 units in hidden layer, 1 unit in output layer, and "relu" as activation function, adam optimizer with learning rate 0.01
densemodel=keras_model_sequential();
layer_dense(densemodel, input_shape=ncol(trainingX), units=20, activation="relu")
layer_dropout(densemodel, rate=0.1)
layer_dense(densemodel, units=1, activation="relu")
optimizer=optimizer_adam(lr=0.01,clipnorm=1);
compile(densemodel, optimizer=optimizer, loss="logcosh", metrics = list("mean_squared_error"))
I trained the model with n = 2e4 training data and ran into serious gradient explosion, which was finally confirmed caused by some outliers (n < 10) in the training records.
Without removing the the outlier records, any one or combination of the following strategies failed to address the gradient explosion problem.
kernel_regularizer, bias_regularizer, activity_regularizer, clipnorm=1, clipvalue=0.5 or 0.1, set learning rate to 1e-5, add drop out layer, increase batch size.
basically none of them work.
I expect at least clipnorm or clipvalue should work since according to definition
clipnorm: Gradients will be clipped when their L2 norm exceeds this
value.
clipvalue: Gradients will be clipped when their absolute value exceeds
this value.
but why they failed?

Variational autoencoder cannot train with smal input values

I am using a variational autoencoder to reconstruct images in tensorflow 2.0 with the Keras API. My model's architecture looks like that:
The lambda layer uses a function to sample from a normal distribution which looks like that:
def sampling(args):
z_mean, z_log_var = args
epsilon = K.random_normal(shape =(1,1,16))
return z_mean + K.exp(0.5 * z_log_var) * epsilon
My hyperparameters are as follows:
epochs = 50
batch size =16
num_training = 1800
num_val = 100
num_test = 100
learning rate = 0.001
exponential decay = 0.9 * initial learning rate (calculated every 5 epochs)
optimizer = Adam
shuffle = True
I am using the following loss:
def vae_loss(y_pred, y_gt):
mse_loss = mse(y_pred, y_gt)
z_mean = model.get_layer('z_mean_layer').output
z_log_var = model.get_layer('z_log_var_layer').output
kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
return K.mean(mse_loss + kl_loss)
My weights are initialized the default way: kernel_initializer='glorot_uniform', bias_initializer='zeros'.
My datasets images consist of a randomly placed circle, which looks like that:
The background has the value 0 and the circle's value is sampled from a uniform distribution between -1 and 1, e.g. 0.987 for all circle pixels.
When I train with this configuration, I get the following loss.
The KL divergence is of magnitude 1e-8, whereas the MSE loss is stays at 0.101.
And I always get the same reconstruction, regardless of the input, which is an image with a constant pixel intensity
Now, if I multiply all input images with 500 (eg. background stays zero, circle pixel values are uniformly distributed in the range (-500, 500)), the network miraculously starts to learn.
with a KL loss of magnitude 50 and MSE loss of magnitude 250 (last epochs)
And the image reconstruction works well. Basically, the MSE metric is high, but the circle contour is positioned in the right place.
My quiestion is: How come the network cannot reconstruct images in the range (-1,1) , but does so in the range (-500, 500)?
Machine precision is set to float32.
I have used numerous learning rates, e.g. 0.00001, but this does not solve the problem. I have also trained for many epochs, e.g. 200, still no result.
As mentioned in the comments there is probably a problem with the scaling of the loss. Your current implementation of the MSE loss uses the mean of the squared differences (which is fairly small). Instead of using the mean, try using the sum of the squared differences over your image. The Keras VAE (https://keras.io/examples/variational_autoencoder/) does this by scaling the computed MSE loss with the original image size (in pytorch this can be specified directly https://github.com/pytorch/examples/blob/234bcff4a2d8480f156799e6b9baae06f7ddc96a/vae/main.py#L74).

Cosine similarity loss cause weight values to explode

Suppose my data consists of images of bubbles, and the labels are histograms describing the distribution of sizes, for example:
0-10mm 10%
10-20mm 30%
20-30mm 40%
30-40mm 20%
It is important to note that -
All size percentages sum to 100% (or 1.0 to be more precise).
I don't have annotated data, so i can't train an object detector and then just calculate the distribution by counting objects detected. However, i do have a feature extractor train on my data.
I implemented a simple CNN that consists of -
Resnet50 backbone.
Global max pooling.
1x1 convolution of 6 filters (6 distribution bins in labels).
After some experiments i came to the conclusion that softmax and cross entropy as loss function does not suit my problem and needs.
I thought that maybe a cosine similarity loss, with a light modification, may be a good alternative (normalization will be part of post process). This is the implementation:
def cosine_similarity_loss(logits, probs, weights=1.0, label_smoothing=0):
x1_val = tf.sqrt(tf.reduce_sum(tf.matmul(logits, tf.transpose(logits)), axis=1))
x2_val = tf.sqrt(tf.reduce_sum(tf.matmul(probs, tf.transpose(probs)), axis=1))
denom = tf.multiply(x1_val, x2_val)
num = tf.reduce_sum(tf.multiply(logits, probs), axis=1)
cosine_sim = tf.math.divide(num, denom)
cosine_dist = tf.math.reduce_mean(1 - tf.square(cosine_sim)) # Cosine Distance. Reduce mean for shape compatibility.
return cosine_dist
Loss is a summation of cosine distance and l2 regularization on weights. After first feed forward i got loss: 3.1267 and after second feed forward i got loss: 96003645440.0000 - meaning weights exploded (logits: [[-785595.812 -553858.625 -545579.625 -148547.875 -12845.8633 19871.1055]] while probs: [[0.466 0.297 0.19 0.047 0 0]]).
What could be the reason for such rapid and extreme increase?
My guess is cosine distance does an internal normalisation of the logits, removing the magnitude, and thus there is no gradient to propogate that opposes the values increasing. BTW weights is not used in your implementation.
What about just plain Euclidian distance using sigmoid instead of softmax in the last layer. Also, I would try adding another one or two dense layers (say size 512) between resnet50 and output dense layer.

Confused usage of dropout in mini-batch gradient descent

My question is in the end.
An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.
import tensorflow as tf
import numpy as np
fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")
And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):
0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35
My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.
As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.
My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration?
Something like:
# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)
Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?
the screenshot from here
Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.
If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).
In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.
If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.
When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.
Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!