Which one is the correct YOLOv4 total loss function formula? - object-detection

I couldn't find the total loss function in the main YOLOv4 paper. However, I found two differing formulas from two different papers (difference is highlighted/marked in the images below). Which formula is the correct default total loss formula for YOLOv4?
FORMULA 1 - Source
FORMULA 2 - Source

Formula 1 is the more correct one because that term of the formula is the confidence loss when there is an object detected. And the confidence loss is based on the cross entropy error.

Related

MSE loss function calculation

I trained a seq2seq network using input samples with a shape of [30,26] and output shape of [1,7] with MSE as the loss function (model.compile(loss="mse", optimizer="adam"). However, when I compare history.history['loss'] to
keras_error = tf.keras.losses.MSE(predictions_train, data_train) (returns an array of errors which I averaged) the results differ by about 0.2. Insights on how the MSE loss function is calculated for an output sequence like this is greatly appreciated!
The MSE loss is calculated in the same way. You have 7 values in both x and y. Both sides are subtracted from each other then squared then divided by 7. The reason your values are coming like that can be because model.compile is using tf.keras.losses.MeanSquaredError while you are using another function. This kind of discrepancy can come because of this. But, the end game is the performance of the network, is that getting fulfilled?

Dynamic Unrolling of Simple Neural Nets using Keras

I am trying to replicate a neural net to compute the energy of molecules (image given below). The Energy is the sum of bonded/non-bonded interactions and angle/dihedral strains. I have 4 separate neural networks that find out the energy due to each of these, and the total energy is the sum of energies due to each interaction, there may be 100s of these. In my data-set, I only know the total energy.
If my total energy is computed using multiple (an unknown number, decided by the molecule) forward-pos on different neural networks, how do I get keras to backpropagate through the dynamically constructued sum. A non-keras Tensorflow method would work too. (I would have just summed together the outputs of the neural nets if I knew before hand how many bonds would there be, the issue is having to unfold copies of the neural net at runtime).
This is just an example image given in the paper:
In summary, the question is: "How do I implement dynamic unrolling and feed it to a sum in Keras?".
Keras layers can be given a shape of (None, actual-shape...) if one of the dimensions is not known. Then we can use a TensorFlow layer to sum over the axis indexed 0 using tf.reduce_sum(layer, axis=0). So dynamic layer sizes are not hard to achieve in Keras.
However if the input shapes pose more of a constraint, we can pass in the full matrix with dummy 0 values appended, and a mask matrix, then we can use tf.multiply to reject the dummy values, the backpropagation will automatically work of course.

Can i have more than one 1s in the target one hot encoded vector in categorical cross entropy?

If I prepared target labels as eg:[0,0,1,0,1] which contains number 1 more than once. will the categorical cross entropy work fine or is there a good way to do this? please help.
Yes you can have, it would be a multi-label classification
The cross-entropy would calculate something like this for [0,0,1,0,1]
loss = -[0*log(p0)+0*log(p1)+1*log(p2)+0*log(p3)+1*log(p4)]

Tensorflow bounded regression vs classification

As part of my masters thesis I have been tasked with predicting a label integer (0-255) which is a binned representation of an angle. The feature columns are also integers, in the range (0-255).
So far I have used the custom Tensorflow layers estimator, implementing a 256 output classifier which performs well. However, my issue with the classification approach I am using is the following:
My classification model thinks that predicting a 3 instead of a 28 is as good/bad as predicting a 27 as a 28
The numerical interval / ordinal nature of my data (not sure which) leads me to believe that if I used regression I would achieve results with less drastically incorrect predictions or outliers.
My goal:
to reduce the number of drastically incorrect predicted outliers
My questions:
Is regression the better approach, or can I improve my
classification to include an ordinal/interval relationship between
my labels?
If I choose regression, is there a way to bound my predicted output between 0-255 (I know I will have to round float values predicted).
Thanks in advance. Any other comments, suggestions or ideas to help me to best tackle the problem are also very helpful.
If I made any incorrect assumptions or mistake in my interpretation of the problem feel free to correct me.
Question 1: Regression is the simpler approach, however, you can also use classification and manipulate the loss function to have a lower loss for misclassifications that are "close" to the original class.
Question 2: The tensorflow command for bounding your prediction is tf.clip_by_value. Are you mapping all 360 degrees to [0,255]? In that case you will want to consider the boundary cases, i.e. your estimator yields -4 and the true value is 251, but they are the actually representing the same value so loss should be 0.

dropout with relu activations

I am trying to implement a neural network with dropout in tensorflow.
tf.layers.dropout(inputs, rate, training)
From the documentation: "Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. The units that are kept are scaled by 1 / (1 - rate), so that their sum is unchanged at training time and inference time."
Now I understand that this behavior if dropout is applied on top of sigmoid activations that are strictly above zero. If half of the input units are zeroed, the sum of all the outputs will be also halved so it makes sense to scale them by factor of 2 in order to regain some kind of consistency before the next layer.
Now what if one uses the tanh activation which is centered around zero? The reasoning above no longer holds true so is it still valid to scale the output of dropout by the mentioned factor? Is there a way to prevent tensorflow dropout from scaling the outputs?
Thanks in advance
If you have a set of inputs to a node and a set of weights, their weighted sum is a value, S. You can define another random variable by selecting a random fraction f of the original random variables. The weighted sum using the same weights of the random variable defined in this way is S * f. From this, you can see the argument for rescaling is precise if the objective is that the mean of the sum remains the same with and without scaling. This would be true when the activation function is linear in the range of the weighted sums of subsets, and approximately true if the activation function is approximately linear in the range of the weighted sum of subsets.
After passing the linear combination through any non-linear activation function, it is no longer true that rescaling exactly preserves the expected mean. However, if the contribution to a node is not dominated by a small number of nodes, the variance in the sum of a randomly selected subset of a chosen, fairly large size will be relatively small, and if the activation function is approximately linear fairly near the output value, rescaling will work well to produce an output with approximately the same mean. Eg the logistic and tanh functions are approximately linear over any small region. Note that the range of the function is irrelevant, only the differences between its values.
With relu activation, if the original weighted sum is close enough to zero for the weighted sum of subsets to be on both sides of zero, a non-differentiable point in the activation function, rescaling won't work so well, but this is a relatively rare situation and limited to outputs that are small, so may not be a big problem.
The main observations here are that rescaling works best with large numbers of nodes making significant contributions, and relies on local approximate linearity of activation functions.
The point of setting the node to have an output of zero is so that neuron would have no effect on the neurons being fed by it. This would create sparsity and hence, attempts to reduce overfitting. When using sigmoid or tanh, the value is still set to zero.
I think your approach of reasoning here is incorrect. Think of contribution rather than sum.