MSE loss function calculation - tensorflow

I trained a seq2seq network using input samples with a shape of [30,26] and output shape of [1,7] with MSE as the loss function (model.compile(loss="mse", optimizer="adam"). However, when I compare history.history['loss'] to
keras_error = tf.keras.losses.MSE(predictions_train, data_train) (returns an array of errors which I averaged) the results differ by about 0.2. Insights on how the MSE loss function is calculated for an output sequence like this is greatly appreciated!

The MSE loss is calculated in the same way. You have 7 values in both x and y. Both sides are subtracted from each other then squared then divided by 7. The reason your values are coming like that can be because model.compile is using tf.keras.losses.MeanSquaredError while you are using another function. This kind of discrepancy can come because of this. But, the end game is the performance of the network, is that getting fulfilled?

Related

My loss function starts to diverge in Tensorflow1. What can I do?

First of all I am new with DNN network. I am training a network using tensorflow 1. I used TF1 since I used a paper as benchmark of my code and it is written in TF1, although I tried to implement the same algorithm in TF2 I could not achieve the same performance that I achieved with the original version.
My problem is normally my loss function, which is the MSE, goes down however after a random number of epochs the loss function diverges to NaN. It is really strange since although the result diverges, If I train the network again with the last set of weights previous to divergence. The loss MSE continues to reduce until new divergence happens.
I read this type of error is related with numerical inestability. In the algorithm I am implementing, the network is feed several times before compute the loss function, with the network output is computed a new input which is used to feed again the network. The process is repeated several times and after that the loss function is computed. It is similar to the following code (I have simplified the code since it is quite long)
input = ...
for t in range(max_iter):
x1 = tf.nn.relu(input#A1+b1)
x1 = BatchNormalization()(x1)
x2 = tf.nn.relu(x1#A2+b2)
x2 = BatchNormalization()(x2)
x3 = tf.nn.relu(x2#A3+b3)
x3 = BatchNormalization()(x3)
output = x3#A4+b4
input = recompute_input(output)
compute_loss(input, true_values)
I suppose during this process some inputs of the network break the gradients, I have read and tryed the following solutions:
Standarize the network input
Changing the learning rate
Changing the number of batches
I have implemented all the previous solution but I have not solved the problem. Currently I am using Adam Optimizer with learning rage equal to 0.0001 and I keep having the same problem, randomly the algorithm diverges. I tried changing learning rate but it ends in divergence again so I do not how I should do.
How do you recommend me to proceed? In any case, there is some way to check the values of the tensors inside the "run" in order to analyze when the result diverges?
Thank you in advance!

What shape does my loss tensor need to be in tensorflow 2 using Keras API?

I have been playing around with custom loss functions for a while with some success, but I'm struggling with a new loss function, and I wonder if it might be due to the loss result tensor's shape.
My y_true and y_pred tensors have shape == (100, 216, 563). Due to the nature of the data and the calculations I'm performing in my loss function, it makes perfect sense to output a loss tensor of shape == (100, 563) because the second dimension gets reduced away with a reduce_prod() operation.
However, if I use this loss function alone, the loss value steadily increases instead of decreasing... I've not seen this before. If it was all over the place I'd think it was just a bad idea for a loss function or my maths was wrong somewhere, but as far as I can tell the maths is right.
Will this weird shape with a missing middle dimension throw off the gradient calculations? I've tried already using keepdims=True in my reduce_foo() methods, but this makes no difference to the increasing loss value (and the results still have a different shape, shape == (100, 1, 563)
Looking through tensorflow docs, I can find examples of both a loss with matching shape to y_pred and y_true, and another loss with a single scalar value. Are there any specific rules stated anywhere as to what shape the output loss should be or can anyone give me insights that might help me understand why the loss should be a specific shape (if that is even my problem)?

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

How does Keras compute its loss function for matrix-valued outputs?

I try to compute the next several video frames given a collection of previous frames, i.e. I have a deep neural network that directly outputs a small video clip of dimension (samples, frames, m, n, channels). I train my neural network using Keras' mean squared error loss function.
Keras' implementation of the mean squared error loss function is
K.mean(K.square(y_pred - y_true), axis=-1)
The computed loss value will in my case still be a rank 4 tensor (which I checked is indeed true).
As the loss function should be scalar I had imagined this will cause a problem but surprisingly there is no warning issued from Keras and I do get some meaningful results.
Any clue as to how Keras is doing its back-propagation in this case? Is there an internal conversion to a scalar loss function that Keras is doing that I am not aware of?
Thank you!

CNN Loss stuck at 2.302 (ln(10))

I am trying to model the Neural Net for solving CIFAR-10 dataset, but there is this very odd problem I am facing, I have tried over 6 different CNN architecture and with many different CNN hyperparameters and fully connected #neurons values, but all seem to fail with loss of 2.302 and corresponding accuracy of 0.0625, why does this happen, what property of CNN or neural net makes this, I also tried dropout, l2_norm, different kernel sizes, different padding in CNN and Max Pool. I don't understand why the loss gets stuck over such an odd number?
I am implementing this using tensorflow, and I have tried softmax layer + cross_entropy_loss, and without_softmax_layer + sparse_cross_entropy_loss. Is it the plateau the neural net loss function is stuck at?
This seems like you accidentally applied a non-linearity/activation function to the last layer of your network. Keep in mind that the cross entropy works upon values within a range between 0 and 1. As you "force" your output to this range automatically by applying the softmax function just before computing the cross entropy, you should just "apply" a linear activation function (just don't add any).
By the way, the value of 2.302 is not random by any chance. It is rather the result of the softmax loss being -ln(0.1) when you assume that all 10 classes (CIFAR-10) initially got the same expected diffuse probability of 0.1. Check out the explanation by Andrej Karpathy:
http://cs231n.github.io/neural-networks-3/