Is the loss printed by tensorflow a batch/sample wise loss or is it a running average loss? - tensorflow

When I train a TensorFlow model, it usually prints information similar to the below line at each iteration
INFO:tensorflow:loss = 1.9433185, step = 11 (0.300 sec)
Is the loss being printed the loss of the batch that the model saw currently, or is it the running average loss over all the previous batches of the training?
If I use a batch size of 1 i.e. only one training sample in each batch, then the loss printed will be of every sample separately, or will it be a running average loss?

The loss reported in the progress bar of Keras/TensorFlow is always a running mean of the batches seen so far, it is not a per-batch value.
I do not think there is a way to see the per-batch values during training.

Related

What to do when accuracy increasing but loss is also increasing on validation data?

I'm currently working on a multi-class classification problem which is highly imbalanced. I want to save my model weights for best epoch but I'm confused on which metric I should choose?
Here's my training progress bar :
I am using ModelCheckpoint callback in tf.keras and monitoring val_loss as a metric to save best model weights.
As you can see in the image,
At 8th epoch I got an val_acc = 0.9845 but val_loss = 0.629 and precision and recall is also high here.
But at 3rd epoch I got val_acc = 0.9840 but val_loss = 0.590
I understand the difference is not huge but in such cases what's the ideal metric to believe on imbalanced dataset?
The most important factors are the the validation and training error.
If the validation loss (error) is going to increase so means overfitting. You must set the number of epochs as high as possible and avoid the overfitting and terminate training based on the error rates. . As long as it keeps dropping training should continue. Till model start to converge at nth epochs. Indeed it should converge quite well to a low val_loss.
Just bear in mind an epoch is one learning cycle where the learner can see the whole training data set. If you have two batches, the learner needs to go through two iterations for one epoch.
This link can be helpful.
You can divide data in 3 data sets, training, validation and evaluation. Train each network along enough number of epochs to track the training Mean Squared Error to be stuck in a minimum.
The training process uses training data-set and should be executed epoch by epoch, then calculate the Mean Squared Error of the network in each epoch for the validation set. The network for the epoch with the minimum validation MSE is selected for the evaluation process.
This can happen for several reasons. Assuming you have used proper separation of train, test and validation set and preprocessing of datasets like min-max scaler, adjusting missing values, you can do the following.
First run the model for several epoch and plot the validation loss graph.
If the loss is first reducing and after reaching a certain point it is now increasing, if the graph is in U shape, then you can do early stopping.
In other scenario, when loss is steadily increasing, early stopping won't work. In this case, add dropout layer of 0.2-0.3 in between the major layers. This will introduce randomness in the layers and will stop the model from memorising.
Now once you add dropouts, your model may suddenly start to behave strange. Tweak with activation functions and number of output nodes or Dense layer and it will eventually get right.

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

Could tensorflow optimize the each element's loss in a batch separately, instead of optimizing the whole average loss?

How could tensorflow optimize the batch's element losses individually instead of optimizing the batch loss?
When optimizing the loss for each batch, the common way is summing or taking the average of all the batch's element losses as the batch loss, and then optimizing this batch loss. In my case, I would like to optimize each element's loss individually, instead of reducing them together as the batch loss.
For example, in the following codes.
losses = tf.nn.nce_loss(<my batch inputs here>)
loss = tf.reduce_mean(losses)
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(loss)
How could I skip loss = tf.reduce_mean(losses) and minimize the tensor losses directly? (In this way, the mini-batch actually reduces to the situation that batch size is 1.)
I have feed the losses to minimize directly as:
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(losses) # instead of loss
I am not sure how will minimaziation work. When I use it to run in the session, the losses tend to explore to nan.
So is it possible to achieve the above aim in tensorflow?
The difference between computation of gradients of tf.reduce_mean(losses) and gradients of losses is that for losses tensor you will get the SUM of gradients (the sum over gradients over each sample in a batch), while for tf.reduce_mean(losses) you will get the MEAN of the gradients (mean of gradients over the samples in a batch). That's why you start to get NaN values - the sum of gradients becomes a very large number as the size of the batch increases.
If you going to optimize a tensor loss instead of reduced mean loss you can get the exact equivalence by dividing your learning rate by the batch size.
To optimizer individually for each sample just feed one sample per batch.

Why does the accuracy drop to zero in each epoch, while training convlstm layers in keras?

I am trying to use ConvLSTM layers in Keras 2 to train an action recognition model. The model has 3 ConvLSTM layers and 2 Fully Connected ones.
At each and every epoch the accuracy for the first batch (usually more than one) is zero and then it increases to some amount more than the previous epoch. For example, the first epoch finishes at 0.3 and the next would finish at 0.4 and so on.
My question is why does it get back to zero at each epoch?
p.s.
The ConvLSTM is stateless.
The model is compiled with SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True), for some reason it does not converge using Adam.
So - in order to understand why something like this happening you need to understand how keras computes accuracy during the batch computation:
Before each batch - a number of positively classified examples is stored.
After each batch - a number of positively classified examples is stored and it's printed after division by all examples used in training.
As your accuracy is pretty low it's highly probable that in a first few batches none of the examples will be classified properly. Especially when you have a small batch. This makes accuracy to be 0 at the beginning of your training.

Tensorflow: loss decreasing, but accuracy stable

My team is training a CNN in Tensorflow for binary classification of damaged/acceptable parts. We created our code by modifying the cifar10 example code. In my prior experience with Neural Networks, I always trained until the loss was very close to 0 (well below 1). However, we are now evaluating our model with a validation set during training (on a separate GPU), and it seems like the precision stopped increasing after about 6.7k steps, while the loss is still dropping steadily after over 40k steps. Is this due to overfitting? Should we expect to see another spike in accuracy once the loss is very close to zero? The current max accuracy is not acceptable. Should we kill it and keep tuning? What do you recommend? Here is our modified code and graphs of the training process.
https://gist.github.com/justineyster/6226535a8ee3f567e759c2ff2ae3776b
Precision and Loss Images
A decrease in binary cross-entropy loss does not imply an increase in accuracy. Consider label 1, predictions 0.2, 0.4 and 0.6 at timesteps 1, 2, 3 and classification threshold 0.5. timesteps 1 and 2 will produce a decrease in loss but no increase in accuracy.
Ensure that your model has enough capacity by overfitting the training data. If the model is overfitting the training data, avoid overfitting by using regularization techniques such as dropout, L1 and L2 regularization and data augmentation.
Last, confirm your validation data and training data come from the same distribution.
Here are my suggestions, one of the possible problems is that your network start to memorize data, yes you should increase regularization,
update:
Here I want to mention one more problem that may cause this:
The balance ratio in the validation set is much far away from what you have in the training set. I would recommend, at first step try to understand what is your test data (real-world data, the one your model will face in inference time) descriptive look like, what is its balance ratio, and other similar characteristics. Then try to build such a train/validation set almost with the same descriptive you achieve for real data.
Well, I faced the similar situation when I used Softmax function in the last layer instead of Sigmoid for binary classification.
My validation loss and training loss were decreasing but accuracy of both remained constant. So this gave me lesson why sigmoid is used for binary classification.