Batch training uses sum of updates? or average of updates? - tensorflow

I have few questions about batch training of neural networks.
First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?
If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.
Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.
Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.
=======================================================================
train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
for i in range(1000):
batch = mnist.train.next_batch(100)
if i%100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0})
sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})
========================================================================
Please explain how does Tensorflow optimize the weight so very fast.

The answer to this question depends on your loss function.
If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.
For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.

This is the same to use sum of gradients or average gradient because you later have to find a good learning rate that will most likely take into account the division by the batch size in the average of gradient.
However, using the average over the batch has the advantage of having a comparable loss between two training using different batch size.

Related

How to define combine loss function in keras?

My model arch is
I have two outputs, I want to train a model based on two outputs such as mse, and cross-entropy. At first, I used two keras loss
model1.compile(loss=['mse','sparse_categorical_crossentropy'], metrics = ['mse','accuracy'], optimizer='adam')
it's working fine, the problem is the cross entropy loss is very unstable, sometimes gives accuracy 74% in the next epoch shows 32%. I'm confused why is?
Now if define customer loss.
def my_custom_loss(y_true, y_pred):
mse = mean_squared_error(y_true[0], y_pred[0])
crossentropy = binary_crossentropy(y_true[1], y_pred[1])
return mse + crossentropy
But it's not working, it showed a negative loss in total loss.
It is hard to judge the issues depending on the information given. A reason might be a too small batch size or a too high learning rate, making the training unstable. I also wonder, that you use sparse_categorical_crossentropy in the top example and binary_crossentropy in the lower one. How many classes do you actually have?

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

Could tensorflow optimize the each element's loss in a batch separately, instead of optimizing the whole average loss?

How could tensorflow optimize the batch's element losses individually instead of optimizing the batch loss?
When optimizing the loss for each batch, the common way is summing or taking the average of all the batch's element losses as the batch loss, and then optimizing this batch loss. In my case, I would like to optimize each element's loss individually, instead of reducing them together as the batch loss.
For example, in the following codes.
losses = tf.nn.nce_loss(<my batch inputs here>)
loss = tf.reduce_mean(losses)
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(loss)
How could I skip loss = tf.reduce_mean(losses) and minimize the tensor losses directly? (In this way, the mini-batch actually reduces to the situation that batch size is 1.)
I have feed the losses to minimize directly as:
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(losses) # instead of loss
I am not sure how will minimaziation work. When I use it to run in the session, the losses tend to explore to nan.
So is it possible to achieve the above aim in tensorflow?
The difference between computation of gradients of tf.reduce_mean(losses) and gradients of losses is that for losses tensor you will get the SUM of gradients (the sum over gradients over each sample in a batch), while for tf.reduce_mean(losses) you will get the MEAN of the gradients (mean of gradients over the samples in a batch). That's why you start to get NaN values - the sum of gradients becomes a very large number as the size of the batch increases.
If you going to optimize a tensor loss instead of reduced mean loss you can get the exact equivalence by dividing your learning rate by the batch size.
To optimizer individually for each sample just feed one sample per batch.

In tensorflow estimator class, what does it mean to train one step?

Specifically, within one step, how does it training the model? What is the quitting condition for the gradient descent and back propagation?
Docs here: https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
e.g.
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": X_train},
y=y_train,
batch_size=50,
num_epochs=None,
shuffle=True)
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
I understand that training one step means that we feed the neural network model with batch_size many data points once. My questions is, within this one step, how many times does it perform gradient descent? Does it do back propagation and gradient descent just once or does it keep performing gradient descent until the model weights reach a optimal for this batch of data?
In addition to #David Parks answer, using batches for performing gradient descent is referred to as stochastic gradient descent. Instead of updating the weights after each training sample, you average over the sum of gradients of the batch and use this new gradient to update your weights.
For example, if you have 1000 trainings samples and use batches of 200, you calculate the average gradient for 200 samples, and update your weights with it. That means that you only perform 5 updates overall instead of updating your weights 1000 times. On sufficiently big data sets, you will experience a much faster training process.
Michael Nielsen has a really nice way to explain this concept in his book.
1 step = 1 gradient update. And each gradient update step requires one forward pass and one backward pass.
The stopping condition is generally left up to you and is arguably more art than science. Commonly you will plot (tensorboard is handy here) your cost, training accuracy, and periodically your validation set accuracy. The low point on validation accuracy is generally a good point to stop. Depending on your dataset validation accuracy may drop and at some point increase again, or it may simply flatten out, at which point the stopping condition often correlates with the developer's degree of impatience.
Here's a nice article on stopping conditions, a google search will turn up plenty more.
https://stats.stackexchange.com/questions/231061/how-to-use-early-stopping-properly-for-training-deep-neural-network
Another common approach to stopping is to drop the learning rate every time you compute that no change has occurred to validation accuracy for some "reasonable" number of steps. When you've effectively hit 0 learning rate, you call it quits.
The input function emits batches (when num_epochs=None, num_batches is infinite):
num_batches = num_epochs * (num_samples / batch_size)
One step is processing 1 batch, if steps > num_batches, the training will stop after num_batches.

Tensorflow estimator: average_loss vs loss

In tf.estimator, what's the difference between average_loss and loss? I would have guessed from the names that the former would be the latter divided by the number of records, but that's not the case; with a few thousand records, the latter is about three or four times the former.
The difference between average_loss and loss is that one reduces the SUM over the batch losses, while the other reduces the MEAN over the same losses. Hence, the ratio is exactly the batch_size argument of your input_fn. If you pass batch_size=1, you should see them equal.
The actual reported tensors depend on the particular type of tf.Estimator, but they are very similar, here's the source code for the regression head (corresponds to tf.DNNRegressor):
training_loss = losses.compute_weighted_loss(unweighted_loss, weights=weights,
reduction=losses.Reduction.SUM)
mean_loss = metrics_lib.mean(unweighted_loss, weights=weights)
As you can see, they are computed from the same unweighted_loss and weights tensors. The same values are reported to tensorboard summary.
The actual ratio is exactly 4.0, which corresponds to the batch size.
When you train a network, you usually feed inputs as a batch.
In the example you refer to , the batch size is 4, so the loss is the sum of the losses over the whole batch, while the average loss is the average of the losses over the whole batch.