Could tensorflow optimize the each element's loss in a batch separately, instead of optimizing the whole average loss? - tensorflow

How could tensorflow optimize the batch's element losses individually instead of optimizing the batch loss?
When optimizing the loss for each batch, the common way is summing or taking the average of all the batch's element losses as the batch loss, and then optimizing this batch loss. In my case, I would like to optimize each element's loss individually, instead of reducing them together as the batch loss.
For example, in the following codes.
losses = tf.nn.nce_loss(<my batch inputs here>)
loss = tf.reduce_mean(losses)
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(loss)
How could I skip loss = tf.reduce_mean(losses) and minimize the tensor losses directly? (In this way, the mini-batch actually reduces to the situation that batch size is 1.)
I have feed the losses to minimize directly as:
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(losses) # instead of loss
I am not sure how will minimaziation work. When I use it to run in the session, the losses tend to explore to nan.
So is it possible to achieve the above aim in tensorflow?

The difference between computation of gradients of tf.reduce_mean(losses) and gradients of losses is that for losses tensor you will get the SUM of gradients (the sum over gradients over each sample in a batch), while for tf.reduce_mean(losses) you will get the MEAN of the gradients (mean of gradients over the samples in a batch). That's why you start to get NaN values - the sum of gradients becomes a very large number as the size of the batch increases.
If you going to optimize a tensor loss instead of reduced mean loss you can get the exact equivalence by dividing your learning rate by the batch size.
To optimizer individually for each sample just feed one sample per batch.

Related

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

How to make a selective back-propagation in a mini-batch in Tensorflow?

Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow."
(Here, a trajectory means a sequence of 2D positions.)
Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.
The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.
When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.
The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.
I also used tf.where function when calculating the loss. However, the training time does not decrease.
How can I reduce the training time?
Here I attached my pseudo code for training.
# For each frame in a sequence
for f in range(pred_length):
# For each element in a batch
for b in range(batch_size):
with tf.variable_scope("rnnlm") as scope:
if (f > 0 or b > 0):
scope.reuse_variables()
# for each pedestrian in an element
for p in range(MNP):
# ground-truth position
cur_gt_pose = ...
# loss mask
loss_mask_ped = ... # '1' or '0'
# go through RNN decoder
output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
zero_states_dec_list[b][p])
# fully connected layer for output
cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)
# go through embedding function for the next input
prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))
# calculate MSE loss
mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))
# only valid ped's traj contributes to the loss
self.loss += tf.multiply(mse_loss, loss_mask_ped)
I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.
However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.

Tensorflow estimator: average_loss vs loss

In tf.estimator, what's the difference between average_loss and loss? I would have guessed from the names that the former would be the latter divided by the number of records, but that's not the case; with a few thousand records, the latter is about three or four times the former.
The difference between average_loss and loss is that one reduces the SUM over the batch losses, while the other reduces the MEAN over the same losses. Hence, the ratio is exactly the batch_size argument of your input_fn. If you pass batch_size=1, you should see them equal.
The actual reported tensors depend on the particular type of tf.Estimator, but they are very similar, here's the source code for the regression head (corresponds to tf.DNNRegressor):
training_loss = losses.compute_weighted_loss(unweighted_loss, weights=weights,
reduction=losses.Reduction.SUM)
mean_loss = metrics_lib.mean(unweighted_loss, weights=weights)
As you can see, they are computed from the same unweighted_loss and weights tensors. The same values are reported to tensorboard summary.
The actual ratio is exactly 4.0, which corresponds to the batch size.
When you train a network, you usually feed inputs as a batch.
In the example you refer to , the batch size is 4, so the loss is the sum of the losses over the whole batch, while the average loss is the average of the losses over the whole batch.

Batch training uses sum of updates? or average of updates?

I have few questions about batch training of neural networks.
First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?
If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.
Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.
Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.
=======================================================================
train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
for i in range(1000):
batch = mnist.train.next_batch(100)
if i%100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0})
sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})
========================================================================
Please explain how does Tensorflow optimize the weight so very fast.
The answer to this question depends on your loss function.
If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.
For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.
This is the same to use sum of gradients or average gradient because you later have to find a good learning rate that will most likely take into account the division by the batch size in the average of gradient.
However, using the average over the batch has the advantage of having a comparable loss between two training using different batch size.

Tensorflow CIFAR10 Multi GPU - Why Combined Loss?

In the TensorFlow CIFAR10 example, trained over multiple GPUs, the loss seems to be combined for each "tower", and the gradient is calculated from this combined loss.
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.contrib.deprecated.scalar_summary(loss_name, l)
return total_loss
I'm new to TensorFlow, but from my understanding, every time cifar10.loss is called, tf.add_to_collection('losses', cross_entropy_mean) is run and the loss from the current batch is being stored in the collection.
Then losses = tf.get_collection('losses', scope) is called, and all the losses are being retrieved from the collection. Then tf.add_n op is adding all the retrieved loss tensors from this "tower" together.
I expected the loss to be just from the current training step/batch, not all batches.
Am I misunderstanding something? Or is there a reason for combining the losses together?
If weight decay is enabled, it will also add it to the losses collection.
Therefore, for each tower(scope), it will add_n all the losses: cross_entropy_mean and weight_decay.
Then Gradients are calculated for each tower(scope). At the end all the gradients for different towers (scopes) will get averaged in the average_gradients.
Why combined loss
The example you are referring is a example of data parallelism over multiple gpus. Data parallelism helps towards training deeper model with bigger batch_size. In this setting you need to combine loss from the gpus as each of the gpus is holding one part of the input batch (loss and gradients corresponding to that input part). One illustration is provided in the following example from tensorflow data parallism example.
Note: In case of model parallelism different subgraph of the model run on separate gpus and intermediate outputs are collected by the master.
example
if you want to train the model using a batch size of 256, for a deeper model (for example, resnet/inception)that mayn't fit into one single gpu (for example a 8 GB memory), so you can split the batch into two batches of size 128 and do forward pass of the model using the two batches on separate gpus and compute loss and gradients. The computed (loss. gradients) from each of the gpus are collected and averaged over. the averaged gradient is used to update the model parameters.