Multi-GPU CIFAR10 example in tensorflow: aggregated loss - tensorflow

In the tensorflow multi-gpu CIFAR 10 example, for each GPU they compute the loss (lines 174-180)
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
loss = tower_loss(scope)
When a few lines below (line 246), they evaluate loss with
_, loss_value = sess.run([train_op, loss])
what loss is exactly computed?
I looked at the tower_loss function, but I don't see any incremental aggregation over all GPUs (towers).
I understand that the whole graph is being executed (over all GPUs), but what value of the loss will be returned? Only the loss on the last GPU? I don't see any aggregation on the actual loss variable.

The computed loss is indeed only the loss on the last GPU. In the code they use a Python variable loss to access the Tensor.
You can also validate this easily by printing the Python variable representing this tensor. E.g. adding print(loss)on line 244 (with a 2-GPU setup), will return:
Tensor("tower_1/total_loss_1:0", shape=(), dtype=float32, device=/device:GPU:1)

I think the gradient computed from the loss of each GPU tower is appended by the tower_grads list, and average_grad function averages all gradients. I don't quite understand the question here, because tower_loss() function is within one GPU, the aggregation and sync of all GPU outputs are collected out of it. The previous answer of print will definitely print out the last GPU result, because it is last output of the for loop of all GPU runs, but it does not mean that only the last loss is collected.

Related

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

Tensorflow Optimizer On Multi-valued Tensor

By mistake I forgot to reduce the mean of the output from the cross entropy before I fed it as the loss, but the training ran anyways and produced reasonable results.
Now I'm wondering if what I did:
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='cross_entropy_per_example')
op = tf.train.AdamOptimizer(0.01).minimize(loss)
Is the same as:
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='cross_entropy_per_example'))
op = tf.train.AdamOptimizer(0.01).minimize(loss)
I was under the impression that the optimization of the cost function required a single value tensor, but I'm confused why the training ran despite passing a tensor with more than one value.
tf.gradients (and therefore most higher-level interfaces to it, including Optimizers) implicitly sums whatever you're differentiating. tf.gradients will only compute gradients with respect to a scalar. There is some mention of this in the tf.gradients documentation.
So in your case it's just off by whatever reduce_mean was dividing by.

Batch training uses sum of updates? or average of updates?

I have few questions about batch training of neural networks.
First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?
If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.
Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.
Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.
=======================================================================
train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
for i in range(1000):
batch = mnist.train.next_batch(100)
if i%100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0})
sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})
========================================================================
Please explain how does Tensorflow optimize the weight so very fast.
The answer to this question depends on your loss function.
If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.
For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.
This is the same to use sum of gradients or average gradient because you later have to find a good learning rate that will most likely take into account the division by the batch size in the average of gradient.
However, using the average over the batch has the advantage of having a comparable loss between two training using different batch size.

Tensorflow CIFAR10 Multi GPU - Why Combined Loss?

In the TensorFlow CIFAR10 example, trained over multiple GPUs, the loss seems to be combined for each "tower", and the gradient is calculated from this combined loss.
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.contrib.deprecated.scalar_summary(loss_name, l)
return total_loss
I'm new to TensorFlow, but from my understanding, every time cifar10.loss is called, tf.add_to_collection('losses', cross_entropy_mean) is run and the loss from the current batch is being stored in the collection.
Then losses = tf.get_collection('losses', scope) is called, and all the losses are being retrieved from the collection. Then tf.add_n op is adding all the retrieved loss tensors from this "tower" together.
I expected the loss to be just from the current training step/batch, not all batches.
Am I misunderstanding something? Or is there a reason for combining the losses together?
If weight decay is enabled, it will also add it to the losses collection.
Therefore, for each tower(scope), it will add_n all the losses: cross_entropy_mean and weight_decay.
Then Gradients are calculated for each tower(scope). At the end all the gradients for different towers (scopes) will get averaged in the average_gradients.
Why combined loss
The example you are referring is a example of data parallelism over multiple gpus. Data parallelism helps towards training deeper model with bigger batch_size. In this setting you need to combine loss from the gpus as each of the gpus is holding one part of the input batch (loss and gradients corresponding to that input part). One illustration is provided in the following example from tensorflow data parallism example.
Note: In case of model parallelism different subgraph of the model run on separate gpus and intermediate outputs are collected by the master.
example
if you want to train the model using a batch size of 256, for a deeper model (for example, resnet/inception)that mayn't fit into one single gpu (for example a 8 GB memory), so you can split the batch into two batches of size 128 and do forward pass of the model using the two batches on separate gpus and compute loss and gradients. The computed (loss. gradients) from each of the gpus are collected and averaged over. the averaged gradient is used to update the model parameters.

[Tensorflow]: cifar10_multi_gpu_train.py - unintended loss reporting

cifar10_multi_gpu_train.py
At this line, every loss for each tower in the multi GPU is calculated
However, these losses are not averaged, and it seems like the loss from the last GPU is used to return loss.
Is this on purpose (if yes, why?) or is it a bug in the code?
At this line, note that loss is in different name scopes (tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i))); so if I understand correctly, it is not that only the loss for the last GPU is used; instead, all losses under a corresponding naming scope for each GPU are used.
Each tower (corresponding to each GPU) will have a loss, which is used to calculate the gradient. Losses are not averaged; instead, all gradients for all towers are averaged at line 196.
Note that in this figure from the tutorial, there is no aggregation for all individual loss, it is the gradients that are averaged.