In tf.estimator, what's the difference between average_loss and loss? I would have guessed from the names that the former would be the latter divided by the number of records, but that's not the case; with a few thousand records, the latter is about three or four times the former.
The difference between average_loss and loss is that one reduces the SUM over the batch losses, while the other reduces the MEAN over the same losses. Hence, the ratio is exactly the batch_size argument of your input_fn. If you pass batch_size=1, you should see them equal.
The actual reported tensors depend on the particular type of tf.Estimator, but they are very similar, here's the source code for the regression head (corresponds to tf.DNNRegressor):
training_loss = losses.compute_weighted_loss(unweighted_loss, weights=weights,
reduction=losses.Reduction.SUM)
mean_loss = metrics_lib.mean(unweighted_loss, weights=weights)
As you can see, they are computed from the same unweighted_loss and weights tensors. The same values are reported to tensorboard summary.
The actual ratio is exactly 4.0, which corresponds to the batch size.
When you train a network, you usually feed inputs as a batch.
In the example you refer to , the batch size is 4, so the loss is the sum of the losses over the whole batch, while the average loss is the average of the losses over the whole batch.
Related
i have a feedforward regression network (in Keras with TensorFlow backend) with single hidden layer (30 neurons) and output layer with 2 neurons (for Imaginary and Real parts of complex signal) ...My question is how the MSE loss is calculated exactly ?
since i am getting only one number in "history object" for each epoch.
Eventually i would like to extract separate loss number per output neuron each epoch, is it possible in Keras ?
Losses are calculated for every batch pass and those are then averaged into an epoch loss which is the number you are given.
If you want to calculate loss for output neurons separately I think you will have to split your output layer into two, see image below for illustration. You can then assign a loss function for both outputs and you will have access to loss values of both neurons. Note that you will have to split your ground truth into two values as you now have two outputs instead of one.
code could look like this:
inputs = x = tf.keras.layers.Input(input_shape)
x = tf.keras.layers.Dense(30)(x)
y1 = tf.keras.layers.Dense(1)(x)
y2 = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs=inputs, outputs=[y1, y2])
loss = [tf.keras.losses.MeanSquaredError(), tf.keras.losses.MeanSquaredError()]
I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.
How could tensorflow optimize the batch's element losses individually instead of optimizing the batch loss?
When optimizing the loss for each batch, the common way is summing or taking the average of all the batch's element losses as the batch loss, and then optimizing this batch loss. In my case, I would like to optimize each element's loss individually, instead of reducing them together as the batch loss.
For example, in the following codes.
losses = tf.nn.nce_loss(<my batch inputs here>)
loss = tf.reduce_mean(losses)
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(loss)
How could I skip loss = tf.reduce_mean(losses) and minimize the tensor losses directly? (In this way, the mini-batch actually reduces to the situation that batch size is 1.)
I have feed the losses to minimize directly as:
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(losses) # instead of loss
I am not sure how will minimaziation work. When I use it to run in the session, the losses tend to explore to nan.
So is it possible to achieve the above aim in tensorflow?
The difference between computation of gradients of tf.reduce_mean(losses) and gradients of losses is that for losses tensor you will get the SUM of gradients (the sum over gradients over each sample in a batch), while for tf.reduce_mean(losses) you will get the MEAN of the gradients (mean of gradients over the samples in a batch). That's why you start to get NaN values - the sum of gradients becomes a very large number as the size of the batch increases.
If you going to optimize a tensor loss instead of reduced mean loss you can get the exact equivalence by dividing your learning rate by the batch size.
To optimizer individually for each sample just feed one sample per batch.
I have few questions about batch training of neural networks.
First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?
If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.
Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.
Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.
=======================================================================
train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
for i in range(1000):
batch = mnist.train.next_batch(100)
if i%100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0})
sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})
========================================================================
Please explain how does Tensorflow optimize the weight so very fast.
The answer to this question depends on your loss function.
If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.
For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.
This is the same to use sum of gradients or average gradient because you later have to find a good learning rate that will most likely take into account the division by the batch size in the average of gradient.
However, using the average over the batch has the advantage of having a comparable loss between two training using different batch size.
In the TensorFlow CIFAR10 example, trained over multiple GPUs, the loss seems to be combined for each "tower", and the gradient is calculated from this combined loss.
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.contrib.deprecated.scalar_summary(loss_name, l)
return total_loss
I'm new to TensorFlow, but from my understanding, every time cifar10.loss is called, tf.add_to_collection('losses', cross_entropy_mean) is run and the loss from the current batch is being stored in the collection.
Then losses = tf.get_collection('losses', scope) is called, and all the losses are being retrieved from the collection. Then tf.add_n op is adding all the retrieved loss tensors from this "tower" together.
I expected the loss to be just from the current training step/batch, not all batches.
Am I misunderstanding something? Or is there a reason for combining the losses together?
If weight decay is enabled, it will also add it to the losses collection.
Therefore, for each tower(scope), it will add_n all the losses: cross_entropy_mean and weight_decay.
Then Gradients are calculated for each tower(scope). At the end all the gradients for different towers (scopes) will get averaged in the average_gradients.
Why combined loss
The example you are referring is a example of data parallelism over multiple gpus. Data parallelism helps towards training deeper model with bigger batch_size. In this setting you need to combine loss from the gpus as each of the gpus is holding one part of the input batch (loss and gradients corresponding to that input part). One illustration is provided in the following example from tensorflow data parallism example.
Note: In case of model parallelism different subgraph of the model run on separate gpus and intermediate outputs are collected by the master.
example
if you want to train the model using a batch size of 256, for a deeper model (for example, resnet/inception)that mayn't fit into one single gpu (for example a 8 GB memory), so you can split the batch into two batches of size 128 and do forward pass of the model using the two batches on separate gpus and compute loss and gradients. The computed (loss. gradients) from each of the gpus are collected and averaged over. the averaged gradient is used to update the model parameters.