I use tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits, labels)) to calculate the loss, and the output is around 1.15.
Is it low enough? I guess the sum of labels and the sum of normalized logits are both 1, so that the evaluation criterion of the above loss function is invariable? So I wonder how low the loss value should be. Or is there any other way to evaluate the loss?
Related
I know this has probably been addressed many a time, but I'm constantly hearing conflicting points and I'm uncertain as to how I should go about computing the loss function, and moreover, how to compute the gradient over a mini-batch.
Let's say I have a simple linear regression ANN model with one input, one output and no activation function. The weight matrix (W) and bias matrix (B) then just have the shape (1, 1). If I batch my data into mini-batches of size 32, then the input matrix (X) will have dimensions (1, 32). Forward prop is then performed without a hitch: it's just W.X + B which works fine because the shapes are compatible. It then produces the predictions which can be denoted by matrix Yi with shape (1, 32). The cost is then computed as the mean squared error of the model outputs and the truth values. In this case, there's only one output, so the cost over one training example is just (truth - predicted)2.
So I'm confused about a couple of aspects at this point. Do you a) compute the average cost over a mini-batch, then compute the derivative of the averaged cost w.r.t to the weight and bias; or b) calculate individual costs for each example in the mini-batch and then compute the derivatives of the costs w.r.t the weight and bias, and then finally sum the gradients and average them?
Since gradient is a linear operator,
grad((cost(x1)+...+cost(xn))/n)=(grad(cost(x1))+...grad(cost(xn)))/n.
Previous developer apply neural network and give me result of loss, MSE and MAE.
How do I compare these result with my models (Linear Regression)? I can calculate MSE and MAE, but what is loss?
Loss is the "cost function" that is being optimized. The metric is assigned (i.e. you can set the loss to MSE, MAE, acc=accuracy, etc.).
You can take a look at the .compile line of your code to see what it is set as.
https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile
https://www.youtube.com/watch?v=0twSSFZN9Mc
How could tensorflow optimize the batch's element losses individually instead of optimizing the batch loss?
When optimizing the loss for each batch, the common way is summing or taking the average of all the batch's element losses as the batch loss, and then optimizing this batch loss. In my case, I would like to optimize each element's loss individually, instead of reducing them together as the batch loss.
For example, in the following codes.
losses = tf.nn.nce_loss(<my batch inputs here>)
loss = tf.reduce_mean(losses)
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(loss)
How could I skip loss = tf.reduce_mean(losses) and minimize the tensor losses directly? (In this way, the mini-batch actually reduces to the situation that batch size is 1.)
I have feed the losses to minimize directly as:
optim = tf.nn.GradientDesentOptimizor(learning_rate = 0.01).minimize(losses) # instead of loss
I am not sure how will minimaziation work. When I use it to run in the session, the losses tend to explore to nan.
So is it possible to achieve the above aim in tensorflow?
The difference between computation of gradients of tf.reduce_mean(losses) and gradients of losses is that for losses tensor you will get the SUM of gradients (the sum over gradients over each sample in a batch), while for tf.reduce_mean(losses) you will get the MEAN of the gradients (mean of gradients over the samples in a batch). That's why you start to get NaN values - the sum of gradients becomes a very large number as the size of the batch increases.
If you going to optimize a tensor loss instead of reduced mean loss you can get the exact equivalence by dividing your learning rate by the batch size.
To optimizer individually for each sample just feed one sample per batch.
In tf.estimator, what's the difference between average_loss and loss? I would have guessed from the names that the former would be the latter divided by the number of records, but that's not the case; with a few thousand records, the latter is about three or four times the former.
The difference between average_loss and loss is that one reduces the SUM over the batch losses, while the other reduces the MEAN over the same losses. Hence, the ratio is exactly the batch_size argument of your input_fn. If you pass batch_size=1, you should see them equal.
The actual reported tensors depend on the particular type of tf.Estimator, but they are very similar, here's the source code for the regression head (corresponds to tf.DNNRegressor):
training_loss = losses.compute_weighted_loss(unweighted_loss, weights=weights,
reduction=losses.Reduction.SUM)
mean_loss = metrics_lib.mean(unweighted_loss, weights=weights)
As you can see, they are computed from the same unweighted_loss and weights tensors. The same values are reported to tensorboard summary.
The actual ratio is exactly 4.0, which corresponds to the batch size.
When you train a network, you usually feed inputs as a batch.
In the example you refer to , the batch size is 4, so the loss is the sum of the losses over the whole batch, while the average loss is the average of the losses over the whole batch.
I have few questions about batch training of neural networks.
First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?
If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.
Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.
Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.
=======================================================================
train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
for i in range(1000):
batch = mnist.train.next_batch(100)
if i%100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0})
sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})
========================================================================
Please explain how does Tensorflow optimize the weight so very fast.
The answer to this question depends on your loss function.
If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.
For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.
This is the same to use sum of gradients or average gradient because you later have to find a good learning rate that will most likely take into account the division by the batch size in the average of gradient.
However, using the average over the batch has the advantage of having a comparable loss between two training using different batch size.