I have been working with udacity self driving challenge#2. What ever changes I make to the deep network like learning rate, activation function, i am getting gradient zero issue while training. I have used both cross entropy loss and mse loss. For cross entropy 100 classes are used with degree difference of 10 i.e radian angle of 0.17. For example from (-8.2 to -8.03) is class 0 and then (-8.03 to -7.86) is class 1 and so on.
Please find attached screen shots. As seen the layer before output (fc4 in the first image) almost becomes zero. So most of the gradient above almost follows the same pattern. Need some suggestion to eliminate this gradient zero error.
This seems to be vanishing gradient problem, 1.) Have you tried Relu? (I know you said you have tried diff activation fn) 2.) Have you tried reducing # of layers? 3.) Are your features normalized?
There are architectures designed to prevent this as well (ex. LSTM) but I think you should be able to get by with something simple like above.
Related
I've found what is probably a rare case in Tensorflow, but I'm trying to train a classifier (linear or nonlinear) using KL divergence (cross entropy) as the cost function in Tensorflow, with soft targets/labels (labels that form a valid probability distribution but are not "hard" 1 or 0).
However it is clear (tell-tail signs) that something is definitely wrong. I've tried both linear and nonlinear (dense neural network) forms, but no matter what I always get the same final value for my loss function regardless of network architecture (even if I train only a bias). Also, the cost function converges extremely quickly (within like 20-30 iterations) using L-BFGS (a very reliable optimizer!). Another sign something is amiss is that I can't overfit the data, and the validation set appears to have exactly the same loss value as the training set. However, strangely I do see some improvements when I increase network architecture size and/or change regularization loss. The accuracy improves with this as well (although not to the point that I'm happy with it or it's as I expect).
It DOES work as expected when I use the exact same code but send in one-hot encoded labels (not soft targets). An example of the cost function from training taken from Tensorboard is shown below. Can someone pitch me some ideas?
Ahh my friend, you're problem is that with soft targets, especially ones that aren't close to 1 or zero, cross entropy loss doesn't change significantly as the algorithm improves. One thing that will help you understand this problem is to take an example from your training data and compute the entropy....then you will know what the lowest value your cost function can be. This may shed some light on your problem. So for one of your examples, let's say the targets are [0.39019628, 0.44301641, 0.16678731]. Well, using the formula for cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
but then using the targets "y_" in place of the predicted probabilities "y" we arrive at the true entropy value of 1.0266190072458234. If you're predictions are just slightly off of target....lets say they are [0.39511779, 0.44509024, 0.15979198], then the cross entropy is 1.026805558049737.
Now, as with most difficult problems, it's not just one thing but a combination of things. The loss function is being implemented correctly, but you made the "mistake" of doing what you should do in 99.9% of cases when training deep learning algorithms....you used 32-bit floats. In this particular case though, you will run out of significant digits that a 32-bit float can represent well before you training algorithm converges to a nice result. If I use your exact same data and code but only change the data types to 64-bit floats though, you can see below that the results are much better -- your algorithm continues to train well out past 2000 iterations and you will see it reflected in your accuracy as well. In fact, you can see from the slope if 128 bit floating point was supported, you could continue training and probably see advantages from it. You wouldn't probably need that precision in your final weights and biases...just during training to support continuing optimization of the cost function.
When using neural networks for classification, it is said that:
You generally want to use softmax cross-entropy output, as this gives you the probability of each of the possible options.
In the common case where there are only two options, you want to use sigmoid, which is the same thing except avoids redundantly outputting p and 1-p.
The way to calculate softmax cross entropy in TensorFlow seems to be along the lines of:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
So the output can be connected directly to the minimization code, which is good.
The code I have for sigmoid output, likewise based on various tutorials and examples, is along the lines of:
p = tf.sigmoid(tf.squeeze(...))
cost = tf.reduce_mean((p - y)**2)
I would have thought the two should be similar in form since they are doing the same jobs in almost the same way, but the above code fragments look almost completely different. Furthermore, the sigmoid version is explicitly squaring the error whereas the softmax isn't. (Is the squaring happening somewhere in the implementation of softmax, or is something else going on?)
Is one of the above simply incorrect, or is there a reason why they need to be completely different?
The soft-max cross-entropy cost and the square loss cost of a sigmoid are completely different cost functions. Though they seem to be closely related, it is not the same thing.
It is true that both functions are "doing the same job" if the job is defined as "be the output layer of a classification network". Similarly, you can say that "softmax regression and neural networks are doing the same job". It is true, both techniques are trying to classify things, but in a different way.
The softmax layer with cross-entropy cost is usually preferred over sigmoids with l2-loss. Softmax with cross-entropy has its own pros, such as a stronger gradient of the output layer and normalization to probability vector, whereas the derivatives of the sigmoids with l2-loss are weaker. You can find plenty of explanations in this beautiful book.
I am trying to model the Neural Net for solving CIFAR-10 dataset, but there is this very odd problem I am facing, I have tried over 6 different CNN architecture and with many different CNN hyperparameters and fully connected #neurons values, but all seem to fail with loss of 2.302 and corresponding accuracy of 0.0625, why does this happen, what property of CNN or neural net makes this, I also tried dropout, l2_norm, different kernel sizes, different padding in CNN and Max Pool. I don't understand why the loss gets stuck over such an odd number?
I am implementing this using tensorflow, and I have tried softmax layer + cross_entropy_loss, and without_softmax_layer + sparse_cross_entropy_loss. Is it the plateau the neural net loss function is stuck at?
This seems like you accidentally applied a non-linearity/activation function to the last layer of your network. Keep in mind that the cross entropy works upon values within a range between 0 and 1. As you "force" your output to this range automatically by applying the softmax function just before computing the cross entropy, you should just "apply" a linear activation function (just don't add any).
By the way, the value of 2.302 is not random by any chance. It is rather the result of the softmax loss being -ln(0.1) when you assume that all 10 classes (CIFAR-10) initially got the same expected diffuse probability of 0.1. Check out the explanation by Andrej Karpathy:
http://cs231n.github.io/neural-networks-3/
I am following this tutorial on RNN where on line 177 the following code is executed.
max_grad_norm = 10
....
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars),
global_step=tf.contrib.framework.get_or_create_global_step())
Why do we do clip_by_global_norm? How is the value of max_grad_norm decided?
The reason for clipping the norm is that otherwise it may explode:
There are two widely known issues with properly training recurrent
neural networks, the vanishing and the exploding gradient problems
detailed in Bengio et al. (1994). In this paper we attempt to improve
the understanding of the underlying issues by exploring these problems
from an analytical, a geometric and a dynamical systems perspective.
Our analysis is used to justify a simple yet effective solution. We
propose a gradient norm clipping strategy to deal with exploding
gradients
The above taken from this paper.
In terms of how to set max_grad_norm, you could play with it a bit to see how it affects your results. This is usually set to quite small number (I have seen 5 in several cases). Note that tensorflow does not force you to specify this value. If you don't it will specify it itself (as explained in the documentation).
The reason that exploding\vanishing gradient is common in rnn is because while doing backpropagation (this is called backpropagation through time), we will need to multiply the gradient matrices all the way to t=0 (that is, if we currently at t=100, say the 100's character in a sentence, we will need to multiply 100 matrices). Here is the equation for t=3:
(this equation is taken from here)
If the norm of the matrices is bigger than 1, it will eventually explode. It it is smaller that 1, it will eventually vanish. This may happen in usual neural networks as well if they have a lot of hidden layers. However, feed forward neural networks usually don't have so many hidden layers, while the input sequences to rnn can easily have many characters.
I'm using TensorFlow for a multi-target regression problem. Specifically, in a convolutional network with pixel-wise labeling with the input being an image and the label being a "heat-map" where each pixel has a float value. More specifically, the ground truth labeling for each pixel is lower bounded by zero, and, while technically having no upper bound, usually gets no larger than 1e-2.
Without batch normalization, the network is able to give a reasonable heat-map prediction. With batch normalization, the network takes much long to get to reasonable loss value, and the best it does is making every pixel the average value. This is using the tf.contrib.layers conv2d and batch_norm methods, with the batch_norm being passed to the conv2d's normalization_fn (or not in the case of no batch normalization). I had briefly tried batch normalization on another (single value) regression network, and had trouble then as well (though, I hadn't tested that as extensively). Is there a problem using batch normalization on regression problems in general? Is there a common solution?
If not, what could be some causes batch normalization failing on such an application? I've attempted a variety of initializations, learning rates, etc. I would expect the final layer (which of course does not use batch normalization) could use weights to scale the output of the penultimate layer to the appropriate regression values. Failing that, I removed batch norm from that layer, but with no improvement. I've attempted a small classification problem using batch normalization and saw no problem there, so it seems reasonable that it could be due somehow to the nature of the regression problem, but I don't know how that could cause such a drastic difference. Is batch normalization known to have trouble on regression problems?
I believe your issue is in the labels. Batch norm will scale all input values between 0 and 1. If the labels are not scaled to a similar range the task will be more difficult. This is because it requires the NN to learn values of a different scale.
By removing the batch norm from the penultimate layer, the task may be improved slightly, but you are still requiring an NN layer to learn to downscale values of its input while subsequently normalizing back to the range 0 - 1 (opposite to your objective).
To solve this problem, apply a 0 - 1 scaler to the labels such that your upper bound is no longer 1e-2. During inference, transform the predictions back with the same function to get the actual prediction.