Why do we clip_by_global_norm to obtain gradients while performing RNN - tensorflow

I am following this tutorial on RNN where on line 177 the following code is executed.
max_grad_norm = 10
....
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars),
global_step=tf.contrib.framework.get_or_create_global_step())
Why do we do clip_by_global_norm? How is the value of max_grad_norm decided?

The reason for clipping the norm is that otherwise it may explode:
There are two widely known issues with properly training recurrent
neural networks, the vanishing and the exploding gradient problems
detailed in Bengio et al. (1994). In this paper we attempt to improve
the understanding of the underlying issues by exploring these problems
from an analytical, a geometric and a dynamical systems perspective.
Our analysis is used to justify a simple yet effective solution. We
propose a gradient norm clipping strategy to deal with exploding
gradients
The above taken from this paper.
In terms of how to set max_grad_norm, you could play with it a bit to see how it affects your results. This is usually set to quite small number (I have seen 5 in several cases). Note that tensorflow does not force you to specify this value. If you don't it will specify it itself (as explained in the documentation).
The reason that exploding\vanishing gradient is common in rnn is because while doing backpropagation (this is called backpropagation through time), we will need to multiply the gradient matrices all the way to t=0 (that is, if we currently at t=100, say the 100's character in a sentence, we will need to multiply 100 matrices). Here is the equation for t=3:
(this equation is taken from here)
If the norm of the matrices is bigger than 1, it will eventually explode. It it is smaller that 1, it will eventually vanish. This may happen in usual neural networks as well if they have a lot of hidden layers. However, feed forward neural networks usually don't have so many hidden layers, while the input sequences to rnn can easily have many characters.

Related

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

Neural network immediately overfitting

I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)
32 neurons:
128 neurons:
I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.
Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.
Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?
If so I suppose I could increase both bounds. If not I would lower them.
model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')
Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.
Getting back to your question, You have a high varience problem (Good in training, bad in testing).
There are eight things you can do in order
Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).
Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.
because a larger network will have more local minima.
Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.
One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error
Take this :
In MSE 1/2 * (y_true - y_prediction)^2
because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.
Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.
Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:
Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.
Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Is batchnorm used in neural networks that are not CNN?

1.) Batchnorm is always used in deep convolutional neural networks. But is it also used in not-CNN. In NN. In networks with just fully-connected layers?
2.) Is batchnorm used in shallow CNNs?
3.) If I have a CNN with an input image and an input array IN_array, the output is an array after the last fully-connected layer. I call this array FC_array. If I want to concat that FC_array with the IN_array.
CONCAT_array = tf.concat(values=[FC_array, IN_array])
Is it useful to have a bachnorm after the concat layer? Or should that batchnorm be just after the FC_array before the concat layer?
For information, the IN_array is a tf.one_hot() vector.
Thank you
TL;DR: 1. Yes 2. Yes 3. No
TS;WM:
Batch normalization was a great invention by Sergey Ioffe and Christian Szegedy early 2015. Back in those days, battling vanishing or exploding gradients was an everyday problem. Read that article if you want to gain a deep understanding. but basically this quote from the abstract should give you some idea:
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
They did in fact first use batch normalization for DCNNs, which allowed them to beat human performance in the top-5 ImageNet classification, but any network where there are nonlinearities can benefit from batch normalization. Including a network consisting of fully-connected layers.
Yes, it is used for shallow CNN-s too. Any network with more than one layer can benefit from it, albeit it is true that more benefit comes to deeper networks.
First of all, one-hot vectors should never be normalized. Normalization means you subtract the mean and divide by the variance, thus creating a dataset with 0 mean and 1 variance. If you do this to a one-hot vector, then the cross-entropy loss calculation will be completely off. Second, there is no point in normalizing a concat layer separately, since it does not change the values, just concatenates them. Batch normalization is done on the input of a layer, so the one after the concat, that will get the concatenated values, can do it if necessary.

Why is the code for a neural network with a sigmoid so different than the code with softmax_cross_entropy_with_logits?

When using neural networks for classification, it is said that:
You generally want to use softmax cross-entropy output, as this gives you the probability of each of the possible options.
In the common case where there are only two options, you want to use sigmoid, which is the same thing except avoids redundantly outputting p and 1-p.
The way to calculate softmax cross entropy in TensorFlow seems to be along the lines of:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
So the output can be connected directly to the minimization code, which is good.
The code I have for sigmoid output, likewise based on various tutorials and examples, is along the lines of:
p = tf.sigmoid(tf.squeeze(...))
cost = tf.reduce_mean((p - y)**2)
I would have thought the two should be similar in form since they are doing the same jobs in almost the same way, but the above code fragments look almost completely different. Furthermore, the sigmoid version is explicitly squaring the error whereas the softmax isn't. (Is the squaring happening somewhere in the implementation of softmax, or is something else going on?)
Is one of the above simply incorrect, or is there a reason why they need to be completely different?
The soft-max cross-entropy cost and the square loss cost of a sigmoid are completely different cost functions. Though they seem to be closely related, it is not the same thing.
It is true that both functions are "doing the same job" if the job is defined as "be the output layer of a classification network". Similarly, you can say that "softmax regression and neural networks are doing the same job". It is true, both techniques are trying to classify things, but in a different way.
The softmax layer with cross-entropy cost is usually preferred over sigmoids with l2-loss. Softmax with cross-entropy has its own pros, such as a stronger gradient of the output layer and normalization to probability vector, whereas the derivatives of the sigmoids with l2-loss are weaker. You can find plenty of explanations in this beautiful book.

TensorFlow - Batch normalization failing on regression?

I'm using TensorFlow for a multi-target regression problem. Specifically, in a convolutional network with pixel-wise labeling with the input being an image and the label being a "heat-map" where each pixel has a float value. More specifically, the ground truth labeling for each pixel is lower bounded by zero, and, while technically having no upper bound, usually gets no larger than 1e-2.
Without batch normalization, the network is able to give a reasonable heat-map prediction. With batch normalization, the network takes much long to get to reasonable loss value, and the best it does is making every pixel the average value. This is using the tf.contrib.layers conv2d and batch_norm methods, with the batch_norm being passed to the conv2d's normalization_fn (or not in the case of no batch normalization). I had briefly tried batch normalization on another (single value) regression network, and had trouble then as well (though, I hadn't tested that as extensively). Is there a problem using batch normalization on regression problems in general? Is there a common solution?
If not, what could be some causes batch normalization failing on such an application? I've attempted a variety of initializations, learning rates, etc. I would expect the final layer (which of course does not use batch normalization) could use weights to scale the output of the penultimate layer to the appropriate regression values. Failing that, I removed batch norm from that layer, but with no improvement. I've attempted a small classification problem using batch normalization and saw no problem there, so it seems reasonable that it could be due somehow to the nature of the regression problem, but I don't know how that could cause such a drastic difference. Is batch normalization known to have trouble on regression problems?
I believe your issue is in the labels. Batch norm will scale all input values between 0 and 1. If the labels are not scaled to a similar range the task will be more difficult. This is because it requires the NN to learn values of a different scale.
By removing the batch norm from the penultimate layer, the task may be improved slightly, but you are still requiring an NN layer to learn to downscale values of its input while subsequently normalizing back to the range 0 - 1 (opposite to your objective).
To solve this problem, apply a 0 - 1 scaler to the labels such that your upper bound is no longer 1e-2. During inference, transform the predictions back with the same function to get the actual prediction.