Tensorflow batch normalization: difference momentum and renorm_momentum - tensorflow

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?

There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

Related

Defining assignment function as variable in tensroflow?

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

Different between fit and evaluate in keras

I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.

Questions about tensorflow GetStarted tutorial

So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.

What's the differences between tf.GraphKeys.TRAINABLE_VARIABLES and tf.GraphKeys.UPDATE_OPS in tensorflow?

Here is doc of tf.GraphKeys in tensorflow, such as TRAINABLE_VARIABLES: the subset of Variable objects that will be trained by an optimizer.
And i know tf.get_collection(), which can find some tensor that you want.
When use tensorflow.contrib.layers.batch_norm(), the parameter updates_collections default value is GraphKeys.UPDATE_OPS.
How can we understand those collections, and difference in them.
Besides, we can find more in ops.py.
These are two different things.
TRAINABLE_VARIABLES
TRAINABLE_VARIABLES is the collection of variables or training parameters which should be modified when minimizing the loss. For example, these can be the weights determining the function performed by each node in the network.
How do variables get added to this collection? This happens automatically when you define a new variable with tf.get_variable, unless you specify
tf.get_variable(..., trainable=False)
When would you want a variable to be untrainable? This happens from time to time. For example, occasionally you will want to use a two-step approach in which you first train the entire network on a large, generic dataset, then fine-tune the network on a smaller dataset which is specifically related to your problem. In such cases, you might want to fine-tune only part of the network, e.g., the last layer. Specifying some variables as untrainable is one of the ways to do this.
UPDATE_OPS
UPDATE_OPS is a collection of ops (operations performed when the graph runs, like multiplication, ReLU, etc.), not variables. Specifically, this collection maintains a list of ops which need to run before each training step.
How do ops get added to this collection?
By definition, update_ops occur outside the regular flow of training by loss minimization, so generally you will be adding ops to this collection only under special circumstances. For example, when performing batch normalization, you want to recompute the batch mean and variance before each training step, and this is how it's done. The mechanics of batch normalization using tf.contrib.layers.batch_norm are described in more detail in this article.
Disagree with the previous answer.
Actually, everything is an OP in the tensorflow, the variables in the TRAINABLE_VARIABLES collections are also OPs, which is created by the OP tf.get_variable or tf.Variable.
As for the UPDATE_OPS collection, it usually include the moving average and moving variance, crated in the tf.layers.batch_norm function. These ops can also be regarded as variables, as their values are updated at each training step, just like the weights and bias.
The main difference is that the trainable variables participate the process of back propagation, while the variables in the UPDATE_OPS not. They only participate the inference process in the test mode, so so gridients are computed on these variable in the UPDATE_OPS .

`estimator.train` with num_steps in Tensorflow

I have made a custom estimator in Tensorflow 1.4. In estimator.trainfunction, I see a steps parameter, which I am using as a way to stop the training and then evaluate on my validation dataset.
while True:
model.train(input_fn= lambda:train_input_fn(train_data), steps = FLAGS.num_steps)
model.evaluate(input_fn= lambda:train_input_fn(test_data))
After every num_steps, I run evaluate on validation dataset.
What I am observing is, after num_steps, once the evaluation is done, there is a jerk in the plot of AUC/Loss functions(in general all metric).
Plot attached :
I am unable to understand why it's happening.
Is it not the right way to evaluate metrics on validation dataset at regular intervals
Link to code
The issue
The issue comes from the fact that what you plot in TensorBoard is the accuracy or AUC computed since the beginning of estimator.train.
Here is what happens in details:
you create a summary based on the second output of tf.metrics.accuracy
accuracy = tf.metrics.accuracy(labels, predictions)
tf.summary.scalar('accuracy', accuracy[1])
when you call estimator.train(), a new Session is created and all the local variables are initialized again. This includes the local variables of accuracy (sum and count)
during this Session, the op tf.summary.merge_all() is called at regular intervals. What happens is that your summary is the accuracy of all the batches processed since you last called estimator.train(). Therefore, at the beginning of each training phase, the output is pretty noisy and it gets more stable once you progress.
Whenever you evaluate and call estimator.train() again, the local variables are initialized again and you go in a short "noisy" phase, which results in bumps on the training curve.
A solution
If you want a scalar summary that gives you the actual accuracy for each batch, it seems like you need to implement it without using tf.metrics. For instance, if you want the accuracy you will need to do:
accuracy = tf.reduce_mean(tf.cast(tf.equal(labels, predictions), tf.float32))
tf.summary.scalar('accuracy', accuracy)
It is easy to implement this for the accuracy, and I know it might be painful to do for AUC but I don't see a better solution for now.
Maybe having these bumps is not so bad. For instance if you train on one epoch, you will get the overall training accuracy on one epoch at the end.