Why and when do I need to use the global step in tensorflow - tensorflow

I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?

You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.

Related

Saving model weights after each forward pass

I am doing research in explainable AI by looking at the patterns in weights as a function of model hyperparameters and input data. One of the things I'm examining is how weights progress from randomness (or initializer starting values) to stabilization, after learning completes. I'd like to, instead of saving weights every epoch, save them at every other or third forward pass. How would I do that? Must I tweak the 'period' argument to the Keras model checkpoint method? If so, what's an easy formula to set that argument? Thx and have a great day.
just pass save_freq=3 when instantiating the tf.keras.callbacks.ModelCheckpoint.
Quoting the docs:
https://keras.io/api/callbacks/model_checkpoint/
save_freq: 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches. If the Model is compiled with steps_per_execution=N, then the saving criteria will be checked every Nth batch. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'.

Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

How to smoothly produce Tensorflow auc summaries for training and test sets?

Tensorflow describes writing file summaries to visualize graph execution.
I envision three stages:
training the data (with optimization)
measuring accuracy on the training set (no optimization)
measuring accuracy on the test set (no optimization!)
I'd like all stages in the same script, as in the evaluate function of the wide_and_deep tutorial, but with the low-level API. I'd like three different graphs for stats like loss or AUC, one for each stage.
Suppose I use one session, and in each stage I define an AUC summary op:
# define auc
auc, auc_op = tf.metrics.auc(labels, predictions)
# summary scalar to track it
tf.summary.scalar("auc", auc_op, family=family_name)
# merge all summaries for evaluation and later writing
summary_op = tf.summary.merge_all()
...
summary_writer.add_summary(summary, step_num)
There are three graphs, but the first graph has all three runs on it, and the second graph has the last two runs (see below). What's worse, each stage starts from the previous state. This makes sense, because all the variables from the previous stages are still around.
I could use a different session for each stage, but that would throw away the model as well.
What is the smooth way to handle this?
I'd like to just clear some of the summary variables. I've tried re-initializing some variables, looked at related questions, read about name scope and variable scope and tried not to re-use variables for AUC, read about variables and sharing, looked into pruning nodes (though I don't understand it), etc. I have not made it work yet.
I am using the low-level API. I saw something like this in the high-level API in _eval_metric_ops, but I don't understand how they 'clear' the different stages. With name_scope?
Do I have to save and load the model into a new session just for this, or is there some clean way to graph each summary separately?
The metric ops will be local variables, so you could run tf.local_variables_initializer() in your Session, which will reset all of your metrics. You could also look through the local variables collection for those with "auc" in the name if you wanted to be a bit more discerning. The high-level way to do this would be to use an Estimator, which will manage metrics for you.

Running training op multiple times using tf.contrib.data.Dataset?

I'm playing around with the tf.contrib.data.Dataset API. Basically, I have an input tensor generated by dataset.batch(BATCH_SIZE).get_next() and I can currently do something like
for _ in range(100):
sess.run(train_op)
to repeat the train step 100 times. I don't need to feed anything, since all the inputs come from dataset. Doing this in a loop seems like a waste, so is there a way to tell TF to repeat the run step 100 times without having to fall back into Python code between every iteration?
I saw some similar questions about preventing the CPU-GPU transfer between iterations by feeding persistent tensors residing on the GPU, but that's a different problem.

What does tf.train.get_global_step() do in TensorFlow?

What is the use of the function tf.train.get_global_step() in TensorFlow?
In machine learning concepts what is it equivalent to?
You could use it to restart training exactly where you left off when the training procedure has been stopped for some reason. Of course you can always restart training without knowing the global_step (if you save checkpoints regularly in your code, that is), but unless you somehow keep track of how many iterations you already performed, you will not know how many iterations are left after the restart. Sometimes you really want your model to be trained exactly n iterations and not n plus unknown amount before crash. So in my opinion, this is more of a practicality than a theoretical machine learning concept.
tf.train.get_global_step() return global step(variable, tensor from variable node or None) through get_collection(tf.GraphKeys.GLOBAL_STEP) or get_tensor_by_name('global_step:0')
global step is widely used in learn rate decay(like tf.train.exponential_decay, see Decaying the learning rate for more information).
You can pass global step to optimzer apply_gradients or minimize method to increment by one.
while you defined the global step operator, you can get value of it by sess.run(global_step_op)