In this is tutorial code from TensorFlow website,
could anyone help explain what does global_step mean?
I found on the Tensorflow website written that global step is used count training steps, but I don't quite get what exactly it means.
Also, what does the number 0 mean when setting up global_step?
def training(loss,learning_rate):
tf.summary.scalar('loss',loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Why 0 as the first parameter of the global_step tf.Variable?
global_step = tf.Variable(0, name='global_step',trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
return train_op
According to Tensorflow doc global_step: increment by one after the variables have been updated. Does that mean after one update global_step becomes 1?
global_step refers to the number of batches seen by the graph. Every time a batch is provided, the weights are updated in the direction that minimizes the loss. global_step just keeps track of the number of batches seen so far. When it is passed in the minimize() argument list, the variable is increased by one. Have a look at optimizer.minimize().
You can get the global_step value using tf.train.global_step().
Also handy are the utility methods tf.train.get_global_step or tf.train.get_or_create_global_step.
0 is the initial value of the global step in this context.
The global_step Variable holds the total number of steps during training across the tasks (each step index will occur only on a single task).
A timeline created by global_step helps us understand know where we are in
the grand scheme, from each of the tasks separately. For instance, the loss and accuracy could be plotted against global_step on Tensorboard.
show you a vivid sample below:
code:
train_op = tf.train.GradientDescentOptimizer(learning_rate=LEARNING_RATE).minimize(loss_tensor,global_step=tf.train.create_global_step())
with tf.Session() as sess:
...
tf.logging.log_every_n(tf.logging.INFO,"np.mean(loss_evl)= %f at step %d",100,np.mean(loss_evl),sess.run(tf.train.get_global_step()))
corresponding print
INFO:tensorflow:np.mean(loss_evl)= 1.396970 at step 1
INFO:tensorflow:np.mean(loss_evl)= 1.221397 at step 101
INFO:tensorflow:np.mean(loss_evl)= 1.061688 at step 201
There are networks, e.g. GANs, that may need two (or more) different steps. Training a GANs with the WGAN specification requires that the steps on the discriminator (or critic) D are more than the ones done on the generator G. In that case, it is usefull to declare different global_steps variables.
Example: (G_lossand D_loss are the loss of the generator and the discriminator)
G_global_step = tf.Variable(0, name='G_global_step', trainable=False)
D_global_step = tf.Variable(0, name='D_global_step', trainable=False)
minimizer = tf.train.RMSPropOptimizer(learning_rate=0.00005)
G_solver = minimizer.minimize(G_loss, var_list=params, global_step=G_global_step)
D_solver = minimizer.minimize(D_loss, var_list=params, global_step=D_global_step)
Related
I implement a network using tensorflow, and the loss is not converged. Then, I get some value in the network, and I find that the BN layer do not work. Please look at the following picture:
We can see that s2 is the result of batch normalization of s1, but the value in s2 is still very large. I don't know what's the problem. Why the value in s2 is so large?
I have updated my code to github. Someone who is interested can test it.
As per the official tensorflow documentation here,
when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so
they need to be executed alongside the train_op. Also, be sure to add
any batch_normalization ops before getting the update_ops collection.
Otherwise, update_ops will be empty, and training/inference will not
work properly.
For example:
training = tf.placeholder(tf.bool, name="is_training")
# ...
x_norm = tf.layers.batch_normalization(x, training=training)
# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
train_op = optimizer.minimize(loss)
train_op = tf.group([train_op, update_ops])
# or, you can also do something like this:
# with tf.control_dependencies(update_ops):
# train_op = optimizer.minimize(loss)
So, it is really important to get the update ops as stated in the tensorflow documentation because in training time the moving variance and the moving mean of the layer have to be updated. If you don’t do this, batch normalization will not work and the network will not train as expected. It is also useful to declare a placeholder to tell the network if it is in training time or inference time, as during test(or inference) time, the mean and the variance are fixed. They are estimated using the previously calculated means and variances of each training batch.
I know that optimizers in Tensorflow divide minimize into compute_gradients and apply_gradients. However, optimization algorithms like Adam generally process the gradients with momentum and some other techniques as the following figure suggests(Thanks #kmario23 for providing the figure).
I wonder when these techniques are applied to the gradients? Are they applied in compute_gradients or apply_gradients?
Update
sess = tf.Session()
x = tf.placeholder(tf.float32, [None, 1])
y = tf.layers.dense(x, 1)
loss = tf.losses.mean_squared_error(tf.ones_like(y), y)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(loss)
sess.run(tf.global_variables_initializer())
print(sess.run(grads, feed_dict={x: [[1]]}))
print(sess.run(grads, feed_dict={x: [[1]]}))
The above code outputs the same results twice, does it suggest that moment estimates are computed in apply_gradients? Because, IMHO, if moment estimates are computed in apply_gradients, then after the first print statement, first and second moments will be updated, which should result in different result in the second printstatement.
Below is the Adam algorithm as presented in the Deep Learning book. As for your question, the important thing to note here is the gradient of theta (written as Laplacian of theta) in second to last step.
As for how TensorFlow computes this is a two step process in the optimization (i.e. minimization)
1) compute_gradients
2) apply_gradients
In the first step all the necessary ingredients for the final gradients are computed. So, the second step is just applying the update to the parameters based on the gradients computed in the first step and the learning rate (lr).
compute_gradients computes only gradients, all other additional operations corresponding to specific optimization algorithms are done in apply_gradients. The code in the update is one evidence, another evidence is the following figure cropped from tensorboard, where Adam corresponds to the compute_gradient operation.
I'm trying to compute the gradient of the output layer with respect to the input layer. My neural network is relatively small (input layer composed of 9 activation units and the output layer of 1) and the training went fine as the test provided a very good accuracy. I made the NN model using Keras.
In order to solve my problem, I need to compute the gradient of the output with respect to the input. This is, I need to obtain the Jacobian which as dimension [1x9]. The gradients function in tensorflow should provide me with everything I need, but when I run the code below I obtain a different solution every time.
output_v = model.output
input_v = model.input
gradients = tf.gradients(output_v, input_v)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
print(sess.run(model.input,feed_dict={model.input:x_test_N[0:1,:]}))
evaluated_gradients = sess.run(gradients,feed_dict{model.input:x_test_N[0:1,:]})
print(evaluated_gradients)
sess.close()
The first print command shows this value every time I run it (just to make sure that the input values are not modified):
[[-1.4306372 -0.1272892 0.7145787 1.338818 -1.2957293 -0.5402862-0.7771702 -0.5787912 -0.9157122]]
But the second print shows different ones:
[[ 0.00175761, -0.0490326 , -0.05413761, 0.09952173, 0.06112418, -0.04772799, 0.06557006, -0.02473242, 0.05542536]]
[[-0.00416433, 0.08235116, -0.00930298, 0.04440641, 0.03752216, 0.06378302, 0.03508484, -0.01903783, -0.0538374 ]]
Using finite differences, evaluated_gradients[0,0] = 0.03565103, which isn't close to any of the first values previously printed.
Thanks for your time!
Alberto
Solved by creating a specific session just before training my model:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
K.set_session(sess)
history = model.fit(x_train_N, y_train_N, epochs=n_epochs,
validation_split=split, verbose=1, batch_size=n_batch_size,
shuffle='true', callbacks=[early_stop, tensorboard])
And evaluating the gradient after training, while tf.session is still open:
evaluated_gradients = sess.run(K.gradients(model.output, model.input), feed_dict={model.input: x_test_N})
Presumably your network is set up to initialize weights to random values. When you run sess.run(tf.initialize_all_variables()), you are initializing your variables to new random values. Therefore you get different values for output_v in every run, and hence different gradients. If you want to use a model you trained before, you should replace the initialization with initialize_all_variables() with a restore command. I am not familiar with how this is done in Keras since I usually work directly with tensorflow, but I would try this.
Also note that initialize_all_variables is deprecated and you should use global_variables_initializer instead.
everyone. I am using tensorflow 1.4 to train a model like U-net for my purpose. Due to the constraints of my hardware, when training, the batch_size could only set to be 1 otherwise there will be OOM error.
Here comes my question. In this case, when the batch_size equals to 1, will the tf.layers.batch_normalization() works correctly(saying moving average, moving variance, gamma, beta)? will small batch_size makes it working unstable?
In my work, I set training=True when training, and training=False when testing. When training, I use
logits = mymodel.inference()
loss = tf.mean_square_error(labels, logits)
updata_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
...
saver = tf.train.Saver(tf.global_variables())
with tf.Session() as sess:
sess.run(tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer()))
sess.run(train_op)
...
saver.save(sess, save_path, global_step)
when testing, I use:
logits = model.inference()
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, checkpoint)
sess.run(tf.local_variables_initializer())
results = sess.run(logits)
Could anyone tell me that am I using this wrong? And how much influence with batch_size equals to 1 in tf.layers.batch_normalization()?
Any help will be appreciated! Thanks in advance.
Yes, tf.layers.batch_normalization() works with batches of single elements. Doing batch normalization on such batches is actually named instance normalization (i.e. normalization of a single instance).
#Maxim made a great post about instance normalization if you want to know more. You can also find more theory on the web and in the literature, e.g. Instance Normalization: The Missing Ingredient for Fast Stylization.
I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)