Outputting batch/epoch training loss during `tf.train.MonitoredTrainingSession` - tensorflow

I would like to output my loss with MonitoredTrainingSession every epoch or batch.
Ideally I would love to get a flag that the epoch is ended or be able to provide a callback like in keras. I see that I can also do it by manually counting steps, but I want to use the tf functionality, which seems still poorly documented.
From what I could find in their documentation, one can use tf.train.LoggingTensorHook to print the tensors every n steps.
The problem however is that it prints with frequency different from what I request. When I run following with every_n_iter=4 I get output every 2nd iteration:
tf.reset_default_graph()
with g.as_default():
loghook = tf.train.LoggingTensorHook([tf.reduce_mean(loss, name='m_loss')],
every_n_iter=4,
formatter=lambda x: "LOSS\t%.4f" % [tt for kk,tt in x.items() if kk.name.startswith('m_loss')][-1]
)
optimizer = get_optimizer(lr=lr, opt_name = opt_name)
training_op = optimizer.minimize(loss)
init_op = tf.global_variables_initializer()
with tf.Session(graph=g) as sess:
sess.run(init_op)
with tf.train.MonitoredTrainingSession(log_step_count_steps=1, hooks=[loghook]) as sess:
losslist = []
while not sess.should_stop():
print('.')
loss_ = sess.run(loss, feed_dict={K.learning_phase():1})
sess.run(training_op)
losslist.append(np.mean(loss_))
I am getting output like:
.
INFO:tensorflow:LOSS 2.2416
.
.
INFO:tensorflow:LOSS 2.1547
.
.
INFO:tensorflow:LOSS 2.1186
.
.
etc. That is it outputs every 2nd step, not every 4th.
The documentation says:
every_n_iter: `int`, print the values of `tensors` once every N local
steps taken on the current worker.
I am running it on one local machine. Why one "local step" equals two loop python iterations? Why two and not five?
Looking at the Python source does not seem helping. Any Google folks aware of what it is doing?

"local step" is incremented on every call to sess.run(). You are calling sess.run() twice within your while loop.
Here are some pointers to relevant code:
https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/python/training/basic_session_run_hooks.py#L255 - increment _iter_count after every call to sess.run().
https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/python/training/basic_session_run_hooks.py#L228 - If _iter_count should trigger logging, add the current tensors to be run in the following call to sess.run() so that their values can be logged next.

Related

How to get current global_step in data pipeline

I am trying to create a filter which depends on the current global_step of the training but I am failing to do so properly.
First, I cannot use tf.train.get_or_create_global_step() in the code below because it will throw
ValueError: Variable global_step already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:
This is why I tried fetching the scope with tf.get_default_graph().get_name_scope() and within that context I was able to "get" the global step:
def filter_examples(example):
scope = tf.get_default_graph().get_name_scope()
with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
current_step = tf.train.get_or_create_global_step()
subtokens_by_step = tf.floor(current_step / curriculum_step_update)
max_subtokens = min_subtokens + curriculum_step_size * tf.cast(subtokens_by_step, dtype=tf.int32)
return tf.size(example['targets']) <= max_subtokens
dataset = dataset.filter(filter_examples)
The problem with this is that it does not seem to work as I expected. From what I am observing, the current_step in the code above seems to be 0 all the time (I don't know that, just based on my observations I assume that).
The only thing that seems to make a difference, and it sounds weird, is restarting the training. I think, also based on observations, in that case current_step will be the actual current step of the training at this point. But the value itself won't update as the training continues.
If there a way to get the actual value of the current step and use it in my filter like above?
Environment
Tensorflow 1.12.1
As we discussed in the comments, having and updating your own counter might be an alternative to using the global_step variable. The counter variable could be updated as follows:
op = tf.assign_add(counter, 1)
with tf.control_dependencies(op):
# Some operation here before which the counter should be updated
Using tf.control_dependencies allows to "attach" the update of counter to a path within the computational graph. You can then use the counter variable wherever you need it.
If you use variables inside datasets you need to reinitilize iterators in tf 1.x.
iterator = tf.compat.v1.make_initializable_iterator(dataset)
init = iterator.initializer
tensors = iterator.get_next()
with tf.compat.v1.Session() as sess:
for epoch in range(num_epochs):
sess.run(init)
for example in range(num_examples):
tensor_vals = sess.run(tensors)

tf.gradients(model.output, model.input) computes a different value each time I run it

I'm trying to compute the gradient of the output layer with respect to the input layer. My neural network is relatively small (input layer composed of 9 activation units and the output layer of 1) and the training went fine as the test provided a very good accuracy. I made the NN model using Keras.
In order to solve my problem, I need to compute the gradient of the output with respect to the input. This is, I need to obtain the Jacobian which as dimension [1x9]. The gradients function in tensorflow should provide me with everything I need, but when I run the code below I obtain a different solution every time.
output_v = model.output
input_v = model.input
gradients = tf.gradients(output_v, input_v)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
print(sess.run(model.input,feed_dict={model.input:x_test_N[0:1,:]}))
evaluated_gradients = sess.run(gradients,feed_dict{model.input:x_test_N[0:1,:]})
print(evaluated_gradients)
sess.close()
The first print command shows this value every time I run it (just to make sure that the input values are not modified):
[[-1.4306372 -0.1272892 0.7145787 1.338818 -1.2957293 -0.5402862-0.7771702 -0.5787912 -0.9157122]]
But the second print shows different ones:
[[ 0.00175761, -0.0490326 , -0.05413761, 0.09952173, 0.06112418, -0.04772799, 0.06557006, -0.02473242, 0.05542536]]
[[-0.00416433, 0.08235116, -0.00930298, 0.04440641, 0.03752216, 0.06378302, 0.03508484, -0.01903783, -0.0538374 ]]
Using finite differences, evaluated_gradients[0,0] = 0.03565103, which isn't close to any of the first values previously printed.
Thanks for your time!
Alberto
Solved by creating a specific session just before training my model:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
K.set_session(sess)
history = model.fit(x_train_N, y_train_N, epochs=n_epochs,
validation_split=split, verbose=1, batch_size=n_batch_size,
shuffle='true', callbacks=[early_stop, tensorboard])
And evaluating the gradient after training, while tf.session is still open:
evaluated_gradients = sess.run(K.gradients(model.output, model.input), feed_dict={model.input: x_test_N})
Presumably your network is set up to initialize weights to random values. When you run sess.run(tf.initialize_all_variables()), you are initializing your variables to new random values. Therefore you get different values for output_v in every run, and hence different gradients. If you want to use a model you trained before, you should replace the initialization with initialize_all_variables() with a restore command. I am not familiar with how this is done in Keras since I usually work directly with tensorflow, but I would try this.
Also note that initialize_all_variables is deprecated and you should use global_variables_initializer instead.

tensorflow Dataset order undefined?

If I use multiple elements from a tf.data.Dataset dataset to build the graph, and then evaluate the graph later, it seems the order the element from the Dataset is undefined. As an example, the following code snippet
import tensorflow as tf
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_one_shot_iterator()
print 'build graph and then eval'
keep = []
for i in range(5):
keep.append(iterator.get_next())
with tf.Session() as sess:
keep_eval = sess.run(keep)
print keep_eval
print 'eval each element'
with tf.Session() as sess:
for i in range(5):
print sess.run(iterator.get_next()),
will result in output like:
build graph and then eval
[3 0 1 4 2]
eval each element
0 1 2 3 4
Also, each run will yield different "build graph and then eval".
I would expect "build graph and then eval" to be ordered as well like "eval each element". Can anyone explain why this happens?
The order of a tf.data.Dataset is defined and deterministic (unless you add a non-deterministic Dataset.shuffle()).
However, your two loops build different graphs, which accounts for the difference:
The "build graph and then eval" part creates a list of five iterator.get_next() operations and runs the five operations in parallel. Because these operations run in parallel, they may produce results in different order.
The "eval each element" part also creates five iterator.get_next() operations, but it runs them sequentially, so you always get the results in the expected order.
Note that we do not recommend calling iterator.get_next() in a loop, because it creates a new operation on each call, which gets added to the graph, and consumes memory. Instead, when you loop over a Dataset, try to use the following pattern:
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_one_shot_iterator()
# Call `iterator.get_next()` once and use the result in each iteration.
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(5):
print sess.run(next_element)
From the TensorFlow FAQs here
The individual ops have parallel implementations, using multiple cores in a CPU, or multiple threads in a GPU.
So your "build graph then eval" call runs in parallel for each element in the list, which is why the numbers are in random order, while the for loop forces one call to be run after another, so its serial. You can verify by timing both, the first one should be fast, the for loop will be slower.

TensorFlow: Read batch features to an array

I am using tf.contrib.learn.ReadBatchFeatures (https://www.tensorflow.org/versions/master/api_docs/python/contrib.learn/input_processing#read_batch_features) to read in Example protos as part of my input function, which returns a dict of Tensor objects. After training my model, calling predict on my Estimator returns one batch of predictions as an array, which I would like to compare to the known values.
I try to obtain the known values by calling tf.Session().run(labels), where labels is a Tensor of known values, returned from the input function. However, at this point, my program hangs. I suspect it is stuck in an infinite loop reading labels from the disk, rather than just reading one batch as I would like.
Is this the correct way to obtain one batch of values in the labels Tensor?
Edit: I have tried to start the queue runners, is the following correct?
_, labels = eval_input_fn()
with tf.Session().as_default():
tf.local_variables_initializer()
tf.train.start_queue_runners()
label_values = labels.eval()
print(label_values)
The whole setup you need is:
_, labels = eval_input_fn()
with tf.Session() as sess:
sess.run([
tf.local_variables_initializer(),
tf.global_variables_initializer()
])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
print(sess.run(label))
except tf.errors.OutOfRangeError as error:
coord.request_stop(error)
finally:
coord.request_stop()
coord.join(threads)

What does global_step mean in Tensorflow?

In this is tutorial code from TensorFlow website,
could anyone help explain what does global_step mean?
I found on the Tensorflow website written that global step is used count training steps, but I don't quite get what exactly it means.
Also, what does the number 0 mean when setting up global_step?
def training(loss,learning_rate):
tf.summary.scalar('loss',loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Why 0 as the first parameter of the global_step tf.Variable?
global_step = tf.Variable(0, name='global_step',trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
return train_op
According to Tensorflow doc global_step: increment by one after the variables have been updated. Does that mean after one update global_step becomes 1?
global_step refers to the number of batches seen by the graph. Every time a batch is provided, the weights are updated in the direction that minimizes the loss. global_step just keeps track of the number of batches seen so far. When it is passed in the minimize() argument list, the variable is increased by one. Have a look at optimizer.minimize().
You can get the global_step value using tf.train.global_step().
Also handy are the utility methods tf.train.get_global_step or tf.train.get_or_create_global_step.
0 is the initial value of the global step in this context.
The global_step Variable holds the total number of steps during training across the tasks (each step index will occur only on a single task).
A timeline created by global_step helps us understand know where we are in
the grand scheme, from each of the tasks separately. For instance, the loss and accuracy could be plotted against global_step on Tensorboard.
show you a vivid sample below:
code:
train_op = tf.train.GradientDescentOptimizer(learning_rate=LEARNING_RATE).minimize(loss_tensor,global_step=tf.train.create_global_step())
with tf.Session() as sess:
...
tf.logging.log_every_n(tf.logging.INFO,"np.mean(loss_evl)= %f at step %d",100,np.mean(loss_evl),sess.run(tf.train.get_global_step()))
corresponding print
INFO:tensorflow:np.mean(loss_evl)= 1.396970 at step 1
INFO:tensorflow:np.mean(loss_evl)= 1.221397 at step 101
INFO:tensorflow:np.mean(loss_evl)= 1.061688 at step 201
There are networks, e.g. GANs, that may need two (or more) different steps. Training a GANs with the WGAN specification requires that the steps on the discriminator (or critic) D are more than the ones done on the generator G. In that case, it is usefull to declare different global_steps variables.
Example: (G_lossand D_loss are the loss of the generator and the discriminator)
G_global_step = tf.Variable(0, name='G_global_step', trainable=False)
D_global_step = tf.Variable(0, name='D_global_step', trainable=False)
minimizer = tf.train.RMSPropOptimizer(learning_rate=0.00005)
G_solver = minimizer.minimize(G_loss, var_list=params, global_step=G_global_step)
D_solver = minimizer.minimize(D_loss, var_list=params, global_step=D_global_step)