getting runtime statistics with monitoredtrainingsession in tensorflow - tensorflow

I am trying to get my tensorflow code profile (running and memory consumption of each layers in the network) by following the runtime statistics instruction here. As far as I understand, I need to create run options and run metadata like this
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
and pass them to sess.run
However, as I am also trying to use tf.train.MonitoredTrainingSession I don't know if I can pass the same thing into this class. A plausible approach could make use of Hooks but I do not know how to do it. I am still very new to them

You can simply create a custom hook and pass it to the MonitoredTrainingSession. There is no need to pass your own tf.RunMetadata() instance to the run call.
Here is an example Hook which stores metadata every N steps to ckptdir:
import tensorflow as tf
class TraceHook(tf.train.SessionRunHook):
"""Hook to perform Traces every N steps."""
def __init__(self, ckptdir, every_step=50, trace_level=tf.RunOptions.FULL_TRACE):
self._trace = every_step == 1
self.writer = tf.summary.FileWriter(ckptdir)
self.trace_level = trace_level
self.every_step = every_step
def begin(self):
self._global_step_tensor = tf.train.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError("Global step should be created to use _TraceHook.")
def before_run(self, run_context):
if self._trace:
options = tf.RunOptions(trace_level=self.trace_level)
else:
options = None
return tf.train.SessionRunArgs(fetches=self._global_step_tensor,
options=options)
def after_run(self, run_context, run_values):
global_step = run_values.results - 1
if self._trace:
self._trace = False
self.writer.add_run_metadata(run_values.run_metadata,
f'{global_step}', global_step)
if not (global_step + 1) % self.every_step:
self._trace = True
It checks in before_run whether it has to trace or not and if so, adds the RunOptions. In after_run it checks if the next run call needs to be traced and if so, it sets _trace to True again. Additionally it stores the metadata when it is available.

Related

Implement early stopping in tf.estimator.DNNRegressor using the available training hooks

I am new to tensorflow and want to implement early stopping in tf.estimator.DNNRegressor with available training hooksTraining Hooks for the MNIST dataset. The early stopping hook will stop training if the loss does not improve for some specified number of steps. Tensorflow documentaton only provides example for Logging hooks. Can someone write a code snippet for implementing it?
Here is a EarlyStoppingHook sample implementation:
import numpy as np
import tensorflow as tf
import logging
from tensorflow.python.training import session_run_hook
class EarlyStoppingHook(session_run_hook.SessionRunHook):
"""Hook that requests stop at a specified step."""
def __init__(self, monitor='val_loss', min_delta=0, patience=0,
mode='auto'):
"""
"""
self.monitor = monitor
self.patience = patience
self.min_delta = min_delta
self.wait = 0
if mode not in ['auto', 'min', 'max']:
logging.warning('EarlyStopping mode %s is unknown, '
'fallback to auto mode.', mode, RuntimeWarning)
mode = 'auto'
if mode == 'min':
self.monitor_op = np.less
elif mode == 'max':
self.monitor_op = np.greater
else:
if 'acc' in self.monitor:
self.monitor_op = np.greater
else:
self.monitor_op = np.less
if self.monitor_op == np.greater:
self.min_delta *= 1
else:
self.min_delta *= -1
self.best = np.Inf if self.monitor_op == np.less else -np.Inf
def begin(self):
# Convert names to tensors if given
graph = tf.get_default_graph()
self.monitor = graph.as_graph_element(self.monitor)
if isinstance(self.monitor, tf.Operation):
self.monitor = self.monitor.outputs[0]
def before_run(self, run_context): # pylint: disable=unused-argument
return session_run_hook.SessionRunArgs(self.monitor)
def after_run(self, run_context, run_values):
current = run_values.results
if self.monitor_op(current - self.min_delta, self.best):
self.best = current
self.wait = 0
else:
self.wait += 1
if self.wait >= self.patience:
run_context.request_stop()
This implementation is based on Keras implementation.
To use it with CNN MNIST example create hook and pass it to train.
early_stopping_hook = EarlyStoppingHook(monitor='sparse_softmax_cross_entropy_loss/value', patience=10)
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook, early_stopping_hook])
Here sparse_softmax_cross_entropy_loss/value is the name of the loss op in that example.
EDIT 1:
It looks like there is no "official" way of finding loss node when using estimators (or I can't find it).
For the DNNRegressor this node has name dnn/head/weighted_loss/Sum.
Here is how to find it in the graph:
Start tensorboard in model directory. In my case I didn't set any directory so estimator used temporary directory and printed this line:
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpInj8SC
Start tensorboard:
tensorboard --logdir /tmp/tmpInj8SC
Open it in browser and navigate to GRAPHS tab.
Find loss in the graph. Expand blocks in the sequence: dnn → head → weighted_loss and click on the Sum node (note that there is summary node named loss connected to it).
Name shown in the info "window" to the right is the name of the selected node, that need to be passed to monitor argument pf EarlyStoppingHook.
Loss node of the DNNClassifier has the same name by default. Both DNNClassifier and DNNRegressor have optional argument loss_reduction that influences loss node name and behavior (defaults to losses.Reduction.SUM).
EDIT 2:
There is a way of finding loss without looking at the graph.
You can use GraphKeys.LOSSES collection to get the loss. But this way will work only after training started. So you can use it only in a hook.
For example you can remove monitor argument from the EarlyStoppingHook class and change its begin function to always use the first loss in the collection:
self.monitor = tf.get_default_graph().get_collection(tf.GraphKeys.LOSSES)[0]
You also probably need to check that there is a loss in the collection.

Tensorflow - running total

How can I add the number 5 after every iteration of the loop?
I want to do something like this:
weight = 0.225
for i in range(10):
weight += 5
print (weight)
Here is how I am trying in tensorflow but it never updates the weight
import tensorflow as tf
def dummy(x):
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
return res
# build computational graph
a = tf.placeholder('float', None)
d = dummy(a)
weights = {
'h0': tf.Variable(tf.random_normal([1]))
}
# initialize variables
init = tf.global_variables_initializer()
# create session and run the graph
with tf.Session() as sess:
sess.run(init)
for i in range(10):
print (sess.run(d, feed_dict={a: [2]}))
# close session
sess.close()
There's an operation explicitly created for adding a value and assigning the result back to the input node: tf.assign_add
You should use it instead of tf.assing + tf.add.
Also, it's more important that you understand why you previous code won't work.
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
At the fist line, you're defining a node add, whose inputs are weights['h0'] and 5 and you're assigning this node to a python variable weights['h0'].
Now, thus, weights['h0'] is a python variable holding a tensorflow node.
In the next line, you're defining another add node, between the previous node and x, and you return this node.
When the graph is evaluated, you evaluate the node pointed by res, that force the evaluation of the previous node (because res is a function of the node holded by weights['h0']).
The problem is the that your assignment at line 1 is a python assignment and not a tensorflow assignment.
Thus that assign operation is executed only in the python environment but it has no defined an assign node into the tensorflow graph.
P.S: when you use with you're defining a context manager that handles the closing operations for you. You can thus remove sess.close() because is executed automatically when you exit from that context
Apparently there is an assign operator
https://www.tensorflow.org/api_docs/python/tf/assign
weights['h0'] = tf.assign(weights['h0'], tf.add(weights['h0'], 5))

How to understand sess.as_default() and sess.graph.as_default()?

I read the docs of sess.as_default()
N.B. The default session is a property of the current thread. If you create a new thread, and wish to use the default session in that thread, you must explicitly add a with sess.as_default(): in that thread's function.
My understanding is that if there are two more sessions when a new thread is created, we must set a session to run TensorFlow code in it. So, to do this, a session is chosen and as_default() is called.
N.B. Entering a with sess.as_default(): block does not affect the current default graph. If you are using multiple graphs, and sess.graph is different from the value of tf.get_default_graph, you must explicitly enter a with sess.graph.as_default(): block to make sess.graph the default graph.
In sess.as_default() block, to call a specific graph, one must call sess.graph.as_default() to run the graph?
The tf.Session API mentions that a graph is launched in a session. The following code illustrates this:
import tensorflow as tf
graph1 = tf.Graph()
graph2 = tf.Graph()
with graph1.as_default() as graph:
a = tf.constant(0, name='a')
graph1_init_op = tf.global_variables_initializer()
with graph2.as_default() as graph:
a = tf.constant(1, name='a')
graph2_init_op = tf.global_variables_initializer()
sess1 = tf.Session(graph=graph1)
sess2 = tf.Session(graph=graph2)
sess1.run(graph1_init_op)
sess2.run(graph2_init_op)
# Both tensor names are a!
print(sess1.run(graph1.get_tensor_by_name('a:0'))) # prints 0
print(sess2.run(graph2.get_tensor_by_name('a:0'))) # prints 1
with sess1.as_default() as sess:
print(sess.run(sess.graph.get_tensor_by_name('a:0'))) # prints 0
with sess2.as_default() as sess:
print(sess.run(sess.graph.get_tensor_by_name('a:0'))) # prints 1
with graph2.as_default() as g:
with sess1.as_default() as sess:
print(tf.get_default_graph() == graph2) # prints True
print(tf.get_default_session() == sess1) # prints True
# This is the interesting line
print(sess.run(sess.graph.get_tensor_by_name('a:0'))) # prints 0
print(sess.run(g.get_tensor_by_name('a:0'))) # fails
print(tf.get_default_graph() == graph2) # prints False
print(tf.get_default_session() == sess1) # prints False
You don't need to call sess.graph.as_default() to run the graph, but you need to get the correct tensors or operations in the graph to run it. The context allows you to get the graph or session using tf.get_default_graph or tf.get_default_session.
In the interesting line above, the default session is sess1 and it is implicitly calling sess1.graph, which is the graph in sess1, which is graph1, and hence it prints 0.
In the line following that, it fails because it is trying to run an operation in graph2 with sess1.

A custom alternating update rule with keras

I would like to use an alternating update rule with keras.
I.e. per-batch I would like to call a regular gradient-based step, and next call a custom step.
I thought about implementing it by either inheriting an optimizer or a callback (and use the on-batch calls). However, neither would do, because they both lack the batch-data and batch-labels (and I need both).
Any idea on how to implement a custom alternating update with keras?
If required, I don't mind directly calling tensorflow specific methods, as long as I can keep use the project wrapped with the keras framework (with model.fit, model.predict .. )
try create a custom callback
import keras.callbacks as callbacks
class JSONMetrics(callbacks.Callback):
_model = None
_each_epoch = None
_metrics = None
_epoch = None
_file_json = None
def __init__(self,model,each_epoch,logger=None):
self._file_json = "file_log.json"
self._model = model
self._each_epoch= each_epoch
self._epoch = 0
self._metrics = {'loss':[], 'acc':[]}
def on_epoch_begin(self, epoch, logs):
# print('Epoch {0} begin'.format(epoch))
try:
with open(self._file_json, 'r') as f:
self._metrics = json.load(f)
def on_epoch_end(self, epoch, logs):
self._logger.info('Nemesis: Epoch {0} end'.format(epoch))
self._metrics['loss'].append(logs.get('loss'))
self._metrics['acc'].append(logs.get('acc'))
with open(self._file_json, 'w') as f:
data = json.dump(self._metrics, f)
if self._epoch % self._each_epoch == 0:
file_name = 'weights%08d.h5' % self._epoch
#print('Saving weights at {0} file'.format(file_name))
self._model.save_weights(file_name)
self._epoch += 1
You can evoke the self.model to solve your problem and save the acc and loss for example.

Dequeueing from RandomShuffleQueue does not reduce size

In order to train a model I have encapsulated my model in a class.
I use a tf.RandomShuffleQueue to enqueue a list of filenames to.
However when I dequeue the elements they get dequeued but the size of the queue does not reduce.
Following are more specific questions followed by the code snippet :
If I have only 5 images for example, but steps range upto 100, would this result in the addfilenames called repeatedly automatically ? It does not give me any error on dequeuing so I am thinking that it is getting called automatically.
Why the size of the tf.RandomShuffleQueue is not changing ? It remains constant.
import os
import time
import functools
import tensorflow as tf
from Read_labelclsloc import readlabel
def ReadTrain(traindir):
# Returns a list of training images, their labels and a dictionay.
# The dictionary maps label names to integer numbers.
return trainimgs, trainlbls, classdict
def ReadVal(valdir, classdict):
# Reads the validation image labels.
# Returns a dictionary with filenames as keys and
# corresponding labels as values.
return valdict
def lazy_property(function):
# Just a decorator to make sure that on repeated calls to
# member functions, ops don't get created repeatedly.
# Acknowledgements : https://danijar.com/structuring-your-tensorflow-models/
attribute= '_cache_' + function.__name__
#property
#functools.wraps(function)
def decorator(self):
if not hasattr(self, attribute):
setattr(self, attribute, function(self))
return getattr(self, attribute)
return decorator
class ModelInitial:
def __init__(self, traindir, valdir):
self.graph
self.traindir = traindir
self.valdir = valdir
self.traininginfo()
self.epoch = 0
def traininginfo(self):
self.trainimgs, self.trainlbls, self.classdict = ReadTrain(self.traindir)
self.valdict = ReadVal(self.valdir, self.classdict)
with self.graph.as_default():
self.trainimgs_tensor = tf.constant(self.trainimgs)
self.trainlbls_tensor = tf.constant(self.trainlbls, dtype=tf.uint16)
self.trainimgs_dict = {}
self.trainimgs_dict["ImageFile"] = self.trainimgs_tensor
return None
#lazy_property
def graph(self):
g = tf.Graph()
with g.as_default():
# Layer definitions go here
return g
#lazy_property
def addfilenames (self):
# This is the function where filenames are pushed to a RandomShuffleQueue
filename_queue = tf.RandomShuffleQueue(capacity=len(self.trainimgs), min_after_dequeue=0,\
dtypes=[tf.string], names=["ImageFile"],\
seed=0, name="filename_queue")
sz_op = filename_queue.size()
dq_op = filename_queue.dequeue()
enq_op = filename_queue.enqueue_many(self.trainimgs_dict)
return filename_queue, enq_op, sz_op, dq_op
def Train(self):
# The function for training.
# I have not written the training part yet.
# Still struggling with preprocessing
with self.graph.as_default():
filename_q, filename_enqueue_op, sz_op, dq_op= self.addfilenames
qr = tf.train.QueueRunner(filename_q, [filename_enqueue_op])
filename_dequeue_op = filename_q.dequeue()
init_op = tf.global_variables_initializer()
sess = tf.Session(graph=self.graph)
sess.run(init_op)
coord = tf.train.Coordinator()
enq_threads = qr.create_threads(sess, coord=coord, start=True)
counter = 0
for step in range(100):
print(sess.run(dq_op["ImageFile"]))
print("Epoch = %d "%(self.epoch))
print("size = %d"%(sess.run(sz_op)))
counter+=1
names = [n.name for n in self.graph.as_graph_def().node]
coord.request_stop()
coord.join(enq_threads)
print("Counter = %d"%(counter))
return None
if __name__ == "__main__":
modeltrain = ModelInitial(<Path to training images>,\
<Path to validation images>)
a = modeltrain.graph
print(a)
modeltrain.Train()
print("Success")
The mystery is caused by the tf.train.QueueRunner that you created for the queue, which causes it to be filled in the background.
The following lines cause a background "queue runner" thread to be created:
qr = tf.train.QueueRunner(filename_q, [filename_enqueue_op])
# ...
enq_threads = qr.create_threads(sess, coord=coord, start=True)
This thread calls filename_enqueue_op in a loop, which causes the queue to be filled up as you remove elements from it.
The background thread from step 1 will almost always have a pending enqueue operation (filename_enqueue_op) on the queue. This means that after you dequeue a filename, the pending enqueue will run add fill the queue back up to capacity. (Technically there is a race condition here and you could see a size of capacity - 1, but this is quite unlikely).