How to smoothly produce Tensorflow auc summaries for training and test sets? - tensorflow

Tensorflow describes writing file summaries to visualize graph execution.
I envision three stages:
training the data (with optimization)
measuring accuracy on the training set (no optimization)
measuring accuracy on the test set (no optimization!)
I'd like all stages in the same script, as in the evaluate function of the wide_and_deep tutorial, but with the low-level API. I'd like three different graphs for stats like loss or AUC, one for each stage.
Suppose I use one session, and in each stage I define an AUC summary op:
# define auc
auc, auc_op = tf.metrics.auc(labels, predictions)
# summary scalar to track it
tf.summary.scalar("auc", auc_op, family=family_name)
# merge all summaries for evaluation and later writing
summary_op = tf.summary.merge_all()
...
summary_writer.add_summary(summary, step_num)
There are three graphs, but the first graph has all three runs on it, and the second graph has the last two runs (see below). What's worse, each stage starts from the previous state. This makes sense, because all the variables from the previous stages are still around.
I could use a different session for each stage, but that would throw away the model as well.
What is the smooth way to handle this?
I'd like to just clear some of the summary variables. I've tried re-initializing some variables, looked at related questions, read about name scope and variable scope and tried not to re-use variables for AUC, read about variables and sharing, looked into pruning nodes (though I don't understand it), etc. I have not made it work yet.
I am using the low-level API. I saw something like this in the high-level API in _eval_metric_ops, but I don't understand how they 'clear' the different stages. With name_scope?
Do I have to save and load the model into a new session just for this, or is there some clean way to graph each summary separately?

The metric ops will be local variables, so you could run tf.local_variables_initializer() in your Session, which will reset all of your metrics. You could also look through the local variables collection for those with "auc" in the name if you wanted to be a bit more discerning. The high-level way to do this would be to use an Estimator, which will manage metrics for you.

Related

Training multiple Keras models in one script

I want to train different Keras models (or in some cases just multiple runs of the same model to compare the results) in a queue (using TensorFlow as the backend if that matters). In my current setup I create and fit all of these models in one big python script, e.g. (in a simplified way):
for i in range(10):
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
The create_model(i) function creates the specific model for the i'th run. This includes changing the number of inputs / labels for example. The compile function can be different (e.g. different optimizer) for each run as well.
While this code works for me and I have not found any problems, I am unclear if this is the correct way to do it because all of the models reside in the same TensorFlow Graph (if I understand the way Keras / TensorFlow work together correctly). My questions are:
is this the correct way to run multiple independent models. (I do not want any influence of the i'th run on the i+1'th run)
is running the models from different python scripts (in this example model1.py, model2.py, ... model9.py) in any way better technically speaking (I am not referring to readability / reproducibility here) because each model would then have its own separate TensorFlow Graph / Session?
Does clearing the Session / deleting the Graph via keras.backend.clear_session() have any influence in this case if it is run after the save function (some_function_to_save_model() inside the for loop)? Is this in some way beneficial compared to the current setup?
Once again: I am not concerned with the problems that might arise due to creating messy code if all models are cramped together in one script instead of a single script per model only with creating & training models independently.
Unfortunately I did not find a concise answer to this (only suggestions using both methods). Maybe someone here can enlighten me?
Edit: Maybe I should be more precise. Basically I would like to have a technical explanation regarding the differences (advantages & disadvantages) of the following three cases:
create_and_train.py:
for i in range(10):
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
create_and_train.py:
for i in range(10):
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
# clear session:
keras.backend.clear_session()
create_and_train_i.py with i in [0, 1, ..., 9]:
i = 5 # (e.g.)
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
and e.g. a bash script that loops through these

What's the differences between tf.GraphKeys.TRAINABLE_VARIABLES and tf.GraphKeys.UPDATE_OPS in tensorflow?

Here is doc of tf.GraphKeys in tensorflow, such as TRAINABLE_VARIABLES: the subset of Variable objects that will be trained by an optimizer.
And i know tf.get_collection(), which can find some tensor that you want.
When use tensorflow.contrib.layers.batch_norm(), the parameter updates_collections default value is GraphKeys.UPDATE_OPS.
How can we understand those collections, and difference in them.
Besides, we can find more in ops.py.
These are two different things.
TRAINABLE_VARIABLES
TRAINABLE_VARIABLES is the collection of variables or training parameters which should be modified when minimizing the loss. For example, these can be the weights determining the function performed by each node in the network.
How do variables get added to this collection? This happens automatically when you define a new variable with tf.get_variable, unless you specify
tf.get_variable(..., trainable=False)
When would you want a variable to be untrainable? This happens from time to time. For example, occasionally you will want to use a two-step approach in which you first train the entire network on a large, generic dataset, then fine-tune the network on a smaller dataset which is specifically related to your problem. In such cases, you might want to fine-tune only part of the network, e.g., the last layer. Specifying some variables as untrainable is one of the ways to do this.
UPDATE_OPS
UPDATE_OPS is a collection of ops (operations performed when the graph runs, like multiplication, ReLU, etc.), not variables. Specifically, this collection maintains a list of ops which need to run before each training step.
How do ops get added to this collection?
By definition, update_ops occur outside the regular flow of training by loss minimization, so generally you will be adding ops to this collection only under special circumstances. For example, when performing batch normalization, you want to recompute the batch mean and variance before each training step, and this is how it's done. The mechanics of batch normalization using tf.contrib.layers.batch_norm are described in more detail in this article.
Disagree with the previous answer.
Actually, everything is an OP in the tensorflow, the variables in the TRAINABLE_VARIABLES collections are also OPs, which is created by the OP tf.get_variable or tf.Variable.
As for the UPDATE_OPS collection, it usually include the moving average and moving variance, crated in the tf.layers.batch_norm function. These ops can also be regarded as variables, as their values are updated at each training step, just like the weights and bias.
The main difference is that the trainable variables participate the process of back propagation, while the variables in the UPDATE_OPS not. They only participate the inference process in the test mode, so so gridients are computed on these variable in the UPDATE_OPS .

Why and when do I need to use the global step in tensorflow

I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?
You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.

TensorFlow - Removing name scope when plotting summary?

I'm currently building my operations twice, once for training and once for validation with the variable_scope set to have reuse=True to ensure I've only got one set of weights to train.
To organize the operations though, I wrap the operation building call for training in a
with tf.name_scope='train':
and similarly do the same for validation. This allows me to create a few summary hooks easily, by simply calling
tf.summary.merge(tf.get_collection(tf.GraphKeys.SUMMARIES, scope='train'))
at the end to get summaries for either the training graph or the validation graph and save these summaries with the appropriate summary saver.
Unfortunately, this also means that a scalar in the training summaries is not displayed on the same plot as the equivalent scalar in the validation (because they are in different name scopes).
Is there either a way to remove the name scope before saving the summary, or a different method of wrapping the summaries for a specific case together without applying the name scope to begin with? Or do I need to manually keep track of the summaries for each case?
EDIT:
Just clarify, my code looks something like:
with tf.name_scope('train'):
create_network() # Summaries create in here.
with tf.name_scope('validation'):
create_network(reuse=True) # More summaries in here.
train_summaries = tf.summary.merge(tf.get_collection(tf.GraphKeys.SUMMARIES, scope='train'))
validation_summaries = tf.summary.merge(tf.get_collection(tf.GraphKeys.SUMMARIES, scope='validation'))
# Down here, create the summary saver hooks, etc.
Something like this is done in the multi-GPU CIFRA-10 example code to get rid of unnecessary prefixes:
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
Perhaps you can report the scalar with the same name from both validation as well as the training part of your code.

How to structure the model for training and evaluation on the test set

I want to train a model. Every 1000 steps, I want to evaluate it on the test set and write it to the tensorboard log. However, there's a problem. I have a code like this:
image_b_train, label_b_train = tf.train.shuffle_batch(...)
out_train = model.inference(image_b_train)
accuracy_train = tf.reduce_mean(...)
image_b_test, label_b_test = tf.train.shuffle_batch(...)
out_test = model.inference(image_b_test)
accuracy_test = tf.reduce_mean(...)
where model inference declares the variables in the model. However, there's a problem. For the test set I have a separate queue, and I can't swap one queue for another with tensorflow.
Currently I solved the problem by creating 2 graphs, one for training and the other for testing. I copy from one graph to the other with tf.train.Saver. Another solution might be to use tf.get_variable, but this is a global variable, and I don't like it because my code becomes less reusable.
Yes, you need two graphs. These graphs can share variables. This can be done by:
Using Keras layers (from tf.contrib.keras) which let you define the model once and use it to compute two inference graphs
Using slim-style layers (from tf.layers) with tf.get_variable and reuse
Using tf.make_template to make your own model-like object which can be called once to build the training graph and once to build the inference graph
Using tf.estimator.Estimator which lets you define a model function once and runs it automatically for training and evaluation for you
There are other options, but any of these is well-supported and should unblock you.