How to structure the model for training and evaluation on the test set - tensorflow

I want to train a model. Every 1000 steps, I want to evaluate it on the test set and write it to the tensorboard log. However, there's a problem. I have a code like this:
image_b_train, label_b_train = tf.train.shuffle_batch(...)
out_train = model.inference(image_b_train)
accuracy_train = tf.reduce_mean(...)
image_b_test, label_b_test = tf.train.shuffle_batch(...)
out_test = model.inference(image_b_test)
accuracy_test = tf.reduce_mean(...)
where model inference declares the variables in the model. However, there's a problem. For the test set I have a separate queue, and I can't swap one queue for another with tensorflow.
Currently I solved the problem by creating 2 graphs, one for training and the other for testing. I copy from one graph to the other with tf.train.Saver. Another solution might be to use tf.get_variable, but this is a global variable, and I don't like it because my code becomes less reusable.

Yes, you need two graphs. These graphs can share variables. This can be done by:
Using Keras layers (from tf.contrib.keras) which let you define the model once and use it to compute two inference graphs
Using slim-style layers (from tf.layers) with tf.get_variable and reuse
Using tf.make_template to make your own model-like object which can be called once to build the training graph and once to build the inference graph
Using tf.estimator.Estimator which lets you define a model function once and runs it automatically for training and evaluation for you
There are other options, but any of these is well-supported and should unblock you.

Related

Training multiple Keras models in one script

I want to train different Keras models (or in some cases just multiple runs of the same model to compare the results) in a queue (using TensorFlow as the backend if that matters). In my current setup I create and fit all of these models in one big python script, e.g. (in a simplified way):
for i in range(10):
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
The create_model(i) function creates the specific model for the i'th run. This includes changing the number of inputs / labels for example. The compile function can be different (e.g. different optimizer) for each run as well.
While this code works for me and I have not found any problems, I am unclear if this is the correct way to do it because all of the models reside in the same TensorFlow Graph (if I understand the way Keras / TensorFlow work together correctly). My questions are:
is this the correct way to run multiple independent models. (I do not want any influence of the i'th run on the i+1'th run)
is running the models from different python scripts (in this example model1.py, model2.py, ... model9.py) in any way better technically speaking (I am not referring to readability / reproducibility here) because each model would then have its own separate TensorFlow Graph / Session?
Does clearing the Session / deleting the Graph via keras.backend.clear_session() have any influence in this case if it is run after the save function (some_function_to_save_model() inside the for loop)? Is this in some way beneficial compared to the current setup?
Once again: I am not concerned with the problems that might arise due to creating messy code if all models are cramped together in one script instead of a single script per model only with creating & training models independently.
Unfortunately I did not find a concise answer to this (only suggestions using both methods). Maybe someone here can enlighten me?
Edit: Maybe I should be more precise. Basically I would like to have a technical explanation regarding the differences (advantages & disadvantages) of the following three cases:
create_and_train.py:
for i in range(10):
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
create_and_train.py:
for i in range(10):
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
# clear session:
keras.backend.clear_session()
create_and_train_i.py with i in [0, 1, ..., 9]:
i = 5 # (e.g.)
model = create_model(i)
model.compile(...)
model.fit(...)
some_function_to_save_model(model)
and e.g. a bash script that loops through these

how to do finetune using pre-trained model in tf.estimator

i got a model converted from caffe by using MMDNN tool, it converted the caffe model into a saved_model tensorflow style. it's a resnet18 model, and i just strip out several layers in the last, i wish i could load this architecture in the model_fn in a tf.estimator, and manually add some extra layers to do my job.
As the tutorial recommended that I could use loader.load method to load the saved_model. But i just want to use it in a estimator, and i need to define the architecture in the model_fn function. I searched out the SO and github but there isn't a very specific workflow to do that thing, somebody could help me out?
Here is one way of fine tuning using tf.Estimator:
Define your model using the SAME variable names/scopes as in your saved model
Use tf.estimator's warm start functions to initialize your new model with the saved weights. Here is a code snippet :
if fine_tuning:
ws = tf.estimator.WarmStartSettings(ckpt_to_initialize_from=path_saved_model,
vars_to_warm_start='.*')
else:
ws = None
estimator = tf.estimator.Estimator(model_fn=model_function,
warm_start_from=ws,
...
)
This will initialize any variable that share names between your currently defined graph and the saved model.

Loading a model from tensorflow SavedModel onto mutliple GPUs

Let's say someone hands me a TF SavedModel and I would like to replicate this model on the 4 GPUs I have on my machine so I can run inference in parallel on batches of data. Are there any good examples of how to do this?
I can load a saved model in this way:
def load_model(self, saved_model_dirpath):
'''Loads a model from a saved model directory - this should
contain a .pb file and a variables directory'''
signature_key = tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
input_key = 'input'
output_key = 'output'
meta_graph_def = tf.saved_model.loader.load(self.sess, [tf.saved_model.tag_constants.SERVING],
saved_model_dirpath)
signature = meta_graph_def.signature_def
input_tensor_name = signature[signature_key].inputs[input_key].name
output_tensor_name = signature[signature_key].outputs[output_key].name
self.input_tensor = self.sess.graph.get_tensor_by_name(input_tensor_name)
self.output_tensor = self.sess.graph.get_tensor_by_name(output_tensor_name)
..but this would require that I have a handle to the session. For models that I have written myself, I would have access to the inference function and I could just call it and wrap it using with tf.device(), but in this case, I'm not sure how to extract the inference function out of a Saved Model. Should I load 4 separate sessions or is there a better way? Couldn't find much documentation on this, but apologies in advance if I missed something. Thanks!
There is no support for this use case in TensorFlow at the moment. Unfortunately, "replicating the inference function" based only on the SavedModel (which is basically the computation graph with some metadata), is a fairly complex (and brittle, if implemented) graph transformation problem.
If you don't have access to the source code that produced this model, your best bet is to load the SavedModel 4 times into 4 separate graphs, rewriting the target device to the corresponding GPU each time. Then, run each graph/session separately.
Note that you can invoke sess.run() multiple times concurrently since sess.run() releases the GIL for the time of actual computation. All you need is several Python threads.

How to smoothly produce Tensorflow auc summaries for training and test sets?

Tensorflow describes writing file summaries to visualize graph execution.
I envision three stages:
training the data (with optimization)
measuring accuracy on the training set (no optimization)
measuring accuracy on the test set (no optimization!)
I'd like all stages in the same script, as in the evaluate function of the wide_and_deep tutorial, but with the low-level API. I'd like three different graphs for stats like loss or AUC, one for each stage.
Suppose I use one session, and in each stage I define an AUC summary op:
# define auc
auc, auc_op = tf.metrics.auc(labels, predictions)
# summary scalar to track it
tf.summary.scalar("auc", auc_op, family=family_name)
# merge all summaries for evaluation and later writing
summary_op = tf.summary.merge_all()
...
summary_writer.add_summary(summary, step_num)
There are three graphs, but the first graph has all three runs on it, and the second graph has the last two runs (see below). What's worse, each stage starts from the previous state. This makes sense, because all the variables from the previous stages are still around.
I could use a different session for each stage, but that would throw away the model as well.
What is the smooth way to handle this?
I'd like to just clear some of the summary variables. I've tried re-initializing some variables, looked at related questions, read about name scope and variable scope and tried not to re-use variables for AUC, read about variables and sharing, looked into pruning nodes (though I don't understand it), etc. I have not made it work yet.
I am using the low-level API. I saw something like this in the high-level API in _eval_metric_ops, but I don't understand how they 'clear' the different stages. With name_scope?
Do I have to save and load the model into a new session just for this, or is there some clean way to graph each summary separately?
The metric ops will be local variables, so you could run tf.local_variables_initializer() in your Session, which will reset all of your metrics. You could also look through the local variables collection for those with "auc" in the name if you wanted to be a bit more discerning. The high-level way to do this would be to use an Estimator, which will manage metrics for you.

How to properly freeze a tensorflow graph containing a LookupTable

I am working with a model that uses multiple lookup tables to transform the model input from text to feature ids. I am able to train the model fine. I am able to load it via the javacpp bindings. I am using a default Saver object via the tensor flow supervisor on a periodic basis.
When I try to run the model I get the following error:
Table not initialized.
[[Node: hash_table_Lookup_3 = LookupTableFind[Tin=DT_STRING, Tout=DT_INT64,
_class=["loc:#string_to_index_2/hash_table"], _output_shapes=[[-1]],
_device="/job:localhost/replica:0/task:0/cpu:0"]
(string_to_index_2/hash_table, ParseExample/ParseExample:5, string_to_index_2/hash_table/Const)]]
I prepare the model by using the freeze_graph.py script as follows:
bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=/tmp/tf/graph.pbtxt
--input_checkpoint=/tmp/tf/model.ckpt-0 --output_graph=/tmp/ticker_classifier.pb
--output_node_names=sigmoid --initializer_nodes=init_all_tables
As far as I can tell specifying the initializer_nodes has no effect on the resulting file. Am I running into something that is not currently supported? If not than is there something else I need to do to prepare the graph to be frozen?
I had the same problem when using C++ to invoke TF API to run the inference. It seems the reason is I train a model using tf.feature_column.categorical_column_with_hash_bucket, which needs to be initialized like this:
table_init_op = tf.tables_initializer(name="init_all_tables")
sess.run(table_init_op)
So when you want to freeze the model, you must append the name of table_init_op to the argument "--output_node_names":
freeze_graph --input_graph=/tmp/tf/graph.pbtxt
--input_checkpoint=/tmp/tf/model.ckpt-0
-- output_graph=/tmp/ticker_classifier.pb
--output_node_names=sigmoid,init_all_tables
--initializer_nodes=init_all_tables
When you load and init model in C++, you should first invoke TF C++ API like this:
std::vector<Tensor> dummy_outputs;
Status st = session->Run({}, {}, {"init_all_tables"}, dummy_outputs);
Now you have initialized all tables and can do other things such as inference. This issue may give you a help.