How to run evaluation with MultiWorkerMirroredStrategy in Tensorflow 2? - tensorflow

I have the following code which works fine on a single machine:
if distributed_training:
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
elif len(available_gpus) > 1:
strategy = tf.distribute.MirroredStrategy()
else:
strategy = None
run_config = tf.compat.v1.estimator.tpu.RunConfig(
train_distribute=strategy,
model_dir=model_dir
)
estimator = tf.compat.v1.estimator.tpu.TPUEstimator(
model_fn=model_fn,
config=run_config,
train_batch_size=train_batch_size,
eval_batch_size=eval_batch_size)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
However, when I try to run it in a distributed fashion it simply won't do evaluation. It will train fine, but for some reason evaluation is skipped. I've already tried setting eval_distribute to MultiWorkerMirroredStrategy, but that doesn't change anything.
Any ideas?

Related

How to use feedable iterator from Tensorflow Dataset API along with MonitoredTrainingSession?

Tensorflow programmer's guide recommends using feedable iterator to switch between training and validation dataset without reinitializing the iterator. It mainly requires to feed the handle to choose between them.
How to use it along with tf.train.MonitoredTrainingSession?
The following method fails with "RuntimeError: Graph is finalized and cannot be modified." error.
with tf.train.MonitoredTrainingSession() as sess:
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
How to achieve both the convenience of MonitoredTrainingSession and iterating training and validation datasets simultaneously?
I got the answer from the Tensorflow GitHub issue - https://github.com/tensorflow/tensorflow/issues/12859
The solution is to invoke the iterator.string_handle() before creating the MonitoredSession.
import tensorflow as tf
from tensorflow.contrib.data import Dataset, Iterator
dataset_train = Dataset.range(10)
dataset_val = Dataset.range(90, 100)
iter_train_handle = dataset_train.make_one_shot_iterator().string_handle()
iter_val_handle = dataset_val.make_one_shot_iterator().string_handle()
handle = tf.placeholder(tf.string, shape=[])
iterator = Iterator.from_string_handle(
handle, dataset_train.output_types, dataset_train.output_shapes)
next_batch = iterator.get_next()
with tf.train.MonitoredTrainingSession() as sess:
handle_train, handle_val = sess.run([iter_train_handle, iter_val_handle])
for step in range(10):
print('train', sess.run(next_batch, feed_dict={handle: handle_train}))
if step % 3 == 0:
print('val', sess.run(next_batch, feed_dict={handle: handle_val}))
Output:
('train', 0)
('val', 90)
('train', 1)
('train', 2)
('val', 91)
('train', 3)
#Michael Jaison G answer is correct. However, it does not work when you also want to use certain session_run_hooks that need to evaluate parts of the graph, like e.g. LoggingTensorHook or SummarySaverHook.
The example below will cause an error:
import tensorflow as tf
dataset_train = tf.data.Dataset.range(10)
dataset_val = tf.data.Dataset.range(90, 100)
iter_train_handle = dataset_train.make_one_shot_iterator().string_handle()
iter_val_handle = dataset_val.make_one_shot_iterator().string_handle()
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle, dataset_train.output_types, dataset_train.output_shapes)
feature = iterator.get_next()
pred = feature * feature
tf.summary.scalar('pred', pred)
global_step = tf.train.create_global_step()
summary_hook = tf.train.SummarySaverHook(save_steps=5,
output_dir="summaries", summary_op=tf.summary.merge_all())
with tf.train.MonitoredTrainingSession(hooks=[summary_hook]) as sess:
handle_train, handle_val = sess.run([iter_train_handle, iter_val_handle])
for step in range(10):
feat = sess.run(feature, feed_dict={handle: handle_train})
pred_ = sess.run(pred, feed_dict={handle: handle_train})
print('train: ', feat)
print('pred: ', pred_)
if step % 3 == 0:
print('val', sess.run(feature, feed_dict={handle: handle_val}))
This will fail with error:
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder' with dtype string
[[Node: Placeholder = Placeholder[dtype=DT_STRING, shape=[], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
[[Node: cond/Switch_1/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_18_cond/Switch_1", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
The reason being that the hook will try to evaluate the graph already upon the first session.run([iter_train_handle, iter_val_handle]) which obviously does not contain a handle in the feed_dict yet.
The workaround solution being to overwrite the hooks that cause the problem and changing the code in before_run and after_run to only evaluate on session.run calls containing the handle in the feed_dict (you can access the feed_dict of the current session.run call via the run_context argument of before_run and after_run)
Or you can use the latest master of Tensorflow (post-1.4) which adds a run_step_fn function to MonitoredSession which allows you to specify the following step_fn which will avoid the error (on the expense of evaluating the if statement TrainingIteration number of times ...)
def step_fn(step_context):
if handle_train is None:
handle_train, handle_val = sess.run([iter_train_handle, iter_val_handle])
return step_context.run_with_hooks(fetches=..., feed_dict=...)
There is a demo for using placeholder in mot_session with SessionRunHook.
This demo is about switching datasets by feeding diff handle_string.
BTW, I have tried all solutions, but only this works.
dataset_switching

TensorFlow Estimator restoring all variables properly, but loss spikes up afterwards

I am using TensorFlow 1.2.1 on Windows 10, and using the Estimator API. Everything runs without any errors, but whenever I have to restore the parameters from a checkpoint, some aspect of it doesn't work. I've checked that the values of every variable in classifier.get_variable_names() does not change after an evaluation, however the Loss spikes back up to near where it started, this is followed by a continued learning, each time learning faster than the last.
This happens within one TensorFlow run, when a validation or evaluation run happens, or when I rerun the python file to continue training.
The following graphs are one example of this problem, they are restoring the variables every 2500 steps:
http://imgur.com/6q9Wuat
http://imgur.com/CQ2hdR8
The following code is a significiantly reduced version of my code, which still replicates the error:
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib
tf.logging.set_verbosity(tf.logging.INFO)
sess = tf.InteractiveSession()
def cnn_model_fn(features, labels, mode):
dense_layer1 = tf.layers.dense(inputs=features, units=512, activation=tf.nn.relu, name="FC_1")
dense_layer2 = tf.layers.dense(inputs=dense_layer1, units=1024, activation=tf.nn.relu, name="FC_2")
dense_layer3 = tf.layers.dense(inputs=dense_layer2, units=2048, activation=tf.nn.relu, name="FC_3")
dense_layer4 = tf.layers.dense(inputs=dense_layer3, units=512, activation=tf.nn.relu, name="FC_4")
logits = tf.layers.dense(inputs=dense_layer4, units=2, name="logit_layer")
loss = None
train_op = None
if mode != learn.ModeKeys.INFER:
loss = tf.losses.softmax_cross_entropy(
onehot_labels=labels, logits=logits)
if mode == learn.ModeKeys.TRAIN:
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.contrib.framework.get_global_step(),
learning_rate=.001,
optimizer="SGD")
predictions = {
"classes": tf.argmax(input=logits, axis=1),
"probabilities": tf.nn.softmax(
logits, name="softmax_tensor")}
return model_fn_lib.ModelFnOps(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op)
def main(unused_param):
def data_pipeline(filenames, batch_size, num_epochs=None, min_after_dequeue=10000):
with tf.name_scope("data_pipeline"):
filename_queue = tf.train.string_input_producer(filenames, num_epochs=num_epochs)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
row = tf.decode_csv(value, record_defaults=[[0.0] for _ in range(66)])
example_op, label_op = tf.stack(row[:len(row)-2]), tf.stack(row[len(row)-2:])
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
[example_op, label_op],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
def input_data_fn(data_getter_ops):
batch, labels = sess.run(data_getter_ops)
return tf.constant(batch, dtype=tf.float32), tf.constant(labels, dtype=tf.float32)
NUM_EPOCHS = 6
BATCHES_IN_TRAINING_EPOCH = 8000
training_data_pipe_ops = data_pipeline(
filenames=["train_data.csv"],
batch_size=500,
min_after_dequeue=10000)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
classifier = tf.contrib.learn.Estimator(
model_fn=cnn_model_fn,
model_dir="/tmp/bug_finder")
for j in range(NUM_EPOCHS):
classifier.fit(
input_fn=lambda: input_data_fn(training_data_pipe_ops),
steps = BATCHES_IN_TRAINING_EPOCH)
print("Epoch", str(j+1), "training completed.")
coord.request_stop()
coord.join(threads)
if __name__ == "__main__":
tf.app.run()
I figured out the issue, I was creating data pipelines with the interactive session I created, and then having my input function evaluate the examples (like a feed-dictionary). The reason this is an issue is that the Estimator class creates it's own session (a MonitoredTraininSession), and since the graph operations weren't being created from within a call from the Estimator class (and thus with it's session), they were not being saved. Using an input function to create the graph operations, and return the final graph operation (the batching) has resulted in everything working smoothly.

Tensorflow classifier.export_savedmodel (Beginner)

I know about the "Serving a Tensorflow Model" page
https://www.tensorflow.org/serving/serving_basic
but those functions assume you're using tf.Session() which the DNNClassifier tutorial does not... I then looked at the api doc for DNNClassifier and it has an export_savedmodel function (the export function is deprecated) and it seems simple enough but I am getting a "'NoneType' object is not iterable" error... which is suppose to mean I'm passing in an empty variable but I'm unsure what I need to change... I've essentially copied and pasted the code from the get_started/tflearn page on tensorflow.org but then added
directoryName = "temp"
def serving_input_fn():
print("asdf")
classifier.export_savedmodel(
directoryName,
serving_input_fn
)
just after the classifier.fit function call... the other parameters for export_savedmodel are optional I believe... any ideas?
Tutorial with Code:
https://www.tensorflow.org/get_started/tflearn#construct_a_deep_neural_network_classifier
API Doc for export_savedmodel
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/DNNClassifier#export_savedmodel
There are two kind of TensorFlow applications:
The functions that assume you are using tf.Session() are functions from "low level" Tensorflow examples, and
the DNNClassifier tutorial is a "high level" Tensorflow application.
I'm going to explain how to export "high level" Tensorflow models (using export_savedmodel).
The function export_savedmodel requires the argument serving_input_receiver_fn, that is a function without arguments, which defines the input from the model and the predictor. Therefore, you must create your own serving_input_receiver_fn, where the model input type match with the model input in the training script, and the predictor input type match with the predictor input in the testing script.
On the other hand, if you create a custom model, you must define the export_outputs, defined by the function tf.estimator.export.PredictOutput, which input is a dictionary that define the name that has to match with the name of the predictor output in the testing script.
For example:
TRAINING SCRIPT
def serving_input_receiver_fn():
serialized_tf_example = tf.placeholder(dtype=tf.string, shape=[None], name='input_tensors')
receiver_tensors = {"predictor_inputs": serialized_tf_example}
feature_spec = {"words": tf.FixedLenFeature([25],tf.int64)}
features = tf.parse_example(serialized_tf_example, feature_spec)
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
def estimator_spec_for_softmax_classification(logits, labels, mode):
predicted_classes = tf.argmax(logits, 1)
if (mode == tf.estimator.ModeKeys.PREDICT):
export_outputs = {'predict_output': tf.estimator.export.PredictOutput({"pred_output_classes": predicted_classes, 'probabilities': tf.nn.softmax(logits)})}
return tf.estimator.EstimatorSpec(mode=mode, predictions={'class': predicted_classes, 'prob': tf.nn.softmax(logits)}, export_outputs=export_outputs) # IMPORTANT!!!
onehot_labels = tf.one_hot(labels, 31, 1, 0)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
if (mode == tf.estimator.ModeKeys.TRAIN):
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
eval_metric_ops = {'accuracy': tf.metrics.accuracy(labels=labels, predictions=predicted_classes)}
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
def model_custom(features, labels, mode):
bow_column = tf.feature_column.categorical_column_with_identity("words", num_buckets=1000)
bow_embedding_column = tf.feature_column.embedding_column(bow_column, dimension=50)
bow = tf.feature_column.input_layer(features, feature_columns=[bow_embedding_column])
logits = tf.layers.dense(bow, 31, activation=None)
return estimator_spec_for_softmax_classification(logits=logits, labels=labels, mode=mode)
def main():
# ...
# preprocess-> features_train_set and labels_train_set
# ...
classifier = tf.estimator.Estimator(model_fn = model_custom)
train_input_fn = tf.estimator.inputs.numpy_input_fn(x={"words": features_train_set}, y=labels_train_set, batch_size=batch_size_param, num_epochs=None, shuffle=True)
classifier.train(input_fn=train_input_fn, steps=100)
full_model_dir = classifier.export_savedmodel(export_dir_base="C:/models/directory_base", serving_input_receiver_fn=serving_input_receiver_fn)
TESTING SCRIPT
def main():
# ...
# preprocess-> features_test_set
# ...
with tf.Session() as sess:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], full_model_dir)
predictor = tf.contrib.predictor.from_saved_model(full_model_dir)
model_input = tf.train.Example(features=tf.train.Features( feature={"words": tf.train.Feature(int64_list=tf.train.Int64List(value=features_test_set)) }))
model_input = model_input.SerializeToString()
output_dict = predictor({"predictor_inputs":[model_input]})
y_predicted = output_dict["pred_output_classes"][0]
(Code tested in Python 3.6.3, Tensorflow 1.4.0)
If you try to use predictor with tensorflow > 1.6 you can get this Error :
signature_def_key "serving_default". Available signatures are ['predict']. Original error:
No SignatureDef with key 'serving_default' found in MetaGraphDef.
Here is working example which is tested on 1.7.0 :
SAVING :
First you need to define features length in dict format like this:
feature_spec = {'x': tf.FixedLenFeature([4],tf.float32)}
Then you have to build a function which have placeholder with same shape of features and return using tf.estimator.export.ServingInputReceiver
def serving_input_receiver_fn():
serialized_tf_example = tf.placeholder(dtype=tf.string,
shape=[None],
name='input_tensors')
receiver_tensors = {'inputs': serialized_tf_example}
features = tf.parse_example(serialized_tf_example, feature_spec)
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
Then just save with export_savedmodel :
classifier.export_savedmodel(dir_path, serving_input_receiver_fn)
full example code:
import os
from six.moves.urllib.request import urlopen
import numpy as np
import tensorflow as tf
dir_path = os.path.dirname('.')
IRIS_TRAINING = os.path.join(dir_path, "iris_training.csv")
IRIS_TEST = os.path.join(dir_path, "iris_test.csv")
feature_spec = {'x': tf.FixedLenFeature([4],tf.float32)}
def serving_input_receiver_fn():
serialized_tf_example = tf.placeholder(dtype=tf.string,
shape=[None],
name='input_tensors')
receiver_tensors = {'inputs': serialized_tf_example}
features = tf.parse_example(serialized_tf_example, feature_spec)
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
def main():
training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TRAINING,
target_dtype=np.int,
features_dtype=np.float32)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TEST,
target_dtype=np.int,
features_dtype=np.float32)
feature_columns = [tf.feature_column.numeric_column("x", shape=[4])]
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir=dir_path)
# Define the training inputs
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
shuffle=True)
# Train model.
classifier.train(input_fn=train_input_fn, steps=200)
classifier.export_savedmodel(dir_path, serving_input_receiver_fn)
if __name__ == "__main__":
main()
Restoring
Now let's restore the model :
import tensorflow as tf
import os
dir_path = os.path.dirname('.') #current directory
exported_path= os.path.join(dir_path, "1536315752")
def main():
with tf.Session() as sess:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], exported_path)
model_input= tf.train.Example(features=tf.train.Features(feature={
'x': tf.train.Feature(float_list=tf.train.FloatList(value=[6.4, 3.2, 4.5, 1.5]))
}))
predictor= tf.contrib.predictor.from_saved_model(exported_path)
input_tensor=tf.get_default_graph().get_tensor_by_name("input_tensors:0")
model_input=model_input.SerializeToString()
output_dict= predictor({"inputs":[model_input]})
print(" prediction is " , output_dict['scores'])
if __name__ == "__main__":
main()
Here is Ipython notebook demo example with data and explanation :
There are two possible questions and answers possible. First you encounter a missing session for the DNNClassifier which uses the more higher level estimators API (as opposed to the more low level API's where you manipulate the ops yourself). The nice thing about tensorflow is that all high and low level APIs are more-or-less interoperable, so if you want a session and do something with that session, it is as simple as adding:
sess = tf.get_default_session()
The you can start hooking in the remainder of the tutorial.
The second interpretation of your question is, what about the export_savedmodel, well actually export_savedmodel and the sample code from the serving tutorial try to achieve the same goal. When you are training your graph you set up some infrastructure to feed input to the graph (typically batches from a training dataset) however when you switch to 'serving' you will often read your input from somewhere else, and you need some separate infrastructure which replaces the input of the graph used for training. The bottomline is that the serving_input_fn() which you filled with a print should in essence return an input op. This is also said in the documentation:
serving_input_fn: A function that takes no argument and returns an
InputFnOps.
Hence instead of print("asdf") it should do something similar as adding an input chain (which should be similar to what builder.add_meta_graph_and_variables is also adding).
Examples of serving_input_fn()'s can for example be found (in the cloudml sample)[https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/customestimator/trainer/model.py#L240]. Such as the following which serves input from JSON:
def json_serving_input_fn():
"""Build the serving inputs."""
inputs = {}
for feat in INPUT_COLUMNS:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
return tf.estimator.export.ServingInputReceiver(inputs, inputs)

How can input operators be added to a tensorflow graph after distributed execution has begun?

I would like to use distributed tensorflow to train on model in streaming fashion using a parameter server. The worker setup is something like this based on https://www.tensorflow.org/how_tos/distributed/:
def train_model(filenames, params):
with tf.device(tf.train.replica_device_setter(
worker_device='/job:worker/task:%d' % params.task_index, cluster=cluster)):
input_op = construct_input_op(filenames)
global_step = tf.Variable(0)
train_op = construct_train_op(input_op, global_step, params)
init_op = tf.global_variables_initializer()
saver = tf.train.Saver(tf.global_variables() + tf.local_variables())
supervisor = tf.train.Supervisor(
is_chief=params.task_index == 0,
logdir=params.training_summary_dir,
init_op=init_op,
saver=saver,
global_step=global_step,
save_model_secs=0)
with supervisor.managed_session(server.target) as sess:
while not supervisor.should_stop() and step <= params.max_steps:
sess.run(train_op)
supervisor.stop()
cluster = tf.train.ClusterSpec({"ps": [params.param_server_host], "worker": params.worker_hosts})
server = tf.train.Server(cluster, job_name="worker", task_index=params.task_index)
while True:
filenames = wait_for_new_training_data(...)
train_model(filenames, params)
With this setup, I get a runtime error that the Graph is finalized and cannot be modified when adding input operators in the second pass. How can I make this example work?
You can try ._unsafe_unfinalize() on your graph object, although a more robust solution would be to create all needed operations ahead of time (there are performance/thread-safety issues caused by modifying the graph while running it/concurrently)

Restoring a tensorflow model for finetuning, with "slim.learning.train"

In tensorflow, with slim.learning.train (TF 0.11), I would like to restore a model from a checkpoint and continue the training. The model had a successful training session, and I would like to fine tune it. However, when I do that, TF crash with an error
Init operations did not make model ready.
I do the training with:
tf.contrib.slim.learning.train(
train_op,
train_dir,
log_every_n_steps=FLAGS.log_every_n_steps,
graph=g,
global_step=model.global_step,
number_of_steps=FLAGS.number_of_steps,
init_fn=model.init_fn,
saver=model.saver,
session_config=session_config)
I tried 3 alternatives:
#1
Following this doc
model.init_fn = None
#2
with g.as_default():
model_path = tf.train.latest_checkpoint(train_dir)
if model_path:
def restore_fn(sess):
tf.logging.info(
"Restoring SA&T variables from checkpoint file %s",
restore_fn.model_path)
model.saver.restore(sess, restore_fn.model_path)
restore_fn.model_path = model_path
model.init_fn = restore_fn
else:
model.init_fn = None
#3
with g.as_default():
model_path = tf.train.latest_checkpoint(train_dir)
if model_path:
variables_to_restore = tf.contrib.slim.get_variables_to_restore()
model.init_fn = tensorflow.contrib.framework.assign_from_checkpoint_fn(
model_path, variables_to_restore)
else:
model.init_fn = None
Issue was solved. It happened because the saver (tf.train.Saver) was defined directly after the model build.
Instead, defining it following the train op definition, solved the issue.