Issues with Tensor-flow distributed training (Early stopping)

Issues with Tensor-flow distributed training (Early stopping) - tensorflow

During TensorFlow distributed MultiWorkerMirroredStrategy training with I am facing errors during early_stopping.
I am using early stopping criterion for TrainSpec for distributed training with 2 nodes.With early stopping hook, I receive the following error immediately on all the worker nodes simultaneously.
If the early stopping hook is removed the code finishes to completion.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
run_config = tf.estimator.RunConfig(
model_dir=model_output_dir,
save_checkpoints_steps=5000,
keep_checkpoint_max=1,
train_distribute=strategy
early_stopping_mae = tf.estimator.experimental.stop_if_no_decrease_hook(
estimator, metric_name='mae',
max_steps_without_decrease=20000, min_steps=100)
train_spec = tf.estimator.TrainSpec(
input_fn=lambda: csv_input_fn(
train_filepath,
hparams['batch_size'],
hparams['num_epochs']),
hooks=[early_stopping_mae]
)
With early stopping, I receive the following error. Its unclear if early stopping is even supported with multi worker strategy.
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/ops/collective_ops.py", line 133, in broadcast_send
instance_key=instance_key)
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/ops/gen_collective_ops.py", line 159, in collective_bcast_send
shape=shape, name=name)
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 626, in _apply_op_helper
param_name=input_name)
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'input' has DataType bool not in list of allowed values: float32, float16, float64, int32, int64

Related

Argument must be a string or a number, not 'ExponentialDecay'

I am on Tensorflow 2.4.0, and tried to perform Exponential decay on the learning rate as follows:
learning_rate_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.1, decay_steps=1000, decay_rate=0.97, staircase=False)
and start the learning rate of my optimizer with such decay method:
optimizer_to_use = Adam(learning_rate=learning_rate_scheduler)
the model is compiled as follows
model.compile(loss=metrics.contrastive_loss, optimizer=optimizer_to_use, metrics=[accuracy])
The train goes well until the third epoch, where the following error is showed:
File "train_contrastive_siamese_network_inception.py", line 163, in run_experiment
history = model.fit([pairTrain[:, 0], pairTrain[:, 1]], labelTrain[:], validation_data=([pairTest[:, 0], pairTest[:, 1]], labelTest[:]), batch_size=config.BATCH_SIZE, epochs=config.EPOCHS, callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1145, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/callbacks.py", line 432, in on_epoch_end
callback.on_epoch_end(epoch, numpy_logs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/callbacks.py", line 2542, in on_epoch_end
old_lr = float(K.get_value(self.model.optimizer.lr))
TypeError: float() argument must be a string or a number, not 'ExponentialDecay'
I checked this issue was even raised in the official keras Forum, but no success even there. Plus, the documentation clearly states that:
A LearningRateSchedule instance can be passed in as the learning_rate argument of any optimizer.
What could be the issue?

AttributeError: module 'tensorflow.contrib.seq2seq' has no attribute 'prepare_attention'

I am trying to run my code and the code is throwing the error.
Error is mentioned below:
AttributeError: module 'tensorflow.contrib.seq2seq' has no attribute 'prepare_attention'
I updated my tensorflow version to 1.0.0. But the up-gradation did not solved my problem. I also searched in google regarding this error, but i did not got correct solution.
Here is the code part, please have a look.
Getting the training and test predictions
training_predictions, test_predictions = seq2seq_model(tf.reverse(inputs, [-1]),
targets,
keep_prob,
batch_size,
sequence_length,
len(answerswords2int),
len(questionswords2int),
encoding_embedding_size,
decoding_embedding_size,
rnn_size,
num_layers,
questionswords2int)
C:\Users\Maniech\Anaconda3\lib\site-packages\tensorflow_core\python\client\session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
warnings.warn('An interactive session is already active. This can '
Traceback (most recent call last):
File "<ipython-input-8-aecd893a8ef5>", line 37, in <module>
questionswords2int)
File "C:/Users/Maniech/Desktop/Deep NLP AZ/chatbot.py", line 292, in seq2seq_model
batch_size)
File "C:/Users/Maniech/Desktop/Deep NLP AZ/chatbot.py", line 258, in decoder_rnn
batch_size)
File "C:/Users/Maniech/Desktop/Deep NLP AZ/chatbot.py", line 201, in decode_training_set
attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
AttributeError: module 'tensorflow.contrib.seq2seq' has no attribute 'prepare_attention'
Any help is appreciated.

TPUEstimator error -- AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'cross_replica_sum'

I have written a tensorflow code using the TPUEstimator, but I am having problems running it in use_tpu=False mode. I would like to run it on my local computer to make sure that all the operations are TPU-compatible. The code works fine with the normal Estimator. Here is my master code:
import logging
from tensorflow.contrib.tpu.python.tpu import tpu_config, tpu_estimator, tpu_optimizer
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
from capser_7_model_fn import *
from capser_7_input_fn import *
import subprocess
from absl import flags
flags.DEFINE_bool(
'use_tpu', False,
'Use TPUs rather than plain CPUs')
tf.flags.DEFINE_string(
"tpu", default='$TPU_NAME',
help="The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string("model_dir", LOGDIR, "Estimator model_dir")
flags.DEFINE_integer(
'save_checkpoints_secs', 1000,
'Interval (in seconds) at which the model data '
'should be checkpointed. Set to 0 to disable.')
flags.DEFINE_integer(
'save_summary_steps', 100,
'Number of steps which must have run before showing summaries.')
tf.flags.DEFINE_integer("iterations", 1000,
"Number of iterations per TPU training loop.")
tf.flags.DEFINE_integer("num_shards", 8, "Number of shards (TPU chips).")
tf.flags.DEFINE_integer("batch_size", 1024,
"Mini-batch size for the training. Note that this "
"is the global batch size and not the per-shard batch.")
FLAGS = tf.flags.FLAGS
if FLAGS.use_tpu:
my_project_name = subprocess.check_output(['gcloud', 'config', 'get-value', 'project'])
my_zone = subprocess.check_output(['gcloud', 'config', 'get-value', 'compute/zone'])
cluster_resolver = TPUClusterResolver(
tpu=[FLAGS.tpu],
zone=my_zone,
project=my_project_name)
master = TPUClusterResolver(tpu=[os.environ['TPU_NAME']]).get_master()
else:
master = ''
my_tpu_run_config = tpu_config.RunConfig(
master=master,
model_dir=FLAGS.model_dir,
save_checkpoints_secs=FLAGS.save_checkpoints_secs,
save_summary_steps=FLAGS.save_summary_steps,
session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
tpu_config=tpu_config.TPUConfig(iterations_per_loop=FLAGS.iterations, num_shards=FLAGS.num_shards),
)
# create estimator for model (the model is described in capser_7_model_fn)
capser = tpu_estimator.TPUEstimator(model_fn=model_fn_tpu,
config=my_tpu_run_config,
use_tpu=FLAGS.use_tpu,
train_batch_size=batch_size,
params={'model_batch_size': batch_size_per_shard})
# train model
logging.getLogger().setLevel(logging.INFO) # to show info about training progress
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
I have a capsule network defined in model_fn_tpu, which returns the TPUEstimator spec. The optimizer is a standard AdamOptimizer. I have made all the changes explained here https://www.tensorflow.org/guide/using_tpu#optimizer to make my code compatible with TPUEstimator. I get the following error:
Traceback (most recent call last):
File "C:/Users/doerig/PycharmProjects/capser/TPU_playground.py", line 85, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 856, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 831, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 2016, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1121, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1317, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "C:\Users\doerig\PycharmProjects\capser\capser_7_model_fn.py", line 101, in model_fn_tpu
**output_decoder_deconv_params)
File "C:\Users\doerig\PycharmProjects\capser\capser_model.py", line 341, in capser_model
loss_training_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step(), name="training_op")
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\training\optimizer.py", line 424, in minimize
name=name)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py", line 113, in apply_gradients
summed_grads_and_vars.append((tpu_ops.cross_replica_sum(grad), var))
AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'cross_replica_sum'
Any ideas to solve this problem? Thank you in advance!

I suspect this is either a bug in the version of TensorFlow you are using + Windows, or else an issue with your build of TensorFlow.
For example, when I chase down the file tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py in the TF 1.4 branch, I see that tpu_ops is imported as:
from tensorflow.contrib.tpu.python.ops import tpu_ops
and if you chase that to the relevant file, you see:
if platform.system() != "Windows":
# pylint: disable=wildcard-import,unused-import,g-import-not-at-top
from tensorflow.contrib.tpu.ops.gen_tpu_ops import *
from tensorflow.contrib.util import loader
from tensorflow.python.platform import resource_loader
# pylint: enable=wildcard-import,unused-import,g-import-not-at-top
_tpu_ops = loader.load_op_library(
resource_loader.get_path_to_datafile("_tpu_ops.so"))
else:
# We have already built the appropriate libraries into the binary via CMake
# if we have built contrib, so we don't need this
pass
Following up with the other TF branches that existed at the time of this posting, we see similar comments in 1.5, in 1.6, in 1.7, in 1.8, and in 1.9.
I strongly suspect this would not occur under Linux, but I might test this later and edit this answer.

How to use tf.train.Saver in SessionRunHook?

I have trained many sub-models, each sub-models is a part of the last model. And then I want to use those pretrained sub models to initial the last model's parameters. I try to use SessionRunHook to load other ckpt file's model parameters to initial the last model's.
I tried the follow code but failed. Hope some advices. Thanks!
The error info is:
Traceback (most recent call last):
File "train_high_api_local.py", line 282, in <module>
tf.app.run()
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "train_high_api_local.py", line 266, in main
clf_.train(input_fn=lambda: read_file([tables[0]], epochs_per_eval), steps=None, hooks=[hook_test]) # input yield: x, y
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 314, in train
.......
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 674, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File "train_high_api_local.py", line 102, in after_create_session
saver = tf.train.Saver([ti]) # TODO: ERROR INFO: Graph is finalized and cannot be modified.
.......
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3135, in create_op
self._check_not_finalized()
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2788, in _check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.
and the code detail is:
class SetTensor(session_run_hook.SessionRunHook):
""" like tf.train.LoggingTensorHook """
def after_create_session(self, session, coord):
""" Called when new TensorFlow session is created: graph is finalized and ops can no longer be added. """
graph = tf.get_default_graph()
ti = graph.get_tensor_by_name("h_1_15/bias:0")
with session.as_default():
with tf.name_scope("rewrite"):
saver = tf.train.Saver([ti]) # TODO: ERROR INFO: Graph is finalized and cannot be modified.
saver.restore(session, "/Users/zhouliaoming/data/credit_dnn/model_retrain/rm_gene_v2_sall/model.ckpt-2102")
pass
def main(unused_argv):
""" train """
norm_all_func = lambda x: tf.cond(x>1, lambda: tf.log(x), lambda: tf.identity(x))
feature_columns=[[tf.feature_column.numeric_column(COLUMNS[i], shape=fi, normalizer_fn=lambda x: tf.py_func(weight_norm2, [x], tf.float32) )] for i, fi in enumerate(FEA_DIM)] # normlized: running OK!
## use self-defined model
param = {"learning_rate": 0.0001, "feature_columns": feature_columns, "isanalysis": FLAGS.isanalysis, "isall": False}
clf_ = tf.estimator.Estimator(model_fn=model_fn_wide2deep, params=param, model_dir=ckpt_dir)
hook_test = SetTensor(["h_1_15/bias", "h_1_15/kernel"])
epochs_per_eval = 1
for n in range(int(FLAGS.num_epochs/epochs_per_eval)):
# train num_epochs
clf_.train(input_fn=lambda: read_file([tables[0]], epochs_per_eval), steps=None, hooks=[hook_test]) # input yield: x, y

SessionRunHook is not meant for this use case. As the error says, you cannot change the graph once sess.run() has been invoked.
You can assign variables using saver.restore() in your "normal code". You don't have to be inside any hooks.
Also, if you want to restore many variables and can match them to their names and shapes in a checkpoint, you might want to take a look at https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4. It shows some example code to restore a subset of variables.

You can do this:
class SaveAtEnd(tf.train.SessionRunHook):
def begin(self):
self._saver = # create your saver
def end(self, session):
self._saver.save(session, ...)

Proper way to optimize the input in TensorFlow for visualization

I have trained a model in TensorFlow and now I would like to visualize which inputs maximally activate an output. I'd like to know what the cleanest way to do this is.
I had thought to do this by creating a trainable input variable which I can assign once per run. Then by using an appropriate loss function and using an optimizer with a var_list containing just this input variable I would update this input variable until convergence. i.e.
trainable_input = tf.get_variable(
'trainable_input',
shape=data_op.get_shape(),
dtype=data_op.dtype,
initializer=tf.zeros_initializer(),
trainable=True,
collections=[tf.GraphKeys.LOCAL_VARIABLES])
trainable_input_assign_op = tf.assign(trainable_input, data_op)
data_op = trainable_input
# ... run the rest of the graph building code here, now with a trainable input
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
# loss_op is defined on one of the outputs
train_op = optimizer.minimize(loss_op, var_list=[trainable_input])
However, when I do this I run into issues. If I try to restore the pre-trained graph using a Supervisor, then it naturally complains that the new variables created by the AdamOptimizer do not exist in the graph I'm trying to restore. I can remedy this by using get_slots to get the variables the AdamOptimizer creates and manually adding those variables to the tf.GraphKeys.LOCAL_VARIABLES collection, but it feels pretty hacky and I'm not sure what the consequences of this would be. I can also exclude those variables explicitly from the Saver that is passed to the Supervisor without adding them to the tf.GraphKeys.LOCAL_VARIABLES collection, but then I get an exception that they do not get properly initialized by the Supervisor:
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 973, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 801, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.5/site-packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 962, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 719, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py", line 280, in prepare_session
self._local_init_op, msg))
RuntimeError: Init operations did not make model ready. Init op: init, init fn: None, local_init_op: name: "group_deps_5"
op: "NoOp"
input: "^init_1"
input: "^init_all_tables"
, error: Variables not initialized: trainable_input/trainable_input/Adam, trainable_input/trainable_input/Adam_1
I'm not really sure why these variables are not getting initialized since I have used that technique before to exclude some variables from the restore process (GLOBAL and LOCAL) and they seem to get initialized as expected.
In short, my question is whether there is a simple way to add an optimizer to the graph and do a checkpoint restore (where the checkpoint does not contain the optimizer variables) without having to muck around with the internals of the optimizer. If that's not possible, then is there any downside to just adding the optimizer variables to the LOCAL_VARIABLES collection?

The same error occurs when I use slim library. In fact, the slim.learning.train() uses tf.train.Supervisor inside. I hope my answer on this GitHub issue may help your Supervisor problem.
I have the same problem with you. I solve it by doing following two steps.
1. pass the parameter saver to slim.learning.train()
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
saver = tf.train.Saver(var_list=optimistic_restore_vars(ckpt.model_checkpoint_path) if ckpt else None)
where function optimistic_restore_vars is defined as
def optimistic_restore_vars(model_checkpoint_path):
reader = tf.train.NewCheckpointReader(model_checkpoint_path)
saved_shapes = reader.get_variable_to_shape_map()
var_names = sorted([(var.name, var.name.split(':')[0]) for var in tf.global_variables() if var.name.split(':')[0] in saved_shapes])
restore_vars = []
name2var = dict(zip(map(lambda x:x.name.split(':')[0], f.global_variables()), tf.global_variables()))
with tf.variable_scope('', reuse=True):
for var_name, saved_var_name in var_names:
curr_var = name2var[saved_var_name]
var_shape = curr_var.get_shape().as_list()
if var_shape == saved_shapes[saved_var_name]:
restore_vars.append(curr_var)
return restore_vars
```
2. pass the parameter local_init_op to slim.learning.train() to initialize the added new variables
local_init_op = tf.global_variables_initializer()
In last, the code should look like this
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
saver = tf.train.Saver(var_list=optimistic_restore_vars ckpt.model_checkpoint_path) if ckpt else None)
local_init_op = tf.global_variables_initializer()
###########################
# Kicks off the training. #
###########################
learning.train(
train_tensor,
saver=saver,
local_init_op=local_init_op,
logdir=FLAGS.train_dir,
master=FLAGS.master,
is_chief=(FLAGS.task == 0),
init_fn=_get_init_fn(),
summary_op=summary_op,
number_of_steps=FLAGS.max_number_of_steps,
log_every_n_steps=FLAGS.log_every_n_steps,
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
sync_optimizer=optimizer if FLAGS.sync_replicas else None
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Issues with Tensor-flow distributed training (Early stopping) - tensorflow

Related

Argument must be a string or a number, not 'ExponentialDecay'

AttributeError: module 'tensorflow.contrib.seq2seq' has no attribute 'prepare_attention'

TPUEstimator error -- AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'cross_replica_sum'

How to use tf.train.Saver in SessionRunHook?

Proper way to optimize the input in TensorFlow for visualization

Categories

Resources