I am on Tensorflow 2.4.0, and tried to perform Exponential decay on the learning rate as follows:
learning_rate_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.1, decay_steps=1000, decay_rate=0.97, staircase=False)
and start the learning rate of my optimizer with such decay method:
optimizer_to_use = Adam(learning_rate=learning_rate_scheduler)
the model is compiled as follows
model.compile(loss=metrics.contrastive_loss, optimizer=optimizer_to_use, metrics=[accuracy])
The train goes well until the third epoch, where the following error is showed:
File "train_contrastive_siamese_network_inception.py", line 163, in run_experiment
history = model.fit([pairTrain[:, 0], pairTrain[:, 1]], labelTrain[:], validation_data=([pairTest[:, 0], pairTest[:, 1]], labelTest[:]), batch_size=config.BATCH_SIZE, epochs=config.EPOCHS, callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1145, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/callbacks.py", line 432, in on_epoch_end
callback.on_epoch_end(epoch, numpy_logs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/callbacks.py", line 2542, in on_epoch_end
old_lr = float(K.get_value(self.model.optimizer.lr))
TypeError: float() argument must be a string or a number, not 'ExponentialDecay'
I checked this issue was even raised in the official keras Forum, but no success even there. Plus, the documentation clearly states that:
A LearningRateSchedule instance can be passed in as the learning_rate argument of any optimizer.
What could be the issue?
Related
During TensorFlow distributed MultiWorkerMirroredStrategy training with I am facing errors during early_stopping.
I am using early stopping criterion for TrainSpec for distributed training with 2 nodes.With early stopping hook, I receive the following error immediately on all the worker nodes simultaneously.
If the early stopping hook is removed the code finishes to completion.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
run_config = tf.estimator.RunConfig(
model_dir=model_output_dir,
save_checkpoints_steps=5000,
keep_checkpoint_max=1,
train_distribute=strategy
early_stopping_mae = tf.estimator.experimental.stop_if_no_decrease_hook(
estimator, metric_name='mae',
max_steps_without_decrease=20000, min_steps=100)
train_spec = tf.estimator.TrainSpec(
input_fn=lambda: csv_input_fn(
train_filepath,
hparams['batch_size'],
hparams['num_epochs']),
hooks=[early_stopping_mae]
)
With early stopping, I receive the following error. Its unclear if early stopping is even supported with multi worker strategy.
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/ops/collective_ops.py", line 133, in broadcast_send
instance_key=instance_key)
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/ops/gen_collective_ops.py", line 159, in collective_bcast_send
shape=shape, name=name)
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 626, in _apply_op_helper
param_name=input_name)
File "/home/ab981s/anaconda3/envs/py2tensorflow_nightly/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'input' has DataType bool not in list of allowed values: float32, float16, float64, int32, int64
I have written a tensorflow code using the TPUEstimator, but I am having problems running it in use_tpu=False mode. I would like to run it on my local computer to make sure that all the operations are TPU-compatible. The code works fine with the normal Estimator. Here is my master code:
import logging
from tensorflow.contrib.tpu.python.tpu import tpu_config, tpu_estimator, tpu_optimizer
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
from capser_7_model_fn import *
from capser_7_input_fn import *
import subprocess
from absl import flags
flags.DEFINE_bool(
'use_tpu', False,
'Use TPUs rather than plain CPUs')
tf.flags.DEFINE_string(
"tpu", default='$TPU_NAME',
help="The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string("model_dir", LOGDIR, "Estimator model_dir")
flags.DEFINE_integer(
'save_checkpoints_secs', 1000,
'Interval (in seconds) at which the model data '
'should be checkpointed. Set to 0 to disable.')
flags.DEFINE_integer(
'save_summary_steps', 100,
'Number of steps which must have run before showing summaries.')
tf.flags.DEFINE_integer("iterations", 1000,
"Number of iterations per TPU training loop.")
tf.flags.DEFINE_integer("num_shards", 8, "Number of shards (TPU chips).")
tf.flags.DEFINE_integer("batch_size", 1024,
"Mini-batch size for the training. Note that this "
"is the global batch size and not the per-shard batch.")
FLAGS = tf.flags.FLAGS
if FLAGS.use_tpu:
my_project_name = subprocess.check_output(['gcloud', 'config', 'get-value', 'project'])
my_zone = subprocess.check_output(['gcloud', 'config', 'get-value', 'compute/zone'])
cluster_resolver = TPUClusterResolver(
tpu=[FLAGS.tpu],
zone=my_zone,
project=my_project_name)
master = TPUClusterResolver(tpu=[os.environ['TPU_NAME']]).get_master()
else:
master = ''
my_tpu_run_config = tpu_config.RunConfig(
master=master,
model_dir=FLAGS.model_dir,
save_checkpoints_secs=FLAGS.save_checkpoints_secs,
save_summary_steps=FLAGS.save_summary_steps,
session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
tpu_config=tpu_config.TPUConfig(iterations_per_loop=FLAGS.iterations, num_shards=FLAGS.num_shards),
)
# create estimator for model (the model is described in capser_7_model_fn)
capser = tpu_estimator.TPUEstimator(model_fn=model_fn_tpu,
config=my_tpu_run_config,
use_tpu=FLAGS.use_tpu,
train_batch_size=batch_size,
params={'model_batch_size': batch_size_per_shard})
# train model
logging.getLogger().setLevel(logging.INFO) # to show info about training progress
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
I have a capsule network defined in model_fn_tpu, which returns the TPUEstimator spec. The optimizer is a standard AdamOptimizer. I have made all the changes explained here https://www.tensorflow.org/guide/using_tpu#optimizer to make my code compatible with TPUEstimator. I get the following error:
Traceback (most recent call last):
File "C:/Users/doerig/PycharmProjects/capser/TPU_playground.py", line 85, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 856, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 831, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 2016, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1121, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1317, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "C:\Users\doerig\PycharmProjects\capser\capser_7_model_fn.py", line 101, in model_fn_tpu
**output_decoder_deconv_params)
File "C:\Users\doerig\PycharmProjects\capser\capser_model.py", line 341, in capser_model
loss_training_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step(), name="training_op")
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\training\optimizer.py", line 424, in minimize
name=name)
File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py", line 113, in apply_gradients
summed_grads_and_vars.append((tpu_ops.cross_replica_sum(grad), var))
AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'cross_replica_sum'
Any ideas to solve this problem? Thank you in advance!
I suspect this is either a bug in the version of TensorFlow you are using + Windows, or else an issue with your build of TensorFlow.
For example, when I chase down the file tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py in the TF 1.4 branch, I see that tpu_ops is imported as:
from tensorflow.contrib.tpu.python.ops import tpu_ops
and if you chase that to the relevant file, you see:
if platform.system() != "Windows":
# pylint: disable=wildcard-import,unused-import,g-import-not-at-top
from tensorflow.contrib.tpu.ops.gen_tpu_ops import *
from tensorflow.contrib.util import loader
from tensorflow.python.platform import resource_loader
# pylint: enable=wildcard-import,unused-import,g-import-not-at-top
_tpu_ops = loader.load_op_library(
resource_loader.get_path_to_datafile("_tpu_ops.so"))
else:
# We have already built the appropriate libraries into the binary via CMake
# if we have built contrib, so we don't need this
pass
Following up with the other TF branches that existed at the time of this posting, we see similar comments in 1.5, in 1.6, in 1.7, in 1.8, and in 1.9.
I strongly suspect this would not occur under Linux, but I might test this later and edit this answer.
I use tf.contrib.seq2seq.dynamic_decode for decoder training
prediction, final_decoder_state, _ = dynamic_decode(
custom_decoder
)
with custom decoder
custom_decoder = CustomDecoder(decoder_cell, helper, decoder_init_state)
and helper
helper = CustomTrainingHelper(batch_size, targets, stop_targets,
num_outs, outputs_per_step, 1.0, False)
And dynamic_decoder raises error
Traceback (most recent call last):
File "E:/tasks/text_to_speech/tts/tf_seq2seq.py", line 95, in <module>
custom_decoder
File "C:\Users\User\Anaconda3\lib\site-packages\tensorflow\contrib\seq2seq\python\ops\decoder.py", line 304, in dynamic_decode
swap_memory=swap_memory)
File "C:\Users\User\Anaconda3\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 3224, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "C:\Users\User\Anaconda3\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2956, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "C:\Users\User\Anaconda3\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2930, in _BuildLoop
next_vars.append(_AddNextAndBackEdge(m, v))
File "C:\Users\User\Anaconda3\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 688, in _AddNextAndBackEdge
_EnforceShapeInvariant(m, v)
File "C:\Users\User\Anaconda3\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 632, in _EnforceShapeInvariant
(merge_var.name, m_shape, n_shape))
ValueError: The shape for decoder/while/Merge_12:0 is not an invariant for the loop. It enters the loop with shape (10, 1), but has shape (?, 1) after one iteration. Provide shape invariants using either the `shape_invariants` argument of tf.while_loop or set_shape() on the loop variables.
batch_size is equal to 10. As I understand the issue is in tf.while_loop and batch_size. In what way it is possible to fix this error? Thanks in advance.
Your provided too little information to say anything specific. Please follow (https://stackoverflow.com/help/mcve) in the future.
In general, this error is telling you the following. By default TensorFlow checks that the variables passed from one iteration of the while loop to the next one don't change shape. In your case, the decoder/while/Merge_12:0 tensor originally had a shape of (10, 1) but after one iteration it became (?, 1) meaning that tensorflow can no longer infer the size of the first dimension.
If you know that the first dimension is really 10, you can use Tensor.set_shape to tell this to TensorFlow.
I am implementing a modification of the U-net for semantic segmentation.
I have two outputs from the network :
model = Model(input=inputs, output= [conv10, dense3])
model.compile(optimizer=Adam(lr=1e-5), loss=common_loss, metrics=[common_loss])
where common loss is defined as :
def common_loss(y_true, y_pred):
segmentation_loss = categorical_crossentropy(y_true[0], y_pred[0])
classifiction_loss = categorical_crossentropy(y_true[1], y_pred[1])
return segmentation_loss + alpha * classifiction_loss
When I run this I get an value error as:
File "y-net.py", line 138, in <module>
train_and_predict()
File "y-net.py", line 133, in train_and_predict
callbacks=[model_checkpoint], validation_data=(X_val, [y_img_val, y_class_val]))
File "/home/gpu_users/meetshah/miniconda2/envs/check/lib/python2.7/site-packages/keras/engine/training.py", line 1124, in fit
callback_metrics=callback_metrics)
File "/home/gpu_users/meetshah/miniconda2/envs/check/lib/python2.7/site-packages/keras/engine/training.py", line 848, in _fit_loop
callbacks.on_batch_end(batch_index, batch_logs)
File "/home/gpu_users/meetshah/miniconda2/envs/check/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_batch_end
callback.on_batch_end(batch, logs)
File "/home/gpu_users/meetshah/miniconda2/envs/check/lib/python2.7/site-packages/keras/callbacks.py", line 191, in on_batch_end
self.progbar.update(self.seen, self.log_values)
File "/home/gpu_users/meetshah/miniconda2/envs/check/lib/python2.7/site-packages/keras/utils/generic_utils.py", line 147, in update
if abs(avg) > 1e-3:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
My implementation and the entire trace can be found here :
https://gist.github.com/meetshah1995/19d54270e8d1b20f814e6c1495facc6a
You can see how to implement multiple metrics with multiple outputs here: https://github.com/EdwardTyantov/ultrasound-nerve-segmentation/blob/master/u_model.py.
model.compile(optimizer=optimizer,
loss={'main_output': dice_coef_loss, 'aux_output': 'binary_crossentropy'},
metrics={'main_output': dice_coef, 'aux_output': 'acc'},
loss_weights={'main_output': 1., 'aux_output': 0.5})
I am not sure, if combined output metrics are supported yet.
I've written some code in Tensorflow to compute the edit-distance between one string and a set of strings. I can't figure out the error.
import tensorflow as tf
sess = tf.Session()
# Create input data
test_string = ['foo']
ref_strings = ['food', 'bar']
def create_sparse_vec(word_list):
num_words = len(word_list)
indices = [[xi, 0, yi] for xi,x in enumerate(word_list) for yi,y in enumerate(x)]
chars = list(''.join(word_list))
return(tf.SparseTensor(indices, chars, [num_words,1,1]))
test_string_sparse = create_sparse_vec(test_string*len(ref_strings))
ref_string_sparse = create_sparse_vec(ref_strings)
sess.run(tf.edit_distance(test_string_sparse, ref_string_sparse, normalize=True))
This code works and when run, it produces the output:
array([[ 0.25],
[ 1. ]], dtype=float32)
But when I attempt to do this by feeding the sparse tensors in through sparse placeholders, I get an error.
test_input = tf.sparse_placeholder(dtype=tf.string)
ref_input = tf.sparse_placeholder(dtype=tf.string)
edit_distances = tf.edit_distance(test_input, ref_input, normalize=True)
feed_dict = {test_input: test_string_sparse,
ref_input: ref_string_sparse}
sess.run(edit_distances, feed_dict=feed_dict)
Here is the error traceback:
Traceback (most recent call last):
File "<ipython-input-29-4e06de0b7af3>", line 1, in <module>
sess.run(edit_distances, feed_dict=feed_dict)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 597, in _run
for subfeed, subfeed_val in _feed_fn(feed, feed_val):
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 558, in _feed_fn
return feed_fn(feed, feed_val)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 268, in <lambda>
[feed.indices, feed.values, feed.shape], feed_val)),
TypeError: zip argument #2 must support iteration
Any idea what is going on here?
TL;DR: For the return type of create_sparse_vec(), use tf.SparseTensorValue instead of tf.SparseTensor.
The problem here comes from the return type of create_sparse_vec(), which is tf.SparseTensor, and is not understood as a feed value in the call to sess.run().
When you feed a (dense) tf.Tensor, the expected value type is a NumPy array (or certain objects that can be converted to an array). When you feed a tf.SparseTensor, the expected value type is a tf.SparseTensorValue, which is similar to a tf.SparseTensor but its indices, values, and shape properties are NumPy arrays (or certain objects that can be converted to arrays, like the lists in your example.
The following code should work:
def create_sparse_vec(word_list):
num_words = len(word_list)
indices = [[xi, 0, yi] for xi,x in enumerate(word_list) for yi,y in enumerate(x)]
chars = list(''.join(word_list))
return tf.SparseTensorValue(indices, chars, [num_words,1,1])