Continue training after keyboard interrupt? - tensorflow

I was training tensorflow and then i mashed the keyboard for shits and giggles:
INFO:tensorflow:global step 101: loss = 5.1761 (52.61 sec/step)
INFO:tensorflow:global step 102: loss = 4.8679 (18.78 sec/step)
INFO:tensorflow:global step 103: loss = 4.9662 (19.02 sec/step)
INFO:tensorflow:global step 104: loss = 5.1126 (17.36 sec/step)
^C^X^C^[^[^[^[^[
exit
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 110, in main
saver=saver)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 782, in train
sess, train_op, global_step, train_step_kwargs)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 530, in train_step
run_metadata=run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
KeyboardInterrupt
Kristoffers-MacBook-Pro:im2txt kristoffer$ logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
[Process completed]
When I'm trying to start training again, I get the following error:
$ bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=150
CRITICAL:tensorflow:Found no input files matching /train-?????-of-00256
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 65, in main
model.build()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 353, in build
self.build_inputs()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 153, in build_inputs
num_reader_threads=self.config.num_input_reader_threads)
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/ops/inputs.py", line 98, in prefetch_input_data
data_files, shuffle=True, capacity=16, name=shard_queue_name)
File "/Library/Python/2.7/site-packages/tensorflow/python/training/input.py", line 211, in string_input_producer
raise ValueError(not_null_err)
ValueError: string_input_producer requires a non-null input tensor
What causes this and what can I do about it? Is there a proper way to pause/cancel a training session? (Tensorflow seems to pick up where it left of if you startup with training 50 steps and then set the steps to 100)

It seems like your problem is caused by that Tensorflow is trying to load a session, which wasn't properly saved when you interrupted your code. Your solutions now would either be not loading the last session when you restart the code (by commenting the loading line), or deleting the saved session files (then it should automatically restart from scratch). It is hard to give a more specific example, since you didn't share your code...

Related

Invalid argument: Nan in summary histogram by editing the number of labels

I have decreased the deflault number of labels from 19 to 10 of dataset cityscapes. My goal is to change the dataset so the decoder need to relearn the weights, as an preperation-exercise of increasing the output classes of the decoder.
The network I am using is deeplab, the trainning process is fine at first. About 500 steps were run before the error.
(The code below doesn't start from the first line after the start of training)
I1111 16:19:23.461441 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.82067
Total loss is :[6.42209053]
INFO:tensorflow:global_step/sec: 1.84064
I1111 16:19:28.894436 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84064
Total loss is :[6.23576546]
INFO:tensorflow:global_step/sec: 1.84368
I1111 16:19:34.318257 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84368
Total loss is :[6.09628582]
INFO:tensorflow:global_step/sec: 1.83645
I1111 16:19:39.763585 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.83645
Total loss is :[6.20008707]
INFO:tensorflow:global_step/sec: 1.84192
I1111 16:19:45.192930 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84192
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[{{node image_pooling/BatchNorm/moving_variance_1}}]]
[[Mean_225/_10177]]
(1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[{{node image_pooling/BatchNorm/moving_variance_1}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 515, in main
sess.run([train_tensor])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/home/zwang/.local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
[[Mean_225/_10177]]
(1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)
Original stack trace for 'image_pooling/BatchNorm/moving_variance_1':
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
tf.app.run()
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 472, in main
dataset.ignore_label)
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 379, in _train_deeplab_model
reuse_variable=(i != 0))
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 275, in _tower_loss
_build_deeplab(iterator, {common.OUTPUT_TYPE: num_of_classes}, ignore_label)
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 257, in _build_deeplab
output_type_dict[model.MERGED_LOGITS_SCOPE])
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 328, in _log_summaries
tf.summary.histogram(model_var.op.name, model_var)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 179, in histogram
tag=tag, values=values, name=scope)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 329, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
I think the error
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
seems like an error of tensorboard, is there some way to avoid it?
Since my training has run 500 steps out of 30000 steps without any problem. I am hoping that without some part of the function (like histogram of tensorboard), or by editing the num_of_labels somewhere else _(maybe there is another parameter of the_num_of_classes may need editing)_, the trainning process would run properly.
Could you give some suggestions either direkt to this error, or to my general approach? Thanks
Best Regards
Zhe
The problem was solved be adjusting the hyper-parameters for training, like decreasing the learning rate to stabilize the training process.

Using InMemoryEvaluatorHook with TPU throws exception

I tried using an InMemoryEvaluatorHook with a TPUEstimator to get validation statistics while training my model. Using a loop of estimator.train() and estimator.evaluate() was too expensive as it rebuilt the graph every epoch, rather than trying to reuse it (as referenced in this issue: https://github.com/tensorflow/tensorflow/issues/13895). This is the basic code I use:
estimator = tf.contrib.tpu.TPUEstimator(
model_fn=model_fn,
config=run_config,
use_tpu=True,
train_batch_size=self.batch_size,
eval_batch_size=self.batch_size,
predict_batch_size=self.batch_size,
params={})
train_fn = lambda params: input_fn(
'train', self.data_dir, batch_size=params['batch_size'], train=True)
val_fn = lambda params: input_fn(
'validation',
self.data_dir,
batch_size=params['batch_size'],
train=False)
train_hook = tf.contrib.estimator.InMemoryEvaluatorHook(
estimator,
val_fn,
steps=self.steps_per_val_epoch,
every_n_iter=self.steps_per_epoch)
estimator.train(
input_fn=train_fn,
steps=self.steps_per_epoch * self.max_num_training_epochs,
hooks=[
train_hook,
])
This resulted in the following error:
Traceback (most recent call last):
File "dev/google_communicator/worker.py", line 160, in <module>
main()
File "dev/google_communicator/worker.py", line 133, in main
results = evaluator.eval(inputs, outputs)
File "/darch/deep_architect/contrib/misc/evaluators/tensorflow/tpu_estimator_classification.py", line 278, in eval
train_hook,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train
rendezvous.raise_errors()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 631, in __init__
h.begin()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/estimator/python/estimator/hooks.py", line 135, in begin
self._input_fn, self._hooks, checkpoint_path=None)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1484, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1520, in _call_model_fn_eval
features, labels, model_fn_lib.ModeKeys.EVAL, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2195, in _call_model_fn
features, labels, mode, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2631, in _model_fn
rendezvous=self._rendezvous[mode]),
KeyError: 'eval'
Is there a better way to get validation statistics every epochs with TPUs? If not, how are you supposed to do validation?
Edit: I seemed to have gotten past this error by running estimator.train() and estimator.evaluate() for one step without the hook, and then running the full training with the hook. Unfortunately, after the first evaluation, there is an error with restarting training:
Traceback (most recent call last):
File "dev/google_communicator/worker.py", line 160, in <module>
main()
File "dev/google_communicator/worker.py", line 133, in main
results = evaluator.eval(inputs, outputs)
File "/darch/deep_architect/contrib/misc/evaluators/tensorflow/tpu_estimator_classification.py", line 329, in eval
train_hook,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train
rendezvous.raise_errors()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized.
[[{{node TPUReplicate/_compile/_14248540389241865347/_28}} = TPUCompile[NumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_18378946049549366873_f15n_0[], metadata="\n\006\010...6\323\352L", num_computations=1, _device="/job:worker/replica:0/task:0/device:CPU:0"](^cluster/control_before/_0)]]
[[{{node tpu_compile_succeeded_assert/_1897752282630996029/_29_G679}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:2", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=2337451129362726278, tensor_name="edge_174_tpu_compile_succeeded_assert/_1897752282630996029/_29", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:2"]()]]
To clarify, the following things happen before the error is thrown: the two initializing train and evaluate calls to the estimator, training for one epoch, evaluating on the validation set. When the estimator tries to restart training, this exception is thrown.
This open issue might be relevant: https://github.com/tensorflow/tensor2tensor/issues/1202

Nan in summary histogram for: FirstStageFeatureExtractor

I am trying to run the tensorflow object detection api. I have made a dataset with 1 class using labelImg and then converting the xmls to tfrecord files. Some information:
os: Ubuntu 16.04
gpu: nvidia geforce 1080Ti & 1060
tensorflow version: 1.3.0
training model: faster_rcnn_resnet101_coco (although I have tried others)
Classes: 1
I then run train.py and training begins. When I get ~step 355 I get the error:
INFO:tensorflow:global step 363: loss = 1.4006 (0.294 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "object_detection/train.py", line 163, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/ucfadng/tensorflow/models/research/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train
sv.stop(threads, close_summary_writer=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 296, in stop_on_exception
yield
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 494, in run
self.run_loop()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 994, in run_loop
self._sv.global_step])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1/tag, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma/read)]]
[[Node: Loss/RPNLoss/map/TensorArray_2/_1353 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5239_Loss/RPNLoss/map/TensorArray_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1', defined at:
File "object_detection/train.py", line 163, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/ucfadng/tensorflow/models/research/object_detection/trainer.py", line 295, in train
global_summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 192, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 129, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma_1/tag, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv3/BatchNorm/gamma/read)]]
[[Node: Loss/RPNLoss/map/TensorArray_2/_1353 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5239_Loss/RPNLoss/map/TensorArray_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
My label map is:
item {
id: 1
name: 'rail'
}
The pipeline used was downloaded from: https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_resnet101_coco.config
I have had a look at the TFRecord files and look as expected which leads be to think this is something to do with the initial dataset. But what specifically would cause this error?
My dataset consists of 250 clear images containing railway tracks. I have labelled sections of the tracks so each image has ~20/30 labelled objects.
So far the main thing I have tried is lowering learning rates and changing batch size but these did not solve the problem.
Any help in solving this would be very appreciated.
Cheers
I have now updated to version: 1.4.0 and this is no longer a problem. I did not change anything else. Perhaps this was a bug with version: 1.3.0. I will leave this up just in case anyone has the same issue.
This doesn't seem to be an error with tensorflow itself, i believe it has something to do with your system running out of memory.
Setting train to false in batch_norm (within your pipeline.config file) appears to overcome this problem.
It should look something like this:
batch_norm {
decay: 0.999
center: true
scale: true
epsilon: 0.001
train: false
}
Hope this helped.

tensorflow CTC runtime error

I met a runtime error during training when I tried to applied tensorflow built-in CTC loss function (https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/conectionist_temporal_classification__ctc_) to SynthText dataset. http://www.robots.ox.ac.uk/~vgg/data/scenetext/
It said " Not enough time for target transition sequence (required: 4, available: 0)".
Here is some info for the environment: tensorflow version '0.12.0-rc0'.
I am able to apply tensorflow built-in CTC to Synth90K Dataset with great performance.
It seems like the SynthText dataset is not compilable with tensorflow built-in CTC but Synth90K Dataset could.
Please find the error message as reference
step 980, loss = 50.17 (92.1 examples/sec; 0.695 sec/batch)
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
Traceback (most recent call last):
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 270, in train
_, loss_value = sess.run([train_op, loss])
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]
Caused by op u'tower_0/CTCLoss', defined at:
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 179, in train
loss,logits_op,images,labels = tower_loss(scope)
File "multi-gpu-train.py", line 79, in tower_loss
_ = network2.loss(logits,images, labels)
File "/home/ubuntu/experiments/network2_dev/network2.py", line 61, in loss
out = tf.nn.ctc_loss(logit, to_sparse(y), seq_len, time_major=False)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/ctc_ops.py", line 145, in ctc_loss
ctc_merge_repeated=ctc_merge_repeated)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 164, in _ctc_loss
name=name)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]

Adding to vocab during Tensorflow seq2seq run

I am using the Tensorflow seq2seq tutorial to play with machine translation. Say I have trained the model for some time and determine that I want to supplement the original vocab with new words to enhance the quality of the model. Is there a way to pause training, add words to the vocabulary, and then resume training from the most recent checkpoint? I attempted to do so but when I began training again I got this error:
Traceback (most recent call last):
File "execute.py", line 405, in <module>
train()
File "execute.py", line 127, in train
model = create_model(sess, False)
File "execute.py", line 108, in create_model
model.saver.restore(session, ckpt.model_checkpoint_path)
File "/home/jrthom18/.local/lib/python2.7/site- packages/tensorflow/python/training/saver.py", line 1388, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [384633] rhs shape= [384617]
[[Node: save/Assign_82 = Assign[T=DT_FLOAT, _class=["loc:#proj_b"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](proj_b, save/RestoreV2_82)]]
Caused by op u'save/Assign_82', defined at:
File "execute.py", line 405, in <module>
train()
File "execute.py", line 127, in train
model = create_model(sess, False)
File "execute.py", line 99, in create_model
model = seq2seq_model.Seq2SeqModel( gConfig['enc_vocab_size'], gConfig['dec_vocab_size'], _buckets, gConfig['layer_size'], gConfig['num_layers'], gConfig['max_gradient_norm'], gConfig['batch_size'], gConfig['learning_rate'], gConfig['learning_rate_decay_factor'], forward_only=forward_only)
File "/home/jrthom18/data/3x256_bs32/easy_seq2seq/seq2seq_model.py", line 166, in __init__
self.saver = tf.train.Saver(tf.global_variables(), keep_checkpoint_every_n_hours=2.0)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1000, in __init__
self.build()
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1030, in build
restore_sequentially=self._restore_sequentially)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 624, in build
restore_sequentially, reshape)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 373, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 130, in restore
self.op.get_shape().is_fully_defined())
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
use_locking=use_locking, name=name)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/jrthom18/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [384633] rhs shape= [384617]
[[Node: save/Assign_82 = Assign[T=DT_FLOAT, _class=["loc:#proj_b"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](proj_b, save/RestoreV2_82)]]
Obviously the new vocab is larger and so the tensor sizes do not match. Is there some way around this?
You can not update your vocab once its set but you can always use shared wordpiece model. It will help you to directly copy out of vocab words form source to target output.