Invalid argument: Nan in summary histogram by editing the number of labels - tensorflow

I have decreased the deflault number of labels from 19 to 10 of dataset cityscapes. My goal is to change the dataset so the decoder need to relearn the weights, as an preperation-exercise of increasing the output classes of the decoder.
The network I am using is deeplab, the trainning process is fine at first. About 500 steps were run before the error.
(The code below doesn't start from the first line after the start of training)
I1111 16:19:23.461441 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.82067
Total loss is :[6.42209053]
INFO:tensorflow:global_step/sec: 1.84064
I1111 16:19:28.894436 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84064
Total loss is :[6.23576546]
INFO:tensorflow:global_step/sec: 1.84368
I1111 16:19:34.318257 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84368
Total loss is :[6.09628582]
INFO:tensorflow:global_step/sec: 1.83645
I1111 16:19:39.763585 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.83645
Total loss is :[6.20008707]
INFO:tensorflow:global_step/sec: 1.84192
I1111 16:19:45.192930 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84192
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[{{node image_pooling/BatchNorm/moving_variance_1}}]]
[[Mean_225/_10177]]
(1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[{{node image_pooling/BatchNorm/moving_variance_1}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 515, in main
sess.run([train_tensor])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/home/zwang/.local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
[[Mean_225/_10177]]
(1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)
Original stack trace for 'image_pooling/BatchNorm/moving_variance_1':
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
tf.app.run()
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 472, in main
dataset.ignore_label)
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 379, in _train_deeplab_model
reuse_variable=(i != 0))
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 275, in _tower_loss
_build_deeplab(iterator, {common.OUTPUT_TYPE: num_of_classes}, ignore_label)
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 257, in _build_deeplab
output_type_dict[model.MERGED_LOGITS_SCOPE])
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 328, in _log_summaries
tf.summary.histogram(model_var.op.name, model_var)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 179, in histogram
tag=tag, values=values, name=scope)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 329, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
I think the error
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
seems like an error of tensorboard, is there some way to avoid it?
Since my training has run 500 steps out of 30000 steps without any problem. I am hoping that without some part of the function (like histogram of tensorboard), or by editing the num_of_labels somewhere else _(maybe there is another parameter of the_num_of_classes may need editing)_, the trainning process would run properly.
Could you give some suggestions either direkt to this error, or to my general approach? Thanks
Best Regards
Zhe

The problem was solved be adjusting the hyper-parameters for training, like decreasing the learning rate to stabilize the training process.

Related

Creating custom object detection model

I am trying to build an object detection model with my custom dataset having only 1 class.
While following all the procedures explained in the tutorial the script crashes and log out the following error
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
0 successful operations.
0 derived errors ignored.
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d':
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/model_lib.py", line 308, in model_fn
features[fields.InputDataFields.true_image_shape])
File "/home/stud/hammadal/custom-model/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 600, in predict
preprocessed_inputs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/models/ssd_inception_v2_feature_extractor.py", line 130, in extract_features
scope=scope)
File "/home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py", line 129, in inception_v2_base
scope=end_point)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 2784, in separable_convolution2d
outputs = layer.apply(inputs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
return self.__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 537, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
), args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 446, in converted_call
return _call_unconverted(f, args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 253, in _call_unconverted
return f(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 1658, in call
data_format=conv_utils.convert_data_format(self.data_format, ndim=4))
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 793, in separable_conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d':
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/model_lib.py", line 308, in model_fn
features[fields.InputDataFields.true_image_shape])
File "/home/stud/hammadal/custom-model/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 600, in predict
preprocessed_inputs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/models/ssd_inception_v2_feature_extractor.py", line 130, in extract_features
scope=scope)
File "/home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py", line 129, in inception_v2_base
scope=end_point)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 2784, in separable_convolution2d
outputs = layer.apply(inputs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
return self.__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 537, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
), args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 446, in converted_call
return _call_unconverted(f, args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 253, in _call_unconverted
return f(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 1658, in call
data_format=conv_utils.convert_data_format(self.data_format, ndim=4))
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 793, in separable_conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
This is being faced while running it on to a server where I can use the power of the GPU.
When I run the script on the local machine using only CPU and batch size of 1 the script executes.
The script being used is from the tensorflow official repo HERE.
The server hardware information is as follow:
> OS: Ubuntu x86_64 memory: 503GiB
> system memory processor: Intel(R)
> Xeon(R) CPU E5-2630 v4 # 2.20GHz
> display: GV100GL [Tesla V100 PCIe 32GB]
Libraries:
> tensorflow-gpu: 1.14
> numpy: 1.16
> absl-py 0.9
I have been trying to work my way through since last 2 weeks. If someone can help or guide me what do I need to read I would highly appericiate it
It looks like cuDNN failed to initialize. Which is related more so to TensorFlow. Try using the following on the server, which should install cuDNN properly:
conda install tensorflow-gpu

Object Detection crash after 5428 steps, TypeError: 'numpy.float64' object cannot be interpreted as an integer

My object detector has run multiple times but at this mark of 5428 it then crashes from TypeError's
I'm running in in anaconda with:
numpy 1.18.1
numpy-base 1.18.1
tensorflow-gpu 1.14
This snippet below I think is the most important error?:
2020-02-19 13:56:06.901096: W tensorflow/core/framework/op_kernel.cc:1490] Invalid argument: TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\numpy\core\function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
The full traceback is below:
I0219 13:55:41.016854 15428 basic_session_run_hooks.py:260] loss = 0.0140173, step = 5400 (10.773 sec)
INFO:tensorflow:Saving checkpoints for 5428 into training/model.ckpt.
I0219 13:55:43.900022 15428 basic_session_run_hooks.py:606] Saving checkpoints for 5428 into training/model.ckpt.
INFO:tensorflow:Calling model_fn.
I0219 13:55:56.207441 15428 estimator.py:1145] Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
I0219 13:55:58.009801 15428 regularizers.py:98] Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
I0219 13:55:58.025418 15428 regularizers.py:98] Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
I0219 13:55:58.025418 15428 convolutional_box_predictor.py:151] depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
I0219 13:55:59.573186 15428 regularizers.py:98] Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
I0219 13:55:59.588815 15428 regularizers.py:98] Scale of 0 disables regularizer.
WARNING:tensorflow:From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\eval_util.py:796: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0219 13:56:00.855241 15428 deprecation.py:323] From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\eval_util.py:796: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
WARNING:tensorflow:From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\utils\visualization_utils.py:498: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argument making all functions stateful.
W0219 13:56:01.105266 15428 deprecation.py:323] From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\utils\visualization_utils.py:498: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argument making all functions stateful.
WARNING:tensorflow:From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\utils\visualization_utils.py:1044: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.
W0219 13:56:01.277014 15428 deprecation_wrapper.py:119] From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\utils\visualization_utils.py:1044: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.
WARNING:tensorflow:From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\model_lib.py:484: The name tf.metrics.mean is deprecated. Please use tf.compat.v1.metrics.mean instead.
W0219 13:56:01.386395 15428 deprecation_wrapper.py:119] From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\model_lib.py:484: The name tf.metrics.mean is deprecated. Please use tf.compat.v1.metrics.mean instead.
INFO:tensorflow:Done calling model_fn.
I0219 13:56:01.749697 15428 estimator.py:1147] Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-02-19T13:56:01Z
I0219 13:56:01.781106 15428 evaluation.py:255] Starting evaluation at 2020-02-19T13:56:01Z
INFO:tensorflow:Graph was finalized.
I0219 13:56:02.489665 15428 monitored_session.py:240] Graph was finalized.
2020-02-19 13:56:02.508162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:06:00.0
2020-02-19 13:56:02.512995: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-02-19 13:56:02.516493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-02-19 13:56:02.518703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-19 13:56:02.523922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2020-02-19 13:56:02.526614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2020-02-19 13:56:02.529223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8788 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1)
WARNING:tensorflow:From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0219 13:56:02.535778 15428 deprecation.py:323] From C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from training/model.ckpt-5428
I0219 13:56:02.538779 15428 saver.py:1280] Restoring parameters from training/model.ckpt-5428
INFO:tensorflow:Running local_init_op.
I0219 13:56:03.495252 15428 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0219 13:56:03.656017 15428 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Performing evaluation on 5 images.
I0219 13:56:06.852077 13368 coco_evaluation.py:205] Performing evaluation on 5 images.
creating index...
index created!
INFO:tensorflow:Loading and preparing annotation results...
I0219 13:56:06.867704 13368 coco_tools.py:115] Loading and preparing annotation results...
INFO:tensorflow:DONE (t=0.00s)
I0219 13:56:06.867704 13368 coco_tools.py:137] DONE (t=0.00s)
creating index...
index created!
2020-02-19 13:56:06.901096: W tensorflow/core/framework/op_kernel.cc:1490] Invalid argument: TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\numpy\core\function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\script_ops.py", line 209, in __call__
ret = func(*args)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_evaluation.py", line 384, in first_value_func
self._metrics = self.evaluate()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_evaluation.py", line 215, in evaluate
coco_wrapped_groundtruth, coco_wrapped_detections, agnostic_mode=False)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_tools.py", line 176, in __init__
iouType=iou_type)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\pycocotools\cocoeval.py", line 76, in __init__
self.params = Params(iouType=iouType) # parameters
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\pycocotools\cocoeval.py", line 527, in __init__
self.setDetParams()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\pycocotools\cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\numpy\core\function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: End of sequence
[[{{node IteratorGetNext}}]]
(1) Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[Loss/BoxClassifierLoss/assert_equal/Assert/Assert/data_4/_2449]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\evaluation.py", line 272, in _evaluate_once
session.run(eval_ops, feed_dict)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\six.py", line 703, in reraise
raise value
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: End of sequence
[[node IteratorGetNext (defined at model_main.py:105) ]]
(1) Out of range: End of sequence
[[node IteratorGetNext (defined at model_main.py:105) ]]
[[Loss/BoxClassifierLoss/assert_equal/Assert/Assert/data_4/_2449]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'IteratorGetNext':
File "model_main.py", line 109, in <module>
tf.app.run()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 473, in train_and_evaluate
return executor.run()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 613, in run
return self.run_local()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1192, in _train_model_default
saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1419, in run
run_metadata=run_metadata))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 594, in after_run
if self._save(run_context.session, global_step):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 619, in _save
if l.after_save(session, step):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 519, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 539, in _evaluate
self._evaluator.evaluate_and_export())
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 920, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 477, in evaluate
name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 519, in _actual_eval
return _evaluate()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 501, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1501, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1534, in _call_model_fn_eval
input_fn, ModeKeys.EVAL)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1022, in _get_features_and_labels_from_input_fn
self._call_input_fn(input_fn, mode))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\util.py", line 65, in parse_input_fn_result
result = iterator.get_next()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 426, in get_next
output_shapes=self._structure._flat_shapes, name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 1947, in iterator_get_next
output_shapes=output_shapes, name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
op_def=op_def)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\script_ops.py", line 209, in __call__
ret = func(*args)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_evaluation.py", line 384, in first_value_func
self._metrics = self.evaluate()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_evaluation.py", line 215, in evaluate
coco_wrapped_groundtruth, coco_wrapped_detections, agnostic_mode=False)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_tools.py", line 176, in __init__
iouType=iou_type)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\pycocotools\cocoeval.py", line 76, in __init__
self.params = Params(iouType=iouType) # parameters
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\pycocotools\cocoeval.py", line 527, in __init__
self.setDetParams()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\pycocotools\cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\numpy\core\function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
[[node PyFunc_3 (defined at C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_evaluation.py:394) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'PyFunc_3':
File "model_main.py", line 109, in <module>
tf.app.run()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 473, in train_and_evaluate
return executor.run()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 613, in run
return self.run_local()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1192, in _train_model_default
saving_listeners)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1419, in run
run_metadata=run_metadata))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 594, in after_run
if self._save(run_context.session, global_step):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 619, in _save
if l.after_save(session, step):
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 519, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 539, in _evaluate
self._evaluator.evaluate_and_export())
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\training.py", line 920, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 477, in evaluate
name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 519, in _actual_eval
return _evaluate()
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 501, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1501, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1537, in _call_model_fn_eval
features, labels, ModeKeys.EVAL, config)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\model_lib.py", line 482, in model_fn
eval_config, list(category_index.values()), eval_dict)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\eval_util.py", line 947, in get_eval_metric_ops_for_evaluators
eval_dict))
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\object_detection-0.1-py3.7.egg\object_detection\metrics\coco_evaluation.py", line 394, in get_estimator_eval_metric_ops
first_value_op = tf.py_func(first_value_func, [], tf.float32)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\script_ops.py", line 480, in py_func
return py_func_common(func, inp, Tout, stateful, name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\script_ops.py", line 462, in py_func_common
func=func, inp=inp, Tout=Tout, stateful=stateful, eager=False, name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\script_ops.py", line 285, in _internal_py_func
input=inp, token=token, Tout=Tout, name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\gen_script_ops.py", line 159, in py_func
"PyFunc", input=input, token=token, Tout=Tout, name=name)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
op_def=op_def)
File "C:\Users\luke9\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
This is missing some of the traceback because of the character limit, but it is all numpy TypeError's.
Try downgrading your numpy version.
In my case, i had to downgrade it to 1.17.4
I had the same issue. It seems that updating to tensorflow 1.15.0 solves it.
You need to downgrade your numpy to 1.17.
I also found that limiting the gpu memory growth could also help, however it did slow down the training speed a lot.

maximum number of labels in corss_entropy classification is 300?

I found that with sparse_softmax_cross_entropy_with_logits ou can only have at maximum 300 labels?
I found nothing about that.
What if I have more?
EDIT:
If I do not limit to 300 classes, I get evrytime the following trace:
2019-03-05 15:24:17.899610: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_xent_op.cc:90 : Invalid argument: Received a label value of 428 which is outside the valid range of [0, 300). Label values: 428 262
Traceback (most recent call last):
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 428 which is outside the valid range of [0, 300). Label values: 428 262
[[{{node QAModel/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}} = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](QAModel/StartDist/SimpleSoftmaxLayer/Add, _arg_QAModel/Placeholder_4_0_5)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 236, in <module>
tf.app.run()
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 194, in main
qa_model.train(sess, train_context_path, train_qn_path, train_ans_path, dev_qn_path, dev_context_path, dev_ans_path)
File "C:\Users\\IB\QA-Models-Bidaf\code\qa_model.py", line 764, in train
loss, global_step, param_norm, grad_norm = self.run_train_iter(session, batch, summary_writer)
File "C:\Users\\IB\QA-Models-Bidaf\code\qa_model.py", line 359, in run_train_iter
[_, summaries, loss, global_step, param_norm, gradient_norm] = session.run(output_feed, input_feed)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 428 which is outside the valid range of [0, 300). Label values: 428 262
[[node QAModel/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at C:\Users\\IB\QA-Models-Bidaf\code\qa_model.py:318) = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](QAModel/StartDist/SimpleSoftmaxLayer/Add, _arg_QAModel/Placeholder_4_0_5)]]
Caused by op 'QAModel/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits', defined at:
File "main.py", line 236, in <module>
tf.app.run()
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 165, in main
qa_model = QAModel(FLAGS, id2word, word2id, emb_matrix)
File "C:\Users\\IB\QA-Models-Bidaf\code\qa_model.py", line 64, in __init__
self.add_loss()
File "C:\Users\\IB\QA-Models-Bidaf\code\qa_model.py", line 318, in add_loss
loss= tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=self.ans_span) # loss_start has shape (batch_size)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2049, in sparse_softmax_cross_entropy_with_logits
precise_logits, labels, name=name)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 8063, in sparse_softmax_cross_entropy_with_logits
labels=labels, name=name)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
op_def=op_def)
File "C:\Users\\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Received a label value of 428 which is outside the valid range of [0, 300). Label values: 428 262
[[node QAModel/loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at C:\Users\\IB\QA-Models-Bidaf\code\qa_model.py:318) = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](QAModel/StartDist/SimpleSoftmaxLayer/Add, _arg_QAModel/Placeholder_4_0_5)]]
Everytime I get this range upt to 300. Why?!
valid range of [0, 300)
When I took part in YouTube8M Kaggle competition it had about 5k classes and we used a loss provided by competition organizers, i.e. Google
Have a look https://github.com/mpekalski/Y8M/blob/master/video_level_code/losses.py

tensorflow CTC runtime error

I met a runtime error during training when I tried to applied tensorflow built-in CTC loss function (https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/conectionist_temporal_classification__ctc_) to SynthText dataset. http://www.robots.ox.ac.uk/~vgg/data/scenetext/
It said " Not enough time for target transition sequence (required: 4, available: 0)".
Here is some info for the environment: tensorflow version '0.12.0-rc0'.
I am able to apply tensorflow built-in CTC to Synth90K Dataset with great performance.
It seems like the SynthText dataset is not compilable with tensorflow built-in CTC but Synth90K Dataset could.
Please find the error message as reference
step 980, loss = 50.17 (92.1 examples/sec; 0.695 sec/batch)
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
Traceback (most recent call last):
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 270, in train
_, loss_value = sess.run([train_op, loss])
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]
Caused by op u'tower_0/CTCLoss', defined at:
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 179, in train
loss,logits_op,images,labels = tower_loss(scope)
File "multi-gpu-train.py", line 79, in tower_loss
_ = network2.loss(logits,images, labels)
File "/home/ubuntu/experiments/network2_dev/network2.py", line 61, in loss
out = tf.nn.ctc_loss(logit, to_sparse(y), seq_len, time_major=False)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/ctc_ops.py", line 145, in ctc_loss
ctc_merge_repeated=ctc_merge_repeated)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 164, in _ctc_loss
name=name)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]

Continue training after keyboard interrupt?

I was training tensorflow and then i mashed the keyboard for shits and giggles:
INFO:tensorflow:global step 101: loss = 5.1761 (52.61 sec/step)
INFO:tensorflow:global step 102: loss = 4.8679 (18.78 sec/step)
INFO:tensorflow:global step 103: loss = 4.9662 (19.02 sec/step)
INFO:tensorflow:global step 104: loss = 5.1126 (17.36 sec/step)
^C^X^C^[^[^[^[^[
exit
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 110, in main
saver=saver)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 782, in train
sess, train_op, global_step, train_step_kwargs)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 530, in train_step
run_metadata=run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
KeyboardInterrupt
Kristoffers-MacBook-Pro:im2txt kristoffer$ logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
[Process completed]
When I'm trying to start training again, I get the following error:
$ bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=150
CRITICAL:tensorflow:Found no input files matching /train-?????-of-00256
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 65, in main
model.build()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 353, in build
self.build_inputs()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 153, in build_inputs
num_reader_threads=self.config.num_input_reader_threads)
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/ops/inputs.py", line 98, in prefetch_input_data
data_files, shuffle=True, capacity=16, name=shard_queue_name)
File "/Library/Python/2.7/site-packages/tensorflow/python/training/input.py", line 211, in string_input_producer
raise ValueError(not_null_err)
ValueError: string_input_producer requires a non-null input tensor
What causes this and what can I do about it? Is there a proper way to pause/cancel a training session? (Tensorflow seems to pick up where it left of if you startup with training 50 steps and then set the steps to 100)
It seems like your problem is caused by that Tensorflow is trying to load a session, which wasn't properly saved when you interrupted your code. Your solutions now would either be not loading the last session when you restart the code (by commenting the loading line), or deleting the saved session files (then it should automatically restart from scratch). It is hard to give a more specific example, since you didn't share your code...