I am trying to build an object detection model with my custom dataset having only 1 class.
While following all the procedures explained in the tutorial the script crashes and log out the following error
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
0 successful operations.
0 derived errors ignored.
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d':
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/model_lib.py", line 308, in model_fn
features[fields.InputDataFields.true_image_shape])
File "/home/stud/hammadal/custom-model/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 600, in predict
preprocessed_inputs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/models/ssd_inception_v2_feature_extractor.py", line 130, in extract_features
scope=scope)
File "/home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py", line 129, in inception_v2_base
scope=end_point)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 2784, in separable_convolution2d
outputs = layer.apply(inputs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
return self.__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 537, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
), args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 446, in converted_call
return _call_unconverted(f, args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 253, in _call_unconverted
return f(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 1658, in call
data_format=conv_utils.convert_data_format(self.data_format, ndim=4))
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 793, in separable_conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
[[Loss/unstack_1/_10307]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at /home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py:129) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d':
File "model_main.py", line 109, in <module>
tf.app.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/model_lib.py", line 308, in model_fn
features[fields.InputDataFields.true_image_shape])
File "/home/stud/hammadal/custom-model/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 600, in predict
preprocessed_inputs)
File "/home/stud/hammadal/custom-model/models/research/object_detection/models/ssd_inception_v2_feature_extractor.py", line 130, in extract_features
scope=scope)
File "/home/stud/hammadal/custom-model/models/research/slim/nets/inception_v2.py", line 129, in inception_v2_base
scope=end_point)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 2784, in separable_convolution2d
outputs = layer.apply(inputs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
return self.__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 537, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
), args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 446, in converted_call
return _call_unconverted(f, args, kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 253, in _call_unconverted
return f(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 1658, in call
data_format=conv_utils.convert_data_format(self.data_format, ndim=4))
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 793, in separable_conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/nfs/student/hammadal/custom-model/tf1.14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
This is being faced while running it on to a server where I can use the power of the GPU.
When I run the script on the local machine using only CPU and batch size of 1 the script executes.
The script being used is from the tensorflow official repo HERE.
The server hardware information is as follow:
> OS: Ubuntu x86_64 memory: 503GiB
> system memory processor: Intel(R)
> Xeon(R) CPU E5-2630 v4 # 2.20GHz
> display: GV100GL [Tesla V100 PCIe 32GB]
Libraries:
> tensorflow-gpu: 1.14
> numpy: 1.16
> absl-py 0.9
I have been trying to work my way through since last 2 weeks. If someone can help or guide me what do I need to read I would highly appericiate it
It looks like cuDNN failed to initialize. Which is related more so to TensorFlow. Try using the following on the server, which should install cuDNN properly:
conda install tensorflow-gpu
I met a runtime error during training when I tried to applied tensorflow built-in CTC loss function (https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/conectionist_temporal_classification__ctc_) to SynthText dataset. http://www.robots.ox.ac.uk/~vgg/data/scenetext/
It said " Not enough time for target transition sequence (required: 4, available: 0)".
Here is some info for the environment: tensorflow version '0.12.0-rc0'.
I am able to apply tensorflow built-in CTC to Synth90K Dataset with great performance.
It seems like the SynthText dataset is not compilable with tensorflow built-in CTC but Synth90K Dataset could.
Please find the error message as reference
step 980, loss = 50.17 (92.1 examples/sec; 0.695 sec/batch)
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
Traceback (most recent call last):
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 270, in train
_, loss_value = sess.run([train_op, loss])
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]
Caused by op u'tower_0/CTCLoss', defined at:
File "multi-gpu-train.py", line 305, in
tf.app.run()
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "multi-gpu-train.py", line 301, in main
train()
File "multi-gpu-train.py", line 179, in train
loss,logits_op,images,labels = tower_loss(scope)
File "multi-gpu-train.py", line 79, in tower_loss
_ = network2.loss(logits,images, labels)
File "/home/ubuntu/experiments/network2_dev/network2.py", line 61, in loss
out = tf.nn.ctc_loss(logit, to_sparse(y), seq_len, time_major=False)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/ctc_ops.py", line 145, in ctc_loss
ctc_merge_repeated=ctc_merge_repeated)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 164, in _ctc_loss
name=name)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/ubuntu/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Not enough time for target transition sequence (required: 4, available: 0), skipping data instance in batch: 28
[[Node: tower_0/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/transpose_2/_555, tower_0/Where, tower_0/sub_2/_557, tower_0/Sum_1/_559)]]
I was training tensorflow and then i mashed the keyboard for shits and giggles:
INFO:tensorflow:global step 101: loss = 5.1761 (52.61 sec/step)
INFO:tensorflow:global step 102: loss = 4.8679 (18.78 sec/step)
INFO:tensorflow:global step 103: loss = 4.9662 (19.02 sec/step)
INFO:tensorflow:global step 104: loss = 5.1126 (17.36 sec/step)
^C^X^C^[^[^[^[^[
exit
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 110, in main
saver=saver)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 782, in train
sess, train_op, global_step, train_step_kwargs)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 530, in train_step
run_metadata=run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/Library/Python/2.7/site-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
KeyboardInterrupt
Kristoffers-MacBook-Pro:im2txt kristoffer$ logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
[Process completed]
When I'm trying to start training again, I get the following error:
$ bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=150
CRITICAL:tensorflow:Found no input files matching /train-?????-of-00256
Traceback (most recent call last):
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 65, in main
model.build()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 353, in build
self.build_inputs()
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 153, in build_inputs
num_reader_threads=self.config.num_input_reader_threads)
File "/Users/kristoffer/web/im2txt/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/ops/inputs.py", line 98, in prefetch_input_data
data_files, shuffle=True, capacity=16, name=shard_queue_name)
File "/Library/Python/2.7/site-packages/tensorflow/python/training/input.py", line 211, in string_input_producer
raise ValueError(not_null_err)
ValueError: string_input_producer requires a non-null input tensor
What causes this and what can I do about it? Is there a proper way to pause/cancel a training session? (Tensorflow seems to pick up where it left of if you startup with training 50 steps and then set the steps to 100)
It seems like your problem is caused by that Tensorflow is trying to load a session, which wasn't properly saved when you interrupted your code. Your solutions now would either be not loading the last session when you restart the code (by commenting the loading line), or deleting the saved session files (then it should automatically restart from scratch). It is hard to give a more specific example, since you didn't share your code...