InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape - tensorflow

I am testing Yolo-v3 (https://github.com/experiencor/keras-yolo3) with tensorflow-gpu 1.15 an keras 2.3.1. The training process is started by:
runfile("train.py",'-c config.json')
Here are the printed out messages:
Using TensorFlow backend.
WARNING:tensorflow:From train.py:40: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.
valid_annot_folder not exists. Spliting the trainining set.
Seen labels: {'kangaroo': 266}
Given labels: ['kangaroo']
Training on: ['kangaroo']
WARNING:tensorflow:From C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
.....
Loading pretrained weights.
C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\callbacks\callbacks.py:998: UserWarning: `epsilon` argument is deprecated and will be removed, use `min_delta` instead.
warnings.warn('`epsilon` argument is deprecated and '
Traceback (most recent call last):
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1348, in _run_fn
self._extend_graph()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: {{node replica_0/lambda_1/Shape}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 305, in <module>
_main_(args)
File "train.py", line 282, in _main_
max_queue_size = 8
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training_generator.py", line 42, in fit_generator
model._make_train_function()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 333, in _make_train_function
**self._function_kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 3006, in function
v1_variable_initialization()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 420, in v1_variable_initialization
session = get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 385, in get_session
return tf_keras_backend.get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 486, in get_session
_initialize_variables(session)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 903, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: node replica_0/lambda_1/Shape (defined at C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
I don't understand what caused the InvalidArgumentError. Is my tensoflow-gpu not installed correctly? Or there is some conflict in deploying gpu?

Try changing the "gpus" value to "0" if it is anythong else. It should work if you are executing in GPU.

Related

Training the XSeg model for Deepfacelabs fails due to memory error

I'm new to deepfakes and I'm trying to do the 5XSeg) train.bat and everytime it finishes the filtering I get the following error. I use wf, and tried batch sizes from 1-8, always the same result. I have a Ryzen 5 3600, a 3080 Ti and 16 GB of RAM.
Using 26519 xseg labeled samples.
Traceback (most recent call last):
File "multiprocessing\queues.py", line 234, in _feed
File "multiprocessing\reduction.py", line 51, in dumps
MemoryError
Error:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1375, in _do_call
return fn(*args)
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1360, in _run_fn
target_list, run_metadata)
File "multiprocessing\queues.py", line 234, in _feed
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1453, in _call_tf_sessionrun
run_metadata)
File "multiprocessing\reduction.py", line 51, in dumps
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node MatMul}}]]
[[concat_6/concat/_3]]
(1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node MatMul}}]]
0 successful operations.
0 derived errors ignored.
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 263, in update_sample_for_preview
self.get_history_previews()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 383, in get_history_previews
return self.onGetPreview (self.sample_for_preview, for_history=True)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 209, in onGetPreview
I, M, IM, = [ np.clip( nn.to_data_format(x,"NHWC", self.model_data_format), 0.0, 1.0) for x in ([image_np,mask_np] + self.view (image_np) ) ]
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 141, in view
return nn.tf_sess.run ( [pred], feed_dict={self.model.input_t :input_np})
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 968, in run
run_metadata_ptr)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1369, in _do_run
run_metadata)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1394, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[node MatMul (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:66) ]]
[[concat_6/concat/_3]]
(1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[node MatMul (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:66) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
XSeg/dense1/weight/read (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:47)
Reshape_60 (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\ops\__init__.py:182)
Input Source operations connected to node MatMul:
XSeg/dense1/weight/read (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:47)
Reshape_60 (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\ops\__init__.py:182)
Original stack trace for 'MatMul':
File "threading.py", line 884, in _bootstrap
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
debug=debug)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 17, in __init__
super().__init__(*args, force_model_class_name='XSeg', **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 193, in __init__
self.on_initialize()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 103, in on_initialize
gpu_pred_logits_t, gpu_pred_t = self.model.flow(gpu_input_t, pretrain=self.pretrain)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\facelib\XSegNet.py", line 85, in flow
return self.model(x, pretrain=pretrain)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
return self.forward(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\models\XSeg.py", line 124, in forward
x = self.dense1(x)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
return self.forward(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py", line 66, in forward
x = tf.matmul(x, weight)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\math_ops.py", line 3655, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5713, in mat_mul
name=name)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3569, in _create_op_internal
op_def=op_def)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 2045, in __init__
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
debug=debug)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 17, in __init__
super().__init__(*args, force_model_class_name='XSeg', **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 216, in __init__
self.update_sample_for_preview(choose_preview_history=self.choose_preview_history)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 265, in update_sample_for_preview
self.sample_for_preview = self.generate_next_samples()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 461, in generate_next_samples
sample.append ( generator.generate_next() )
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\samplelib\SampleGeneratorBase.py", line 21, in generate_next
self.last_generation = next(self)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\samplelib\SampleGeneratorFace.py", line 112, in __next__
return next(generator)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\joblib\SubprocessGenerator.py", line 73, in __next__
gen_data = self.cs_queue.get()
File "multiprocessing\queues.py", line 94, in get
File "multiprocessing\connection.py", line 216, in recv_bytes
File "multiprocessing\connection.py", line 318, in _recv_bytes
File "multiprocessing\connection.py", line 344, in _get_more_data
MemoryError
Reducing the batch size didn't help as well as increasing the page file. I tried to Google it but I couldn't find a solution.

ValueError: Operation u'tpu_140462710602256/VarIsInitializedOp' has been marked as not fetchable

The code works fine on GPU and CPU.But when I use keras_to_tpu_model function to make the model able to run on TPU, the error occurred.
This is the full output on colab:https://colab.research.google.com/gist/WangHexie/2252beb26f16354cb6e9ba2639970e5b/tpu-error.ipynb
Change runtype to TPU,I think this can be reproduced.
Code on github:https://github.com/WangHexie/DHNE/blob/master/src/hypergraph_embedding.py#L60
You can test the code on GPU by changing to the gpu branch.
Traceback
Traceback (most recent call last):
File "src/hypergraph_embedding.py", line 158, in <module>
h.train(dataset)
File "src/hypergraph_embedding.py", line 75, in train
epochs=self.options.epochs_to_train, verbose=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 2177, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_generator.py", line 176, in fit_generator
x, y, sample_weight=sample_weight, class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 1940, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 1238, in __call__
infeed_manager)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 1143, in _tpu_model_ops_for_input_specs
infeed_manager)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 1053, in _specialize_model
_model_fn, inputs=[[]] * self._tpu_assignment.num_towers)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 687, in split_compile_and_replicate
outputs = computation(*computation_inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 959, in _model_fn
self.model.cpu_optimizer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 378, in _clone_optimizer
config = optimizer.get_config()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/optimizers.py", line 275, in get_config
'lr': float(K.get_value(self.lr)),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 2709, in get_value
return x.eval(session=get_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 469, in get_session
_initialize_variables(session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 731, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 484, in __init__
self._assert_fetchable(graph, fetch.op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 497, in _assert_fetchable
'Operation %r has been marked as not fetchable.' % op.name)
ValueError: Operation u'tpu_140276544043536/VarIsInitializedOp' has been marked as not fetchable.
I have a same issue which confuses me two days. I find a solution is that just switch to using tf.train.RMSPropOptimizer instead of using RMSProp from tensorflow.keras.optimizers.

Protobuf errors while using Tensorflow Object Detection API locally

I got tensorflow and object detection API on my machine.
Test run shows that everything works.
~ $ cd models/research
research $ protoc object_detection/protos/*.proto --python_out=.
research $ export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
research $ python3 object_detection/builders/model_builder_test.py
...............
----------------------------------------------------------------------
Ran 15 tests in 0.144s
OK
Then I tried to retrain a model and got the protobuf error
research $ cd object_detection
object_detection $ python3 train.py --logtostderr --train_dir=training/ --pipeline_config_path=ssdlite_mobilenet_v2_coco_2018_05_09/pipeline.config
WARNING:tensorflow:From /Users/me/models/research/object_detection/trainer.py:257: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
Traceback (most recent call last):
File "/Users/me/models/research/object_detection/utils/label_map_util.py", line 135, in load_labelmap
text_format.Merge(label_map_string, label_map)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 533, in Merge
descriptor_pool=descriptor_pool)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 587, in MergeLines
return parser.MergeLines(lines, message)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 620, in MergeLines
self._ParseOrMerge(lines, message)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 635, in _ParseOrMerge
self._MergeField(tokenizer, message)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 735, in _MergeField
merger(tokenizer, message, field)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 823, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 722, in _MergeField
tokenizer.Consume(':')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/text_format.py", line 1087, in Consume
raise self.ParseError('Expected "%s".' % token)
google.protobuf.text_format.ParseError: 3:10 : Expected ":".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1083, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1105, in InternalParse
(tag_bytes, new_pos) = local_ReadTag(buffer, pos)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/internal/decoder.py", line 181, in ReadTag
while six.indexbytes(buffer, pos) & 0x80:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 184, in <module>
tf.app.run()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train.py", line 180, in main
graph_hook_fn=graph_rewriter_fn)
File "/Users/me/models/research/object_detection/trainer.py", line 264, in train
train_config.prefetch_queue_capacity, data_augmentation_options)
File "/Users/me/models/research/object_detection/trainer.py", line 59, in create_input_queue
tensor_dict = create_tensor_dict_fn()
File "train.py", line 121, in get_next
dataset_builder.build(config)).get_next()
File "/Users/me/models/research/object_detection/builders/dataset_builder.py", line 155, in build
label_map_proto_file=label_map_proto_file)
File "/Users/me/models/research/object_detection/data_decoders/tf_example_decoder.py", line 245, in __init__
use_display_name)
File "/Users/me/models/research/object_detection/utils/label_map_util.py", line 152, in get_label_map_dict
label_map = load_labelmap(label_map_path)
File "/Users/me/models/research/object_detection/utils/label_map_util.py", line 137, in load_labelmap
label_map.ParseFromString(label_map_string)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/message.py", line 185, in ParseFromString
self.MergeFromString(serialized)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1089, in MergeFromString
raise message_mod.DecodeError('Truncated message.')
google.protobuf.message.DecodeError: Truncated message.
object_detection $
I tried a bunch of similar problems' solutions, but they didn't work for my case. For example this one suggest to encode pbtxt file with ASCII.
Python 2 gives an error too. Here is it's last line
google.protobuf.message.DecodeError: Unexpected end-group tag.
Context:
macOS 10.13.4
Local run on CPU
Python 3.6.4
protobuf 3.5.1
libprotoc 3.4.0
tensorflow 1.8.0
Google Cloud SDK 200.0.0
bq 2.0.33
core 2018.04.30
gsutil 4.31

CudnnLSTM runs out of space with Eager Execution

I'm using 3 tf.contrib.cudnn_rnn.CudnnLSTM(1, 128, direction='bidirectional') layers with a batch size of 32 on an AWS p2.xlarge instance. The exact same configuration works correctly with non-eager(standard) tensorflow. Following is the error log:
2018-04-27 18:15:59.139739: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1520] Failed to allocate RNN workspace of 74252288 bytes.
2018-04-27 18:15:59.139758: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1697] Unable to create rnn workspace
Traceback (most recent call last):
File "tf_run_eager.py", line 424, in <module>
run_experiments()
File "tf_run_eager.py", line 417, in run_experiments
train_losses.append(model.optimize(bX, bY).numpy())
File "tf_run_eager.py", line 397, in optimize
loss, grads_and_vars = self.loss(phoneme_features, utterances)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 233, in grad_fn
sources)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py", line 65, in imperative_grad
tape._tape, vspace, target, sources, output_gradients, status) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 141, in grad_fn
op_inputs, op_outputs, orig_outputs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 109, in _magic_gradient_function
return grad_fn(mock_op, *out_grads)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1609, in _cudnn_rnn_backward
direction=op.get_attr("direction"))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py", line 320, in cudnn_rnn_backprop
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnBackward [Op:CudnnRNNBackprop]

the error message while running model_test.py for tensorflow deeplab

I have been trying to test the installation of deeplab by following this
# From tensorflow/models/research/
python deeplab/model_test.py
However, I got the following error message, in specific,
2018-04-25 10:54:23.488868: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at mkl_concat_op.cc:784 : Aborted: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
E...
======================================================================
ERROR: testForwardpassDeepLabv3plus (__main__.DeeplabModelTest)
----------------------------------------------------------------------
The complete traceback is as follows
2018-04-25 10:54:23.488868: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at mkl_concat_op.cc:784 : Aborted: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
E...
======================================================================
ERROR: testForwardpassDeepLabv3plus (__main__.DeeplabModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.AbortedError: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
[[Node: concat = _MklConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _kernel="MklOp", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ResizeBilinear, aspp0/Relu, concat/axis, DMT/_283, aspp0/Relu:1, DMT/_284)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "deeplab/model_test.py", line 108, in testForwardpassDeepLabv3plus
outputs_to_scales_to_logits = sess.run(outputs_to_scales_to_logits)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.AbortedError: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
[[Node: concat = _MklConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _kernel="MklOp", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ResizeBilinear, aspp0/Relu, concat/axis, DMT/_283, aspp0/Relu:1, DMT/_284)]]
Caused by op 'concat', defined at:
File "deeplab/model_test.py", line 120, in <module>
tf.test.main()
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/test.py", line 76, in main
return _googletest.main(argv)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/googletest.py", line 99, in main
benchmark.benchmarks_main(true_main=main_wrapper)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/benchmark.py", line 338, in benchmarks_main
true_main()
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/googletest.py", line 98, in main_wrapper
return app.run(main=g_main, argv=args)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/googletest.py", line 69, in g_main
return unittest_main(argv=argv)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/main.py", line 95, in __init__
self.runTests()
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/main.py", line 256, in runTests
self.result = testRunner.run(self.test)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/runner.py", line 176, in run
test(result)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 122, in run
test(result)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 122, in run
test(result)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/case.py", line 653, in __call__
return self.run(*args, **kwds)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/case.py", line 605, in run
testMethod()
File "deeplab/model_test.py", line 105, in testForwardpassDeepLabv3plus
image_pyramid=[1.0])
File "/data/dsp_emerging/ugwz/virtualE/deeplab/models/research/deeplab/model.py", line 296, in multi_scale_logits
fine_tune_batch_norm=fine_tune_batch_norm)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/models/research/deeplab/model.py", line 461, in _get_logits
fine_tune_batch_norm=fine_tune_batch_norm)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/models/research/deeplab/model.py", line 424, in _extract_features
concat_logits = tf.concat(branch_logits, 3)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1181, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 949, in concat_v2
"ConcatV2", values=values, axis=axis, name=name)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
AbortedError (see above for traceback): Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
[[Node: concat = _MklConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _kernel="MklOp", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ResizeBilinear, aspp0/Relu, concat/axis, DMT/_283, aspp0/Relu:1, DMT/_284)]]
----------------------------------------------------------------------
Ran 5 tests in 23.571s
FAILED (errors=1)
Roll back to Tensorflow 1.6
This issue is still being addressed in versions 1.7 and above.
https://github.com/tensorflow/tensorflow/issues/17494
In Google Colab, in Runtime type Python2 or Python3, with GPU, I run without any error using commands:
!git clone https://github.com/tensorflow/models.git
%env PYTHONPATH=/env/python/:/content/models/research/:/content/models/research/slim
!python /content/models/research/deeplab/model_test.py