Training the XSeg model for Deepfacelabs fails due to memory error - training-data

I'm new to deepfakes and I'm trying to do the 5XSeg) train.bat and everytime it finishes the filtering I get the following error. I use wf, and tried batch sizes from 1-8, always the same result. I have a Ryzen 5 3600, a 3080 Ti and 16 GB of RAM.
Using 26519 xseg labeled samples.
Traceback (most recent call last):
File "multiprocessing\queues.py", line 234, in _feed
File "multiprocessing\reduction.py", line 51, in dumps
MemoryError
Error:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1375, in _do_call
return fn(*args)
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1360, in _run_fn
target_list, run_metadata)
File "multiprocessing\queues.py", line 234, in _feed
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1453, in _call_tf_sessionrun
run_metadata)
File "multiprocessing\reduction.py", line 51, in dumps
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node MatMul}}]]
[[concat_6/concat/_3]]
(1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node MatMul}}]]
0 successful operations.
0 derived errors ignored.
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 263, in update_sample_for_preview
self.get_history_previews()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 383, in get_history_previews
return self.onGetPreview (self.sample_for_preview, for_history=True)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 209, in onGetPreview
I, M, IM, = [ np.clip( nn.to_data_format(x,"NHWC", self.model_data_format), 0.0, 1.0) for x in ([image_np,mask_np] + self.view (image_np) ) ]
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 141, in view
return nn.tf_sess.run ( [pred], feed_dict={self.model.input_t :input_np})
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 968, in run
run_metadata_ptr)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1369, in _do_run
run_metadata)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1394, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[node MatMul (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:66) ]]
[[concat_6/concat/_3]]
(1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[node MatMul (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:66) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
XSeg/dense1/weight/read (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:47)
Reshape_60 (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\ops\__init__.py:182)
Input Source operations connected to node MatMul:
XSeg/dense1/weight/read (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:47)
Reshape_60 (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\ops\__init__.py:182)
Original stack trace for 'MatMul':
File "threading.py", line 884, in _bootstrap
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
debug=debug)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 17, in __init__
super().__init__(*args, force_model_class_name='XSeg', **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 193, in __init__
self.on_initialize()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 103, in on_initialize
gpu_pred_logits_t, gpu_pred_t = self.model.flow(gpu_input_t, pretrain=self.pretrain)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\facelib\XSegNet.py", line 85, in flow
return self.model(x, pretrain=pretrain)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
return self.forward(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\models\XSeg.py", line 124, in forward
x = self.dense1(x)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
return self.forward(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py", line 66, in forward
x = tf.matmul(x, weight)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\math_ops.py", line 3655, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5713, in mat_mul
name=name)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3569, in _create_op_internal
op_def=op_def)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 2045, in __init__
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
debug=debug)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 17, in __init__
super().__init__(*args, force_model_class_name='XSeg', **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 216, in __init__
self.update_sample_for_preview(choose_preview_history=self.choose_preview_history)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 265, in update_sample_for_preview
self.sample_for_preview = self.generate_next_samples()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 461, in generate_next_samples
sample.append ( generator.generate_next() )
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\samplelib\SampleGeneratorBase.py", line 21, in generate_next
self.last_generation = next(self)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\samplelib\SampleGeneratorFace.py", line 112, in __next__
return next(generator)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\joblib\SubprocessGenerator.py", line 73, in __next__
gen_data = self.cs_queue.get()
File "multiprocessing\queues.py", line 94, in get
File "multiprocessing\connection.py", line 216, in recv_bytes
File "multiprocessing\connection.py", line 318, in _recv_bytes
File "multiprocessing\connection.py", line 344, in _get_more_data
MemoryError
Reducing the batch size didn't help as well as increasing the page file. I tried to Google it but I couldn't find a solution.

Related

InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape

I am testing Yolo-v3 (https://github.com/experiencor/keras-yolo3) with tensorflow-gpu 1.15 an keras 2.3.1. The training process is started by:
runfile("train.py",'-c config.json')
Here are the printed out messages:
Using TensorFlow backend.
WARNING:tensorflow:From train.py:40: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.
valid_annot_folder not exists. Spliting the trainining set.
Seen labels: {'kangaroo': 266}
Given labels: ['kangaroo']
Training on: ['kangaroo']
WARNING:tensorflow:From C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
.....
Loading pretrained weights.
C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\callbacks\callbacks.py:998: UserWarning: `epsilon` argument is deprecated and will be removed, use `min_delta` instead.
warnings.warn('`epsilon` argument is deprecated and '
Traceback (most recent call last):
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1348, in _run_fn
self._extend_graph()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: {{node replica_0/lambda_1/Shape}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 305, in <module>
_main_(args)
File "train.py", line 282, in _main_
max_queue_size = 8
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training_generator.py", line 42, in fit_generator
model._make_train_function()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 333, in _make_train_function
**self._function_kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 3006, in function
v1_variable_initialization()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 420, in v1_variable_initialization
session = get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 385, in get_session
return tf_keras_backend.get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 486, in get_session
_initialize_variables(session)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 903, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: node replica_0/lambda_1/Shape (defined at C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
I don't understand what caused the InvalidArgumentError. Is my tensoflow-gpu not installed correctly? Or there is some conflict in deploying gpu?
Try changing the "gpus" value to "0" if it is anythong else. It should work if you are executing in GPU.

Invalid argument: TypeError: an integer is required (got type NoneType) when closing Tensorflow

I'm implementing the IMPALA framework https://github.com/deepmind/scalable_agent which uses tensorflows multiprocessing. Everything seems to work fine as the experiment ran for the specified number of steps... however I start to get a type error when tensorflow is closing processes.
Does someone know what could've contributed to this error and how to solve it? This seems to be something with tensorflow rather than the original code itself.
INFO:tensorflow:Closing all processes.
[750. 450. 25.125]
2020-01-13 17:01:00.624889: W tensorflow/core/framework/op_kernel.cc:1389] Invalid argument: TypeError: an integer is required (got type NoneType)
Traceback (most recent call last):
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 207, in __call__
ret = func(*args)
File "/home/haianh/anaconda3/envs/lab/scalable_agent/py_process.py", line 86, in py_call
result = self._out.recv()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 411, in _recv_bytes
return self._recv(size)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
TypeError: an integer is required (got type NoneType)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, TypeError: an integer is required (got type NoneType)
Traceback (most recent call last):
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 207, in __call__
ret = func(*args)
File "/home/haianh/anaconda3/envs/lab/scalable_agent/py_process.py", line 86, in py_call
result = self._out.recv()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 411, in _recv_bytes
return self._recv(size)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
TypeError: an integer is required (got type NoneType)
[[{{node scan/while/flow_environment_step/step}}]]
INFO:tensorflow:All processes closed.
Traceback (most recent call last):
File "experiment.py", line 689, in <module>
tf.app.run()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "experiment.py", line 683, in main
train(action_set, level_names)
File "experiment.py", line 630, in train
session.run(enqueue_ops)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 788, in __exit__
self._close_internal(exception_type)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 826, in _close_internal
self._sess.close()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1082, in close
self._sess.close()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1244, in close
ignore_live_threads=True)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/six.py", line 696, in reraise
raise value
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 257, in _run
enqueue_callable()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1257, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: an integer is required (got type NoneType)
Traceback (most recent call last):
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 207, in __call__
ret = func(*args)
File "/home/haianh/anaconda3/envs/lab/scalable_agent/py_process.py", line 86, in py_call
result = self._out.recv()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 411, in _recv_bytes
return self._recv(size)
File "/home/haianh/anaconda3/envs/lab/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
TypeError: an integer is required (got type NoneType)
[[{{node scan/while/flow_environment_step/step}}]]

ValueError: Operation u'tpu_140462710602256/VarIsInitializedOp' has been marked as not fetchable

The code works fine on GPU and CPU.But when I use keras_to_tpu_model function to make the model able to run on TPU, the error occurred.
This is the full output on colab:https://colab.research.google.com/gist/WangHexie/2252beb26f16354cb6e9ba2639970e5b/tpu-error.ipynb
Change runtype to TPU,I think this can be reproduced.
Code on github:https://github.com/WangHexie/DHNE/blob/master/src/hypergraph_embedding.py#L60
You can test the code on GPU by changing to the gpu branch.
Traceback
Traceback (most recent call last):
File "src/hypergraph_embedding.py", line 158, in <module>
h.train(dataset)
File "src/hypergraph_embedding.py", line 75, in train
epochs=self.options.epochs_to_train, verbose=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 2177, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_generator.py", line 176, in fit_generator
x, y, sample_weight=sample_weight, class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 1940, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 1238, in __call__
infeed_manager)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 1143, in _tpu_model_ops_for_input_specs
infeed_manager)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 1053, in _specialize_model
_model_fn, inputs=[[]] * self._tpu_assignment.num_towers)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 687, in split_compile_and_replicate
outputs = computation(*computation_inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 959, in _model_fn
self.model.cpu_optimizer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py", line 378, in _clone_optimizer
config = optimizer.get_config()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/optimizers.py", line 275, in get_config
'lr': float(K.get_value(self.lr)),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 2709, in get_value
return x.eval(session=get_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 469, in get_session
_initialize_variables(session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 731, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 484, in __init__
self._assert_fetchable(graph, fetch.op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 497, in _assert_fetchable
'Operation %r has been marked as not fetchable.' % op.name)
ValueError: Operation u'tpu_140276544043536/VarIsInitializedOp' has been marked as not fetchable.
I have a same issue which confuses me two days. I find a solution is that just switch to using tf.train.RMSPropOptimizer instead of using RMSProp from tensorflow.keras.optimizers.

the error message while running model_test.py for tensorflow deeplab

I have been trying to test the installation of deeplab by following this
# From tensorflow/models/research/
python deeplab/model_test.py
However, I got the following error message, in specific,
2018-04-25 10:54:23.488868: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at mkl_concat_op.cc:784 : Aborted: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
E...
======================================================================
ERROR: testForwardpassDeepLabv3plus (__main__.DeeplabModelTest)
----------------------------------------------------------------------
The complete traceback is as follows
2018-04-25 10:54:23.488868: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at mkl_concat_op.cc:784 : Aborted: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
E...
======================================================================
ERROR: testForwardpassDeepLabv3plus (__main__.DeeplabModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.AbortedError: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
[[Node: concat = _MklConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _kernel="MklOp", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ResizeBilinear, aspp0/Relu, concat/axis, DMT/_283, aspp0/Relu:1, DMT/_284)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "deeplab/model_test.py", line 108, in testForwardpassDeepLabv3plus
outputs_to_scales_to_logits = sess.run(outputs_to_scales_to_logits)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.AbortedError: Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
[[Node: concat = _MklConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _kernel="MklOp", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ResizeBilinear, aspp0/Relu, concat/axis, DMT/_283, aspp0/Relu:1, DMT/_284)]]
Caused by op 'concat', defined at:
File "deeplab/model_test.py", line 120, in <module>
tf.test.main()
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/test.py", line 76, in main
return _googletest.main(argv)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/googletest.py", line 99, in main
benchmark.benchmarks_main(true_main=main_wrapper)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/benchmark.py", line 338, in benchmarks_main
true_main()
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/googletest.py", line 98, in main_wrapper
return app.run(main=g_main, argv=args)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/platform/googletest.py", line 69, in g_main
return unittest_main(argv=argv)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/main.py", line 95, in __init__
self.runTests()
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/main.py", line 256, in runTests
self.result = testRunner.run(self.test)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/runner.py", line 176, in run
test(result)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 122, in run
test(result)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/suite.py", line 122, in run
test(result)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/case.py", line 653, in __call__
return self.run(*args, **kwds)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/unittest/case.py", line 605, in run
testMethod()
File "deeplab/model_test.py", line 105, in testForwardpassDeepLabv3plus
image_pyramid=[1.0])
File "/data/dsp_emerging/ugwz/virtualE/deeplab/models/research/deeplab/model.py", line 296, in multi_scale_logits
fine_tune_batch_norm=fine_tune_batch_norm)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/models/research/deeplab/model.py", line 461, in _get_logits
fine_tune_batch_norm=fine_tune_batch_norm)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/models/research/deeplab/model.py", line 424, in _extract_features
concat_logits = tf.concat(branch_logits, 3)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1181, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 949, in concat_v2
"ConcatV2", values=values, axis=axis, name=name)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/data/dsp_emerging/ugwz/virtualE/deeplab/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
AbortedError (see above for traceback): Operation received an exception:Status: 3, message: could not create a concat primitive descriptor, in file tensorflow/core/kernels/mkl_concat_op.cc:781
[[Node: concat = _MklConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _kernel="MklOp", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ResizeBilinear, aspp0/Relu, concat/axis, DMT/_283, aspp0/Relu:1, DMT/_284)]]
----------------------------------------------------------------------
Ran 5 tests in 23.571s
FAILED (errors=1)
Roll back to Tensorflow 1.6
This issue is still being addressed in versions 1.7 and above.
https://github.com/tensorflow/tensorflow/issues/17494
In Google Colab, in Runtime type Python2 or Python3, with GPU, I run without any error using commands:
!git clone https://github.com/tensorflow/models.git
%env PYTHONPATH=/env/python/:/content/models/research/:/content/models/research/slim
!python /content/models/research/deeplab/model_test.py

Error while running TensorFlow wide_n_deep Tutorial

I encountered the error:
AttributeError: 'NoneType' object has no attribute 'bucketize'
The full error is as follows:
Traceback (most recent call last):
File "wide_n_deep_tutorial_1.py", line 214, in <module>
train_and_eval()
File "wide_n_deep_tutorial_1.py", line 203, in train_and_eval
m.fit(input_fn=lambda: input_fn(df_train), steps=FLAGS.train_steps)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn_linear_combined.py", line 711, in fit
max_steps=max_steps)
File "C:\Python35\lib\site-packages\tensorflow\python\util\deprecation.py", line 191, in new_func
return func(*args, **kwargs)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 355, in fit
max_steps=max_steps)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 699, in _train_model
train_ops = self._get_train_ops(features, labels)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1052, in _get_train_ops
return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1019, in _call_model_fn
params=self.params)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn_linear_combined.py", line 504, in _dnn_linear_combined_model_fn
scope=scope)
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\layers\feature_column_ops.py", line 526, in weighted_sum_from_feature_columns
transformed_tensor = transformer.transform(column)
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\layers\feature_column_ops.py", line 869, in transform
feature_column.insert_transformed_feature(self._columns_to_tensors)
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\layers\feature_column.py", line 1489, in insert_transformed_feature
name="bucketize")
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\ops\bucketization_op.py", line 48, in bucketize
return _bucketization_op.bucketize(input_tensor, boundaries, name=name)
AttributeError: 'NoneType' object has no attribute 'bucketize'
I got the same issue, it seems that on windows, we just got None, sourcecode,
try to run this code on linux, or try to remove the bucketization and the column crossing, for example. change the line:
flags.DEFINE_string("model_type","wide_n_deep","valid model types:{'wide','deep', 'wide_n_deep'")
to
flags.DEFINE_string("model_type","deep","valid model types:{'wide','deep', 'wide_n_deep'")
follow this issue for update: issue