How to fix "Input shape axis 0 must equal 4, got shape [5]" in tensorflow? - tensorflow

I am running tensorflow object_detection api in a docker with TiTAN. Using the command python object_detection/model_main.py --"pipeline_config_path object_detection/train_manhole/faster_rcnn_resnet101_coco.config --model_dir object_detection/train_manhole --alsologtostder, I receive an error.
Here is the error information:
root#a358c8644e9c:~/manhole/models/research# python object_detection/model_main.py --pipeline_config_path object_detection/train_manhole/faster_rcnn_resnet101_coco.config --model_dir object_detection/train_manhole --alsologtostder
/root/manhole/models/research/object_detection/utils/visualization_utils.py:26: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.
The backend was *originally* set to 'TkAgg' by the following code:
File "object_detection/model_main.py", line 26, in <module>
from object_detection import model_lib
File "/root/manhole/models/research/object_detection/model_lib.py", line 27, in <module>
from object_detection import eval_util
File "/root/manhole/models/research/object_detection/eval_util.py", line 28, in <module>
from object_detection.metrics import coco_evaluation
File "/root/manhole/models/research/object_detection/metrics/coco_evaluation.py", line 20, in <module>
from object_detection.metrics import coco_tools
File "/root/manhole/models/research/object_detection/metrics/coco_tools.py", line 47, in <module>
from pycocotools import coco
File "/root/manhole/models/research/pycocotools/coco.py", line 49, in <module>
import matplotlib.pyplot as plt
File "/usr/local/lib/python3.5/dist-packages/matplotlib/pyplot.py", line 71, in <module>
from matplotlib.backends import pylab_setup
File "/usr/local/lib/python3.5/dist-packages/matplotlib/backends/__init__.py", line 16, in <module>
line for line in traceback.format_stack()
import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7f0a65a25e18>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /root/manhole/models/research/object_detection/builders/dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /root/manhole/models/research/object_detection/builders/dataset_builder.py:152: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
WARNING:tensorflow:From /root/manhole/models/research/object_detection/predictors/heads/box_head.py:93: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /root/manhole/models/research/object_detection/core/losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2019-03-29 02:38:24.469848: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-29 02:38:24.546573: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-29 02:38:24.547117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 11.41GiB
2019-03-29 02:38:24.547150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-29 02:38:24.709319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 02:38:24.709370: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-03-29 02:38:24.709377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-03-29 02:38:24.709670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11036 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input shape axis 0 must equal 4, got shape [5]
[[{{node Preprocessor/ResizeToRange/cond/resize_images/unstack}} = Unpack[T=DT_INT32, axis=0, num=4, _device="/device:CPU:0"](Preprocessor/ResizeToRange/cond/resize_images/Shape)]]
[[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[1], [1,?,?,3], [1,2], [1,3], [1,100], [1,100,4], [1,100,2], [1,100,2], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "object_detection/model_main.py", line 109, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input shape axis 0 must equal 4, got shape [5]
[[{{node Preprocessor/ResizeToRange/cond/resize_images/unstack}} = Unpack[T=DT_INT32, axis=0, num=4, _device="/device:CPU:0"](Preprocessor/ResizeToRange/cond/resize_images/Shape)]]
[[node IteratorGetNext (defined at object_detection/model_main.py:105) = IteratorGetNext[output_shapes=[[1], [1,?,?,3], [1,2], [1,3], [1,100], [1,100,4], [1,100,2], [1,100,2], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]
My docker environment is :
== cat /etc/issue ===============================================
Linux a358c8644e9c 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.6 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
== are we in docker =============================================
Yes
== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux a358c8644e9c 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy 1.14.2
protobuf 3.7.1
tensorflow-estimator 1.13.0
tensorflow-gpu 1.12.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.12.0
tf.GIT_VERSION = v1.12.0-0-ga6d8ffae09
tf.COMPILER_VERSION = 4.8.5
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Fri Mar 29 02:57:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:01:00.0 On | N/A |
| 23% 36C P0 69W / 250W | 378MiB / 12192MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0.176
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart_static.a
Thanks for help!

Thanks! I fixed it. The reason was these were some datasets which were pictures broken. What's more, the depth of some pictures had more than 3. I used the python script to select those pictures which caused the error above. Finally, it runs normally.

I faced the same issue, the problem was because of the dataset(broken images in the dataset). There were incorrect image shapes like 2D image shapes in the dataset, that the correct shape should be 3D shape.
What was solved is deleting the not 3D images by comparing images' shapes. Done!!!!

Related

export_model.py - Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files

I've been trying to train my own deeplab model from https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/pascal.md.
I'm running everything on Google Colab.
I've been able to train the model fine:
%%shell
export PYTHONPATH=$PYTHONPATH:"/content/models/research":"/content/models/research/slim"
NUM_ITERATIONS=50
python3 train.py \
--logtostderr \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=200,200 \
--train_batch_size=12 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--fine_tune_batch_norm=true \
--tf_initial_checkpoint="/content/deeplabv3_pascal_train_aug/model.ckpt.index" \
--train_logdir="/content/output" \
--dataset_dir="/content/drive/My Drive/Colab Notebooks/Background Removal/tfrecord"
And create visualizations fine:
%%shell
export PYTHONPATH=$PYTHONPATH:"/content/models/research":"/content/models/research/slim"
python3 vis.py \
--logtostderr \
--vis_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--vis_crop_size=200,200 \
--checkpoint_dir=/content/output \
--vis_logdir=/content/output/vis \
--dataset_dir="/content/drive/My Drive/Colab Notebooks/Background Removal/tfrecord" \
--max_number_of_iterations=1
But running export_model.py does not work. I thought it might have been an issue with the model I have trained, so I tried exporting the initial checkpoint I am training off of - it doesn't work either.
%%shell
export PYTHONPATH=$PYTHONPATH:"/content/models/research":"/content/models/research/slim"
NUM_ITERATIONS=50
python3 export_model.py \
--logtostderr \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--crop_size=200 \
--crop_size=200 \
--checkpoint_path='/content/output/model.ckpt-50.index' \
--export_path='/content/output'
Full output from running export_model.py:
WARNING:tensorflow:From /content/models/research/deeplab/core/conv2d_ws.py:40: The name tf.layers.Layer is deprecated. Please use tf.compat.v1.layers.Layer instead.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
WARNING:tensorflow:From export_model.py:201: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
WARNING:tensorflow:From export_model.py:117: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
W0329 17:24:00.753659 139709292058496 module_wrapper.py:139] From export_model.py:117: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From export_model.py:117: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
W0329 17:24:00.753914 139709292058496 module_wrapper.py:139] From export_model.py:117: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
WARNING:tensorflow:From export_model.py:118: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
W0329 17:24:00.754124 139709292058496 module_wrapper.py:139] From export_model.py:118: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
INFO:tensorflow:Prepare to export model to: /content/output
I0329 17:24:00.754279 139709292058496 export_model.py:118] Prepare to export model to: /content/output
WARNING:tensorflow:From export_model.py:91: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
W0329 17:24:00.755340 139709292058496 module_wrapper.py:139] From export_model.py:91: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
INFO:tensorflow:Exported model performs single-scale inference.
I0329 17:24:00.817728 139709292058496 export_model.py:130] Exported model performs single-scale inference.
WARNING:tensorflow:From /content/models/research/deeplab/model.py:320: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.
W0329 17:24:00.818036 139709292058496 module_wrapper.py:139] From /content/models/research/deeplab/model.py:320: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.
WARNING:tensorflow:From /content/models/research/deeplab/core/feature_extractor.py:461: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0329 17:24:00.818522 139709292058496 deprecation.py:323] From /content/models/research/deeplab/core/feature_extractor.py:461: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
WARNING:tensorflow:From /content/models/research/deeplab/core/feature_extractor.py:75: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
W0329 17:24:00.821603 139709292058496 module_wrapper.py:139] From /content/models/research/deeplab/core/feature_extractor.py:75: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W0329 17:24:00.825009 139709292058496 deprecation.py:323] From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /content/models/research/deeplab/core/utils.py:41: The name tf.image.resize_bilinear is deprecated. Please use tf.compat.v1.image.resize_bilinear instead.
W0329 17:24:02.636440 139709292058496 module_wrapper.py:139] From /content/models/research/deeplab/core/utils.py:41: The name tf.image.resize_bilinear is deprecated. Please use tf.compat.v1.image.resize_bilinear instead.
WARNING:tensorflow:From export_model.py:162: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
W0329 17:24:02.986706 139709292058496 module_wrapper.py:139] From export_model.py:162: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
WARNING:tensorflow:From export_model.py:178: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
W0329 17:24:02.991279 139709292058496 module_wrapper.py:139] From export_model.py:178: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
WARNING:tensorflow:From export_model.py:178: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
W0329 17:24:02.991502 139709292058496 deprecation.py:323] From export_model.py:178: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
WARNING:tensorflow:From export_model.py:181: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.
W0329 17:24:03.295938 139709292058496 module_wrapper.py:139] From export_model.py:181: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.
WARNING:tensorflow:From export_model.py:182: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
W0329 17:24:03.296255 139709292058496 module_wrapper.py:139] From export_model.py:182: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/tools/freeze_graph.py:127: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0329 17:24:03.419735 139709292058496 deprecation.py:323] From /tensorflow-1.15.2/python3.6/tensorflow_core/python/tools/freeze_graph.py:127: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-03-29 17:24:03.901045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-29 17:24:03.919472: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.920276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2020-03-29 17:24:03.920544: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-29 17:24:03.922225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-29 17:24:03.923832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-29 17:24:03.924132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-29 17:24:03.926131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-29 17:24:03.927020: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-29 17:24:03.930883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-29 17:24:03.931017: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.931838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.932481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-03-29 17:24:03.937940: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-03-29 17:24:03.938159: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a83480 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-29 17:24:03.938192: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-03-29 17:24:03.993090: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.993934: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a83640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-29 17:24:03.993966: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-03-29 17:24:03.994138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.994819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2020-03-29 17:24:03.994883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-29 17:24:03.994912: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-29 17:24:03.994937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-29 17:24:03.994960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-29 17:24:03.994984: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-29 17:24:03.995007: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-29 17:24:03.995031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-29 17:24:03.995121: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.995850: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.996477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-03-29 17:24:03.996539: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-29 17:24:03.998097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-29 17:24:03.998127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2020-03-29 17:24:03.998140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2020-03-29 17:24:03.998307: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.999000: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 17:24:03.999707: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-03-29 17:24:03.999752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from /content/output/model.ckpt-50.index
I0329 17:24:04.002565 139709292058496 saver.py:1284] Restoring parameters from /content/output/model.ckpt-50.index
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files /content/output/model.ckpt-50.index
[[{{node save/RestoreV2}}]]
(1) Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files /content/output/model.ckpt-50.index
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_301]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 1290, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files /content/output/model.ckpt-50.index
[[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files /content/output/model.ckpt-50.index
[[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_301]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'save/RestoreV2':
File "export_model.py", line 201, in <module>
tf.app.run()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "export_model.py", line 178, in main
saver = tf.train.Saver(tf.all_variables())
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 828, in __init__
self.build()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 1300, in restore
names_to_keys = object_graph_key_mapping(save_path)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 1618, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "export_model.py", line 201, in <module>
tf.app.run()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "export_model.py", line 192, in main
initializer_nodes=None)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/tools/freeze_graph.py", line 151, in freeze_graph_with_def_protos
saver.restore(sess, input_checkpoint)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 1306, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found.
(0) Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files /content/output/model.ckpt-50.index
[[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Not found: Tensor name "MobilenetV2/Conv/BatchNorm/beta" not found in checkpoint files /content/output/model.ckpt-50.index
[[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_301]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'save/RestoreV2':
File "export_model.py", line 201, in <module>
tf.app.run()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "export_model.py", line 178, in main
saver = tf.train.Saver(tf.all_variables())
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 828, in __init__
self.build()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-14-46a5ede3bd50> in <module>()
----> 1 get_ipython().run_cell_magic('shell', '', 'export PYTHONPATH=$PYTHONPATH:"/content/models/research":"/content/models/research/slim"\nNUM_ITERATIONS=50\npython3 export_model.py \\\n --logtostderr \\\n --atrous_rates=6 \\\n --atrous_rates=12 \\\n --atrous_rates=18 \\\n --output_stride=16 \\\n --crop_size=200 \\\n --crop_size=200 \\\n --checkpoint_path=\'/content/output/model.ckpt-50.index\' \\\n --export_path=\'/content/output\'')
2 frames
/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in check_returncode(self)
136 if self.returncode:
137 raise subprocess.CalledProcessError(
--> 138 returncode=self.returncode, cmd=self.args, output=self.output)
139
140 def _repr_pretty_(self, p, cycle): # pylint:disable=unused-argument
CalledProcessError: Command 'export PYTHONPATH=$PYTHONPATH:"/content/models/research":"/content/models/research/slim"
NUM_ITERATIONS=50
python3 export_model.py \
--logtostderr \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--crop_size=200 \
--crop_size=200 \
--checkpoint_path='/content/output/model.ckpt-50.index' \
--export_path='/content/output'' returned non-zero exit status 1.
I'm aware of similar GitHub issues (https://github.com/tensorflow/models/issues/6212 and https://github.com/tensorflow/models/issues/3992), but it doesn't look like any were resolved. I also tried poking around in the export_model.py code in deeplab, but I don't understand the TF code enough to know where to look.
It is trying to search for model checkpoints trained on MobileNet-v2 backbone by default. But as you have trained your model on xception backbone. Please add '--model_variant="xception_65"' argument to your export_model.py.

CUDNN_STATUS_INTERNAL_ERROR in tensorflow-gpu 2.0

When I run CNN in Tensorflow 2.0, I get CUDNN_STATUS_INTERNAL_ERROR.
It seems that libcublas.so.10.0 and libcudnn.so.7 are loaded fine.
versions should be fine:
Tensorflow 2.0
ubuntu 18.04
GeForce GTX 1650
NVIDIA driver 430
cudnn: 7.4.2.24 (also tried with 7.3.0.29 and 7.6.4.38)
(ref)
I tried followings but they didn't fix the problem:
I removed ~/.nv (ref)
Modified /usr/include/cudnn.h #include "driver_types.h" to #include <driver_types.h> and passed mnistCUDNN test (ref)
Questions:
Does passing the mnistCUDNN test mean that required packages are installed correctly?
How can I fix this problem below?
After all, here's error message:
Using TensorFlow backend.
2019-10-16 14:48:16.226892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-16 14:48:16.255123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
...
2019-10-16 14:48:16.370703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
Train on 48000 samples, validate on 12000 samples
Epoch 1/12
2019-10-16 14:48:17.357747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-16 14:48:17.525865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
--error here--
2019-10-16 14:48:17.873127: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-16 14:48:17.879412: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
--error here--
2019-10-16 14:48:17.879516: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
File "lenet.py", line 96, in <module> x_train, y_train, batch_size=128, epochs=12, validation_split=0.2
File "lenet.py", line 83, in train verbose=self.verbose
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit validation_freq=validation_freq)
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop outs = fit_function(ins_batch)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3740, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
return self._call_impl(args, kwargs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv2d_1/convolution (defined at /home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_1220]
Function call stack:
keras_scratch_graph
I encountered this error on my Ubuntu 20.04 / RTX 2070 system. I found this:
https://gist.github.com/mikaelhg/cae5b7938aa3dfdf3d06a40739f2f3f4#file-cuda-install-md
where it suggests exporting an environment variable like this:
export TF_FORCE_GPU_ALLOW_GROWTH=true
That fixed it for me. Happy days.

tensorflow.python.framework.errors_impl.ResourceExhaustedError

I'm using an object detection module for classifying images. My specs are as follows:
OS: Ubuntu 18.04 LTS
Python: 3.6.7
VirtualEnv: Version: 16.4.3
Pip3 version inside virtualenv: 19.0.3
TensorFlow Version: 1.13.1
Protoc Version: 3.0.0-9
I'm working on Windows virtualenv and google-colab. This is the error message I get:
python3 legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_pets.config
INFO:tensorflow:global step 1: loss = 18.5013 (48.934 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
File "legacy/train.py", line 184, in <module>
tf.app.run()
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "legacy/train.py", line 180, in main
graph_hook_fn=graph_rewriter_fn)
File "/home/priyank/venv/models-master/research/object_detection/legacy/trainer.py", line 416, in train
saver=saver)
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
ignore_live_threads=ignore_live_threads)
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 832, in stop
ignore_live_threads=ignore_live_threads)
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/priyank/venv/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 257, in _run
enqueue_callable()
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1257, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
<b>tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[15,1,1755,2777,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node batch}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.</b>
You can try the following fixes:
1. Reducing the image dimension in case you are using very high image resolution
2. Try reducing the batch size
3. Check if any other process is using up your memory
Could you also please share your config file

Conv2D for GPU is not currently supported without cudnn

I'm testing a TensorFlow program which used tf.nn.conv2d, but an error occurs as below:
2018-09-18 01:33:54.908161: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-18 01:33:54.987724: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-18 01:33:54.988106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.695
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.44GiB
2018-09-18 01:33:54.988122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-18 01:33:55.193045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-18 01:33:55.193076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-09-18 01:33:55.193082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-09-18 01:33:55.193257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7173 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnimplementedError: Conv2D for GPU is not currently supported without cudnn
[[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable/read)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "facerec.py", line 77, in <module>
l, _ = sess.run([loss, train], feed_dict={x:imag, y:ans})
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnimplementedError: Conv2D for GPU is not currently supported without cudnn
[[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable/read)]]
Caused by op 'Conv2D', defined at:
File "facerec.py", line 31, in <module>
l1 = tf.nn.relu(tf.nn.conv2d(x, kcore1, [1, 1, 1, 1], 'SAME', False)+bias1)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/jzsb/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()
UnimplementedError (see above for traceback): Conv2D for GPU is not currently supported without cudnn
[[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable/read)]]
I surely confirm that I've installed cuDNN before.
jzsb#jzsb-tf:~/PycharmProjects/untitled$ ls -las /usr/local/cuda/include/*dnn*
100 -r--r--r-- 1 root root 100962 8月 15 01:08 /usr/local/cuda/include/cudnn.h
jzsb#jzsb-tf:~/PycharmProjects/untitled$ ls -las /usr/local/cuda/lib64/*dnn*
0 lrwxrwxrwx 1 root root 13 8月 15 01:09 /usr/local/cuda/lib64/libcudnn.so -> libcudnn.so.7
0 lrwxrwxrwx 1 root root 17 8月 15 01:09 /usr/local/cuda/lib64/libcudnn.so.7 -> libcudnn.so.7.2.1
281260 -rwxr-xr-x 1 root root 288007984 8月 15 01:09 /usr/local/cuda/lib64/libcudnn.so.7.2.1
278240 -rw-r--r-- 1 root root 284914492 8月 15 01:09 /usr/local/cuda/lib64/libcudnn_static.a
My tensorflow version is 1.10, please tell me how to solve this problem, thank you.
You can Install the latest version of Tensorflow for GPU and CPU and respective cuDNN and CUDA version, which will resolve the issue.
For details regarding Tensorflow GPU installation and setups like drivers and PATH you can follow this documentation.
For details regarding Table with specific version of Tensorflow, Python,cuDNN, CUDA etc., you can follow this documentation.
The error part of the code is actually shown,
Your GPU is not open in the code: use_CUDnn_on_GPU =false, change fale to true and the program will work.

Tensorflow GPU stopped working

Reproducing the issue
I had tensorflow running a few days ago, but it stopped working. Upon testing it with the tutorial code, both mnist_softmax and mnist_deep fail. Tensorflow is succeeding in running the simple helloworld content.
What I've tried
As with delton137, I've tried setting allow_growth to True or the per_process_gpu_memory_fraction to 0.1, but this does not help.
I've tried reinstalling my cudnn files.
Additional notes
I don't remember making any changes to my Tensorflow installation or my CUDA/cuDNN setup, so my best guess is that this might be an issue with a driver that auto-updated.
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No. Issue is reproducible using code from tensorflow tutorials.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): v1.3.0-rc2-20-g0787eee 1.3.0
Python version: Python 3.5.2 (default, Aug 18 2017, 17:48:00)
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: CUDA release 8.0, V8.0.61 / libcudnn.so.6.0.21
GPU model and memory: GeForce GTX 1080, 8GB, on 384.90 driver
Source code / logs
For helloworld code in REPL
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2017-10-26 21:56:00.418991: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419027: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419036: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419046: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419054: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.565143: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-26 21:56:00.565417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 6.48GiB
2017-10-26 21:56:00.565432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-26 21:56:00.565437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-10-26 21:56:00.565447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
>>> print(sess.run(hello))
b'Hello, TensorFlow!'
For python3 mnist_deep.py
2017-10-26 21:37:56.993479: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-10-26 21:37:56.993560: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-10-26 21:37:56.993580: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
For python3 mnist_softmax.py
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 6.50GiB
2017-10-26 21:53:16.150706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-26 21:53:16.150712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-10-26 21:53:16.150723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
2017-10-26 21:53:16.422081: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-10-26 21:53:16.422132: W tensorflow/stream_executor/stream.cc:1756] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, Variable/read)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mnist_softmax.py", line 78, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "mnist_softmax.py", line 65, in main
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, Variable/read)]]
Caused by op 'MatMul', defined at:
File "mnist_softmax.py", line 78, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "mnist_softmax.py", line 42, in main
y = tf.matmul(x, W) + b
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in _mat_mul
transpose_b=transpose_b, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, Variable/read)]]
Here is the output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 34% 51C P0 35W / 180W | 1340MiB / 8110MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1250 G /usr/lib/xorg/Xorg 785MiB |
| 0 2426 G compiz 359MiB |
| 0 3840 G ...-token=44A975F4EE134A1BF9C8CD1C7223C977 107MiB |
| 0 4944 G ...-token=4F87ADEE5575E9B5125D41E08D43BF0E 83MiB |
+-----------------------------------------------------------------------------+
Try to close sessions active in other processes. Please follow this thread -
TensorFlow: InternalError: Blas SGEMM launch failed