Why is gpu device used not consistent with log info? - tensorflow

My machine have 4 GPUs, and when I run the code, at the beginning I already set:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
Through nvidia-smi command I can see that gpu 1 is actually used. However, the tensorflow log on the terminal shows that gpu 0 is used:
2021-09-24 02:27:55.691073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:0d.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-09-24 02:27:55.691123: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-24 02:27:55.694585: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-24 02:27:55.698234: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-24 02:27:55.698776: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-24 02:27:55.702390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-24 02:27:55.703656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-24 02:27:55.709853: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-24 02:27:55.710078: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:55.711069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:55.711917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
...
2021-09-24 02:27:55.906440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-24 02:27:55.906571: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-24 02:27:57.342555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-24 02:27:57.342608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-09-24 02:27:57.342619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-09-24 02:27:57.342980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:57.343982: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:57.344891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14419 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0d.0, compute capability: 7.0)
I have two questions:
GPU 0 is indeed used, but by another process. In my code, it is using gpu 1. I am wondering why the log above is consistent with the device actually used?
Also, Tensorflow 2 should be automatically detecting available GPUs and use it. If I don't add this line:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
The log shows that it is trying to use gpu= 0 and produces an out of memory error.

the CUDA_VISIBLE_DEVICES environment variable remaps whichever devices you select so that with respect to your CUDA process, those devices (in your list) appear to CUDA as if they started at zero. So when you do:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
Thereafter, CUDA sees that device as if it were device 0.
Just because a GPU is in use by another process/user, does not mean that it is "not available" for you to use. CUDA doesn't prevent two users or two processes from trying to use the same GPU, and in some cases that scenario is sensible/effective. So TF sees it as a usable device, attempts to use it, and runs out of memory. That is one typical reason why people use the environment variable listed in 1 above. The environment variable will make only certain devices "visible" or "usable" to your TF process.

Related

Issues with TensorFlow-GPU on Nvidia K2200 (Manjaro/Arch-Linux)

I set up my other computer to run TensorFlow (Manjaro-Linux) so I am not using my main system for long computations, I have installed everything through the repository with
sudo pacman -S cuda cudnn python-tensorflow-opt-cuda
which is precisely what I did with my other system running a GTX 1060, I updated all my header-files and linux-kernel but I am receiving an error from tensorflow when trying to run my code such that:
2021-06-24 12:17:16.984616: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.summary API due to missing TensorBoard installation.
2021-06-24 12:17:25.310893: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-06-24 12:17:25.530511: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:25.531007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA Quadro K2200 computeCapability: 5.0
coreClock: 1.124GHz coreCount: 5 deviceMemorySize: 3.94GiB deviceMemoryBandwidth: 74.65GiB/s
2021-06-24 12:17:25.541584: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-06-24 12:17:25.662406: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-24 12:17:25.662508: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-06-24 12:17:25.712766: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-06-24 12:17:25.744087: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-06-24 12:17:25.783408: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-06-24 12:17:25.829728: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-06-24 12:17:25.847623: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-24 12:17:25.847767: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:25.848155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:25.848463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Found 20580 images belonging to 120 classes.
2021-06-24 12:17:26.485288: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-24 12:17:26.515400: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:26.516586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA Quadro K2200 computeCapability: 5.0
coreClock: 1.124GHz coreCount: 5 deviceMemorySize: 3.94GiB deviceMemoryBandwidth: 74.65GiB/s
2021-06-24 12:17:26.516645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:26.517055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:26.517521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-24 12:17:26.536155: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-06-24 12:17:31.411471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-24 12:17:31.411503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-24 12:17:31.411509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-06-24 12:17:31.411682: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:31.412043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:31.412421: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-24 12:17:31.412844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3398 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro K2200, pci bus id: 0000:01:00.0, compute capability: 5.0)
2021-06-24 12:17:31.430573: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2021-06-24 12:17:31.556036: F ./tensorflow/core/kernels/random_op_gpu.h:244] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), key, counter, gen, data, size, dist) status: Internal: no kernel image is available for execution on the device
Aborted (core dumped)
Can anyone help me resolve this?
EDIT
Here is the nvidia-smi and nvcc --version these were downloaded with pacman

"iterating over `tf.Tensor` is not allowed in Graph execution." while using numpy_function to return more than one variable

I'm not sure if this just isn't intended, a bug, or if I'm doing it wrong. I'm trying to use a library called albumentations to augment both images and their corresponding keypoints. However, while following the guide here and attempting to add keypoint augmentation, I got the following error:
OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed in Graph execution. Use Eager execution or decorate this function with #tf.function.
On the line with numpy_function. Is there a workaround to be able to return both image and keypoints? I obviously would like to keep all the features tf.data brings, and not have to do augmentation outside the pipeline.
The following code reproduces this error on my machine:
import tensorflow as tf
def aug_fn(image, keypoints):
# Augment image and keypoints here
return image, keypoints
def process_data(data):
image = data[0]
keypoints = data[1]
image, keypoints = tf.numpy_function(func=aug_fn, inp=[image, keypoints], Tout=tf.float32)
return image, keypoints
if __name__ == '__main__':
dummy_data = [tf.zeros((50, 50, 3))[(10, 20)]]
dataset = tf.data.Dataset.from_tensor_slices(dummy_data)
dataset = dataset.map(process_data)
for s in dataset:
print(s)
And this is the whole error message:
/home/ab/anaconda3/bin/python3.8 /snap/pycharm-community/214/plugins/python-ce/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 45661 --file /home/ab/Documents/GitLab/landing-page-dl-model/test.py
pydev debugger: process 85444 is connecting
Connected to pydev debugger (build 202.7660.27)
2020-10-29 16:49:33.515315: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 16:49:35.788599: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-29 16:49:35.821091: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.821839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-10-29 16:49:35.821870: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 16:49:35.823450: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-29 16:49:35.824841: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-29 16:49:35.825091: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-29 16:49:35.826777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-29 16:49:35.827672: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-29 16:49:35.830437: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-29 16:49:35.830552: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.831208: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.831719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-29 16:49:35.832054: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-29 16:49:35.837064: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3392040000 Hz
2020-10-29 16:49:35.837401: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558a0dab57b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-29 16:49:35.837414: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-29 16:49:35.924237: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.924685: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558a0db49630 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-29 16:49:35.924702: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2020-10-29 16:49:35.924899: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.925308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-10-29 16:49:35.925339: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 16:49:35.925368: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-29 16:49:35.925387: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-29 16:49:35.925404: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-29 16:49:35.925422: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-29 16:49:35.925439: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-29 16:49:35.925455: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-29 16:49:35.925516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.925921: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:35.927635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-29 16:49:35.927680: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 16:49:36.255274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-29 16:49:36.255309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-10-29 16:49:36.255321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-10-29 16:49:36.255529: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:36.255965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 16:49:36.256337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7101 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/ab/anaconda3/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/training/tracking/tracking.py", line 178, in resource_tracker_scope
yield
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3371, in __init__
self._function = wrapper_fn.get_concrete_function()
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2938, in get_concrete_function
graph_function = self._get_concrete_function_garbage_collected(
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2906, in _get_concrete_function_garbage_collected
graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3065, in _create_graph_function
func_graph_module.func_graph_from_py_func(
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3364, in wrapper_fn
ret = _wrapper_helper(*args)
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3299, in _wrapper_helper
ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
File "/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 258, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: in user code:
/home/ab/Documents/GitLab/landing-page-dl-model/test.py:12 process_data *
image, keypoints = tf.numpy_function(func=aug_fn, inp=[image, keypoints], Tout=tf.float32)
/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:503 __iter__
self._disallow_iteration()
/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:499 _disallow_iteration
self._disallow_in_graph_mode("iterating over `tf.Tensor`")
/home/ab/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:477 _disallow_in_graph_mode
raise errors.OperatorNotAllowedInGraphError(
OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed in Graph execution. Use Eager execution or decorate this function with #tf.function.
python-BaseException
Process finished with exit code 130 (interrupted by signal 2: SIGINT)

Exporting Tensorflow Model - AssertionError: No checkpoint specified (save_path=None); nothing is being restored

I'm using google colab and tensorflow 2.3.0 on a Ubuntu machine, and working through the example from here:
Tensorlow2 Training Custom Model
This is my code:
!python exporter_main_v2.py --input_type image_tensor --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config --trained_checkpoint_dir=/models/my_ssd_resnet50_v1_fpn --output_directory=exported-models/my_model/
I'm getting the following error:
2020-09-06 08:03:23.830447: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-06 08:03:25.844063: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-06 08:03:25.879149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:25.879813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
2020-09-06 08:03:25.879853: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-06 08:03:25.881273: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-06 08:03:25.882999: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-06 08:03:25.883384: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-06 08:03:25.885102: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-06 08:03:25.886330: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-06 08:03:25.889988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-06 08:03:25.890105: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:25.891047: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:25.891854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-06 08:03:25.901457: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2200000000 Hz
2020-09-06 08:03:25.901653: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2cdd480 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-06 08:03:25.901678: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-06 08:03:26.012959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:26.013665: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2cdd640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-06 08:03:26.013697: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-09-06 08:03:26.013935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:26.014510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
2020-09-06 08:03:26.014556: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-06 08:03:26.014600: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-06 08:03:26.014625: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-06 08:03:26.014647: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-06 08:03:26.014667: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-06 08:03:26.014689: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-06 08:03:26.014712: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-06 08:03:26.014784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:26.015364: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:26.015875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-06 08:03:26.015919: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-06 08:03:26.651590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-06 08:03:26.651650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-09-06 08:03:26.651663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-09-06 08:03:26.651874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:26.652564: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-06 08:03:26.653153: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-06 08:03:26.653195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13962 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
Traceback (most recent call last):
File "exporter_main_v2.py", line 159, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "exporter_main_v2.py", line 155, in main
FLAGS.side_input_types, FLAGS.side_input_names)
File "/usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/exporter_lib_v2.py", line 260, in export_inference_graph
status.assert_existing_objects_matched()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 885, in assert_existing_objects_matched
"No checkpoint specified (save_path=None); nothing is being restored.")
AssertionError: No checkpoint specified (save_path=None); nothing is being restored.
I have already worked through a different example using Tensorflow1 and come to the same problem (I think) and asked for help here:
Stackovefflow question
There are multiple checkpoint files in the directory specified. The training seemed to run as it should.
I'm really stumped. Please can anyone help?
Snapshot added as requested:
just do those steps:
open exporter_lib_v2.py
models/research/object_detection/exporter_lib_v2.py
comment those lines ( you will find them around between lines 265, 279)
# status.assert_existing_objects_matched()
# concrete_function = detection_module.__call__.get_concrete_function()
then replace concrete_function with None in this line
tf.saved_model.save(detection_module,output_saved_model_directory, signatures=None)
resetup object detection API again
cd models/research
sudo python3 setup.py install
There should be only one checkpoint in the trained_checkpoint_dir. Remove unnecessary checkpoints.
Remove the first / after --trained_checkpoint_dir
#!python exporter_main_v2.py --input_type image_tensor --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config --trained_checkpoint_dir=models/my_ssd_resnet50_v1_fpn --output_directory=exported-models/my_model/
It should solve your problem
Check that you're running fine_tune_checkpoint_type: "detection".

Am I using tensorflow GPU?

I'm using TensorFlow-GPU 1.14 on Ubuntu 16.04.
As I'm not familiar with TensorFlow, I wonder I'm using GPU practically or not.
I have
GeForce GTX 1060
Nvidia-driver 418
CUDA 10.0
cuDNN v7.6.5
And when I execute my codes I always get this message,
WARNING:tensorflow:From /home/mine/catkin_ws/src/PROJECT/project6_3/src/ddpg.py:26: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
2020-06-24 20:29:13.827441: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-24 20:29:13.834067: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-06-24 20:29:13.930412: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-24 20:29:13.931260: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6a40a50 executing computations on platform CUDA. Devices:
2020-06-24 20:29:13.931277: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1060 6GB, Compute Capability 6.1
2020-06-24 20:29:13.959129: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-06-24 20:29:13.959392: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6ab1d70 executing computations on platform Host. Devices:
2020-06-24 20:29:13.959409: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2020-06-24 20:29:13.959576: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-24 20:29:13.960326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.759
pciBusID: 0000:01:00.0
2020-06-24 20:29:13.961867: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-24 20:29:13.988711: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-24 20:29:14.002012: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-06-24 20:29:14.006381: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-06-24 20:29:14.038179: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-06-24 20:29:14.057922: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-06-24 20:29:14.114149: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-24 20:29:14.114248: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-24 20:29:14.115060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-24 20:29:14.115765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-06-24 20:29:14.116472: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-24 20:29:14.118350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-24 20:29:14.118378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2020-06-24 20:29:14.118386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2020-06-24 20:29:14.119144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-24 20:29:14.119963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-24 20:29:14.120610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4889 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From /home/mine/catkin_ws/src/PROJECT/project6_3/src/actor_network_bn.py:75: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /home/mine/catkin_ws/src/PROJECT/project6_3/src/actor_network_bn.py:177: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /home/mine/catkin_ws/src/PROJECT/project6_3/src/actor_network_bn.py:178: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
Anyone who knows what this message means?
Am I using GPU, properly?
(When I checked how much my GPU was being used with following commands
$ watch -d -n 0.5 nvidia-smi
it always returns 1407 Mib/ 6000 Mib of usage.)
And additionally, should I modify my codes following WARNING messages?
(it works well currently on some level)
Thanks in advance. :)
Am I using Tensorflow GPU ?
If you have executed below code and if it returns device_type='GPU' means, there is no issue with Tensorflow GPU installation and you are good to use.
import tensorflow as tf
tf.config.list_physical_devices('GPU')
Output:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2020-06-24 20:29:14.120610: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created
TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
4889 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060
6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
If you have check above log from stack trace Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4889 MB memory), that means you are using GPU.
2020-06-24 20:29:13.961867: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-24 20:29:13.988711: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-24 20:29:14.002012: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-06-24 20:29:14.006381: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-06-24 20:29:14.038179: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
They are just the Information as they are prefixed with I. If there would be any warnings then it prefixed with W and for error they prefixed with E.
And you are seeing WARNING:tensorflow: they are conveying you to replace modules with newer one(i.e compat) since those are deprecated and to execute same code in TF2.x.

Failed to initialize CUDnn

While I was trying to use Distributed training, I went through a Cudnn error.
Failed to initialize Cudnn
GPU has enough memory.
nvidia-smi tells me that I have only used 102Mb.
Coda 10.1 , tensorflow 2.1, cudnn 7.6.5
All versions are upto date.
Had even set config.gpu_options.allow_growth = True
Can anyone help me to encounter this issue.?
Complete error log:
2020-04-29 09:01:30.184547: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64
2020-04-29 09:01:30.184643: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64
2020-04-29 09:01:30.184660: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
.2020-04-29 09:01:30.728942: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-29 09:01:30.801630: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.802477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:30.802600: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.803364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:30.804118: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-29 09:01:30.819996: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-29 09:01:30.822466: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-29 09:01:30.823123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-29 09:01:30.827450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-29 09:01:30.830013: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-29 09:01:30.835374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-29 09:01:30.835527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.836349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.837131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.837895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:30.838568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-29 09:01:30.838902: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled touse: AVX2 AVX512F FMA
2020-04-29 09:01:30.846028: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000134999 Hz
2020-04-29 09:01:30.846366: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4522c60 initialized for platform Host (this does not guarantee thatXLA will be used). Devices:
2020-04-29 09:01:30.846396: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-29 09:01:31.066134: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.069463: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.070509: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4598920 initialized for platform CUDA (this does not guarantee thatXLA will be used). Devices:
2020-04-29 09:01:31.070536: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-04-29 09:01:31.070542: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2020-04-29 09:01:31.071050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.071823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:31.071949: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.072675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-29 09:01:31.072750: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-29 09:01:31.072792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-29 09:01:31.072814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-29 09:01:31.072837: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-29 09:01:31.072855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-29 09:01:31.072875: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-29 09:01:31.072897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-29 09:01:31.072978: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.073755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.074494: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.075230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.075934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-29 09:01:31.076050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-29 09:01:31.078176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-29 09:01:31.078206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 1
2020-04-29 09:01:31.078213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N N
2020-04-29 09:01:31.078229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1: N N
2020-04-29 09:01:31.078466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.079271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.080016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.080730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14249 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2020-04-29 09:01:31.081303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but theremust be at least one NUMA node, so returning NUMA node zero
2020-04-29 09:01:31.082033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14249 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5)
WARNING:tensorflow:From /home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/ops/image_ops_impl.py:1556: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train for 390 steps
Epoch 1/60
2020-04-29 09:01:55.079654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-29 09:02:05.724980: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 33418 of 50000
2020-04-29 09:02:10.436522: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
2020-04-29 09:02:10.438717: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-29 09:02:11.453483: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-04-29 09:02:11.456625: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-04-29 09:02:11.482157: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/resnet56/conv1/Conv2D}}]]
[[metrics/sparse_categorical_accuracy/div_no_nan/AddN_1/_28]]
2020-04-29 09:02:11.482214: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/resnet56/conv1/Conv2D}}]]
[[replica_1/loss/ArithmeticOptimizer/HoistCommonFactor_Mul_add_1/_8]]
2020-04-29 09:02:11.482314: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/resnet56/conv1/Conv2D}}]]
2020-04-29 09:02:12.282331: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-04-29 09:02:12.284817: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNNlibrary. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
1/390 [..............................] - ETA: 3:47:17Traceback (most recent call last):
File "worker.py", line 88, in <module>
epochs=NUM_EPOCHS)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
total_epochs=epochs)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
return self._stateless_fn(*args, **kwds)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
self.captured_inputs)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/christie/yolo2/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node replica_1/resnet56/conv1/Conv2D (defined at usr/lib/python3.6/threading.py:916) ]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node replica_1/resnet56/conv1/Conv2D (defined at usr/lib/python3.6/threading.py:916) ]]
[[metrics/sparse_categorical_accuracy/div_no_nan/AddN_1/_28]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_35914]
Function call stack:
distributed_function -> distributed_function
2020-04-29 09:02:12.502074: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-04-29 09:02:12.502969: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled