I am trying to run tensorflow using my GPU and have followed the instructions at this link. After running the commands in Step 6, I get the proper output.
Then, when I try to run an actual model I am trying to build, I get the following error.
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_10' defined at (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_10'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_10}}]] [Op:__inference_train_function_8591]
After doing some research, it appears that the relevant errors are the following:
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
For context, this is running in Ubuntu 20.04 and python 3.9. Any ideas on how to fix?
Related
I'm new to tf object detection api 2.
After training the model you can run an evaluation process to check the accuracy of the model.
But when I tried to run I got the below error. I'm using the backbone as an efficientDet.
I was able to run the evaluation for scaling resolution 512 but 640 is failing with the below error.
This is the python file I called and ended up with the below error.
enter code here /tensorflow/models/research/object_detection/model_main_tf2.py
`enter code here`enter code here`Call arguments received:
• inputs=tf.Tensor(shape=(1, 480, 640, 3), dtype=float32)
• kwargs={'training': 'False'}
exception.
INFO:tensorflow:A replica probably exhausted all examples. Skipping pending examples on other replicas.
I0719 06:49:27.115007 140042699994880 model_lib_v2.py:943] A replica probably exhausted all examples. Skipping pending e
xamples on other replicas.
Traceback (most recent call last):
File "/home/pictcompute/effient_net_ve/tensorflow/models/research/object_detection/model_main_tf2.py", line 115, in <m
odule>
tf.compat.v1.app.run()
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/pictcompute/effient_net_ve/tensorflow/models/research/object_detection/model_main_tf2.py", line 82, in mai
n
model_lib_v2.eval_continuously(
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 1159, in ev
al_continuously
eager_eval_loop(
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 1009, in ea
ger_eval_loop
for evaluator in evaluators:
TypeError: 'NoneType' object is not iterable
Highly appreciate your help.
Thanks
The error occurs when you try to iterate over a None value. For example
mylist = None
for x in mylist:
print(x)
TypeError Traceback (most recent call last)
<ipython-input-2-a63d8b17c4a7> in <module>
1 mylist = None
2
----> 3 for x in mylist:
4 print(x)
TypeError: 'NoneType' object is not iterable
The error can be avoided by checking if a value is None or not before iterating over it. Thank You.
I am testing Yolo-v3 (https://github.com/experiencor/keras-yolo3) with tensorflow-gpu 1.15 an keras 2.3.1. The training process is started by:
runfile("train.py",'-c config.json')
Here are the printed out messages:
Using TensorFlow backend.
WARNING:tensorflow:From train.py:40: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.
valid_annot_folder not exists. Spliting the trainining set.
Seen labels: {'kangaroo': 266}
Given labels: ['kangaroo']
Training on: ['kangaroo']
WARNING:tensorflow:From C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
.....
Loading pretrained weights.
C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\callbacks\callbacks.py:998: UserWarning: `epsilon` argument is deprecated and will be removed, use `min_delta` instead.
warnings.warn('`epsilon` argument is deprecated and '
Traceback (most recent call last):
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1348, in _run_fn
self._extend_graph()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: {{node replica_0/lambda_1/Shape}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 305, in <module>
_main_(args)
File "train.py", line 282, in _main_
max_queue_size = 8
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training_generator.py", line 42, in fit_generator
model._make_train_function()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 333, in _make_train_function
**self._function_kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 3006, in function
v1_variable_initialization()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 420, in v1_variable_initialization
session = get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 385, in get_session
return tf_keras_backend.get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 486, in get_session
_initialize_variables(session)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 903, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: node replica_0/lambda_1/Shape (defined at C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
I don't understand what caused the InvalidArgumentError. Is my tensoflow-gpu not installed correctly? Or there is some conflict in deploying gpu?
Try changing the "gpus" value to "0" if it is anythong else. It should work if you are executing in GPU.
When trying to reload the official tensorflow models for ResNet-50 checkpoint here:
http://download.tensorflow.org/models/official/20181001_resnet/checkpoints/resnet_imagenet_v1_fp32_20181001.tar.gz
...using this code:
import os
import tensorflow as tf
print(tf.__version__)
saver = tf.train.import_meta_graph(os.path.join(
'resnet_imagenet_v1_fp32_20181001',
'model.ckpt-225207.meta'))
I get this error:
1.13.1
Traceback (most recent call last):
File "chehckpoint_to_savedmodel.py", line 11, in <module>
'model.ckpt-225207.meta'))
File "/Users/*user*/Library/Python/3.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 1435, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
File "/Users/*user*/Library/Python/3.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 1457, in _import_meta_graph_with_return_elements
**kwargs))
File "/Users/*user*/Library/Python/3.7/lib/python/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/Users/*user*/Library/Python/3.7/lib/python/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/Users/*user*/Library/Python/3.7/lib/python/site-packages/tensorflow/python/framework/importer.py", line 399, in import_graph_def
_RemoveDefaultAttrs(op_dict, producer_op_list, graph_def)
File "/Users/*user*/Library/Python/3.7/lib/python/site-packages/tensorflow/python/framework/importer.py", line 159, in _RemoveDefaultAttrs
op_def = op_dict[node.op]
KeyError: 'ExperimentalFunctionBufferingResource'
Funny that googling "KeyError: 'ExperimentalFunctionBufferingResource'" returns zero hits. That's a first.
Ideas?
Not sure how else to reload this model. I also tried this:
path = os.path.join(
'resnet_imagenet_v1_fp32_20181001',
'model.ckpt-225207')
checkpoint = tf.train.Checkpoint()
status = checkpoint.restore(path)
print(status)
status.assert_consumed()
But it fails the assertion with no other information.
Thanks in advance.
P
This seems to be a issue with TF >= 1.13 versions. Try downgrading to 1.12 and give it a try. It should work.
Issues to track would be these : #29751
I'm using 3 tf.contrib.cudnn_rnn.CudnnLSTM(1, 128, direction='bidirectional') layers with a batch size of 32 on an AWS p2.xlarge instance. The exact same configuration works correctly with non-eager(standard) tensorflow. Following is the error log:
2018-04-27 18:15:59.139739: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1520] Failed to allocate RNN workspace of 74252288 bytes.
2018-04-27 18:15:59.139758: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1697] Unable to create rnn workspace
Traceback (most recent call last):
File "tf_run_eager.py", line 424, in <module>
run_experiments()
File "tf_run_eager.py", line 417, in run_experiments
train_losses.append(model.optimize(bX, bY).numpy())
File "tf_run_eager.py", line 397, in optimize
loss, grads_and_vars = self.loss(phoneme_features, utterances)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 233, in grad_fn
sources)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py", line 65, in imperative_grad
tape._tape, vspace, target, sources, output_gradients, status) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 141, in grad_fn
op_inputs, op_outputs, orig_outputs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 109, in _magic_gradient_function
return grad_fn(mock_op, *out_grads)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1609, in _cudnn_rnn_backward
direction=op.get_attr("direction"))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py", line 320, in cudnn_rnn_backprop
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnBackward [Op:CudnnRNNBackprop]
Hello
I'm using TensorFlow v 1.4.0 and when I want to create a TensorBoard session with the following commands:
tensorboard --logdir="folder_path"
I have an error:
2018-04-11 17:18:44.422839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11,91GiB freeMemory: 11,74GiB
2018-04-11 17:18:44.467559: E tensorflow/core/common_runtime/direct_session.cc:167] Internal: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Traceback (most recent call last):
File "/usr/local/bin/tensorboard", line 11, in <module>
sys.exit(run_main())
File "/usr/local/lib/python3.5/dist-packages/tensorboard/main.py", line 36, in run_main
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/usr/local/lib/python3.5/dist-packages/tensorboard/main.py", line 45, in main
default.get_assets_zip_provider())
File "/usr/local/lib/python3.5/dist-packages/tensorboard/program.py", line 166, in main
tb = create_tb_app(plugins, assets_zip_provider)
File "/usr/local/lib/python3.5/dist-packages/tensorboard/program.py", line 200, in create_tb_app
window_title=FLAGS.window_title)
File "/usr/local/lib/python3.5/dist-packages/tensorboard/backend/application.py", line 124, in standard_tensorboard_wsgi
plugin_instances = [constructor(context) for constructor in plugins]
File "/usr/local/lib/python3.5/dist-packages/tensorboard/backend/application.py", line 124, in <listcomp>
plugin_instances = [constructor(context) for constructor in plugins]
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/beholder_plugin.py", line 47, in __init__
self.most_recent_frame = im_util.get_image_relative_to_script('no-data.png')
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 277, in get_image_relative_to_script
return read_image(filename)
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 265, in read_image
return np.array(decode_png(image_file.read()))
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 182, in __call__
self._lazily_initialize()
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 160, in _lazily_initialize
self._session = tf.Session(graph=graph, config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1509, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 638, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
TensorBoard worked when I used TensorFlow 1.6 but I think it is not the problem because I tried to re-use the version 1.6 today and it is not working
My folder contains a file "event.out.po", I checked it.
Do you know where is the problem ?
Thank you
I found the problem. In the batch before using TensorBoard, this command must be run to use the gpu:
export CUDA_VISIBLE_DEVICES=0
If the precedent command does not work, you can try:
export CUDA_VISIBLE_DEVICES=''