I'm trying to train cGAN on g4dn.xlarge GPU ec2 machine and it crashes every time after 8 epochs exactly with the following message:
Traceback (most recent call last):
File "pix2pix_tf2.py", line 841, in <module>
main()
File "pix2pix_tf2.py", line 802, in main
results = sess.run(fetches, options=options, run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 958, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1181, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
[[Func/encode_images/target_pngs/while/body/_47/input/_154/_773]]
0 successful operations.
0 derived errors ignored.
env spec:
tensorflow 2.2.0
CUDA V10.0.130
cudnn 7.6.5
updating CUDA to 10.1 solved the issue
Related
I am trying to run tensorflow using my GPU and have followed the instructions at this link. After running the commands in Step 6, I get the proper output.
Then, when I try to run an actual model I am trying to build, I get the following error.
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_10' defined at (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_10'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_10}}]] [Op:__inference_train_function_8591]
After doing some research, it appears that the relevant errors are the following:
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
For context, this is running in Ubuntu 20.04 and python 3.9. Any ideas on how to fix?
I'm new to tf object detection api 2.
After training the model you can run an evaluation process to check the accuracy of the model.
But when I tried to run I got the below error. I'm using the backbone as an efficientDet.
I was able to run the evaluation for scaling resolution 512 but 640 is failing with the below error.
This is the python file I called and ended up with the below error.
enter code here /tensorflow/models/research/object_detection/model_main_tf2.py
`enter code here`enter code here`Call arguments received:
• inputs=tf.Tensor(shape=(1, 480, 640, 3), dtype=float32)
• kwargs={'training': 'False'}
exception.
INFO:tensorflow:A replica probably exhausted all examples. Skipping pending examples on other replicas.
I0719 06:49:27.115007 140042699994880 model_lib_v2.py:943] A replica probably exhausted all examples. Skipping pending e
xamples on other replicas.
Traceback (most recent call last):
File "/home/pictcompute/effient_net_ve/tensorflow/models/research/object_detection/model_main_tf2.py", line 115, in <m
odule>
tf.compat.v1.app.run()
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/pictcompute/effient_net_ve/tensorflow/models/research/object_detection/model_main_tf2.py", line 82, in mai
n
model_lib_v2.eval_continuously(
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 1159, in ev
al_continuously
eager_eval_loop(
File "/home/pictcompute/effient_net_ve/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 1009, in ea
ger_eval_loop
for evaluator in evaluators:
TypeError: 'NoneType' object is not iterable
Highly appreciate your help.
Thanks
The error occurs when you try to iterate over a None value. For example
mylist = None
for x in mylist:
print(x)
TypeError Traceback (most recent call last)
<ipython-input-2-a63d8b17c4a7> in <module>
1 mylist = None
2
----> 3 for x in mylist:
4 print(x)
TypeError: 'NoneType' object is not iterable
The error can be avoided by checking if a value is None or not before iterating over it. Thank You.
I am testing Yolo-v3 (https://github.com/experiencor/keras-yolo3) with tensorflow-gpu 1.15 an keras 2.3.1. The training process is started by:
runfile("train.py",'-c config.json')
Here are the printed out messages:
Using TensorFlow backend.
WARNING:tensorflow:From train.py:40: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.
valid_annot_folder not exists. Spliting the trainining set.
Seen labels: {'kangaroo': 266}
Given labels: ['kangaroo']
Training on: ['kangaroo']
WARNING:tensorflow:From C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
.....
Loading pretrained weights.
C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\callbacks\callbacks.py:998: UserWarning: `epsilon` argument is deprecated and will be removed, use `min_delta` instead.
warnings.warn('`epsilon` argument is deprecated and '
Traceback (most recent call last):
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1348, in _run_fn
self._extend_graph()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: {{node replica_0/lambda_1/Shape}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 305, in <module>
_main_(args)
File "train.py", line 282, in _main_
max_queue_size = 8
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training_generator.py", line 42, in fit_generator
model._make_train_function()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\engine\training.py", line 333, in _make_train_function
**self._function_kwargs)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 3006, in function
v1_variable_initialization()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 420, in v1_variable_initialization
session = get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\keras\backend\tensorflow_backend.py", line 385, in get_session
return tf_keras_backend.get_session()
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 486, in get_session
_initialize_variables(session)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\keras\backend.py", line 903, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device for operation replica_0/lambda_1/Shape: node replica_0/lambda_1/Shape (defined at C:\Users\Dy\Anaconda3\envs\tf1x\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[replica_0/lambda_1/Shape]]
I don't understand what caused the InvalidArgumentError. Is my tensoflow-gpu not installed correctly? Or there is some conflict in deploying gpu?
Try changing the "gpus" value to "0" if it is anythong else. It should work if you are executing in GPU.
I am trying to create version under google cloud ml models for the successfully trained tensorflow estimator model. I believe that I am providing the correct Uri(in google storage) which includes saved_model.pb.
Framework: Tensorflow,
Framework Version: 1.13.1,
Runtime Version: 1.13,
Python: 3.5
Here is the traceback of the error:
Traceback (most recent call last):
File "/google/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 985, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 795, in Run
resources = command_instance.Run(args)
File "/google/google-cloud-sdk/lib/surface/ml_engine/versions/create.py", line 119, in Run
python_version=args.python_version)
File "/google/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 114, in Create
message='Creating version (this might take a few minutes)...')
File "/google/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 75, in WaitForOpMaybe
return operations_client.WaitForOperation(op, message=message).response
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/ml_engine/operations.py", line 114, in WaitForOperation
sleep_ms=5000)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 264, in WaitFor
sleep_ms, _StatusUpdate)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 326, in PollUntilDone
sleep_ms=sleep_ms)
File "/google/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 229, in RetryOnResult
if not should_retry(result, state):
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 320, in _IsNotDone
return not poller.IsDone(operation)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 122, in IsDone
raise OperationError(operation.error.message)
OperationError: Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"
Do you have any idea what might be the problem?
EDIT
I am using:
tf.estimator.LatestExporter('exporter', model.serving_input_fn)
as a estimator exporter.
serving_input_fn:
def serving_input_fn():
inputs = {'string1': tf.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH]),
'string2': tf.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH])}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
PS: my model takes two inputs and returns one binary output.
Hello
I'm using TensorFlow v 1.4.0 and when I want to create a TensorBoard session with the following commands:
tensorboard --logdir="folder_path"
I have an error:
2018-04-11 17:18:44.422839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11,91GiB freeMemory: 11,74GiB
2018-04-11 17:18:44.467559: E tensorflow/core/common_runtime/direct_session.cc:167] Internal: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Traceback (most recent call last):
File "/usr/local/bin/tensorboard", line 11, in <module>
sys.exit(run_main())
File "/usr/local/lib/python3.5/dist-packages/tensorboard/main.py", line 36, in run_main
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/usr/local/lib/python3.5/dist-packages/tensorboard/main.py", line 45, in main
default.get_assets_zip_provider())
File "/usr/local/lib/python3.5/dist-packages/tensorboard/program.py", line 166, in main
tb = create_tb_app(plugins, assets_zip_provider)
File "/usr/local/lib/python3.5/dist-packages/tensorboard/program.py", line 200, in create_tb_app
window_title=FLAGS.window_title)
File "/usr/local/lib/python3.5/dist-packages/tensorboard/backend/application.py", line 124, in standard_tensorboard_wsgi
plugin_instances = [constructor(context) for constructor in plugins]
File "/usr/local/lib/python3.5/dist-packages/tensorboard/backend/application.py", line 124, in <listcomp>
plugin_instances = [constructor(context) for constructor in plugins]
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/beholder_plugin.py", line 47, in __init__
self.most_recent_frame = im_util.get_image_relative_to_script('no-data.png')
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 277, in get_image_relative_to_script
return read_image(filename)
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 265, in read_image
return np.array(decode_png(image_file.read()))
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 182, in __call__
self._lazily_initialize()
File "/usr/local/lib/python3.5/dist-packages/tensorboard/plugins/beholder/im_util.py", line 160, in _lazily_initialize
self._session = tf.Session(graph=graph, config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1509, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 638, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
TensorBoard worked when I used TensorFlow 1.6 but I think it is not the problem because I tried to re-use the version 1.6 today and it is not working
My folder contains a file "event.out.po", I checked it.
Do you know where is the problem ?
Thank you
I found the problem. In the batch before using TensorBoard, this command must be run to use the gpu:
export CUDA_VISIBLE_DEVICES=0
If the precedent command does not work, you can try:
export CUDA_VISIBLE_DEVICES=''