I'm using 3 tf.contrib.cudnn_rnn.CudnnLSTM(1, 128, direction='bidirectional') layers with a batch size of 32 on an AWS p2.xlarge instance. The exact same configuration works correctly with non-eager(standard) tensorflow. Following is the error log:
2018-04-27 18:15:59.139739: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1520] Failed to allocate RNN workspace of 74252288 bytes.
2018-04-27 18:15:59.139758: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1697] Unable to create rnn workspace
Traceback (most recent call last):
File "tf_run_eager.py", line 424, in <module>
run_experiments()
File "tf_run_eager.py", line 417, in run_experiments
train_losses.append(model.optimize(bX, bY).numpy())
File "tf_run_eager.py", line 397, in optimize
loss, grads_and_vars = self.loss(phoneme_features, utterances)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 233, in grad_fn
sources)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py", line 65, in imperative_grad
tape._tape, vspace, target, sources, output_gradients, status) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 141, in grad_fn
op_inputs, op_outputs, orig_outputs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 109, in _magic_gradient_function
return grad_fn(mock_op, *out_grads)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1609, in _cudnn_rnn_backward
direction=op.get_attr("direction"))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py", line 320, in cudnn_rnn_backprop
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnBackward [Op:CudnnRNNBackprop]
Related
I'm new to deepfakes and I'm trying to do the 5XSeg) train.bat and everytime it finishes the filtering I get the following error. I use wf, and tried batch sizes from 1-8, always the same result. I have a Ryzen 5 3600, a 3080 Ti and 16 GB of RAM.
Using 26519 xseg labeled samples.
Traceback (most recent call last):
File "multiprocessing\queues.py", line 234, in _feed
File "multiprocessing\reduction.py", line 51, in dumps
MemoryError
Error:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1375, in _do_call
return fn(*args)
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1360, in _run_fn
target_list, run_metadata)
File "multiprocessing\queues.py", line 234, in _feed
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1453, in _call_tf_sessionrun
run_metadata)
File "multiprocessing\reduction.py", line 51, in dumps
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node MatMul}}]]
[[concat_6/concat/_3]]
(1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node MatMul}}]]
0 successful operations.
0 derived errors ignored.
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 263, in update_sample_for_preview
self.get_history_previews()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 383, in get_history_previews
return self.onGetPreview (self.sample_for_preview, for_history=True)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 209, in onGetPreview
I, M, IM, = [ np.clip( nn.to_data_format(x,"NHWC", self.model_data_format), 0.0, 1.0) for x in ([image_np,mask_np] + self.view (image_np) ) ]
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 141, in view
return nn.tf_sess.run ( [pred], feed_dict={self.model.input_t :input_np})
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 968, in run
run_metadata_ptr)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1369, in _do_run
run_metadata)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1394, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[node MatMul (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:66) ]]
[[concat_6/concat/_3]]
(1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[node MatMul (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:66) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
XSeg/dense1/weight/read (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:47)
Reshape_60 (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\ops\__init__.py:182)
Input Source operations connected to node MatMul:
XSeg/dense1/weight/read (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py:47)
Reshape_60 (defined at E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\ops\__init__.py:182)
Original stack trace for 'MatMul':
File "threading.py", line 884, in _bootstrap
File "threading.py", line 916, in _bootstrap_inner
File "threading.py", line 864, in run
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
debug=debug)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 17, in __init__
super().__init__(*args, force_model_class_name='XSeg', **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 193, in __init__
self.on_initialize()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 103, in on_initialize
gpu_pred_logits_t, gpu_pred_t = self.model.flow(gpu_input_t, pretrain=self.pretrain)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\facelib\XSegNet.py", line 85, in flow
return self.model(x, pretrain=pretrain)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
return self.forward(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\models\XSeg.py", line 124, in forward
x = self.dense1(x)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
return self.forward(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\leras\layers\Dense.py", line 66, in forward
x = tf.matmul(x, weight)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\math_ops.py", line 3655, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5713, in mat_mul
name=name)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3569, in _create_op_internal
op_def=op_def)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 2045, in __init__
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
debug=debug)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\Model_XSeg\Model.py", line 17, in __init__
super().__init__(*args, force_model_class_name='XSeg', **kwargs)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 216, in __init__
self.update_sample_for_preview(choose_preview_history=self.choose_preview_history)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 265, in update_sample_for_preview
self.sample_for_preview = self.generate_next_samples()
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\models\ModelBase.py", line 461, in generate_next_samples
sample.append ( generator.generate_next() )
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\samplelib\SampleGeneratorBase.py", line 21, in generate_next
self.last_generation = next(self)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\samplelib\SampleGeneratorFace.py", line 112, in __next__
return next(generator)
File "E:\DeepFaceLab_NVIDIA_RTX3000_series\_internal\DeepFaceLab\core\joblib\SubprocessGenerator.py", line 73, in __next__
gen_data = self.cs_queue.get()
File "multiprocessing\queues.py", line 94, in get
File "multiprocessing\connection.py", line 216, in recv_bytes
File "multiprocessing\connection.py", line 318, in _recv_bytes
File "multiprocessing\connection.py", line 344, in _get_more_data
MemoryError
Reducing the batch size didn't help as well as increasing the page file. I tried to Google it but I couldn't find a solution.
I am doing tensorflow source code modification, wherein I am modigying the call() function of the source code. Interestingly, the checkpoint restore() is giving the error of mismatch in shape.
But from my understanding the flow of code is: build()->restore()->call().
Although my change in tensorflow code is expected to alter shape in call phase, is it expected that restoring checkpoint which happens before it will give shape mismatch error? Or, there must be another reason for this error?
Traceback (most recent call last):
File "run_classifier.py", line 549, in <module>
app.run(main)
File "/home/arpitj/projects/lib/python3.8/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/arpitj/projects/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_classifier.py", line 542, in main
custom_main(custom_callbacks=None, custom_metrics=None)
File "run_classifier.py", line 485, in custom_main
checkpoint.restore(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 2354, in restore
status = self.read(save_path, options=options)
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 2229, in read
result = self._saver.restore(save_path=save_path, options=options)
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 1366, in restore
base.CheckpointPosition(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 254, in restore
restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 983, in _restore_from_checkpoint_position
current_position.checkpoint.restore_saveables(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 329, in restore_saveables
new_restore_ops = functional_saver.MultiDeviceSaver(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/functional_saver.py", line 339, in restore
restore_ops = restore_fn()
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/functional_saver.py", line 323, in restore_fn
restore_ops.update(saver.restore(file_prefix, options))
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/functional_saver.py", line 115, in restore
restore_ops[saveable.name] = saveable.restore(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 133, in restore
return resource_variable_ops.shape_safe_assign_variable_handle(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 309, in shape_safe_assign_variable_handle
shape.assert_is_compatible_with(value_tensor.shape)
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/framework/tensor_shape.py", line 1171, in assert_is_compatible_with
raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (10, 64, 768) and (12, 64, 768) are incompatible
I am trying to download CELEB-A by tensorflow_datasets (version: 4.5.2) and getting an API error. How can I fix it?
I have update the tensorflow_datasets but the issue does does not fix.
My code is:
import tensorflow_datasets as tf ds
dataset_builder = tfds.builder('celeb_a')
dataset_builder.download_and_prepare()
I am getting the following error:
Downloading and preparing dataset 1.38 GiB (download: 1.38 GiB, generated: 1.62 GiB, total: 3.00 GiB) to /root/tensorflow_datasets/celeb_a/2.0.1...
Dl Size...: 0 MiB [00:00, ? MiB/s] | 0/4 [00:00<?, ? url/s]
Dl Completed...: 0%| | 0/4 [00:00<?, ? url/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 464, in download_and_prepare
download_config=download_config,
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1158, in _download_and_prepare
dl_manager, **optional_pipeline_kwargs)
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/image/celeba.py", line 129, in _split_generators
"landmarks_celeba": LANDMARKS_DATA,
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 549, in download
return _map_promise(self._download, url_or_urls)
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 767, in _map_promise
res = tf.nest.map_structure(lambda p: p.get(), all_promises) # Wait promises
File "/miniconda/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 867, in map_structure
structure[0], [func(*x) for x in entries],
File "/miniconda/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 867, in <listcomp>
structure[0], [func(*x) for x in entries],
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 767, in <lambda>
res = tf.nest.map_structure(lambda p: p.get(), all_promises) # Wait promises
File "/miniconda/lib/python3.7/site-packages/promise/promise.py", line 512, in get
return self._target_settled_value(_raise=True)
File "/miniconda/lib/python3.7/site-packages/promise/promise.py", line 516, in _target_settled_value
return self._target()._settled_value(_raise)
File "/miniconda/lib/python3.7/site-packages/promise/promise.py", line 226, in _settled_value
reraise(type(raise_val), raise_val, self._traceback)
File "/miniconda/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/miniconda/lib/python3.7/site-packages/promise/promise.py", line 844, in handle_future_result
resolve(future.result())
File "/miniconda/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/downloader.py", line 216, in _sync_download
with _open_url(url, verify=verify) as (response, iter_content):
File "/miniconda/lib/python3.7/contextlib.py", line 112, in __enter__
return next(self.gen)
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/downloader.py", line 276, in _open_with_requests
url = _get_drive_url(url, session)
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/downloader.py", line 298, in _get_drive_url
_assert_status(response)
File "/miniconda/lib/python3.7/site-packages/tensorflow_datasets/core/download/downloader.py", line 310, in _assert_status
response.url, response.status_code))
tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pZjFTYXZWM3FlRnM. HTTP code: 404.
It seems the link is broken hence this error is shown while fetching this celeb_a tensorflow dataset. However you can download this dataset manually using this link till we fix that error, .
I'm trying to run on cloud this deep learning model:
https://github.com/razvanmarinescu/brgm#image-reconstruction-with-pre-trained-stylegan2-generators
What I do is simply utilizing their Colab Notebook: https://colab.research.google.com/drive/1G7_CGPHZVGFWIkHOAke4HFg06-tNHIZ4?usp=sharing#scrollTo=qMgE6QFiHuSL
When I try to exectute:
!python recon.py recon-real-images --input=/content/drive/MyDrive/boeing/EDGEconnect/val_imgs --masks=/content/drive/MyDrive/boeing/EDGEconnect/val_masks --tag=brains --network=dropbox:brains.pkl --recontype=inpaint --num-steps=1000 --num-snapshots=1
I receive this error:
args: Namespace(command='recon-real-images', input='/content/drive/MyDrive/boeing/EDGEconnect/val_imgs', masks='/content/drive/MyDrive/boeing/EDGEconnect/val_masks', network_pkl='dropbox:brains.pkl', num_snapshots=1, num_steps=1000, recontype='inpaint', superres_factor=4, tag='brains')
Local submit - run_dir: results/00004-brains-inpaint
dnnlib: Running recon.recon_real_images() on localhost...
Processing image 1/4
Loading networks from "dropbox:brains.pkl"...
Setting up TensorFlow plugin "fused_bias_act.cu": Preprocessing... Loading... Failed!
Traceback (most recent call last):
File "recon.py", line 270, in <module>
main()
File "recon.py", line 263, in main
dnnlib.submit_run(sc, func_name_map[subcmd], **kwargs)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/submission/submit.py", line 343, in submit_run
return farm.submit(submit_config, host_run_dir)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/submission/internal/local.py", line 22, in submit
return run_wrapper(submit_config)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/submission/submit.py", line 280, in run_wrapper
run_func_obj(**submit_config.run_func_kwargs)
File "/content/drive/MyDrive/boeing/brgm/brgm/recon.py", line 189, in recon_real_images
recon_real_one_img(network_pkl, img_list[image_idx], masks, num_snapshots, recontype, superres_factor, num_steps)
File "/content/drive/MyDrive/boeing/brgm/brgm/recon.py", line 132, in recon_real_one_img
_G, _D, Gs = pretrained_networks.load_networks(network_pkl)
File "/content/drive/MyDrive/boeing/brgm/brgm/pretrained_networks.py", line 83, in load_networks
G, D, Gs = pickle.load(stream, encoding='latin1')
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/network.py", line 297, in __setstate__
self._init_graph()
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/network.py", line 154, in _init_graph
out_expr = self._build_func(*self.input_templates, **build_kwargs)
File "<string>", line 395, in G_synthesis_stylegan2
File "<string>", line 359, in layer
File "<string>", line 106, in modulated_conv2d_layer
File "<string>", line 75, in apply_bias_act
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/ops/fused_bias_act.py", line 68, in fused_bias_act
return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/ops/fused_bias_act.py", line 122, in _fused_bias_act_cuda
cuda_kernel = _get_plugin().fused_bias_act
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/ops/fused_bias_act.py", line 16, in _get_plugin
return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/custom_ops.py", line 156, in get_plugin
plugin = tf.load_op_library(bin_file)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/_cudacache/fused_bias_act_237d55aca3e3c3ec0547da06888d8e66.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
I found that the very last part of an error:
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/_cudacache/fused_bias_act_237d55aca3e3c3ec0547da06888d8e66.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
Can be solved by changing a flag in Cuda Makefile: https://github.com/mgharbi/hdrnet_legacy/issues/2 or by installing tf 1.14(colab runs on 1.15.2 and this change made no positive effect).
My question is, how can I get rid of this error, is there an option to change smth inside Google Colab's Cuda Makefile?
The issue started appearing over the weekend. For some reason, it feels to be a DataFlow issue.
Previously, I was able to execute the script and write TF records just fine. However, now, I am unable to initialize the computation graph to process the data.
The traceback is:
Traceback (most recent call last):
File "my_script.py", line 1492, in <module>
MyBeamClass()
File "my_script.py", line 402, in __init__
self.run()
File "my_script.py", line 514, in run
transform_fn_io.WriteTransformFn(path=self.JOB_DIR + '/transform/'))
File "/anaconda3/envs/ml27/lib/python2.7/site-packages/apache_beam/pipeline.py", line 426, in __exit__
self.run().wait_until_finish()
File "/anaconda3/envs/ml27/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1238, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 649, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 176, in execute
op.start()
File "apache_beam/runners/worker/operations.py", line 531, in apache_beam.runners.worker.operations.DoOperation.start
def start(self):
File "apache_beam/runners/worker/operations.py", line 532, in apache_beam.runners.worker.operations.DoOperation.start
with self.scoped_start_state:
File "apache_beam/runners/worker/operations.py", line 533, in apache_beam.runners.worker.operations.DoOperation.start
super(DoOperation, self).start()
File "apache_beam/runners/worker/operations.py", line 202, in apache_beam.runners.worker.operations.Operation.start
def start(self):
File "apache_beam/runners/worker/operations.py", line 206, in apache_beam.runners.worker.operations.Operation.start
self.setup()
File "apache_beam/runners/worker/operations.py", line 480, in apache_beam.runners.worker.operations.DoOperation.setup
with self.scoped_start_state:
File "apache_beam/runners/worker/operations.py", line 485, in apache_beam.runners.worker.operations.DoOperation.setup
pickler.loads(self.spec.serialized_fn))
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in loads
return dill.loads(s)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 317, in loads
return load(file, ignore)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 305, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1232, in load_build
for k, v in state.iteritems():
AttributeError: 'str' object has no attribute 'iteritems'
I am using tensorflow==1.13.1 and tensorflow-transform==0.9.0 and apache_beam==2.7.0
with beam.Pipeline(options=self.pipe_opt) as p:
with beam_impl.Context(temp_dir=self.google_cloud_options.temp_location):
# rest of the script
_ = (
transform_fn
| 'WriteTransformFn' >>
transform_fn_io.WriteTransformFn(path=self.JOB_DIR + '/transform/'))
I was experiencing the same error.
It seems to be triggered by a mismatch in the tensorflow-transform versions of your local (or master) machine and the workers one (specified in the setup.py file).
In my case I was running tensorflow-transform==0.13 on my local machine whereas the workers were running 0.8.
Downgrading the local version to 0.8 fixed the issue.