Tensorflow restore() function throwing incompatible shape error - tensorflow

I am doing tensorflow source code modification, wherein I am modigying the call() function of the source code. Interestingly, the checkpoint restore() is giving the error of mismatch in shape.
But from my understanding the flow of code is: build()->restore()->call().
Although my change in tensorflow code is expected to alter shape in call phase, is it expected that restoring checkpoint which happens before it will give shape mismatch error? Or, there must be another reason for this error?
Traceback (most recent call last):
File "run_classifier.py", line 549, in <module>
app.run(main)
File "/home/arpitj/projects/lib/python3.8/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/arpitj/projects/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_classifier.py", line 542, in main
custom_main(custom_callbacks=None, custom_metrics=None)
File "run_classifier.py", line 485, in custom_main
checkpoint.restore(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 2354, in restore
status = self.read(save_path, options=options)
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 2229, in read
result = self._saver.restore(save_path=save_path, options=options)
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 1366, in restore
base.CheckpointPosition(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 254, in restore
restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 983, in _restore_from_checkpoint_position
current_position.checkpoint.restore_saveables(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 329, in restore_saveables
new_restore_ops = functional_saver.MultiDeviceSaver(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/functional_saver.py", line 339, in restore
restore_ops = restore_fn()
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/functional_saver.py", line 323, in restore_fn
restore_ops.update(saver.restore(file_prefix, options))
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/functional_saver.py", line 115, in restore
restore_ops[saveable.name] = saveable.restore(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 133, in restore
return resource_variable_ops.shape_safe_assign_variable_handle(
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 309, in shape_safe_assign_variable_handle
shape.assert_is_compatible_with(value_tensor.shape)
File "/home/arpitj/projects/lib/python3.8/site-packages/tensorflow/python/framework/tensor_shape.py", line 1171, in assert_is_compatible_with
raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (10, 64, 768) and (12, 64, 768) are incompatible

Related

tensorflow.python.framework.errors_impl.NotFoundError while running deep learning model on Google Colaboratory

I'm trying to run on cloud this deep learning model:
https://github.com/razvanmarinescu/brgm#image-reconstruction-with-pre-trained-stylegan2-generators
What I do is simply utilizing their Colab Notebook: https://colab.research.google.com/drive/1G7_CGPHZVGFWIkHOAke4HFg06-tNHIZ4?usp=sharing#scrollTo=qMgE6QFiHuSL
When I try to exectute:
!python recon.py recon-real-images --input=/content/drive/MyDrive/boeing/EDGEconnect/val_imgs --masks=/content/drive/MyDrive/boeing/EDGEconnect/val_masks --tag=brains --network=dropbox:brains.pkl --recontype=inpaint --num-steps=1000 --num-snapshots=1
I receive this error:
args: Namespace(command='recon-real-images', input='/content/drive/MyDrive/boeing/EDGEconnect/val_imgs', masks='/content/drive/MyDrive/boeing/EDGEconnect/val_masks', network_pkl='dropbox:brains.pkl', num_snapshots=1, num_steps=1000, recontype='inpaint', superres_factor=4, tag='brains')
Local submit - run_dir: results/00004-brains-inpaint
dnnlib: Running recon.recon_real_images() on localhost...
Processing image 1/4
Loading networks from "dropbox:brains.pkl"...
Setting up TensorFlow plugin "fused_bias_act.cu": Preprocessing... Loading... Failed!
Traceback (most recent call last):
File "recon.py", line 270, in <module>
main()
File "recon.py", line 263, in main
dnnlib.submit_run(sc, func_name_map[subcmd], **kwargs)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/submission/submit.py", line 343, in submit_run
return farm.submit(submit_config, host_run_dir)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/submission/internal/local.py", line 22, in submit
return run_wrapper(submit_config)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/submission/submit.py", line 280, in run_wrapper
run_func_obj(**submit_config.run_func_kwargs)
File "/content/drive/MyDrive/boeing/brgm/brgm/recon.py", line 189, in recon_real_images
recon_real_one_img(network_pkl, img_list[image_idx], masks, num_snapshots, recontype, superres_factor, num_steps)
File "/content/drive/MyDrive/boeing/brgm/brgm/recon.py", line 132, in recon_real_one_img
_G, _D, Gs = pretrained_networks.load_networks(network_pkl)
File "/content/drive/MyDrive/boeing/brgm/brgm/pretrained_networks.py", line 83, in load_networks
G, D, Gs = pickle.load(stream, encoding='latin1')
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/network.py", line 297, in __setstate__
self._init_graph()
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/network.py", line 154, in _init_graph
out_expr = self._build_func(*self.input_templates, **build_kwargs)
File "<string>", line 395, in G_synthesis_stylegan2
File "<string>", line 359, in layer
File "<string>", line 106, in modulated_conv2d_layer
File "<string>", line 75, in apply_bias_act
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/ops/fused_bias_act.py", line 68, in fused_bias_act
return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain)
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/ops/fused_bias_act.py", line 122, in _fused_bias_act_cuda
cuda_kernel = _get_plugin().fused_bias_act
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/ops/fused_bias_act.py", line 16, in _get_plugin
return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
File "/content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/custom_ops.py", line 156, in get_plugin
plugin = tf.load_op_library(bin_file)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/_cudacache/fused_bias_act_237d55aca3e3c3ec0547da06888d8e66.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
I found that the very last part of an error:
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/MyDrive/boeing/brgm/brgm/dnnlib/tflib/_cudacache/fused_bias_act_237d55aca3e3c3ec0547da06888d8e66.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
Can be solved by changing a flag in Cuda Makefile: https://github.com/mgharbi/hdrnet_legacy/issues/2 or by installing tf 1.14(colab runs on 1.15.2 and this change made no positive effect).
My question is, how can I get rid of this error, is there an option to change smth inside Google Colab's Cuda Makefile?

Apache BEAM pipeline fails when writing TF Records - AttributeError: 'str' object has no attribute 'iteritems'

The issue started appearing over the weekend. For some reason, it feels to be a DataFlow issue.
Previously, I was able to execute the script and write TF records just fine. However, now, I am unable to initialize the computation graph to process the data.
The traceback is:
Traceback (most recent call last):
File "my_script.py", line 1492, in <module>
MyBeamClass()
File "my_script.py", line 402, in __init__
self.run()
File "my_script.py", line 514, in run
transform_fn_io.WriteTransformFn(path=self.JOB_DIR + '/transform/'))
File "/anaconda3/envs/ml27/lib/python2.7/site-packages/apache_beam/pipeline.py", line 426, in __exit__
self.run().wait_until_finish()
File "/anaconda3/envs/ml27/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1238, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 649, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 176, in execute
op.start()
File "apache_beam/runners/worker/operations.py", line 531, in apache_beam.runners.worker.operations.DoOperation.start
def start(self):
File "apache_beam/runners/worker/operations.py", line 532, in apache_beam.runners.worker.operations.DoOperation.start
with self.scoped_start_state:
File "apache_beam/runners/worker/operations.py", line 533, in apache_beam.runners.worker.operations.DoOperation.start
super(DoOperation, self).start()
File "apache_beam/runners/worker/operations.py", line 202, in apache_beam.runners.worker.operations.Operation.start
def start(self):
File "apache_beam/runners/worker/operations.py", line 206, in apache_beam.runners.worker.operations.Operation.start
self.setup()
File "apache_beam/runners/worker/operations.py", line 480, in apache_beam.runners.worker.operations.DoOperation.setup
with self.scoped_start_state:
File "apache_beam/runners/worker/operations.py", line 485, in apache_beam.runners.worker.operations.DoOperation.setup
pickler.loads(self.spec.serialized_fn))
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in loads
return dill.loads(s)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 317, in loads
return load(file, ignore)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 305, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1232, in load_build
for k, v in state.iteritems():
AttributeError: 'str' object has no attribute 'iteritems'
I am using tensorflow==1.13.1 and tensorflow-transform==0.9.0 and apache_beam==2.7.0
with beam.Pipeline(options=self.pipe_opt) as p:
with beam_impl.Context(temp_dir=self.google_cloud_options.temp_location):
# rest of the script
_ = (
transform_fn
| 'WriteTransformFn' >>
transform_fn_io.WriteTransformFn(path=self.JOB_DIR + '/transform/'))
I was experiencing the same error.
It seems to be triggered by a mismatch in the tensorflow-transform versions of your local (or master) machine and the workers one (specified in the setup.py file).
In my case I was running tensorflow-transform==0.13 on my local machine whereas the workers were running 0.8.
Downgrading the local version to 0.8 fixed the issue.

CudnnLSTM runs out of space with Eager Execution

I'm using 3 tf.contrib.cudnn_rnn.CudnnLSTM(1, 128, direction='bidirectional') layers with a batch size of 32 on an AWS p2.xlarge instance. The exact same configuration works correctly with non-eager(standard) tensorflow. Following is the error log:
2018-04-27 18:15:59.139739: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1520] Failed to allocate RNN workspace of 74252288 bytes.
2018-04-27 18:15:59.139758: E tensorflow/stream_executor/cuda/cuda_dnn.cc:1697] Unable to create rnn workspace
Traceback (most recent call last):
File "tf_run_eager.py", line 424, in <module>
run_experiments()
File "tf_run_eager.py", line 417, in run_experiments
train_losses.append(model.optimize(bX, bY).numpy())
File "tf_run_eager.py", line 397, in optimize
loss, grads_and_vars = self.loss(phoneme_features, utterances)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 233, in grad_fn
sources)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py", line 65, in imperative_grad
tape._tape, vspace, target, sources, output_gradients, status) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 141, in grad_fn
op_inputs, op_outputs, orig_outputs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 109, in _magic_gradient_function
return grad_fn(mock_op, *out_grads)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1609, in _cudnn_rnn_backward
direction=op.get_attr("direction"))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py", line 320, in cudnn_rnn_backprop
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnBackward [Op:CudnnRNNBackprop]

Error while running TensorFlow wide_n_deep Tutorial

I encountered the error:
AttributeError: 'NoneType' object has no attribute 'bucketize'
The full error is as follows:
Traceback (most recent call last):
File "wide_n_deep_tutorial_1.py", line 214, in <module>
train_and_eval()
File "wide_n_deep_tutorial_1.py", line 203, in train_and_eval
m.fit(input_fn=lambda: input_fn(df_train), steps=FLAGS.train_steps)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn_linear_combined.py", line 711, in fit
max_steps=max_steps)
File "C:\Python35\lib\site-packages\tensorflow\python\util\deprecation.py", line 191, in new_func
return func(*args, **kwargs)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 355, in fit
max_steps=max_steps)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 699, in _train_model
train_ops = self._get_train_ops(features, labels)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1052, in _get_train_ops
return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1019, in _call_model_fn
params=self.params)
File "C:\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn_linear_combined.py", line 504, in _dnn_linear_combined_model_fn
scope=scope)
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\layers\feature_column_ops.py", line 526, in weighted_sum_from_feature_columns
transformed_tensor = transformer.transform(column)
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\layers\feature_column_ops.py", line 869, in transform
feature_column.insert_transformed_feature(self._columns_to_tensors)
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\layers\feature_column.py", line 1489, in insert_transformed_feature
name="bucketize")
File "C:\Python35\lib\site-packages\tensorflow\contrib\layers\python\ops\bucketization_op.py", line 48, in bucketize
return _bucketization_op.bucketize(input_tensor, boundaries, name=name)
AttributeError: 'NoneType' object has no attribute 'bucketize'
I got the same issue, it seems that on windows, we just got None, sourcecode,
try to run this code on linux, or try to remove the bucketization and the column crossing, for example. change the line:
flags.DEFINE_string("model_type","wide_n_deep","valid model types:{'wide','deep', 'wide_n_deep'")
to
flags.DEFINE_string("model_type","deep","valid model types:{'wide','deep', 'wide_n_deep'")
follow this issue for update: issue

tensorflow tutorial mnist_with_summary throws TypeError

I am running the mnist_with_summary tutorial to see how the TensorBoard works. It throws a TypeError right away.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
File "/Users/bruceho/workspace/TestTensorflow/mysrc/examples/tutorials/mnist/mnist_with_summaries.py", line 166, in <module>
tf.app.run()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/bruceho/workspace/TestTensorflow/mysrc/examples/tutorials/mnist/mnist_with_summaries.py", line 163, in main
train()
File "/Users/bruceho/workspace/TestTensorflow/mysrc/examples/tutorials/mnist/mnist_with_summaries.py", line 110, in train
y = nn_layer(dropped, 500, 10, 'layer2', act=tf.nn.softmax)
File "/Users/bruceho/workspace/TestTensorflow/mysrc/examples/tutorials/mnist/mnist_with_summaries.py", line 104, in nn_layer
activations = act(preactivate, 'activation')
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/ops/nn_ops.py", line 582, in softmax
return _softmax(logits, gen_nn_ops._softmax, dim, name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/ops/nn_ops.py", line 542, in _softmax
logits = _swap_axis(logits, dim, math_ops.sub(input_rank, 1))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/ops/nn_ops.py", line 518, in _swap_axis
0, [math_ops.range(dim_index), [last_index],
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 991, in range
return gen_math_ops._range(start, limit, delta, name=name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1675, in _range
delta=delta, name=name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 490, in apply_op
preferred_dtype=default_dtype)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 657, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/constant_op.py", line 163, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/tensor_util.py", line 353, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/tensor_util.py", line 290, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected int32, got 'activation' of type 'str' instead.
I tried running from inside eclipse and command line with the same results. Any one experience the same problem?
I think you must have modified the original code somehow. Your problem lies in this line:activations = act(preactivate, 'activation'). So if you check the api of tf.nn.softmax, you would find that the second argument represents dim instead of name. So to fix the problem, just change this line into:activations = act(preactivate, name='activation')
Besides, I don't know if you have changed
diff = tf.nn.softmax_cross_entropy_with_logits(y, y_)
If not, you probably have softmax the output twice.