Tensorflow gpu not able to train my Xception model - tensorflow

I am training an Xception model with tensorflow-gpu. I am getting this error:
FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable block14_sepconv2_bn_5/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/block14_sepconv2_bn_5/moving_variance/N10tensorflow3VarE does not exist.
[[{{node FusedBatchNormV3/ReadVariableOp_1}}]]
[[block9_sepconv3_bn_5/cond/else/_9661/OptionalFromValue_3/_1548]]
(1) Failed precondition: Error while reading resource variable block14_sepconv2_bn_5/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/block14_sepconv2_bn_5/moving_variance/N10tensorflow3VarE does not exist.
[[{{node FusedBatchNormV3/ReadVariableOp_1}}]]
0 successful operations.
0 derived errors ignored.
When I reload, it gives error on a different conv layer. I have not used a gpu before so I do not know how it works. Thanks for the help!

The problem has been sorted. Though I am not exactly sure what happened, I think it was a memory issue. Below lines of code have helped me a lot to manipulate my memory usage:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.per_process_gpu_memory_fraction = 0.5 session = InteractiveSession(config=config)

Related

How can I solve this cuDNN launch failure error with TensorFlow2.7 FusedBatchNormV3 during inference?

I managed to successfully train a BERT model with TF2.7, but at inference I get the following error:
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([1,26624,384,1])
[[node bert/embeddings/layer_normalization/FusedBatchNormV3
(defined at /mnt/task_runtime/nlbt/features/siri_embeddings/subowl_embedder.py:21)
]] [Op:__inference_pruned_2214]
Errors may have originated from an input operation.
Input Source operations connected to node bert/embeddings/layer_normalization/FusedBatchNormV3:
In[0] bert/embeddings/layer_normalization/Reshape:
In[1] bert/embeddings/layer_normalization/ones:
In[2] bert/embeddings/layer_normalization/zeros:
In[3] bert/embeddings/layer_normalization/Const:
In[4] bert/embeddings/layer_normalization/Const_1:
I have found that similar issues are often indicative of a memory problem, but the issue does not occur when at inference I use TF2.4. For completeness, I am using cuda 11.1 and cudnn 8.0.5.39.

Colab TPU: INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses

I am currently trying to train a model for my bachelor's.
The train ETA though is very huge so I have considered using TPUs. However, everytime I try to train with a tpu strategy following this google's notebook I keep getting the following error:
(0) INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1651692210.674048314","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"#1651692210.674047476","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_69/_310]]
Error as shown in Colab
you can check my TPU boilerplate code here:
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
My dataset is stored in my google's drive as images.
I am trying to train using tf.keras.model.fit

Runtime Error: tensorflow.python.framework.errors_impl.NotFoundError: Could not find metadata file. [[{{node LoadDataset/_1}}]] [Op:DatasetFromGraph]

As in the tutorial, trying to execute tff.learning.build_federated_averaging_process(model_fn, client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.02)) but on orchestrator (server) with data saved on edge node (client) using tf.data.experimental.load() method:
#tff.tf_computation
def make_data():
element_spec = collections.OrderedDict([('x', tf.TensorSpec(shape=(None, 784), dtype=tf.float32, name=None)),
('y', tf.TensorSpec(shape=(None,), dtype=tf.int32, name=None))])
data = tf.data.experimental.load('./train_data', element_spec = element_spec)
return data
However, I'm getting the following error:
W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at dataset_ops.cc:175 : Not found: Could not find metadata file.
[[{{node LoadDataset/_1}}]]
TF data was saved using tf.data.experimental.save(train_data[0], './train_data') method. The implementation works when executed locally: tff.backends.native.set_local_execution_context()
python - 3.7
libraries versions:
tensorflow - 2.5.2
tensorflow-estimator - 2.5.0
tensorflow-federated - 0.19.0
Any help would be most appreciated.
When you decorate a Python function with #tff.tf_computation it will serialize the contents as a tf.Graph to be reused later. Frankly, I do not know how I/O like the experimental tf.data load logic interacts with it.
The recommended pattern would be to avoid that, and instead load the data in Python at the level where you are creating TFF computations, and pass the loaded dataset as an input to your tff.tf_computation or tff.federated_computation, with matching type signature (tff.types.SequenceType).

Unable to Enable Tensorflows Eager execution

I have an conda environment with Tensorflow 2.0.0-beta1 installed. However whenever I import tensorflow and attempt to enable eager execution I get the error :
AttributeError: module 'tensorflow' has no attribute 'enable_eager_execution'
The only code that I have run for this is:
import tensorflow as tf
print(tf.__version__)
tf.enable_eager_execution()
Is this an error with the tensorflow 2.0 beta module or an issue with my installation ?
In ternsorflow 2.0 the enable_eager_execution method is moved to tf.compat.v1 module. The following works on tensorflow-2.0.0-beta1
tf.compat.v1.enable_eager_execution()
In tensorflow 2.0 the eager execution is enabled by default. You don't need to enable it in your program.
E.g
import tensorflow as tf
t = tf.constant([5.0])
Now you can directly view the value of tensor without using session object.
print(t)
# tf.Tensor([5.], shape=(1,), dtype=float32)
You can also change the tensor value to numpy array
numpy_array = t.numpy()
print(numpy_array)
# [5.]
You can also disable eager execution in tensorflow-2(Tested on tensorflow-2.0.0-beta1. This might not work on future versions.)
tf.compat.v1.disable_eager_execution()
t2 = tf.constant([5.0])
print(t2)
# Tensor("Const:0", shape=(1,), dtype=float32)
Calling numpy() method on tensor after eager execution is disabled throws an error
AttributeError: 'Tensor' object has no attribute 'numpy'
One issue you should consider while disabling the eager execution is, once the eager execution is disabled it cannot be enabled in the same program, because tf.enable_eager_execution should be called at program startup and calling this method after disabling eager execution throws an error:
ValueError: tf.enable_eager_execution must be called at program startup.

Kafka in the TensorFlow: Failed to consume:Broker: No more messages

I want to use the kafka api in the tensorflow 1.7,however I got the error that Failed to consume:Broker: No more messages. I don't know how to fix this problem.
import tensorflow as tf
import tensorflow.contrib.kafka as kafka
temp = kafka.KafkaDataset(topics='bt1_meeting_appeventlog_oracle:0:0:-1',group="None",
servers="kafka1:9092,kafka2:9092,kafka3:9092")
iterator = temp.make_one_shot_iterator()
#iterator = temp.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
print(sess.run(next_element))
then I got the error :
InternalError (see above for traceback): Failed to consume:Broker: No more messages
[[Node: IteratorGetNext_13 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_4)]]
As I meet the same problem when testing TensorFlow's Kafka client. In my case, setting eof tag of the KafkaDataset to True will solve this problem.
Tensorflow uses librdkafka as its c++ Kafka client. In librdkafka, EOF is treated as an error by default. By setting EOF to True will solve this problem.