Kaggle TPU Unavailable: failed to connect to all addresses - tensorflow

I'm a newbie to ML. When trying to complete digit recognition with TPU method, I encountered problems really stuck me.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
Model = Sequential([
InputLayer((28, 28, 1)),
Dropout(0.1),
Conv2D(128, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Conv2D(64, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Flatten(),
Dense(128, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
Dense(10, activation='softmax')
])
with strategy.scope():
Model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics='accuracy')
CancelledError: 4 root error(s) found.
(0) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(1) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(2) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(3) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]
Function call stack:
train_function -> train_function -> train_function -> train_function
And then I run it again. Got errors as follows
UnavailableError: 9 root error(s) found.
(0) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
(2) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
(3) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]
Function call stack:
train_function -> train_function -> train_function -> train_function
Must be somewhere missing strategy.scopy():
I tried many times and succeeded in many other notebooks but they are all tf.data.Dataset
Though, I still can't figure out this simple digit recognition where wrong. I searched again and again and stuck here 2days, really pissed off.
Full code is at
https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286
Version 6 is the TPU version. And only modified from Version 5 with codes above. Please help me!

It looks like you are storing your training data locally which is causing the issue as TPUs can only access data in GCS.
TPUs read training data exclusively from GCS (Google Cloud Storage) see details here
You can also check this stackoverflow Colab TPU Error when calling model.fit() : UnimplementedError post.

Fixed the problem with changing them to tf.data.Dataset.( without GCS)
Only use local tf.data.Dataset. to call fit() is ok. But it fails with Unavailable: failed to connect to all addresses once ImageDataGenerator() used.
# Fixed with changing to tf.data.Dataset.
ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)
...
...
History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
callbacks=[ReduceLR, Stop], verbose=1)
# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs
Fails once ImageDataGenerator() used.
# Fail again with ImageDataGenerator() used
ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)
History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
-> 1102 context.async_wait()
1103 logs = tmp_logs # No error, now safe to assign to logs.
1104 end_step = step + data_handler.step_increment
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
2328 an error state.
2329 """
-> 2330 context().sync_executors()
2331
2332
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
643 """
644 if self._context_handle:
--> 645 pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
646 else:
647 raise ValueError("Context is not initialized.")
UnavailableError: 4 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Pad_2/paddings/_130]]
(1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_36/_238]]
(2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[IteratorGetNextAsOptional_3/_35]]
(3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.

Related

InvalidArgumentError after step 6 during my training on TPU at Kaggle

I am getting an error when I train my model with TPU on kaggle. I already looked around the internet but can get any solution. Note that it is working if I use GPU but I run OOM error.
Here is the parameter:
batch_size = 8 * strategy.num_replicas_in_sync
tensorflow v 2.4.1
TPU v3-8
Here is my code
try: # detect TPUs
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() # TPU detection
strategy = tf.distribute.TPUStrategy(tpu)
except ValueError: # detect GPUs
strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machinesof multi-GPU machines
image_shape =(224, 224,3)
batch_size = 8 * strategy.num_replicas_in_sync
N_epoch = 200 # number of epoches
# I created the model using the strategy
with strategy.scope():
model = ... # I create the model here
def train_step(in_one, in_two):
loss_ = model.train_on_batch([in_one,], [in_two,])
return loss_
for epoch in range(N_epoch):
for step, (in_one, in_two) in enumerate(zip(data_one, data_two)):
loss_ = train_step(in_one, in_two)
if step % 1 == 0:
print("epoch> %d | step> %d | loss>%.2f" % (epoch+1,step, loss_))
The training process is working fine till it's reach the step 6.
Here is the output of my training with the error
epoch> 1 | step> 0 | loss>0.09
epoch> 1 | step> 1 | loss>0.08
epoch> 1 | step> 2 | loss>0.07
epoch> 1 | step> 3 | loss>0.07
epoch> 1 | step> 4 | loss>0.07
epoch> 1 | step> 5 | loss>0.08
epoch> 1 | step> 6 | loss>0.07
2022-12-16 06:28:40.938812: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: 9 root error(s) found.
(0) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_63]]
(1) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_95]]
(2) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_111]]
(3) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_47]]
(4) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_35854588329 ... [truncated]
2022-12-16 06:28:40.940134: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 17806, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1671172120.940038755","description":"Error received from peer ipv4:10.0.0.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 17806, Output num: 0","grpc_status":3}
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
/tmp/ipykernel_20/2809624530.py in <module>
11 # Train the discriminator & generator on one batch of real images.
12 for step, (low, high) in enumerate(zip(dataL_all, dataH_all)):
---> 13 loss_ = train_step(low, high)
14 historic.append(loss_)
15
/tmp/ipykernel_20/2809624530.py in train_step(low, high)
1 def train_step(low, high):
----> 2 loss_ = model.train_on_batch([low,], [high,])
3 # tensorboard.on_epoch_end(epoch, named_logs(model, loss_))
4 return loss_
5
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
1729 if reset_metrics:
1730 self.reset_metrics()
-> 1731 logs = tf_utils.to_numpy_or_python_type(logs)
1732 if return_dict:
1733 return logs
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in to_numpy_or_python_type(tensors)
512 return t # Don't turn ragged or sparse tensors to NumPy.
513
--> 514 return nest.map_structure(_to_single_numpy_or_python_type, tensors)
515
516
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
657
658 return pack_sequence_as(
--> 659 structure[0], [func(*x) for x in entries],
660 expand_composites=expand_composites)
661
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
657
658 return pack_sequence_as(
--> 659 structure[0], [func(*x) for x in entries],
660 expand_composites=expand_composites)
661
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in _to_single_numpy_or_python_type(t)
508 def _to_single_numpy_or_python_type(t):
509 if isinstance(t, ops.Tensor):
--> 510 x = t.numpy()
511 return x.item() if np.ndim(x) == 0 else x
512 return t # Don't turn ragged or sparse tensors to NumPy.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in numpy(self)
1069 """
1070 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1071 maybe_arr = self._numpy() # pylint: disable=protected-access
1072 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
1073
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _numpy(self)
1037 return self._numpy_internal()
1038 except core._NotOkStatusException as e: # pylint: disable=protected-access
-> 1039 six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
1040
1041 #property
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: 9 root error(s) found.
(0) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_63]]
(1) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_95]]
(2) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_111]]
(3) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_47]]
(4) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_35854588329 ... [truncated]

Colab TPU: INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses

I am currently trying to train a model for my bachelor's.
The train ETA though is very huge so I have considered using TPUs. However, everytime I try to train with a tpu strategy following this google's notebook I keep getting the following error:
(0) INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1651692210.674048314","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"#1651692210.674047476","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_69/_310]]
Error as shown in Colab
you can check my TPU boilerplate code here:
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
My dataset is stored in my google's drive as images.
I am trying to train using tf.keras.model.fit

Error while training object detection model in TF2

While training using TensorFlow Object Detection API from Google Colab I got the following error
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
(1) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[Loss/classification_loss_1/write_summary/ReadVariableOp/_48]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_62874]
Function call stack:
_dist_train_step -> _dist_train_step
Can someone please help me'

Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[]

I'm getting the following error on google colab TPU when try to use custom crf loss function.
I check https://cloud.google.com/tpu/docs/tensorflow-ops for FakeParam operation and looks like operator is available on Cloud TPU.
InvalidArgumentError: 9 root error(s) found. (0) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_279]] (1) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_223]] (2) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_265]] (3) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_251]] (4) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[ ... [truncated]
here is my code:
def make_model():
input_ids_in = tf.keras.layers.Input(shape=(100,), name='input_token', dtype=tf.int32)
input_mask_in = tf.keras.layers.Input(shape=(100,), name='input_mask', dtype=tf.int32)
bert_model = TFAutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
embedding_layer = bert_model(input_ids_in, attention_mask = input_mask_in)[0]
model = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50,trainable=False,
return_sequences=True))(embedding_layer)
model = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(len(labels_ner), activation="relu"))(model)
crf = CRF(len(labels_ner)) # CRF layer
out = crf(model) # output
model = Model([input_ids_in,input_mask_in], out)
model.compile('adam', loss=crf.get_loss)
print("Baseline/LSTM-CRF model built: ")
return model
with strategy.scope():
model = make_model()
model.fit(x_tr, np.argmax(y_tr,axis=-1) ,batch_size=32 ,epochs=5,verbose=1,validation_split = 0.1)
i used this tensorflow_addon crf.py module https://github.com/howl-anderson/addons/blob/feature/crf_layers/tensorflow_addons/layers/crf.py
Thanks
Looks like FakeParam is supported for only these dtypes: {bfloat16,bool,complex64,float,int32,int64,uint32,uint64}, and not for dtype=DT_VARIANT.
Enabling automatic outside compilation on TF2 should resolve this, please add this line somewhere:
tf.config.set_soft_device_placement(True).

Error Training Keras Model on Google Colab using TPU runtime

I am trying to create and train my CNN model using TPU in Google Colab. I was planning to use it for classifying dogs and cats. The model works using GPU/CPU runtime but I have trouble running it on TPU runtime. Here's the code for creating my model.
I used the flow_from_directory() function to input my dataset, here's the code for it
train_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
MAIN_DIR,
target_size = (128,128),
batch_size = 50,
class_mode = 'binary'
)
def create_model():
model=Sequential()
model.add(Conv2D(32,(3,3),activation='relu',input_shape=(128,128,3)))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(64,(3,3),activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(128,(3,3),activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(2,activation='softmax'))
return model
Here is the code used to initIate the TPU on google Colab
tf.keras.backend.clear_session()
resolver = tf.distribute.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
strategy = tf.distribute.experimental.TPUStrategy(resolver)
with strategy.scope():
model = create_model()
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
model.fit(
train_generator,
epochs = 5,
)
But when I run this code, I get greeted with this error:
UnavailableError Traceback (most recent call last)
<ipython-input-15-1970b3405ba3> in <module>()
20 model.fit(
21 train_generator,
---> 22 epochs = 5,
23
24 )
14 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
UnavailableError: 5 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_12/switch_pred/_118/_82]]
(2) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7955920754087029306/_4/_266]]
(3) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Shape_7/_104]]
(4) Unavailable: {{functi ... [truncated]
I really have no idea, how can I fix this. Neither do I know what do these errors mean.
You're hitting a known problem with TPUs - they don't support PyFunction. Details here: #38762, #34346, #39099:
Sorry for the issue. Dataset.from_generator is expected to not work with TPUs as it uses py_function underneath which is incompatible with Cloud TPU 2VM setup. If you would like to read from large datasets, maybe try to materialize it on disk and use TFRecordDataest instead.
Since ImageDataGenerator also uses PyFunction under the hood, it is incompatible with TPUs. Instead, you have to use the tf.data API to load images. This tutorial explains how to do it.