InvalidArgumentError after step 6 during my training on TPU at Kaggle - tensorflow

I am getting an error when I train my model with TPU on kaggle. I already looked around the internet but can get any solution. Note that it is working if I use GPU but I run OOM error.
Here is the parameter:
batch_size = 8 * strategy.num_replicas_in_sync
tensorflow v 2.4.1
TPU v3-8
Here is my code
try: # detect TPUs
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() # TPU detection
strategy = tf.distribute.TPUStrategy(tpu)
except ValueError: # detect GPUs
strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machinesof multi-GPU machines
image_shape =(224, 224,3)
batch_size = 8 * strategy.num_replicas_in_sync
N_epoch = 200 # number of epoches
# I created the model using the strategy
with strategy.scope():
model = ... # I create the model here
def train_step(in_one, in_two):
loss_ = model.train_on_batch([in_one,], [in_two,])
return loss_
for epoch in range(N_epoch):
for step, (in_one, in_two) in enumerate(zip(data_one, data_two)):
loss_ = train_step(in_one, in_two)
if step % 1 == 0:
print("epoch> %d | step> %d | loss>%.2f" % (epoch+1,step, loss_))
The training process is working fine till it's reach the step 6.
Here is the output of my training with the error
epoch> 1 | step> 0 | loss>0.09
epoch> 1 | step> 1 | loss>0.08
epoch> 1 | step> 2 | loss>0.07
epoch> 1 | step> 3 | loss>0.07
epoch> 1 | step> 4 | loss>0.07
epoch> 1 | step> 5 | loss>0.08
epoch> 1 | step> 6 | loss>0.07
2022-12-16 06:28:40.938812: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: 9 root error(s) found.
(0) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_63]]
(1) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_95]]
(2) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_111]]
(3) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_47]]
(4) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_35854588329 ... [truncated]
2022-12-16 06:28:40.940134: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 17806, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1671172120.940038755","description":"Error received from peer ipv4:10.0.0.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 17806, Output num: 0","grpc_status":3}
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
/tmp/ipykernel_20/2809624530.py in <module>
11 # Train the discriminator & generator on one batch of real images.
12 for step, (low, high) in enumerate(zip(dataL_all, dataH_all)):
---> 13 loss_ = train_step(low, high)
14 historic.append(loss_)
15
/tmp/ipykernel_20/2809624530.py in train_step(low, high)
1 def train_step(low, high):
----> 2 loss_ = model.train_on_batch([low,], [high,])
3 # tensorboard.on_epoch_end(epoch, named_logs(model, loss_))
4 return loss_
5
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
1729 if reset_metrics:
1730 self.reset_metrics()
-> 1731 logs = tf_utils.to_numpy_or_python_type(logs)
1732 if return_dict:
1733 return logs
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in to_numpy_or_python_type(tensors)
512 return t # Don't turn ragged or sparse tensors to NumPy.
513
--> 514 return nest.map_structure(_to_single_numpy_or_python_type, tensors)
515
516
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
657
658 return pack_sequence_as(
--> 659 structure[0], [func(*x) for x in entries],
660 expand_composites=expand_composites)
661
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
657
658 return pack_sequence_as(
--> 659 structure[0], [func(*x) for x in entries],
660 expand_composites=expand_composites)
661
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in _to_single_numpy_or_python_type(t)
508 def _to_single_numpy_or_python_type(t):
509 if isinstance(t, ops.Tensor):
--> 510 x = t.numpy()
511 return x.item() if np.ndim(x) == 0 else x
512 return t # Don't turn ragged or sparse tensors to NumPy.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in numpy(self)
1069 """
1070 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1071 maybe_arr = self._numpy() # pylint: disable=protected-access
1072 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
1073
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _numpy(self)
1037 return self._numpy_internal()
1038 except core._NotOkStatusException as e: # pylint: disable=protected-access
-> 1039 six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
1040
1041 #property
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: 9 root error(s) found.
(0) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_63]]
(1) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_95]]
(2) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_111]]
(3) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_47]]
(4) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_35854588329 ... [truncated]

Related

Mask RCNN Keras custom image dimensions

I am trying to follow the example provided by https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#configure-the-training-pipeline to train the faster_rcnn_resnet50_v1_640x640_coco17_tpu-8 model on images sized 1221 by 1562. Even with configuring the pipeline with the following command:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1221
max_dimension: 1562
pad_to_max_dimension: false
}
I still get the following error message snippet:
Node: 'mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape'
2 root error(s) found.
(0) INVALID_ARGUMENT: Input to reshape is a tensor with 27300 values, but the requested shape requires a multiple of 109
[[{{node mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape}}]]
[[Identity_29/_1566]]
(1) INVALID_ARGUMENT: Input to reshape is a tensor with 27300 values, but the requested shape requires a multiple of 109
[[{{node mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_28788] exception.
Any help in resolving this error is greatly appreciated!

Kaggle TPU Unavailable: failed to connect to all addresses

I'm a newbie to ML. When trying to complete digit recognition with TPU method, I encountered problems really stuck me.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
Model = Sequential([
InputLayer((28, 28, 1)),
Dropout(0.1),
Conv2D(128, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Conv2D(64, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Flatten(),
Dense(128, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
Dense(10, activation='softmax')
])
with strategy.scope():
Model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics='accuracy')
CancelledError: 4 root error(s) found.
(0) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(1) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(2) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(3) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]
Function call stack:
train_function -> train_function -> train_function -> train_function
And then I run it again. Got errors as follows
UnavailableError: 9 root error(s) found.
(0) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
(2) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
(3) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]
Function call stack:
train_function -> train_function -> train_function -> train_function
Must be somewhere missing strategy.scopy():
I tried many times and succeeded in many other notebooks but they are all tf.data.Dataset
Though, I still can't figure out this simple digit recognition where wrong. I searched again and again and stuck here 2days, really pissed off.
Full code is at
https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286
Version 6 is the TPU version. And only modified from Version 5 with codes above. Please help me!
It looks like you are storing your training data locally which is causing the issue as TPUs can only access data in GCS.
TPUs read training data exclusively from GCS (Google Cloud Storage) see details here
You can also check this stackoverflow Colab TPU Error when calling model.fit() : UnimplementedError post.
Fixed the problem with changing them to tf.data.Dataset.( without GCS)
Only use local tf.data.Dataset. to call fit() is ok. But it fails with Unavailable: failed to connect to all addresses once ImageDataGenerator() used.
# Fixed with changing to tf.data.Dataset.
ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)
...
...
History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
callbacks=[ReduceLR, Stop], verbose=1)
# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs
Fails once ImageDataGenerator() used.
# Fail again with ImageDataGenerator() used
ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)
History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
-> 1102 context.async_wait()
1103 logs = tmp_logs # No error, now safe to assign to logs.
1104 end_step = step + data_handler.step_increment
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
2328 an error state.
2329 """
-> 2330 context().sync_executors()
2331
2332
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
643 """
644 if self._context_handle:
--> 645 pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
646 else:
647 raise ValueError("Context is not initialized.")
UnavailableError: 4 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Pad_2/paddings/_130]]
(1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_36/_238]]
(2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[IteratorGetNextAsOptional_3/_35]]
(3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.

Error while training object detection model in TF2

While training using TensorFlow Object Detection API from Google Colab I got the following error
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
(1) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[Loss/classification_loss_1/write_summary/ReadVariableOp/_48]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_62874]
Function call stack:
_dist_train_step -> _dist_train_step
Can someone please help me'

Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[]

I'm getting the following error on google colab TPU when try to use custom crf loss function.
I check https://cloud.google.com/tpu/docs/tensorflow-ops for FakeParam operation and looks like operator is available on Cloud TPU.
InvalidArgumentError: 9 root error(s) found. (0) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_279]] (1) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_223]] (2) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_265]] (3) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_251]] (4) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[ ... [truncated]
here is my code:
def make_model():
input_ids_in = tf.keras.layers.Input(shape=(100,), name='input_token', dtype=tf.int32)
input_mask_in = tf.keras.layers.Input(shape=(100,), name='input_mask', dtype=tf.int32)
bert_model = TFAutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
embedding_layer = bert_model(input_ids_in, attention_mask = input_mask_in)[0]
model = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50,trainable=False,
return_sequences=True))(embedding_layer)
model = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(len(labels_ner), activation="relu"))(model)
crf = CRF(len(labels_ner)) # CRF layer
out = crf(model) # output
model = Model([input_ids_in,input_mask_in], out)
model.compile('adam', loss=crf.get_loss)
print("Baseline/LSTM-CRF model built: ")
return model
with strategy.scope():
model = make_model()
model.fit(x_tr, np.argmax(y_tr,axis=-1) ,batch_size=32 ,epochs=5,verbose=1,validation_split = 0.1)
i used this tensorflow_addon crf.py module https://github.com/howl-anderson/addons/blob/feature/crf_layers/tensorflow_addons/layers/crf.py
Thanks
Looks like FakeParam is supported for only these dtypes: {bfloat16,bool,complex64,float,int32,int64,uint32,uint64}, and not for dtype=DT_VARIANT.
Enabling automatic outside compilation on TF2 should resolve this, please add this line somewhere:
tf.config.set_soft_device_placement(True).

Transformers for action recognition: Resource exhausted

I am trying to adapt a transformers code from https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb and use it in action recognition. But I keep getting Resource exhausted: OOM when allocating. I have a RTX Titan which has 24gb. I find it very strange to be running into this kinda of problem. My dataset is composed of 1000 actions with variable length frame (N) where each frame contains 84 float32 points (x, y). I combine N and points to form a fairly big 1d tensor for every action. Only 1 action has tensor length 40K all the others are around 10K, 20K etc.
My batch = 1, num_layers = 2, d_model = 32, dff = 64, num_heads = 2.
A couple of batches are able to run before giving me this error.
EPOCH: 0
tf.Tensor([[336.655 324.18 310.146 ... 252.652 260.521 260.083]],
shape=(1, 15960), dtype=float32) tf.Tensor([[783 499 572 19 784]],
shape=(1, 5), dtype=int64)
Epoch 1 Batch 0 Loss 6.3220 Accuracy 0.2500
tf.Tensor([[323.237 320.201 310.713 ... 223.767 226.396 226.396]],
shape=(1, 13020), dtype=float32) tf.Tensor([[783 42 50 784]],
shape=(1, 4), dtype=int64)
Epoch 1 Batch 1 Loss 6.2927 Accuracy 0.2917
tf.Tensor([[343.387 331.561 316.581 ... 263.883 260.453 255.308]],
shape=(1, 12096), dtype=float32) tf.Tensor([[783 46 784]], shape=(1,
3), dtype=int64)
Epoch 1 Batch 2 Loss 6.1787 Accuracy 0.3611
tf.Tensor([[320.014 317.322 306.94 ... 219.537 220.311 221.472]],
shape=(1, 10332), dtype=float32) tf.Tensor([[783 334 784]], shape=(1,
3), dtype=int64)
Epoch 1 Batch 3 Loss 6.1479 Accuracy 0.3958
tf.Tensor([[224.648 218.128 208.176 ... 188.814 191.243 196.101]],
shape=(1, 27300), dtype=float32) tf.Tensor([[783 350 784]], shape=(1,
3), dtype=int64)
Below is the error I am getting:
2 root error(s) found. (0) Resource exhausted: OOM when allocating
tensor with shape[1,2,31332,31332] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax
(defined at :32) ]] Hint: If you want
to see a list of allocated tensors when OOM happens, add
report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
[[gradient_tape/transformer_1/encoder_2/embedding_4/embedding_lookup/Reshape/_280]]
Hint: If you want to see a list of allocated tensors when OOM happens,
add report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
(1) Resource exhausted: OOM when allocating tensor with
shape[1,2,31332,31332] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax
(defined at :32) ]] Hint: If you want
to see a list of allocated tensors when OOM happens, add
report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
0 successful operations. 0 derived errors ignored.
[Op:__inference_train_step_20907]
Errors may have originated from an input operation. Input Source
operations connected to node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax:
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/add
(defined at :28)
Input Source operations connected to node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax:
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/add
(defined at :28)
Function call stack: train_step -> train_step
::UPDATED PROBLEM::
So I reduced my tensor input shape. I was able to run it without resource exhaustion BUT I am running into another problem. I keep getting:
2 root error(s) found.
(0) Invalid argument: indices[0,1923] = -1
is not in [0, 12936) [[node
transformer_1/encoder_2/embedding_4/embedding_lookup (defined at
:24) ]] (1) Invalid argument:
indices[0,1923] = -1 is not in [0, 12936) [[node
transformer_1/encoder_2/embedding_4/embedding_lookup (defined at
:24) ]]
[[transformer_1/encoder_2/embedding_4/embedding_lookup/_24]] 0
successful operations. 0 derived errors ignored.
[Op:__inference_train_step_17044]
Function call stack: train_step -> train_step
if my tensor inputs are less than 3000 elements, I can run successfully run it but higher I get the above error. Has anyone run into this kinda of problem ? I have no idea what error means or how to fix it :(
any help again is appreciated
Your sequence is too long: in transformer every element is attended to each other. As a result - you have to allocate a huge tensor ~8GB. Transformer is too complicated and needs a lot of memory for running. Looks like 24GB is not enough. Try less sequence length.