Error while training object detection model in TF2 - google-colaboratory

While training using TensorFlow Object Detection API from Google Colab I got the following error
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
(1) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[Loss/classification_loss_1/write_summary/ReadVariableOp/_48]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_62874]
Function call stack:
_dist_train_step -> _dist_train_step
Can someone please help me'

Related

Mask RCNN Keras custom image dimensions

I am trying to follow the example provided by https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#configure-the-training-pipeline to train the faster_rcnn_resnet50_v1_640x640_coco17_tpu-8 model on images sized 1221 by 1562. Even with configuring the pipeline with the following command:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1221
max_dimension: 1562
pad_to_max_dimension: false
}
I still get the following error message snippet:
Node: 'mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape'
2 root error(s) found.
(0) INVALID_ARGUMENT: Input to reshape is a tensor with 27300 values, but the requested shape requires a multiple of 109
[[{{node mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape}}]]
[[Identity_29/_1566]]
(1) INVALID_ARGUMENT: Input to reshape is a tensor with 27300 values, but the requested shape requires a multiple of 109
[[{{node mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_28788] exception.
Any help in resolving this error is greatly appreciated!

Colab TPU: INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses

I am currently trying to train a model for my bachelor's.
The train ETA though is very huge so I have considered using TPUs. However, everytime I try to train with a tpu strategy following this google's notebook I keep getting the following error:
(0) INTERNAL: {{function_node __inference_train_function_7167}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1651692210.674048314","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3124,"referenced_errors":[{"created":"#1651692210.674047476","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_69/_310]]
Error as shown in Colab
you can check my TPU boilerplate code here:
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
My dataset is stored in my google's drive as images.
I am trying to train using tf.keras.model.fit

Kaggle TPU Unavailable: failed to connect to all addresses

I'm a newbie to ML. When trying to complete digit recognition with TPU method, I encountered problems really stuck me.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
Model = Sequential([
InputLayer((28, 28, 1)),
Dropout(0.1),
Conv2D(128, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Conv2D(64, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Flatten(),
Dense(128, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
Dense(10, activation='softmax')
])
with strategy.scope():
Model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics='accuracy')
CancelledError: 4 root error(s) found.
(0) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(1) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(2) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(3) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]
Function call stack:
train_function -> train_function -> train_function -> train_function
And then I run it again. Got errors as follows
UnavailableError: 9 root error(s) found.
(0) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
(2) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
(3) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]
Function call stack:
train_function -> train_function -> train_function -> train_function
Must be somewhere missing strategy.scopy():
I tried many times and succeeded in many other notebooks but they are all tf.data.Dataset
Though, I still can't figure out this simple digit recognition where wrong. I searched again and again and stuck here 2days, really pissed off.
Full code is at
https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286
Version 6 is the TPU version. And only modified from Version 5 with codes above. Please help me!
It looks like you are storing your training data locally which is causing the issue as TPUs can only access data in GCS.
TPUs read training data exclusively from GCS (Google Cloud Storage) see details here
You can also check this stackoverflow Colab TPU Error when calling model.fit() : UnimplementedError post.
Fixed the problem with changing them to tf.data.Dataset.( without GCS)
Only use local tf.data.Dataset. to call fit() is ok. But it fails with Unavailable: failed to connect to all addresses once ImageDataGenerator() used.
# Fixed with changing to tf.data.Dataset.
ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)
...
...
History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
callbacks=[ReduceLR, Stop], verbose=1)
# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs
Fails once ImageDataGenerator() used.
# Fail again with ImageDataGenerator() used
ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)
History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
-> 1102 context.async_wait()
1103 logs = tmp_logs # No error, now safe to assign to logs.
1104 end_step = step + data_handler.step_increment
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
2328 an error state.
2329 """
-> 2330 context().sync_executors()
2331
2332
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
643 """
644 if self._context_handle:
--> 645 pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
646 else:
647 raise ValueError("Context is not initialized.")
UnavailableError: 4 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Pad_2/paddings/_130]]
(1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_36/_238]]
(2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[IteratorGetNextAsOptional_3/_35]]
(3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"#1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"#1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.

Tensorflow gpu not able to train my Xception model

I am training an Xception model with tensorflow-gpu. I am getting this error:
FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable block14_sepconv2_bn_5/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/block14_sepconv2_bn_5/moving_variance/N10tensorflow3VarE does not exist.
[[{{node FusedBatchNormV3/ReadVariableOp_1}}]]
[[block9_sepconv3_bn_5/cond/else/_9661/OptionalFromValue_3/_1548]]
(1) Failed precondition: Error while reading resource variable block14_sepconv2_bn_5/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/block14_sepconv2_bn_5/moving_variance/N10tensorflow3VarE does not exist.
[[{{node FusedBatchNormV3/ReadVariableOp_1}}]]
0 successful operations.
0 derived errors ignored.
When I reload, it gives error on a different conv layer. I have not used a gpu before so I do not know how it works. Thanks for the help!
The problem has been sorted. Though I am not exactly sure what happened, I think it was a memory issue. Below lines of code have helped me a lot to manipulate my memory usage:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.per_process_gpu_memory_fraction = 0.5 session = InteractiveSession(config=config)

Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[]

I'm getting the following error on google colab TPU when try to use custom crf loss function.
I check https://cloud.google.com/tpu/docs/tensorflow-ops for FakeParam operation and looks like operator is available on Cloud TPU.
InvalidArgumentError: 9 root error(s) found. (0) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_279]] (1) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_223]] (2) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_265]] (3) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[]){{node get_loss/cond_1/FakeParam_15}} [[get_loss/cond_1]] TPU compilation failed [[tpu_compile_succeeded_assert/_12238515605435969423/_6]] [[tpu_compile_succeeded_assert/_12238515605435969423/_6/_251]] (4) Invalid argument: {{function_node __inference_train_function_104228}} Compilation failure: Detected unsupported operations when trying to compile graph get_loss_cond_1_true_88089_rewritten[] on XLA_TPU_JIT: FakeParam (No registered 'FakeParam' OpKernel for XLA_TPU_JIT devices compatible with node {{node get_loss/cond_1/FakeParam_15}} (OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_VARIANT, shape=[ ... [truncated]
here is my code:
def make_model():
input_ids_in = tf.keras.layers.Input(shape=(100,), name='input_token', dtype=tf.int32)
input_mask_in = tf.keras.layers.Input(shape=(100,), name='input_mask', dtype=tf.int32)
bert_model = TFAutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
embedding_layer = bert_model(input_ids_in, attention_mask = input_mask_in)[0]
model = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50,trainable=False,
return_sequences=True))(embedding_layer)
model = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(len(labels_ner), activation="relu"))(model)
crf = CRF(len(labels_ner)) # CRF layer
out = crf(model) # output
model = Model([input_ids_in,input_mask_in], out)
model.compile('adam', loss=crf.get_loss)
print("Baseline/LSTM-CRF model built: ")
return model
with strategy.scope():
model = make_model()
model.fit(x_tr, np.argmax(y_tr,axis=-1) ,batch_size=32 ,epochs=5,verbose=1,validation_split = 0.1)
i used this tensorflow_addon crf.py module https://github.com/howl-anderson/addons/blob/feature/crf_layers/tensorflow_addons/layers/crf.py
Thanks
Looks like FakeParam is supported for only these dtypes: {bfloat16,bool,complex64,float,int32,int64,uint32,uint64}, and not for dtype=DT_VARIANT.
Enabling automatic outside compilation on TF2 should resolve this, please add this line somewhere:
tf.config.set_soft_device_placement(True).