Mask RCNN Keras custom image dimensions - tensorflow

I am trying to follow the example provided by https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#configure-the-training-pipeline to train the faster_rcnn_resnet50_v1_640x640_coco17_tpu-8 model on images sized 1221 by 1562. Even with configuring the pipeline with the following command:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1221
max_dimension: 1562
pad_to_max_dimension: false
}
I still get the following error message snippet:
Node: 'mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape'
2 root error(s) found.
(0) INVALID_ARGUMENT: Input to reshape is a tensor with 27300 values, but the requested shape requires a multiple of 109
[[{{node mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape}}]]
[[Identity_29/_1566]]
(1) INVALID_ARGUMENT: Input to reshape is a tensor with 27300 values, but the requested shape requires a multiple of 109
[[{{node mask_rcnn_keras_box_predictor/mask_rcnn_class_head/Reshape}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_28788] exception.
Any help in resolving this error is greatly appreciated!

Related

Onnx load error: The input tensor cannot be reshaped to the requested shape

I want to convert a tensorflow model to an onnx file, the conversion is successful and the file is saved. But when I load the onnx model with onnxruntime, it threw an error:
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Loop node. Name:'generic_loop_Loop__518' Status Message: Non-zero status code returned while running Reshape node. Name:'NmtModel/while/Reshape_1' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:37 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape&, std::vector&) size != 0 && (input_shape.Size() % size) == 0 was false. The input tensor cannot be reshaped to the requested shape. Input shape:{1,4,0,512}, requested shape:{1,-1,0,512}
Seems the reshape operation cannot accept the actual tensor shape of {1,4,0,512}, how can I solve this please?

InvalidArgumentError after step 6 during my training on TPU at Kaggle

I am getting an error when I train my model with TPU on kaggle. I already looked around the internet but can get any solution. Note that it is working if I use GPU but I run OOM error.
Here is the parameter:
batch_size = 8 * strategy.num_replicas_in_sync
tensorflow v 2.4.1
TPU v3-8
Here is my code
try: # detect TPUs
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() # TPU detection
strategy = tf.distribute.TPUStrategy(tpu)
except ValueError: # detect GPUs
strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machinesof multi-GPU machines
image_shape =(224, 224,3)
batch_size = 8 * strategy.num_replicas_in_sync
N_epoch = 200 # number of epoches
# I created the model using the strategy
with strategy.scope():
model = ... # I create the model here
def train_step(in_one, in_two):
loss_ = model.train_on_batch([in_one,], [in_two,])
return loss_
for epoch in range(N_epoch):
for step, (in_one, in_two) in enumerate(zip(data_one, data_two)):
loss_ = train_step(in_one, in_two)
if step % 1 == 0:
print("epoch> %d | step> %d | loss>%.2f" % (epoch+1,step, loss_))
The training process is working fine till it's reach the step 6.
Here is the output of my training with the error
epoch> 1 | step> 0 | loss>0.09
epoch> 1 | step> 1 | loss>0.08
epoch> 1 | step> 2 | loss>0.07
epoch> 1 | step> 3 | loss>0.07
epoch> 1 | step> 4 | loss>0.07
epoch> 1 | step> 5 | loss>0.08
epoch> 1 | step> 6 | loss>0.07
2022-12-16 06:28:40.938812: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: 9 root error(s) found.
(0) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_63]]
(1) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_95]]
(2) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_111]]
(3) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_47]]
(4) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_35854588329 ... [truncated]
2022-12-16 06:28:40.940134: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 17806, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1671172120.940038755","description":"Error received from peer ipv4:10.0.0.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 17806, Output num: 0","grpc_status":3}
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
/tmp/ipykernel_20/2809624530.py in <module>
11 # Train the discriminator & generator on one batch of real images.
12 for step, (low, high) in enumerate(zip(dataL_all, dataH_all)):
---> 13 loss_ = train_step(low, high)
14 historic.append(loss_)
15
/tmp/ipykernel_20/2809624530.py in train_step(low, high)
1 def train_step(low, high):
----> 2 loss_ = model.train_on_batch([low,], [high,])
3 # tensorboard.on_epoch_end(epoch, named_logs(model, loss_))
4 return loss_
5
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
1729 if reset_metrics:
1730 self.reset_metrics()
-> 1731 logs = tf_utils.to_numpy_or_python_type(logs)
1732 if return_dict:
1733 return logs
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in to_numpy_or_python_type(tensors)
512 return t # Don't turn ragged or sparse tensors to NumPy.
513
--> 514 return nest.map_structure(_to_single_numpy_or_python_type, tensors)
515
516
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
657
658 return pack_sequence_as(
--> 659 structure[0], [func(*x) for x in entries],
660 expand_composites=expand_composites)
661
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
657
658 return pack_sequence_as(
--> 659 structure[0], [func(*x) for x in entries],
660 expand_composites=expand_composites)
661
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in _to_single_numpy_or_python_type(t)
508 def _to_single_numpy_or_python_type(t):
509 if isinstance(t, ops.Tensor):
--> 510 x = t.numpy()
511 return x.item() if np.ndim(x) == 0 else x
512 return t # Don't turn ragged or sparse tensors to NumPy.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in numpy(self)
1069 """
1070 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1071 maybe_arr = self._numpy() # pylint: disable=protected-access
1072 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
1073
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _numpy(self)
1037 return self._numpy_internal()
1038 except core._NotOkStatusException as e: # pylint: disable=protected-access
-> 1039 six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
1040
1041 #property
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: 9 root error(s) found.
(0) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_63]]
(1) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_95]]
(2) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_111]]
(3) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_358545883293887456/_5/_47]]
(4) Invalid argument: {{function_node __inference_train_function_69334}} Compilation failure: Reshape's input dynamic dimension is decomposed into multiple output dynamic dimensions, but the constraint is ambiguous and XLA can't infer the output dimension %reshape.7586 = f32[3920,16,384]{2,1,0} reshape(f32[62720,384]{1,0} %transpose.7577), metadata={op_type="Reshape" op_name="model/swin_transformer_block_1/window_attention_1/name1/attn/qkv/Tensordot"}.
TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_358545883293887456/_5}}]]
[[tpu_compile_succeeded_assert/_35854588329 ... [truncated]

Error while training object detection model in TF2

While training using TensorFlow Object Detection API from Google Colab I got the following error
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
(1) Invalid argument: assertion failed: [[0.397090524]] [[0.199330032]]
[[{{node Assert/AssertGuard/else/_25/Assert/AssertGuard/Assert}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[Loss/classification_loss_1/write_summary/ReadVariableOp/_48]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_62874]
Function call stack:
_dist_train_step -> _dist_train_step
Can someone please help me'

Keras BatchNormalization layer incompatibility error

I have the following (part of) network architecture:
Obtained by
...
pool = GlobalAvgPool()(gc_2)
predictions = Dense(units=32, activation='relu', use_bias=False)(pool)
predictions = BatchNormalization()(predictions)
...
I am trying to insert a batch normalization layer, but I get the following error:
ValueError: Input 0 of layer batch_normalization_1 is incompatible with the layer: expected ndim=2, found ndim=3. Full shape received: [None, 1, 32]
I am guessing the second dimension is causing this mishap. Is there any way I can get rid of it?
If your model is complied successfully, there is no problem with your model definition.
This is more likely to happen because of the input data shape and dimensions are incompatible with your model's desired input shape.
expected ndim=2, found ndim=3. means that the model requires a 2D tensor with

TensorFlow fake-quantize layers are also called from TF-Lite

I'm using TensorFlow 2.1 in order to train models with quantization-aware training.
The code to do that is:
import tensorflow_model_optimization as tfmot
model = tfmot.quantization.keras.quantize_annotate_model(model)
This will add fake-quantize nodes to the graph. These nodes should adjust the model's weights so they are more easier to be quantized into int8 and to work with int8 data.
When the training ends, I convert and quantize the model to TF-Lite like so:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = [give data provider]
quantized_tflite_model = converter.convert()
At this point, I wouldn't expect to see the fake-quantize layers in the TL-Lite graph. But surprisingly, I do see them.
Moreover, when I run this quantized model in TF-Lite C++ sample app, I see that it's also running the fake-quantize nodes during inference. In addition to that, it also dequantize and quantize the activations between each layer.
That's a sample of the output from the C++ code:
Node 0 Operator Builtin Code 80 FAKE_QUANT
Inputs: 1
Outputs: 237
Node 1 Operator Builtin Code 114 QUANTIZE
Inputs: 237
Outputs: 238
Node 2 Operator Builtin Code 3 CONV_2D
Inputs: 238 59 58
Outputs: 167
Temporaries: 378
Node 3 Operator Builtin Code 6 DEQUANTIZE
Inputs: 167
Outputs: 239
Node 4 Operator Builtin Code 80 FAKE_QUANT
Inputs: 239
Outputs: 166
Node 5 Operator Builtin Code 114 QUANTIZE
Inputs: 166
Outputs: 240
Node 6 Operator Builtin Code 3 CONV_2D
Inputs: 240 61 60
Outputs: 169
So I find all this very weird, taking also into account the fact that this model should run only on int8 and actually fake-quantize nodes are getting float32 as inputs.
Any help here would be appreciated.
representative_dataset is mostly used with post-training quantization.
Comparing your commands with QAT example, you probably want to remove that line .
https://www.tensorflow.org/model_optimization/guide/quantization/training_example
converter = tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()
# Create float TFLite model.
float_converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_tflite_model = float_converter.convert()
# Measure sizes of models.
_, float_file = tempfile.mkstemp('.tflite')
_, quant_file = tempfile.mkstemp('.tflite')
with open(quant_file, 'wb') as f:
f.write(quantized_tflite_model)
with open(float_file, 'wb') as f:
f.write(float_tflite_model)
print("Float model in Mb:", os.path.getsize(float_file) / float(2**20))
print("Quantized model in Mb:", os.path.getsize(quant_file) / float(2**20))
You can force TF Lite to only use the INT operations:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
If an error occurs, then some layers of your network do not have an INT8 implementation yet.
Furthermore you could also try to investigate your network using Netron.
Nonetheless, if you also want to have INT8 inputs and output you also need to adjust those:
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
However, there is currently an open issue regarding the in- and output, see Issue #38285
I have encountered the same issue. In my case, the quantized tflite model's size increases by ~3x with fake quantization. Does it occur to you? Inspecting the tflite graph in Netron shows quantization layers are inserted between every ops.
My workaround so far is to initiate a new copy of the model without fake quantization, and then load the weights by layers from the quantization-aware-trained model. It can't directly set weights to the whole model because fake quantization layers have parameters, too.