Mask R-CNN for TPU on Google Colab - google-colaboratory

We are trying to build an image segmentation deep learning model using Google Colab TPU. Our model is Mask R-CNN.
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
import tensorflow as tf
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
model.keras_model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
However I am running into issues while converting our Mask R-CNN model to TPU model as pasted below.
ValueError:
Layer <keras.engine.topology.InputLayer object at 0x7f58574f1940> has a
variable shape in a non-batch dimension. TPU models must
have constant shapes for all operations.
You may have to specify `input_length` for RNN/TimeDistributed layers.
Layer: <keras.engine.topology.InputLayer object at 0x7f58574f1940>
Input shape: (None, None, None, 3)
Output shape: (None, None, None, 3)
Appreciate any help.

Google recently released a tutorial on getting Mask R-CNN going on their TPUs. For this, they are using an experimental model for Mask RCNN on Google's TPU github repository (under models/experimental/mask_rcnn). Looking through the code, it looks like they define the model with a fixed input size to overcome the issue you are seeing.
See below for more explanation:
As #aman2930 points out, the shape of your input tensor is not static. This won't work because Tensorflow compiles models with XLA to use a TPU and XLA must have all tensor shapes defined at compile time. In the link above, the website specifically calls this out:
Static shapes
During regular usage TensorFlow attempts to determine the shapes of each tf.Tensor during graph construction. During
execution any unknown shape dimensions are determined dynamically, see
Tensor Shapes for more details.
To run on Cloud TPUs TensorFlow models are compiled using XLA. XLA
uses a similar system for determining shapes at compile time. XLA
requires that all tensor dimensions be statically defined at compile
time. All shapes must evaluate to a constant, and not depend on
external data, or stateful operations like variables or a random
number generator.
That side, further down the document, they mention that the input function is run on the CPU, so isn't limited by static XLA sizes. They point to batch size being the issue, not image size:
Static shapes and batch size
The input pipeline generated by your
input_fn is run on CPU. So it is mostly free from the strict static
shape requirements imposed by the XLA/TPU environment. The one
requirement is that the batches of data fed from your input pipeline
to the TPU have a static shape, as determined by the standard
TensorFlow shape inference algorithm. Intermediate tensors are free to
have a dynamic shapes. If shape inference has failed, but the shape is
known it is possible to impose the correct shape using tf.set_shape().
So you could fix this by reformulating your model to have fixed batch size or to use
tf.contrib.data.batch_and_drop_remainder as they suggest.

Could you please share the input data function. It is hard to tell the exact issue, but it seems that the shape of tensor representing input sample is not static.

Related

Cannot load tflite model, Did not get operators or tensors in subgraph 1

I have converted a tf model to tflite, and applied quantization in the process, but I cannot load it. The error was raised when I try to do interpreter = tf.lite.Interpreter(tflite_model_path), the error message was:
ValueError: Did not get operators or tensors in subgraph 1.
Also during quantization, I got lots of these INFO messages for every dense layer in my model:
2021-09-06 04:38:40.879693: I tensorflow/lite/tools/optimize/quantize_weights.cc:217] Skipping quantization of tensor bert_token_clssfification/classifier/Tensordot/Shape that is not type float.
These messages confuse me greatly, because I'm sure those weights are of type float32. Any ideas what I'm doing wrong? Thanks!
I figured out the cause in my case. It was because I have dropout layers in my model, and I'm using an input tf.bool tensor to explicitly control the training/inference mode of the dropouts layers. Dropout is not currently supported in tflite, and because I'm explicitly controlling the dropout behaviour, the tflite conversion cannot remove the dropout operations.
The correct way to use dropout is to pass the training kwarg during model invocation: out = model(input_batch, training=True).

Set batch size of trained keras model to 1

I am having a keras model trained on my own dataset. However after loading weights the summary shows None as the first dimension(the batch size).
I want to know the process to fix the shape to batch size of 1, as it is compulsory for me to fix it so i can convert the model to tflite with GPU support.
What worked for me was to specify batch size to the Input layer, like this:
input = layers.Input(shape=input_shape, batch_size=1, dtype='float32', name='images')
This then carried through the rest of the layers.
The bad news is that despite this "fix" the tfl runtime still complains about dynamic tensors. I get these non-fatal errors in logcat when it runs:
E/tflite: third_party/tensorflow/lite/core/subgraph.cc:801 tensor.data.raw != nullptr was not true.
E/tflite: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#26 is a dynamic-sized tensor).
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The good news is that despite these errors it seems to be using the GPU anyway, based on performance testing.
I'm using:
tensorflow-lite-support:0.2.0'
tensorflow-lite-metadata:0.2.1'
tensorflow-lite:2.6.0'
tensorflow:tensorflow-lite-gpu:2.3.0'
Hopefully, they'll fix the runtime so it doesn't matter whether the batch size is 'None'. It shouldn't matter for doing inference.

Keras Upsampling2d -> tflite conversion results in failing shape inference and undefined output shape

Keras Upsampling2d operation is converted into this with additional operations and undefined shape
Tensorflow however converts without this operations with correct shape
This leads to undefined overall model output shape and leads to errors on device. How can this be fixed?
This behavior is described here https://github.com/tensorflow/tensorflow/issues/45090
Keras by default sets dynamic batch size to true.
That means that the model input shape is [*,28,28] not [1,28,28].
The old(deprecated) converter used to ignore the dynamic batch and override this to 1 - which is wrong since this is not what the original model has - you can imagine how bad it will be when you try to resize the inputs at runtime.
The current converter instead handles the dynamic batch size correct, and the model generated can be resized at runtime correct.
That's why the sequence of "Shape, StridedSlice, Pack" wasn't constant folded, since the shape is dependent on the shape defined at runtime.
For single input model this can be fixed by setting constant shape for keras model before saving
model.input.set_shape(1 + model.input.shape[1:])

Channels dimension index in the input shape while porting Pytorch models to Tensorflow

One of the major problems I've encountered when converting PyTorch models to TensorFlow through ONNX, is slowness, which appears to be related to the input shape, even though I was able to get bit-exact outputs with the two frameworks.
While the PyTorch input shape is B,C,H,W, the Tensorflow input shape is B,H,W,C, where B,C,H,W stand for batch size, channels, height and width, respectively. Technically, I solve the input shape problem easily when working in Tensorflow, using two calls to np.swapaxes:
# Single image, no batch size here yet
image = np.swapaxes(image, 0, 2) # Swapping C and H dimensions - result: C,W,H
image = np.swapaxes(image, 1, 2) # Swapping H and W dimensions - result: C,H,W (like Pytorch)
The slowness problem seems to be related to the differences in the ways the convolutional operations are implemented in PyTorch vs Tensorflow. While PyTorch expects channels first, Tensorflow expects channels last.
As a result, when I visualize the models using Netron, the ONNX model looks abstract and making sense (first image), whereas the Tensorflow .pb formatted model looks like a big mess (second image).
Note: It appears that this problem has already concerned the writers of the onnx2keras library, which supports an experimental feature of changing the C,H,W ordering originated in Pytorch, into H,W,C.
Any idea how to overcome this limitation? Are there other options for more abstractly exporting PyTorch models into Tensorflow?
ONNX (from PyTorch) - you can see the straight flow and the residual blocks:
Tensorflow (imported from the ONNX model) - almost nothing looks like a series of predefined operations:

What does `training=True` mean when calling a TensorFlow Keras model?

In TensorFlow's offcial documentations, they always pass training=True when calling a Keras model in a training loop, for example, logits = mnist_model(images, training=True).
I tried help(tf.keras.Model.call) and it shows that
Help on function call in module tensorflow.python.keras.engine.network:
call(self, inputs, training=None, mask=None)
Calls the model on new inputs.
In this case `call` just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Arguments:
inputs: A tensor or list of tensors.
training: Boolean or boolean scalar tensor, indicating whether to run
the `Network` in training mode or inference mode.
mask: A mask or list of masks. A mask can be
either a tensor or None (no mask).
Returns:
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.
It says that training is a Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode. But I didn't find any information about this two modes.
In a nutshell, I don't know what is the influence of this argument. And what if I missed this argument when training?
Some neural network layers behave differently during training and inference, for example Dropout and BatchNormalization layers. For example
During training, dropout will randomly drop out units and correspondingly scale up activations of the remaining units.
During inference, it does nothing (since you usually don't want the randomness of dropping out units here).
The training argument lets the layer know which of the two "paths" it should take. If you set this incorrectly, your network might not behave as expected.
Training indicating whether the layer should behave in training mode or in inference mode.
training=True: The layer will normalize its inputs using the mean and variance of the current batch of inputs.
training=False: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
Usually in inference mode training=False, but in some networks such as pix2pix_cGAN‍‍‍‍‍‍ At both times of inference and training, training=True.