OOM with tensorflow - tensorflow

I'm facing an OOM error whole training my tensorflow model, the structure is as follows:
tf.contrib.layers.embed_sequence initialized with GoogleNewsVector
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #forward
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #backward
tf.nn.bidirectional_dynamic_rnn wrapping the above layers
tf.layers.dense as an output layer
i tried to reduce the batch size down to as low as 64, my input data is padded to 1500, and my vocab size is 8938
The cluster i'm using is very powerful (https://wiki.calculquebec.ca/w/Helios/en) i'm using two nodes with 8 GPUs each and still getting this error:
2019-02-23 02:55:16.366766: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at reverse_op.cc:270 : Resource exhausted: OOM when
allocating tensor with shape[2000,800,300] and type float on
/job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
I'm using the estimator API with MirroredStrategy and still no use, is there a way maybe to ask tensorflow to just run the training using the GPUs and keep the tensors stores on the main machine memory? Any other suggestions are welcome.

Running a particular operation (e.g. some tensor multiplication during training) using GPU requires having those tensors stored on the GPU.
You might want to use Tensorboard or something like that to see which operations require the most memory for your calculation graph. In particular, it's possible that first link between the embeddings and LSTM is the culprit and you'd need to narrow that somehow.

Related

I encounter this error when training with yolov7 RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I want to train with yolov7 and get a .pt file, but I am getting such a problem while doing the training, what could be the reason?

Tensorflow not using multiple GPUs - getting OOM

I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.
tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 :
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32]
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
But tensorflow does recognize multiple GPUs when I run my code:
Adding visible gpu devices: 0, 1, 2
Is there anything else I need to do to have TF use all GPUs?
The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
...
https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit
But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.
The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.

Set batch size of trained keras model to 1

I am having a keras model trained on my own dataset. However after loading weights the summary shows None as the first dimension(the batch size).
I want to know the process to fix the shape to batch size of 1, as it is compulsory for me to fix it so i can convert the model to tflite with GPU support.
What worked for me was to specify batch size to the Input layer, like this:
input = layers.Input(shape=input_shape, batch_size=1, dtype='float32', name='images')
This then carried through the rest of the layers.
The bad news is that despite this "fix" the tfl runtime still complains about dynamic tensors. I get these non-fatal errors in logcat when it runs:
E/tflite: third_party/tensorflow/lite/core/subgraph.cc:801 tensor.data.raw != nullptr was not true.
E/tflite: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#26 is a dynamic-sized tensor).
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The good news is that despite these errors it seems to be using the GPU anyway, based on performance testing.
I'm using:
tensorflow-lite-support:0.2.0'
tensorflow-lite-metadata:0.2.1'
tensorflow-lite:2.6.0'
tensorflow:tensorflow-lite-gpu:2.3.0'
Hopefully, they'll fix the runtime so it doesn't matter whether the batch size is 'None'. It shouldn't matter for doing inference.

Mask R-CNN for TPU on Google Colab

We are trying to build an image segmentation deep learning model using Google Colab TPU. Our model is Mask R-CNN.
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
import tensorflow as tf
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
model.keras_model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
However I am running into issues while converting our Mask R-CNN model to TPU model as pasted below.
ValueError:
Layer <keras.engine.topology.InputLayer object at 0x7f58574f1940> has a
variable shape in a non-batch dimension. TPU models must
have constant shapes for all operations.
You may have to specify `input_length` for RNN/TimeDistributed layers.
Layer: <keras.engine.topology.InputLayer object at 0x7f58574f1940>
Input shape: (None, None, None, 3)
Output shape: (None, None, None, 3)
Appreciate any help.
Google recently released a tutorial on getting Mask R-CNN going on their TPUs. For this, they are using an experimental model for Mask RCNN on Google's TPU github repository (under models/experimental/mask_rcnn). Looking through the code, it looks like they define the model with a fixed input size to overcome the issue you are seeing.
See below for more explanation:
As #aman2930 points out, the shape of your input tensor is not static. This won't work because Tensorflow compiles models with XLA to use a TPU and XLA must have all tensor shapes defined at compile time. In the link above, the website specifically calls this out:
Static shapes
During regular usage TensorFlow attempts to determine the shapes of each tf.Tensor during graph construction. During
execution any unknown shape dimensions are determined dynamically, see
Tensor Shapes for more details.
To run on Cloud TPUs TensorFlow models are compiled using XLA. XLA
uses a similar system for determining shapes at compile time. XLA
requires that all tensor dimensions be statically defined at compile
time. All shapes must evaluate to a constant, and not depend on
external data, or stateful operations like variables or a random
number generator.
That side, further down the document, they mention that the input function is run on the CPU, so isn't limited by static XLA sizes. They point to batch size being the issue, not image size:
Static shapes and batch size
The input pipeline generated by your
input_fn is run on CPU. So it is mostly free from the strict static
shape requirements imposed by the XLA/TPU environment. The one
requirement is that the batches of data fed from your input pipeline
to the TPU have a static shape, as determined by the standard
TensorFlow shape inference algorithm. Intermediate tensors are free to
have a dynamic shapes. If shape inference has failed, but the shape is
known it is possible to impose the correct shape using tf.set_shape().
So you could fix this by reformulating your model to have fixed batch size or to use
tf.contrib.data.batch_and_drop_remainder as they suggest.
Could you please share the input data function. It is hard to tell the exact issue, but it seems that the shape of tensor representing input sample is not static.

OOM error when training TensorFlow object-detection API using inception-resnet & NASnet (especially) as backbone

Please help me finding the solution to my problems. It's important for me to state first that, I have successfully created my own custom dataset and I have successfully trained that dataset using resnet101 on my own computer (16GB RAM and 4GB NVIDIA 980).
The problem arise when I tried to switch the backbone using inception-resnet and nasnet. I got the following error
"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape ..."
And I thought I didn't have enough resource on my computer, so I created instance on AWS EC2 with 60GB RAM and 12GB NVIDIA Tesla K80 (my work place only provide this service) and trained the network there.
The training for inception-resnet worked well, however that's not the case with nasnet. Even with 100GB memory I still get OOM error
I found one solution on github tensorflow models web page at issue #1817 and I followed the instruction by adding the following line of code into nasnet config file
train_config: {
batch_size: 1
batch_queue_capacity: 50
num_batch_queue_threads: 8
prefetch_queue_capacity: 10
...
and the code ran well for a while (the following is "top" screenshot). However, I still got the OOM error after running around 6000 steps
INFO:tensorflow:global step 6348: loss = 2.0393 (3.988 sec/step)
INFO:tensorflow:Saving checkpoint to path /home/ubuntu/crack-detection/structure-crack/models/faster_rcnn_nas_coco_2017_11_08/train/model.ckpt
INFO:tensorflow:global step 6349: loss = 0.9803 (3.980 sec/step)
2018-01-25 05:51:25.959402: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 79.73MiB. Current allocation summary follows.
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,17,17,4032]
[[Node: MaxPool2D/MaxPool = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1],
...
Is there anything else I could do to run this smoothly without any OOM errors? Thanks for your help
EDIT #1: The errors come more frequently now, it'll show after 1000-1500 steps.
EDIT #2: Based on the issue #2668 and issue #3014, there's one more thing we can do to be able to run the code without OOM error by adding second_stage_batch_size: 25 (default is 50) in model section of the config file. So, the file should look like the following
model{
faster_rcnn {
...
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 25
}
}
Hope this can help.
I would like to point out that the memory that you run out of is the one of the GPU, so I'm afraid those 100GB are only useful for data wrangling outside a training purpose. Also, without code, it's really difficult to figure out where the error is coming from.
That being said, if you can initialize the neural net architecture with weights, train for 6000 iterations and suddenly run out of GPU memory then I guess that you are either somehow storing values in GPU memory or, if you have variable length inputs, you might be passing a sequence, in that iteration, which is too big memory wise.