Tensorflow not using multiple GPUs - getting OOM - tensorflow

I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.
tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 :
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32]
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
But tensorflow does recognize multiple GPUs when I run my code:
Adding visible gpu devices: 0, 1, 2
Is there anything else I need to do to have TF use all GPUs?

The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
...
https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit
But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.
The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.

Related

Set batch size of trained keras model to 1

I am having a keras model trained on my own dataset. However after loading weights the summary shows None as the first dimension(the batch size).
I want to know the process to fix the shape to batch size of 1, as it is compulsory for me to fix it so i can convert the model to tflite with GPU support.
What worked for me was to specify batch size to the Input layer, like this:
input = layers.Input(shape=input_shape, batch_size=1, dtype='float32', name='images')
This then carried through the rest of the layers.
The bad news is that despite this "fix" the tfl runtime still complains about dynamic tensors. I get these non-fatal errors in logcat when it runs:
E/tflite: third_party/tensorflow/lite/core/subgraph.cc:801 tensor.data.raw != nullptr was not true.
E/tflite: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#26 is a dynamic-sized tensor).
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The good news is that despite these errors it seems to be using the GPU anyway, based on performance testing.
I'm using:
tensorflow-lite-support:0.2.0'
tensorflow-lite-metadata:0.2.1'
tensorflow-lite:2.6.0'
tensorflow:tensorflow-lite-gpu:2.3.0'
Hopefully, they'll fix the runtime so it doesn't matter whether the batch size is 'None'. It shouldn't matter for doing inference.

Tensorflow: How can I find out the BIG tensors?

I encountered the OOM problem on tensorflow, warning that it OOM when allocating XXX tensor. But I believe that it's because some other big tensors occupied too much memory instead of THAT error tensor, since I've ever used that structure with the same shape with lower total memory usage and no OOM occurred.
And another hard thing is that the BIG tensor is RUNTIME TENSOR, not the so-called trainable parameters, so I cannot observe the params size before session runs, what I can do is to just wait the OOM to occur after it begins to run.

How to set specific gpu in bert?

ResourceExhaustedError (see above for traceback):
OOM when allocating tensor of shape [768] and type float [[node
bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Initializer/zeros
(defined at /home/zyl/souhu/bert/optimization.py:122) =
Const_class=["loc:#bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Assign"],
dtype=DT_FLOAT, value=Tensor, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
How to set gpu 1 or another to run bert?
The easiest way to set what GPUs will be used is setting CUDA_VISIBLE_DEVICES environment variable. It will still be GPU:0 TensorFlow, different physically different device.
If you are using BERT within Python (which is rather a painful way), you can use the code that is creating BERT graph in a block:
with tf.device('/device:GPU:1'):
model = modeling.BertModel(...)

OOM with tensorflow

I'm facing an OOM error whole training my tensorflow model, the structure is as follows:
tf.contrib.layers.embed_sequence initialized with GoogleNewsVector
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #forward
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #backward
tf.nn.bidirectional_dynamic_rnn wrapping the above layers
tf.layers.dense as an output layer
i tried to reduce the batch size down to as low as 64, my input data is padded to 1500, and my vocab size is 8938
The cluster i'm using is very powerful (https://wiki.calculquebec.ca/w/Helios/en) i'm using two nodes with 8 GPUs each and still getting this error:
2019-02-23 02:55:16.366766: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at reverse_op.cc:270 : Resource exhausted: OOM when
allocating tensor with shape[2000,800,300] and type float on
/job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
I'm using the estimator API with MirroredStrategy and still no use, is there a way maybe to ask tensorflow to just run the training using the GPUs and keep the tensors stores on the main machine memory? Any other suggestions are welcome.
Running a particular operation (e.g. some tensor multiplication during training) using GPU requires having those tensors stored on the GPU.
You might want to use Tensorboard or something like that to see which operations require the most memory for your calculation graph. In particular, it's possible that first link between the embeddings and LSTM is the culprit and you'd need to narrow that somehow.

OOM error when training TensorFlow object-detection API using inception-resnet & NASnet (especially) as backbone

Please help me finding the solution to my problems. It's important for me to state first that, I have successfully created my own custom dataset and I have successfully trained that dataset using resnet101 on my own computer (16GB RAM and 4GB NVIDIA 980).
The problem arise when I tried to switch the backbone using inception-resnet and nasnet. I got the following error
"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape ..."
And I thought I didn't have enough resource on my computer, so I created instance on AWS EC2 with 60GB RAM and 12GB NVIDIA Tesla K80 (my work place only provide this service) and trained the network there.
The training for inception-resnet worked well, however that's not the case with nasnet. Even with 100GB memory I still get OOM error
I found one solution on github tensorflow models web page at issue #1817 and I followed the instruction by adding the following line of code into nasnet config file
train_config: {
batch_size: 1
batch_queue_capacity: 50
num_batch_queue_threads: 8
prefetch_queue_capacity: 10
...
and the code ran well for a while (the following is "top" screenshot). However, I still got the OOM error after running around 6000 steps
INFO:tensorflow:global step 6348: loss = 2.0393 (3.988 sec/step)
INFO:tensorflow:Saving checkpoint to path /home/ubuntu/crack-detection/structure-crack/models/faster_rcnn_nas_coco_2017_11_08/train/model.ckpt
INFO:tensorflow:global step 6349: loss = 0.9803 (3.980 sec/step)
2018-01-25 05:51:25.959402: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 79.73MiB. Current allocation summary follows.
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,17,17,4032]
[[Node: MaxPool2D/MaxPool = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1],
...
Is there anything else I could do to run this smoothly without any OOM errors? Thanks for your help
EDIT #1: The errors come more frequently now, it'll show after 1000-1500 steps.
EDIT #2: Based on the issue #2668 and issue #3014, there's one more thing we can do to be able to run the code without OOM error by adding second_stage_batch_size: 25 (default is 50) in model section of the config file. So, the file should look like the following
model{
faster_rcnn {
...
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 25
}
}
Hope this can help.
I would like to point out that the memory that you run out of is the one of the GPU, so I'm afraid those 100GB are only useful for data wrangling outside a training purpose. Also, without code, it's really difficult to figure out where the error is coming from.
That being said, if you can initialize the neural net architecture with weights, train for 6000 iterations and suddenly run out of GPU memory then I guess that you are either somehow storing values in GPU memory or, if you have variable length inputs, you might be passing a sequence, in that iteration, which is too big memory wise.