Tensorflow: How can I find out the BIG tensors? - tensorflow

I encountered the OOM problem on tensorflow, warning that it OOM when allocating XXX tensor. But I believe that it's because some other big tensors occupied too much memory instead of THAT error tensor, since I've ever used that structure with the same shape with lower total memory usage and no OOM occurred.
And another hard thing is that the BIG tensor is RUNTIME TENSOR, not the so-called trainable parameters, so I cannot observe the params size before session runs, what I can do is to just wait the OOM to occur after it begins to run.

Related

GPU OOM after many hours of training

I am training a model within my own loop kind of like this:
while True:
[x,y] = getSomeTrainingData(...)
model.fit(x,y,...)
<misc>
What I'm seeing is that it will train for a very long time, but then randomly OOM in the GPU. The batch size and data size is constant. What would cause this and is there anything I can do, like potentially doing some kind of garbage collection between iterations?

Tensorflow not using multiple GPUs - getting OOM

I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.
tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 :
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32]
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
But tensorflow does recognize multiple GPUs when I run my code:
Adding visible gpu devices: 0, 1, 2
Is there anything else I need to do to have TF use all GPUs?
The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
...
https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit
But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.
The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[33694,33694] and type float on:GPU:0 by allocator GPU_0_bfc

I am trying to run my training file and this error pops up. My data dimensions are 4190 by 33694. I tried to reduce the batch size, but still it didnt work
You are trying to allocate a tensor of size 4.2GB (=33694*33694*4/1024^3). Another tensor of the same size would be allocated during backprop. Even if your network has only a single (fully-connected?) layer you would probably need 12GB of video memory to run it.
You will have to review your design, there is hardly any way around it. Replace FC layers with something different, reduce number of neurons drastically, rescale your date if possible.

OOM with tensorflow

I'm facing an OOM error whole training my tensorflow model, the structure is as follows:
tf.contrib.layers.embed_sequence initialized with GoogleNewsVector
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #forward
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #backward
tf.nn.bidirectional_dynamic_rnn wrapping the above layers
tf.layers.dense as an output layer
i tried to reduce the batch size down to as low as 64, my input data is padded to 1500, and my vocab size is 8938
The cluster i'm using is very powerful (https://wiki.calculquebec.ca/w/Helios/en) i'm using two nodes with 8 GPUs each and still getting this error:
2019-02-23 02:55:16.366766: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at reverse_op.cc:270 : Resource exhausted: OOM when
allocating tensor with shape[2000,800,300] and type float on
/job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
I'm using the estimator API with MirroredStrategy and still no use, is there a way maybe to ask tensorflow to just run the training using the GPUs and keep the tensors stores on the main machine memory? Any other suggestions are welcome.
Running a particular operation (e.g. some tensor multiplication during training) using GPU requires having those tensors stored on the GPU.
You might want to use Tensorboard or something like that to see which operations require the most memory for your calculation graph. In particular, it's possible that first link between the embeddings and LSTM is the culprit and you'd need to narrow that somehow.

OOM after n iterations in tensorflow without further tensor allocation

Several times, when working using as much GPU memory as possible, I've experienced OOM errors only after a certain number of training iterations have passed (without allocating new tensors explicitly). Reducing just a bit the batch size (i.e from 32 to 30) has always solved the problem, but I can't understand what should be causing this behabior.
Thanks!