Tensorflow outof memory after a few iterations - tensorflow

I am running a deeplearning project on a large dataset with GPU. Each image is about 200*200*200 voxel.
During training prograss, I received waring and OutofMemory Error at different iterations, maybe sometimes my program will end at first iteration caused by Outof Memory error, but sometime it will end after being trained hundreds of iterations for the same reason.
So I am wondering if it could be trained and have already ran some iterations, why the Outof Memory error will still occur??? I was not run other programs in that GPU and the batchsize is also fixed. Could someone please help me to fix it or provide some ideas about how to deal with it?
Some details:
Tensorflow will always warning like this:
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 13278 get requests, put_count=13270 evicted_count=1000 eviction_rate=0.075358 and unsatisfied allocation rate=0.0834463
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
and after hunders of iterations, pragram will be stopped by Memory error(part of the output):
...
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288a7e00 of size 17408
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288ac200 of size 17408
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b0600 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b2100 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b3c00 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b5700 of size 6912
...
...
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 648.00MiB. See logs for memory state.
Some operations I used: tf.nn.conv3d/tf.nn.conv3d_transpose/tf.nn.batch_normalization

Related

How to interpret the TensorFlow Lite benchmark app results?

I have used the TensorFlow Lite benchmark app, and I got the following result:
E tflite : Average inference timings in us: Warmup: 119501, Init: 3556, Inference: 135968, Overall max resident set size = 32.0469 MB, total malloc-ed size = 0 MB, in-use allocated/mmapped size = 13.3229 MB
I would like to know what does mean these values: Warmup, Init, Inference, Overall max resident set size, total malloc-ed size and in-use allocated/mmapped size.
I didn't find it in the documentation.
For latencies
Warmup: The latency for the first warmup invocation. Note the latency of the very first invocation may be slower since the code may do extra initialization / allocation
Init: Initialization time (to create the TensorFlow Lite interpreter)
Inference: The average of the inference invocation. This should be the most important metrics in most cases.

Understand the OOM mechanism of tensorflow

I am using Glove pre-trained embedding to train my own network. I use
self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)
and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)
to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)
Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource
exhausted: OOM when allocating tensor with shape[4800,400001] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.
I also has tried methods from other post such as from
https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu
and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.
Any help here ?

Tensorflow memory needed doesn't scale with batch size and image size

I am using ubuntu 16.04, tensorflow 1.3
A network with ~ 17M weights
Experiments
image size 400x1000, batch size 4, during graph construction:
failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
image size 300x750, batch size 4, during graph construction:
failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
image size 300x740, batch size 1, during graph construction:
failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
So, the memory requested is the same for all the three experiment. My question is does 17M weights really need such a huge amount of memory? And why the required memory doesn't change with different images sizes and batch sizes ?
It could because you stored a lot of middles results. After you run sess.run, you alloc some new memory to store the new tensor result, but after adding the new alloc memory, the total memory alloced on your host is more than 32GB. Please check your host memory (not gpu memory) used during the runtime. If that is the case, you need to lower your host memory allocing. Maybe store it to harddisk is a good choice.

OOM after n iterations in tensorflow without further tensor allocation

Several times, when working using as much GPU memory as possible, I've experienced OOM errors only after a certain number of training iterations have passed (without allocating new tensors explicitly). Reducing just a bit the batch size (i.e from 32 to 30) has always solved the problem, but I can't understand what should be causing this behabior.
Thanks!

tensorflow: CUDA_ERROR_OUT_OF_MEMORY always happen

I'm going to train a seq2seq model using tf-seq2seq package by 1080 ti (11GB) GPU. I always get the following error using different network's size (even nmt_small):
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:03:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 10.91G (11715084288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 12337 get requests, put_count=10124 evicted_count=1000 eviction_rate=0.0987752 and unsatisfied allocation rate=0.268542
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into ../model/model.ckpt.
INFO:tensorflow:step = 1, loss = 5.07399
It seems that tensorflow try to occupy the total amount of the GPU's memory (10.91GiB) but clearly only 10.75GiB is available.
you should notice some tips:
1- use memory growth, from tensorflow document: "in some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this."
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
2- are you use batch to training? or feed whole data at once? if yes, then decrease your batch size
In addition to both of the suggestions made concerning the memory growth, you can also try:
sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90
with tf.Session(config=sess_config) as sess:
...
With this you can limit the amount of GPU memory allocated by the program, in this case to 90 percent of the available GPU memory. Maybe this is sufficient to solve your problem of the network trying to allocate more memory than available.
If this is not sufficient, you will have to decrease the batch size or the network's size.