I use keras pre-trained InceptionResNetV2 to extract image features.
But it always causes CUDA_ERROR_OUT_OF_MEMORY when I predict images, even though I only predict a single file.
The environment is CUDA 10.0, cudnn 7.4, tensorflow 1.13, RTX 2070. GPU memory is 8GB.
Here are codes:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))
...
IR2 = InceptionResNetV2(weights='imagenet', include_top=False)
...
features = IR2.predict_on_batch(np.array([test_image]))
#test_image only contains one image
Error messages are:
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.53G (3794432768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.39GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.39GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Related
I am trying to run tensorflow-gpu version 2.4.0-dev20200828 (a tf-nightly build) for a convolutional neural network implementation. Some other details:
The version of python is Python 3.8.5.
Running Windows 10
Using an nVidia RTX 2080 which has 8 GB VRAM
Cuda Version 11.1
The following snippet is what I run:
import tensorflow as tf
from tensorflow import keras
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
vgg_16 = keras.applications.VGG16(include_top=False, input_shape=(600, 600, 3))
random_image = np.random.rand(1, 600, 600, 3)
output = vgg_16(random_image)
The code for the memory configuration was taken from answers from here
The issue I am having is that my GPU has 8GB of VRAM, and I need to be able to run the CNN with relatively large image batch sizes. The example is executed on a single image, but surprisingly I seem to only be able to increase the batch size to about 2-3 600 by 600 images. The code taken as per the comments says that it:
Restrict TensorFlow to only allocate 1GB of memory on the first GPU, which is clearly not ideal.
On the one hand if I allocate more, say 4000MB, I get errors such as:
E tensorflow/stream_executor/cuda/cuda_dnn.cc:325] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
If I leave it as 1024 MB, I get messages like:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.25GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Any insights/resources on how to understand this issue much appreciated. I'd be willing to switch to another version of tensorflow/python/cuda if necessary, but ultimately I just want to have a deeper understanding of what this issue is.
A better way to control memory usage is by letting memory growth. You should remove all the above codes about gpus and use this instead:
for gpu in tf.config.experimental.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(gpu, True)
Additionally, you can resize or crop the input image to smaller size to further reduce memory usage.
I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:
minibatch_size_dict = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32}, # Resolution-specific overrides.
minibatch_gpu_base = 8, # Number of samples processed at a time by one GPU.
minibatch_gpu_dict = {}, # Resolution-specific overrides.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 4, # Default interval of progress snapshots.
tick_kimg_dict = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.
I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?
I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.
Here is the error message I get:
2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352). Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144
In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.
One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit
and cuDNN 7.5. To reproduce the results reported in the paper, you
need an NVIDIA GPU with at least 16 GB of DRAM.
Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.
In my two system(P40, CUDA9, CUDNN7), tf1.8 and tf1.12 are installed respectively, and the same piece of code runs in tf1.12 almost double the allocated gpu memory as in tf1.8.
I wrote the following code to simplify the comparison. At this time in tf1.8, 1241MiB gpu memory is allocated and in tf1.12, 737MiB gpu memory is allocated. How could I optimize the gpu memory allocation in tf? Any suggestion would be appreciated.
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
a=tf.get_variable('a',(100,100))
b=tf.get_variable('b',(10000,10000))
sess=tf.Session(config=config)
sess.run(tf.global_variables_initializer())
I am using Glove pre-trained embedding to train my own network. I use
self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)
and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)
to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)
Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource
exhausted: OOM when allocating tensor with shape[4800,400001] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.
I also has tried methods from other post such as from
https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu
and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.
Any help here ?
I'm trying to build a large CNN in TensorFlow, and intend to run it on a multi-GPU system. I've adopted a "tower" system and split batches for both GPUs, while keeping the variables and other computations on the CPU. My system has 32GB of memory, but when I run my code I get the error:
E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 17179869184
Killed
I've seen that the code works (though very very slowly) if I hide CUDA devices to TensorFlow, and thus it doesn't use cudaMallocHost()...
Thank you for your time.
There are some options:
1- reduce your batch size
2- use memory growing:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
3- don't allocate whole of your GPU memory(only 90%):
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
session = tf.Session(config=config, ...)
reduce the batch_size in your code to 100 then it'll work