OOm - cannot run StyleGAN2 despite reducing batch size - tensorflow

I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:
minibatch_size_dict = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32}, # Resolution-specific overrides.
minibatch_gpu_base = 8, # Number of samples processed at a time by one GPU.
minibatch_gpu_dict = {}, # Resolution-specific overrides.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 4, # Default interval of progress snapshots.
tick_kimg_dict = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.
I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?
I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.
Here is the error message I get:
2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352). Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144
In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.

One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit
and cuDNN 7.5. To reproduce the results reported in the paper, you
need an NVIDIA GPU with at least 16 GB of DRAM.
Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.

Related

Tensorflow-GPU 2.4 VRAM issue

I am trying to run tensorflow-gpu version 2.4.0-dev20200828 (a tf-nightly build) for a convolutional neural network implementation. Some other details:
The version of python is Python 3.8.5.
Running Windows 10
Using an nVidia RTX 2080 which has 8 GB VRAM
Cuda Version 11.1
The following snippet is what I run:
import tensorflow as tf
from tensorflow import keras
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
vgg_16 = keras.applications.VGG16(include_top=False, input_shape=(600, 600, 3))
random_image = np.random.rand(1, 600, 600, 3)
output = vgg_16(random_image)
The code for the memory configuration was taken from answers from here
The issue I am having is that my GPU has 8GB of VRAM, and I need to be able to run the CNN with relatively large image batch sizes. The example is executed on a single image, but surprisingly I seem to only be able to increase the batch size to about 2-3 600 by 600 images. The code taken as per the comments says that it:
Restrict TensorFlow to only allocate 1GB of memory on the first GPU, which is clearly not ideal.
On the one hand if I allocate more, say 4000MB, I get errors such as:
E tensorflow/stream_executor/cuda/cuda_dnn.cc:325] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
If I leave it as 1024 MB, I get messages like:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.25GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Any insights/resources on how to understand this issue much appreciated. I'd be willing to switch to another version of tensorflow/python/cuda if necessary, but ultimately I just want to have a deeper understanding of what this issue is.
A better way to control memory usage is by letting memory growth. You should remove all the above codes about gpus and use this instead:
for gpu in tf.config.experimental.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(gpu, True)
Additionally, you can resize or crop the input image to smaller size to further reduce memory usage.

Understand the OOM mechanism of tensorflow

I am using Glove pre-trained embedding to train my own network. I use
self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)
and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)
to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)
Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource
exhausted: OOM when allocating tensor with shape[4800,400001] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.
I also has tried methods from other post such as from
https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu
and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.
Any help here ?

google colaboratory `ResourceExhaustedError` with GPU

I'm trying to fine-tune a Vgg16 model using colaboratory but I ran into this error when training with the GPU.
OOM when allocating tensor of shape [7,7,512,4096]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor of shape [7,7,512,4096] and type float
[[Node: vgg_16/fc6/weights/Momentum/Initializer/zeros = Const[_class=["loc:#vgg_16/fc6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,512,4096] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'vgg_16/fc6/weights/Momentum/Initializer/zeros', defined at:
also have this output for my vm session:
--- colab vm info ---
python v=3.6.3
tensorflow v=1.4.1
tf device=/device:GPU:0
model name : Intel(R) Xeon(R) CPU # 2.20GHz
model name : Intel(R) Xeon(R) CPU # 2.20GHz
MemTotal: 13341960 kB
MemFree: 1541740 kB
MemAvailable: 10035212 kB
My tfrecord is just 118 256x256 JPGs with file size <2MB
Is there a workaround? it works when I use the CPU, just not the GPU
Seeing a small amount of free GPU memory almost always indicates that you've created a TensorFlow session without the allow_growth = True option. See:
https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth
If you don't set this option, by default, TensorFlow will reserve nearly all GPU memory when a session is created.
Good news: As of this week, Colab now sets this option by default, so you should see much lower growth as you use multiple notebooks on Colab. And, you can also inspect GPU memory usage per notebook by selecting 'Manage session's from the runtime menu.
Once selected, you'll see a dialog that lists all notebooks and the GPU memory each is consuming. To free memory, you can terminate runtimes from this dialog as well.
I met the same issue, and I found my problem was caused by the code below:
from tensorflow.python.framework.test_util import is_gpu_available as tf
if tf()==True:
device='/gpu:0'
else:
device='/cpu:0'
I used below Code to check the GPU memory usage status and find the usage is 0% before running the code above, and it became 95% after.
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
print('GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB'.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
Before:
Gen RAM Free: 12.7 GB I Proc size: 139.1 MB
GPU RAM Free: 11438MB | Used: 1MB | Util 0% | Total 11439MB
After:
Gen RAM Free: 12.0 GB I Proc size: 1.0 GB
GPU RAM Free: 564MB | Used: 10875MB | Util 95% | Total 11439MB
Somehow, is_gpu_available() managed consume most of the GPU memory without release them after, so instead, I used below code to detect the gpu status for me, problem solved
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
try:
import GPUtil as GPU
GPUs = GPU.getGPUs()
device='/gpu:0'
except:
device='/cpu:0'
I failed to repro the originally-reported error, but if that is caused by running out of GPU memory (as opposed to main memory) this might help:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
and then pass session_config=config to e.g. slim.learning.train() (or whatever session ctor you end up using).
In my case I didn't solve with solution provided by Ami, even if it's excellent, probably because Colaboratory VM couldn't furnish more resources.
I had the OOM error in detection phase (not model training). I solved with a workaround, disabling GPU for detection:
config = tf.ConfigProto(device_count = {'GPU': 0})
sess = tf.Session(config=config)

tensorflow: CUDA_ERROR_OUT_OF_MEMORY always happen

I'm going to train a seq2seq model using tf-seq2seq package by 1080 ti (11GB) GPU. I always get the following error using different network's size (even nmt_small):
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:03:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 10.91G (11715084288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 12337 get requests, put_count=10124 evicted_count=1000 eviction_rate=0.0987752 and unsatisfied allocation rate=0.268542
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into ../model/model.ckpt.
INFO:tensorflow:step = 1, loss = 5.07399
It seems that tensorflow try to occupy the total amount of the GPU's memory (10.91GiB) but clearly only 10.75GiB is available.
you should notice some tips:
1- use memory growth, from tensorflow document: "in some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this."
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
2- are you use batch to training? or feed whole data at once? if yes, then decrease your batch size
In addition to both of the suggestions made concerning the memory growth, you can also try:
sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90
with tf.Session(config=sess_config) as sess:
...
With this you can limit the amount of GPU memory allocated by the program, in this case to 90 percent of the available GPU memory. Maybe this is sufficient to solve your problem of the network trying to allocate more memory than available.
If this is not sufficient, you will have to decrease the batch size or the network's size.

Cntk can not detect my gpu by cntk.all_devices()

Cntk only detect 1 device(my cpu) by calling cntk.all_devices(). However I do have a gpu on my computer. By running the tutorial supported by cntk, I could get some info:
-------------------------------------------------------------------
-------------------------------------------------------------------
GPU info:
Device[0]: cores = 48; computeCapability = 2.1; type = "NVS 310"; memory = 512 MB
-------------------------------------------------------------------
##############################################################################
# #
# Train command (train action) #
# #
##############################################################################
Model has 9 nodes. Using CPU.
As an consequence, I can not use my gpu as by calling set_default_device(gpu(0)). How could I solve this problem?
The minimum GPU compute capability for CNTK is 3.0. (Edit: The fact that you can run the tutorial using cntk.exe indicates a bug somewhere in the v1 executable.). When you run the tutorial with cntk.exe, it prints out the GPU info, but still ends up using CPU: Model has 9 nodes. Using CPU.
The only way to solve this problem is to change the value of the constant MininumCCMajorForGpu in BestGpu.cpp and recompile.