Forcing PyTorch to keep some memory aside - tensorflow

I am using Yolov3 by Ultralytics (PyTorch) to detect the behavior of cows in a video. The Yolov3 was trained to detect each individual cow in the video. Each image of the cow is cropped using the X and Y coordinates of the bounding box. Each image then goes through another model to determine whether they are sitting or standing. The second model was also trained with our own dataset. The second model uses Tensorflow and it's a very simple InceptionV3 model.
However, whenever I try to load both models, I am getting the following errors
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 16.00 GiB total capacity; 427.42 MiB already allocated; 7.50 MiB free; 448.00 MiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
If the second model is not loaded then the yolov3 (PyTorch) runs without any issue and it does not even use the whole 16GB of VRAM. Is the yolov3 is reserving the whole VRAM and not leaving anything for the tensorflow-based Inceptionv3? If yes, anyway of forcing torch to keep 2 GB VRAM aside?
Full code output here
>> python detectv2.py --weights best.pt --source outch06_20181022073801_0_10.avi
2022-06-01 16:02:40.975544: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-01 16:02:41.342394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14123 MB memory: -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:65:00.0, compute capability: 7.5
detectv2: weights=['best.pt'], source=outch06_20181022073801_0_10.avi, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
Empty DataFrame
Columns: []
Index: []
YOLOv3 2022-5-16 torch 1.11.0 CUDA:0 (Quadro RTX 5000, 16384MiB)
Fusing layers...
Model Summary: 269 layers, 62546518 parameters, 0 gradients
Traceback (most recent call last):
File "detectv2.py", line 462, in <module>
main(opt)
File "detectv2.py", line 457, in main
run(**vars(opt))
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "detectv2.py", line 221, in run
model(torch.zeros(1, 3, *imgsz).to(device).type_as(next(model.model.parameters()))) # warmup
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\yolov3-master\models\common.py", line 357, in forward
y = self.model(im) if self.jit else self.model(im, augment=augment, visualize=visualize)
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\yolov3-master\models\yolo.py", line 127, in forward
return self._forward_once(x, profile, visualize) # single-scale inference, train
File "C:\Users\sourav\yolov3-master\models\yolo.py", line 150, in _forward_once
x = m(x) # run
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\yolov3-master\models\common.py", line 48, in forward_fuse
return self.act(self.conv(x))
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\conv.py", line 447, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\sourav\Anaconda3\envs\yl37\lib\site-packages\torch\nn\modules\conv.py", line 444, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 16.00 GiB total capacity; 427.42 MiB already allocated; 7.50 MiB free; 448.00 MiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

PyTorch did not occupy the GPU as I thought. It was the other way around. I was trying to initiate the TensorFlow model first which occupied the whole memory and did not leave anything for PyTorch.
The solution is here
For tensorflow 2.2+
For single GPU
import tensorflow as tf
gpu = tf.config.experimental.list_physical_devices('GPU')[0]
tf.config.experimental.set_memory_growth(gpu, True)
Details can be found in this post and this documentation.

Related

Tensorflow-GPU 2.4 VRAM issue

I am trying to run tensorflow-gpu version 2.4.0-dev20200828 (a tf-nightly build) for a convolutional neural network implementation. Some other details:
The version of python is Python 3.8.5.
Running Windows 10
Using an nVidia RTX 2080 which has 8 GB VRAM
Cuda Version 11.1
The following snippet is what I run:
import tensorflow as tf
from tensorflow import keras
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
vgg_16 = keras.applications.VGG16(include_top=False, input_shape=(600, 600, 3))
random_image = np.random.rand(1, 600, 600, 3)
output = vgg_16(random_image)
The code for the memory configuration was taken from answers from here
The issue I am having is that my GPU has 8GB of VRAM, and I need to be able to run the CNN with relatively large image batch sizes. The example is executed on a single image, but surprisingly I seem to only be able to increase the batch size to about 2-3 600 by 600 images. The code taken as per the comments says that it:
Restrict TensorFlow to only allocate 1GB of memory on the first GPU, which is clearly not ideal.
On the one hand if I allocate more, say 4000MB, I get errors such as:
E tensorflow/stream_executor/cuda/cuda_dnn.cc:325] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
If I leave it as 1024 MB, I get messages like:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.25GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Any insights/resources on how to understand this issue much appreciated. I'd be willing to switch to another version of tensorflow/python/cuda if necessary, but ultimately I just want to have a deeper understanding of what this issue is.
A better way to control memory usage is by letting memory growth. You should remove all the above codes about gpus and use this instead:
for gpu in tf.config.experimental.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(gpu, True)
Additionally, you can resize or crop the input image to smaller size to further reduce memory usage.

OOm - cannot run StyleGAN2 despite reducing batch size

I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:
minibatch_size_dict = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32}, # Resolution-specific overrides.
minibatch_gpu_base = 8, # Number of samples processed at a time by one GPU.
minibatch_gpu_dict = {}, # Resolution-specific overrides.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 4, # Default interval of progress snapshots.
tick_kimg_dict = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.
I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?
I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.
Here is the error message I get:
2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352). Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144
In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.
One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit
and cuDNN 7.5. To reproduce the results reported in the paper, you
need an NVIDIA GPU with at least 16 GB of DRAM.
Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.

PyTorch Object Detection with GPU on Ubuntu 18.04 - RuntimeError: CUDA out of memory. Tried to allocate xx.xx MiB

I'm attempting to get this PyTorch person detection example:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
running locally with a GPU, either in a Jupyter Notebook or a regular python file. I get the error in the title either way.
I'm using Ubuntu 18.04. Here is a summary of the steps I've performed:
1) Stock Ubuntu 18.04 install on a Lenovo ThinkPad X1 Extreme Gen 2 with a GTX 1650 GPU.
2) Perform a standard CUDA 10.0 / cuDNN 7.4 install. I'd rather not restate all the steps as this post is going to be more than long enough already. This is a standard procedure, pretty much any link found via googling is what I followed.
3) Install torch and torchvision
pip3 install torch torchvision
4) From this link on the PyTorch site:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
I've both saved the linked notebook:
https://colab.research.google.com/github/pytorch/vision/blob/temp-tutorial/tutorials/torchvision_finetuning_instance_segmentation.ipynb
And Also tried the link at the bottom that has the regular Python file:
https://pytorch.org/tutorials/_static/tv-training-code.py
5) Before running either the notebook or the regular Python way, I did the following (found at the top of the above linked notebook):
Install the CoCo API into Python:
cd ~
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
open Makefile in gedit, change the two instances of "python" to "python3", then:
python3 setup.py build_ext --inplace
sudo python3 setup.py install
Get the necessary files the above linked files need to run:
cd ~
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.5.0
from ~/vision/references/detection, copy coco_eval.py, coco_utils.py, engine.py, transforms.py, and utils.py to whichever directory the above linked notebook or tv-training-code.py file are being ran from.
6) Download the Penn Fudan Pedestrian dataset from the link on the above page:
https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
then unzip and put in the same directory as the notebook or tv-training-code.py
In case the above link ever breaks or just for easier reference, here is tv-training-code.py as I have downloaded it at this time:
# Sample code from the TorchVision 0.3 Object Detection Finetuning Tutorial
# http://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
import os
import numpy as np
import torch
from PIL import Image
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from engine import train_one_epoch, evaluate
import utils
import transforms as T
class PennFudanDataset(object):
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
# load all image files, sorting them to
# ensure that they are aligned
self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
def __getitem__(self, idx):
# load images ad masks
img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
img = Image.open(img_path).convert("RGB")
# note that we haven't converted the mask to RGB,
# because each color corresponds to a different instance
# with 0 being background
mask = Image.open(mask_path)
mask = np.array(mask)
# instances are encoded as different colors
obj_ids = np.unique(mask)
# first id is the background, so remove it
obj_ids = obj_ids[1:]
# split the color-encoded mask into a set
# of binary masks
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# suppose all instances are not crowd
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# now get the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return model
def get_transform(train):
transforms = []
transforms.append(T.ToTensor())
if train:
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
def main():
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# our dataset has two classes only - background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
collate_fn=utils.collate_fn)
# get the model using our helper function
model = get_model_instance_segmentation(num_classes)
# move model to the right device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# let's train it for 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
print("That's it!")
if __name__ == "__main__":
main()
Here is an exmaple run of tv-training-code.py
$ python3 tv-training-code.py
Epoch: [0] [ 0/60] eta: 0:01:17 lr: 0.000090 loss: 4.1717 (4.1717) loss_classifier: 0.8903 (0.8903) loss_box_reg: 0.1379 (0.1379) loss_mask: 3.0632 (3.0632) loss_objectness: 0.0700 (0.0700) loss_rpn_box_reg: 0.0104 (0.0104) time: 1.2864 data: 0.1173 max mem: 1865
Traceback (most recent call last):
File "tv-training-code.py", line 165, in <module>
main()
File "tv-training-code.py", line 156, in main
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
File "/xxx/PennFudanExample/engine.py", line 46, in train_one_epoch
losses.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py", line 77, in apply
return self._forward_cls.backward(self, *args)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py", line 189, in wrapper
outputs = fn(ctx, *args)
File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/roi_align.py", line 38, in backward
output_size[0], output_size[1], bs, ch, h, w, sampling_ratio)
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 3.81 GiB total capacity; 2.36 GiB already allocated; 132.69 MiB free; 310.59 MiB cached) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:267)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fdfb6c9b813 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1ce68 (0x7fdfb6edce68 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1de6e (0x7fdfb6edde6e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x279 (0x7fdf59472789 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[many more frame lines omitted]
Clearly the line:
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 3.81 GiB total capacity; 2.36 GiB already allocated; 132.69 MiB free; 310.59 MiB cached) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:267)
is the critical error.
If I run an nvidia-smi before a run:
$ nvidia-smi
Tue Dec 24 14:32:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 Off | 00000000:01:00.0 On | N/A |
| N/A 47C P8 5W / N/A | 296MiB / 3903MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1190 G /usr/lib/xorg/Xorg 142MiB |
| 0 1830 G /usr/bin/gnome-shell 72MiB |
| 0 3711 G ...uest-channel-token=14371934934688572948 78MiB |
+-----------------------------------------------------------------------------+
It seems pretty clear there is plenty of GPU memory available (this GPU is 4GB).
Moreover, I'm confident my CUDA/cuDNN install and GPU hardware are good b/c I train and inference the TensorFlow object detection API on this computer frequently, and as long as I use the allow_growth option I never have GPU related errors.
From Googling on this error it seems to be relatively common. The most common solutions are:
1) Try a smaller batch size (not really applicable in this case since the training and testing batch sizes are 2 and 1 respectively, and I tried with 1 and 1 and still got the same error)
2) Update to the latest version of PyTorch (but I'm already at the latest version).
Some other suggestions involve reworking the training script. I'm very familiar with TensorFlow but I'm new to PyTorch so I'm not sure how to go about that. Also, most of the rework suggestions I can find for this error do not pertain to object detection and therefore I'm not able to relate them to this training script specifically.
Has anybody else gotten this script to run locally with an NVIDIA GPU? Do you suspect a OS/CUDA/PyTorch configuration concern, or is there someway the script can be reworked to prevent this error? Any assistance would be greatly appreciated.
Very strange, after changing both the training and testing batch size to 1, it now does not crash with a GPU error. Very strange since I'm certain I tried this before.
Perhaps it had something to do with changing the batch size to 1 for both training and testing, and then rebooting or somehow refreshing something else? I'm not really sure. Very odd.
Now the evaluate function call is crashing with the error:
object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
But it seems this is completely unrelated so I'll make a separate post for that.

How to train and evaluate simultaneously in Object Detection API ?

I want to have train/evaluate the ssd_mobile_v1_coco on my own dataset at the same time in Object Detection API.
However, when I simply try to do so, I am faced with GPU memory being nearly full and thus the evaluation script fails to start.
Here are the commands I use for training and then evaluation:
Training script is called in one terminal pane like this :
python3 train.py \
--logtostderr \
--train_dir=training_ssd_mobile_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
That runs fine, training works... then I try to run the evaluation script in the second terminal pane :
python3 eval.py \
--logtostderr \
--checkpoint_dir=training_ssd_mobile_caltech \
--eval_dir=eval_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
It fails with the following error :
python3 eval.py \
--logtostderr \
--checkpoint_dir=training_ssd_mobile_caltech \
--eval_dir=eval_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-28 18:40:00.302271: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-28 18:40:00.412808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-28 18:40:00.413217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 93.00MiB
2018-02-28 18:40:00.413424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-02-28 18:40:00.957090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 43.00M (45088768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:00.957919: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 38.70M (40580096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
2018-02-28 18:40:02.274830: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:02.278599: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.280515: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.281958: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.282082: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.75MiB. Current allocation summary follows.
2018-02-28 18:40:12.282160: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256): Total Chunks: 190, Chunks in use: 190. 47.5KiB allocated for chunks. 47.5KiB in use in bin. 11.8KiB client-requested in use in bin.
2018-02-28 18:40:12.282251: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512): Total Chunks: 70, Chunks in use: 70. 35.0KiB allocated for chunks. 35.0KiB in use in bin. 35.0KiB client-requested in use in bin.
[.......................................]2018-02-28 18:40:12.290959: I tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use chunks: 29.83MiB
2018-02-28 18:40:12.290971: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats:
Limit: 45088768
InUse: 31284736
MaxInUse: 32368384
NumAllocs: 808
MaxAllocSize: 5796864
2018-02-28 18:40:12.291022: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************xx*********xx**_*__****______***********************************************xx
2018-02-28 18:40:12.291044: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
WARNING:root:The following classes have no ground truth examples: 1
/home/mm/models/research/object_detection/utils/metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:710: RuntimeWarning: Mean of empty slice
mean_ap = np.nanmean(self.average_precision_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:711: RuntimeWarning: Mean of empty slice
mean_corloc = np.nanmean(self.corloc_per_class)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval.py", line 146, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "eval.py", line 142, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "/home/mm/models/research/object_detection/evaluator.py", line 240, in evaluate
save_graph_dir=(eval_dir if eval_config.save_graph else ''))
File "/home/mm/models/research/object_detection/eval_util.py", line 407, in repeated_checkpoint_run
save_graph_dir)
File "/home/mm/models/research/object_detection/eval_util.py", line 286, in _run_checkpoint_once
result_dict = batch_processor(tensor_dict, sess, batch, counters)
File "/home/mm/models/research/object_detection/evaluator.py", line 183, in _process_batch
result_dict = sess.run(tensor_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D', defined at:
File "eval.py", line 146, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "eval.py", line 142, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "/home/mm/models/research/object_detection/evaluator.py", line 161, in evaluate
ignore_groundtruth=eval_config.ignore_groundtruth)
File "/home/mm/models/research/object_detection/evaluator.py", line 72, in _extract_prediction_tensors
prediction_dict = model.predict(preprocessed_image, true_image_shapes)
File "/home/mm/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 334, in predict
preprocessed_inputs)
File "/home/mm/models/research/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 112, in extract_features
scope=scope)
File "/home/mm/models/research/slim/nets/mobilenet_v1.py", line 232, in mobilenet_v1_base
scope=end_point)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 762, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 652, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py", line 167, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 838, in __call__
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 502, in __call__
return self.call(inp, filter)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 190, in __call__
name=self.name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 639, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Prior to initiating the eval.py TF training has all the GPU memory allocated in advance and therefore I cant figure out how to have both of them running at the same time, or at least have the ODA, run evaluation in specific intervals.
Therefore Is it possible in first place to have evaluation run simultaneously with training? if so How is it done ?
System information
What is the top-level directory of the model you are using: object_detection
Have I written custom code: not yet...
OS Platform and Distribution : Linux Ubuntu 16.04 LTS
TensorFlow installed from (source or binary): pip3 tensorflow-gpu
TensorFlow version (use command below): 1.5.0
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: GTX 1080, 8Gb
One simple way to do this is to add CUDA_VISIBILE_DEVICES before your command
CUDA_VISIBLE_DEVICES="" python eval.py --logtostderr --pipeline_config_path=multires.config --checkpoint_dir=/train_dir/ --eval_dir=eval_dir/
which will prevent your evaluation script from seeing any GPU, and it should fall back to CPU automatically.
To force the eval job to run on your CPU (and prevent it from taking precious GPU-memory), create one virtualenv where you install tensorflow-gpu which you use for training (named e.g. virtual_tf_gpu), and another one where you install tensorflow WITHOUT gpu support (e.g. virtual_tf). Activate your two virtualenvs in two separate terminal windows and start training in your GPU-supported environment and evaluation in your CPU-supported environment.
Good luck!!!

Tensorflow crashes with CUBLAS_STATUS_ALLOC_FAILED

I'm running tensorflow-gpu on Windows 10 using a simple MINST neural network program. When it tries to run, it encounters a CUBLAS_STATUS_ALLOC_FAILED error. A google search doesn't turn up anything.
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:0f:00.0
Total memory: 4.00GiB
Free memory: 3.31GiB
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:906] DMA: 0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:916] 0: Y
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:0f:00.0)
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_blas.cc:372] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\stream.cc:1390] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1021, in _do_call
return fn(*args)
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1003, in _run_fn
status, run_metadata)
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python35\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : a.shape=(100, 784), b.shape=(784, 256), m=100, n=256, k=784
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_7, Variable/read)]]
[[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_35_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
For TensorFlow 2.2 none of the other answers worked when the CUBLAS_STATUS_ALLOC_FAILED problem was encountered. Found a solution on https://www.tensorflow.org/guide/gpu:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
I ran this code before any further calculations are made and found that the same code that produced CUBLAS error before now worked in same session. The sample code above is a specific example that sets the memory growth across a number of physical GPUs but it also solves the memory expansion problem.
The location of the "allow_growth" property of the session config seems to be different now. It's explained here: https://www.tensorflow.org/tutorials/using_gpu
So currently you'd have to set it like this:
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
tensorflow>=2.0
import tensorflow as tf
config = tf.compat.v1.ConfigProto(gpu_options =
tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.8)
# device_count = {'GPU': 1}
)
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
tf.compat.v1.keras.backend.set_session(session)
I found this solution works
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto(
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
# device_count = {'GPU': 1}
)
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
set_session(session)
On windows, currently tensorflow does not allocate all available memory like it says in the documentation, instead you can work around this error by allowing dynamic memory growth as follows:
tf.Session(config=tf.ConfigProto(allow_growth=True))
None of these fixes worked for me, as it seems that the structure of the tensorflow libraries have changed. For Tensorflow 2.0, the only fix that worked for me was as under Limiting GPU memory growth on this page https://www.tensorflow.org/guide/gpu
For completeness and future-proofing, here's the solution from the docs - I imagine changing memory_limit may be necessary for some people - 1 GB was fine for my case.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Tensorflow 2.0 alpha
Allowing GPU memory growth may fix this issue. For Tensorflow 2.0 alpha / nightly there are two methods you can try, to archive this.
1.)
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth()
2.)
import tensorflow as tf
tf.config.gpu.set_per_process_memory_fraction(0.4) # adjust this to the % of VRAM you
# want to give to tensorflow.
I suggest you try both, and see if it helps.
Source: https://www.tensorflow.org/alpha/guide/using_gpu
for keras:
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
set_session(session)
In my case, a stale python process was consuming memory. I killed it through task manager, and things are back to normal.
A bit late to the party but this resolves my issue with tensorflow 2.4.0 and a gtx 980ti.
Before limiting the memory I got an error like:
CUBLAS_STATUS_ALLOC_FAILED
My solution was this piece of code:
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
I found the solution here: https://www.tensorflow.org/guide/gpu