Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory" - gpu

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!

Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

Related

can tf alone train a 20 million-plus rows dataset?

I wanna train a model in tf using a dataset of more than 20 million rows. Are there any limitations/errors that may happen when performing this? Are there any methods/techniques I could try to effectively perform this?. The problem is simple classification one but I've never trained with such a large dataset. Any advice would be helpful. Thanks
TensorFlow can handle petabytes of information passed across tens of thousands of GPUs - the question is, does your code manage resources properly, and can your hardware handle it? This is called distributed training. The topic is very broad, but you can get started with setting up a GPU - that includes installing CUDA & cuDNN. You can also refer to input data pipeline optimization.
I suggest handling all your installs via Anaconda 3, as it handles package compatibility - here's a guide or two to get started.
Lastly, your main hardware constraints are RAM and GPU memory; former for the maximum array size a model can process (e.g. 8GB), and latter for maximum model size the GPU can fit.

Tensorflow inference run time high on first data point, decreases on subsequent data points

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.
Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Tensorflow fails to run on GPU from time to time

Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.
What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?

Very weird behaviour when running the same deep learning code in two different GPUs

I am training networks using pytorch framework. I had K40 GPU in my computer. Last week, I added 1080 to the same computer.
In my first experiment, I observed identical results in both GPU. Then, I tried my second code on both GPUs. In this case, I "constantly" got good results in K40 while getting "constantly" awful results in 1080 for "exactly the same code".
First, I thought the only reason for getting such diverse outputs would be the random seeds in the codes. So, I fixed the seeds like this:
torch.manual_seed(3)
torch.cuda.manual_seed_all(3)
numpy.random.seed(3)
But, this did not solve the issue. I believe issue cannot be randomness because I was "constantly" getting good results in K40 and "constantly" getting bad results in 1080. Moreover, I tried exactly the same code in 2 other computers and 4 other 1080 GPUs and always achieved good results. So, problem has to be about the 1080 I recently plugged in.
I suspect problem might be about driver, or the way I installed pytorch. But, it is still weird that I only get bad results for "some" of the experiments. For the other experiments, I had the identical results.
Can anyone help me on this?
Q: can you please tell what type of experiment this is.. and what architecture of NN you use ?
In below tips, I will assume you are running a straight backpropagation neural net.
You say learning of your test experiment is "unstable" ? Training of a NN should not be "unstable". When it is, different processors could end up with a different outcome, influenced by numeric precision and rounding errors. Saturation could have occurred.. Check if your weight values have become too large. In that case 1) check if your training input and output are logically consistent, and 2) add more neurons in hidden layers and train again.
Good idea to check random() calls, but take into account that in a backprop NN there are several places random() functions can be used. Some backprop NN's also add dynamic noise to training patterns, to prevent early saturation of weights. When this training noise is scaled wrong, you could get bizarre results. When the noise is not added or too small, you could end up with saturation.
I had the same problem. I solved the problem by simply changing
sum
to
torch.sum
. Please try to change all the build-in functions to GPU one.

TensorFlow: How to measure how much GPU memory each tensor takes?

I'm currently implementing YOLO in TensorFlow and I'm a little surprised on how much memory that is taking. On my GPU I can train YOLO using their Darknet framework with batch size 64. On TensorFlow I can only do it with batch size 6, with 8 I already run out of memory. For the test phase I can run with batch size 64 without running out of memory.
I am wondering how I can calculate how much memory is being consumed by each tensor? Are all tensors by default saved in the GPU? Can I simply calculate the total memory consumption as the shape * 32 bits?
I noticed that since I'm using momentum, all my tensors also have a /Momentum tensor. Could that also be using a lot of memory?
I am augmenting my dataset with a method distorted_inputs, very similar to the one defined in the CIFAR-10 tutorial. Could it be that this part is occupying a huge chunk of memory? I believe Darknet does the modifications in the CPU.
Now that 1258 has been closed, you can enable memory logging in Python by setting an environment variable before importing TensorFlow:
import os
os.environ['TF_CPP_MIN_VLOG_LEVEL']='3'
import tensorflow as tf
There will be a lot of logging as a result of this. You'll want to grep the results to find the appropriate lines. For example:
grep MemoryLogTensorAllocation train.log
Sorry for the slow reply. Unfortunately right now the only way to set the log level is to edit tensorflow/core/platform/logging.h and recompile with e.g.
#define VLOG_IS_ON(lvl) ((lvl) <= 1)
There is a bug open 1258 to control logging more elegantly.
MemoryLogTensorOutput entries are logged at the end of each Op execution, and indicate the tensors that hold the outputs of the Op. It's useful to know these tensors since the memory is not released until the downstream Op consumes the tensors, which may be much later on in a large graph.
See the description in this (commit).
The memory allocation is raw info is there although it needs a script to collect the information in an easy to read form.