GPU memory not released tensorflow - tensorflow

I have the issue that my GPU memory is not released after closing a tensorflow session in Python. These three line suffice to cause the problem:
import tensorflow as tf
sess=tf.Session()
sess.close()
After the third line the memory is not released. I have been up and down many forums and tried all sorts of suggestions, but nothing has worked for me. For details please also see my comment at the bottom here:
https://github.com/tensorflow/tensorflow/issues/19731
Here I have documented the ways in which I mange to kill the process and thus release the memory, but this is not useful for long-running and automated processes. I would very much appreciate any further suggestions to try. I am using Windows.
EDIT: I have now found a solution that at least allows me to do what I am trying to do. I am still NOT able to release the memory, but I am able to 'reuse' it. The code has this structure:
import tensorflow as tf
from keras import backend as K
cfg=K.tf.ConfigProto()
#cfg.gpu_options.allow_growth=True #this is optional
cfg.gpu_options.per_process_gpu_memory_fraction = 0.8 #you can use any percentage here
#upload your data and define your model (2 layers in this case) here
for i in range(len(neuron1)):
for j in range(len(neuron2)):
K.set_session(K.tf.Session(config=cfg))
#train your NN for i,j
The first time the script enters the loop the GPU memory is still allocated (80% in the above example) and thus cluttered, however this code nonetheless seems to reuse the same memory somehow. I reckon the K.set_session(K.tf.Session(config=cfg)) somehow destorys or resets the old session allowing the memory to be 'reused' within this context at least. Note that I am not using sess.close() or K.clear_session() or resetting the default graph explicitly. This still does not work for me. When done with the loops the GPU memory is still full.

Refer to this discussion. You can reuse your allocated memory but if you want to free the memory, then you would have to exit the Python interpreter itself.

If I'm understanding correctly, it should be as simple as:
from numba import cuda
cuda.close()

Related

Why does memory usage not going down even after stopping my tensorflow object detection script?

I am training object detection pipeline which is developed using TensorFlow library. My problem is, even after stopping the script memory usage is really high and not going down. Can somebody recommend a remedy to this problem?
I am using TensorFlow=2.6 and the object detection API from tensorflow to train on my data.
Even after I re-ran my script (model_main_tf2) after stopping the older ones, these older ones are still consuming a lot of memory (with same name as model_main_tf2) as can be seen in the figure below.
try running:
tf.keras.backend.clear_session()
after you are finished running a model

Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory"

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!
Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

keras + scikit-learn wrapper, appears to hang when GridSearchCV with n_jobs >1

UPDATE: I have to re-write this question as after some investigation I realise that this is a different problem.
Context: running keras in a gridsearch setting using the kerasclassifier wrapper with scikit learn. Sys: Ubuntu 16.04, libraries: anaconda distribution 5.1, keras 2.0.9, scikitlearn 0.19.1, tensorflow 1.3.0 or theano 0.9.0, using CPUs only.
Code:
I simply used the code here for testing: https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/, the second example 'Grid Search Deep Learning Model Parameters'. Pay attention to line 35, which reads:
grid = GridSearchCV(estimator=model, param_grid=param_grid)
Symptoms: When grid search uses more than 1 jobs (means cpus?), e.g.,, setting 'n_jobs' on the above line A to '2', line below:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=2)
will cause the code to hang indefinitely, either with tensorflow or theano, and there is no cpu usage (see attached screenshot, where 5 python processes were created but none is using cpu).
By debugging, it appears to be the following line with 'sklearn.model_selection._search' that causes problems:
line 648: for parameters, (train, test) in product(candidate_params,
cv.split(X, y, groups)))
, on which the program hangs and cannot continue.
I would really appreciate some insights as to what this means and why this could happen.
Thanks in advance
Are you using a GPU? If so, you can't have multiple threads running each variation of the params because they won't be able to share the GPU.
Here's a full example on how to use keras, sklearn wrappers in a Pipeline with GridsearchCV: Pipeline with a Keras Model
If you really want to have multiple jobs in the GridSearchCV, you can try to limit the GPU fraction used by each job (e.g. if each job only allocates 0.5 of the available GPU memory, you can run 2 jobs simultaneously)
See these issues:
Limit the resource usage for tensorflow backend
GPU memory fraction does not work in keras 2.0.9 but it works in 2.0.8
I dealt with this problem too and it really slowed me down not being able to run what is essentially trivially-parallelizable code. The issue is indeed with the tensorflow session. If a session in created in the parent process before GridSearchCV.fit(), it will hang!
The solution for me was to keep all session/graph creation code restricted to the KerasClassifer class and the model creation function i passed to it.
Also what Felipe said about the memory is true, you will want to restrict the memory usage of TF in either the model creation function or a subclass of KerasClassifier.
Related info:
Session hang issue with python multiprocessing
Keras + Tensorflow and Multiprocessing in Python
TL;DR Answer: You can't because your Keras model can't be serialized, and serialization is needed for parallelizing in Python with joblib.
This problem is much detailed here: https://www.neuraxle.org/stable/scikit-learn_problems_solutions.html#problem-you-can-t-parallelize-nor-save-pipelines-using-steps-that-can-t-be-serialized-as-is-by-joblib
The solution to parallelize your code is to make your Keras estimator serializable. This can be done using savers as described at the link above.
If you're lucky enough to be using TensorFlow v2's prebuilt Keras module, the following practical code sample will reveal to be useful to you as you'd practically just need to take the code and modify it with yours:
https://github.com/guillaume-chevalier/seq2seq-signal-prediction
In this example, all the saving and loading code is all pre-written for you using Neuraxle-TensorFlow, and this makes it parallelizeable if you use Neuraxle's AutoML methods (e.g.: Neuraxle's grid search and Neuraxle's own parallelism things).

tensorflow hangs before initializing global variables in joblib

I have a multi-layer CNN in CPU tensorflow.
I'm using the Parallel and delayed functions in joblib to learn multiple instances of my CNN, trained on the same set of data.
When I try to run this, the program will hang after a joblib worker starts its tf.Session(), but before any tensorflow variables are initialized, and before I get any output from the verbose argument of the Parallel function.
I don't really know why this would happen. So I am looking for general debugging strategies from other people who may have combined tensorflow and joblib.
I was able to get the program to work by changing the backend option of
Parallel to "threading". Apparently, the "multiprocessing" option was creating too much communication and memory overhead when exchanging input and output data.

TensorFlow: How to measure how much GPU memory each tensor takes?

I'm currently implementing YOLO in TensorFlow and I'm a little surprised on how much memory that is taking. On my GPU I can train YOLO using their Darknet framework with batch size 64. On TensorFlow I can only do it with batch size 6, with 8 I already run out of memory. For the test phase I can run with batch size 64 without running out of memory.
I am wondering how I can calculate how much memory is being consumed by each tensor? Are all tensors by default saved in the GPU? Can I simply calculate the total memory consumption as the shape * 32 bits?
I noticed that since I'm using momentum, all my tensors also have a /Momentum tensor. Could that also be using a lot of memory?
I am augmenting my dataset with a method distorted_inputs, very similar to the one defined in the CIFAR-10 tutorial. Could it be that this part is occupying a huge chunk of memory? I believe Darknet does the modifications in the CPU.
Now that 1258 has been closed, you can enable memory logging in Python by setting an environment variable before importing TensorFlow:
import os
os.environ['TF_CPP_MIN_VLOG_LEVEL']='3'
import tensorflow as tf
There will be a lot of logging as a result of this. You'll want to grep the results to find the appropriate lines. For example:
grep MemoryLogTensorAllocation train.log
Sorry for the slow reply. Unfortunately right now the only way to set the log level is to edit tensorflow/core/platform/logging.h and recompile with e.g.
#define VLOG_IS_ON(lvl) ((lvl) <= 1)
There is a bug open 1258 to control logging more elegantly.
MemoryLogTensorOutput entries are logged at the end of each Op execution, and indicate the tensors that hold the outputs of the Op. It's useful to know these tensors since the memory is not released until the downstream Op consumes the tensors, which may be much later on in a large graph.
See the description in this (commit).
The memory allocation is raw info is there although it needs a script to collect the information in an easy to read form.