Puzzled by OOM Error on GPU when using 15595MiB / 16125MiB - gpu

I am using Tensorflow 2.X.
My GPU memory is 16125MiB, but my model requires 15595MiB, according to nvidia-smi
With this total usage, I get an OOM after some time, even when setting the minimum batch size.
I also tried the following, but as soon as more memory is required, I still go out of memory:
config = tf.compat.v1.ConfigProto()
#config.gpu_options.per_process_gpu_memory_fraction = 0.8
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
The total memory value should also include the (150MiB) from firefox, which I need to use once in a while, but I doubt killing it would make much difference.
My dataset is loaded through individual .h5 files (one per data sample), the intermediate computation arrays/tensors are deleted when not used.
Unfortunately, I cannot resize the model for several reasons due to an ongoing publication.
Is there any trick I can use to gain that extra 500MiB, to train my model safely, without incurring into OOM half-way?

Related

GPU Memory Spiking in Keras [duplicate]

I work in an environment in which computational resources are shared, i.e., we have a few server machines equipped with a few Nvidia Titan X GPUs each.
For small to moderate size models, the 12 GB of the Titan X is usually enough for 2–3 people to run training concurrently on the same GPU. If the models are small enough that a single model does not take full advantage of all the computational units of the GPU, this can actually result in a speedup compared with running one training process after the other. Even in cases where the concurrent access to the GPU does slow down the individual training time, it is still nice to have the flexibility of having multiple users simultaneously train on the GPU.
The problem with TensorFlow is that, by default, it allocates the full amount of available GPU memory when it is launched. Even for a small two-layer neural network, I see that all 12 GB of the GPU memory is used up.
Is there a way to make TensorFlow only allocate, say, 4 GB of GPU memory, if one knows that this is enough for a given model?
You can set the fraction of GPU memory to be allocated when you construct a tf.Session by passing a tf.GPUOptions as part of the optional config argument:
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The per_process_gpu_memory_fraction acts as a hard upper bound on the amount of GPU memory that will be used by the process on each GPU on the same machine. Currently, this fraction is applied uniformly to all of the GPUs on the same machine; there is no way to set this on a per-GPU basis.
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
https://github.com/tensorflow/tensorflow/issues/1578
For TensorFlow 2.0 and 2.1 (docs):
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
For TensorFlow 2.2+ (docs):
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
The docs also list some more methods:
Set environment variable TF_FORCE_GPU_ALLOW_GROWTH to true.
Use tf.config.experimental.set_virtual_device_configuration to set a hard limit on a Virtual GPU device.
Here is an excerpt from the Book Deep Learning with TensorFlow
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as it is needed by the process. TensorFlow provides two configuration options on the session to control this. The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations, it starts out allocating very little memory, and as sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process.
1) Allow growth: (more flexible)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
The second method is per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. Note: No release of memory needed, it can even worsen memory fragmentation when done.
2) Allocate fixed memory:
To only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Note:
That's only useful though if you truly want to bind the amount of GPU memory available on the TensorFlow process.
For Tensorflow version 2.0 and 2.1 use the following snippet:
import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
For prior versions , following snippet used to work for me:
import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)
All the answers above assume execution with a sess.run() call, which is becoming the exception rather than the rule in recent versions of TensorFlow.
When using the tf.Estimator framework (TensorFlow 1.4 and above) the way to pass the fraction along to the implicitly created MonitoredTrainingSession is,
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
trainingConfig = tf.estimator.RunConfig(session_config=conf, ...)
tf.estimator.Estimator(model_fn=...,
config=trainingConfig)
Similarly in Eager mode (TensorFlow 1.5 and above),
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
tfe.enable_eager_execution(config=conf)
Edit: 11-04-2018
As an example, if you are to use tf.contrib.gan.train, then you can use something similar to bellow:
tf.contrib.gan.gan_train(........, config=conf)
You can use
TF_FORCE_GPU_ALLOW_GROWTH=true
in your environment variables.
In tensorflow code:
bool GPUBFCAllocator::GetAllowGrowthValue(const GPUOptions& gpu_options) {
const char* force_allow_growth_string =
std::getenv("TF_FORCE_GPU_ALLOW_GROWTH");
if (force_allow_growth_string == nullptr) {
return gpu_options.allow_growth();
}
Tensorflow 2.0 Beta and (probably) beyond
The API changed again. It can be now found in:
tf.config.experimental.set_memory_growth(
device,
enable
)
Aliases:
tf.compat.v1.config.experimental.set_memory_growth
tf.compat.v2.config.experimental.set_memory_growth
References:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/config/experimental/set_memory_growth
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
See also:
Tensorflow - Use a GPU: https://www.tensorflow.org/guide/gpu
for Tensorflow 2.0 Alpha see: this answer
All the answers above refer to either setting the memory to a certain extent in TensorFlow 1.X versions or to allow memory growth in TensorFlow 2.X.
The method tf.config.experimental.set_memory_growth indeed works for allowing dynamic growth during the allocation/preprocessing. Nevertheless one may like to allocate from the start a specific-upper limit GPU memory.
The logic behind allocating a specific GPU memory would also be to prevent OOM memory during training sessions. For example, if one trains while opening video-memory consuming Chrome tabs/any other video consumption process, the tf.config.experimental.set_memory_growth(gpu, True) could result in OOM errors thrown, hence the necessity of allocating from the start more memory in certain cases.
The recommended and correct way in which to allot memory per GPU in TensorFlow 2.X is done in the following manner:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
Shameless plug: If you install the GPU supported Tensorflow, the session will first allocate all GPUs whether you set it to use only CPU or GPU. I may add my tip that even you set the graph to use CPU only you should set the same configuration(as answered above:) ) to prevent the unwanted GPU occupation.
And in an interactive interface like IPython and Jupyter, you should also set that configure, otherwise, it will allocate all memory and leave almost none for others. This is sometimes hard to notice.
If you're using Tensorflow 2 try the following:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
For Tensorflow 2.0 this this solution worked for me. (TF-GPU 2.0, Windows 10, GeForce RTX 2070)
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)
# allocate 60% of GPU memory
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
set_session(tf.Session(config=config))
this code has worked for me:
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
Well I am new to tensorflow, I have Geforce 740m or something GPU with 2GB ram, I was running mnist handwritten kind of example for a native language with training data containing of 38700 images and 4300 testing images and was trying to get precision , recall , F1 using following code as sklearn was not giving me precise reults. once i added this to my existing code i started getting GPU errors.
TP = tf.count_nonzero(predicted * actual)
TN = tf.count_nonzero((predicted - 1) * (actual - 1))
FP = tf.count_nonzero(predicted * (actual - 1))
FN = tf.count_nonzero((predicted - 1) * actual)
prec = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * prec * recall / (prec + recall)
plus my model was heavy i guess, i was getting memory error after 147, 148 epochs, and then I thought why not create functions for the tasks so I dont know if it works this way in tensrorflow, but I thought if a local variable is used and when out of scope it may release memory and i defined the above elements for training and testing in modules, I was able to achieve 10000 epochs without any issues, I hope this will help..
i tried to train unet on voc data set but because of huge image size, memory finishes. i tried all the above tips, even tried with batch size==1, yet to no improvement. sometimes TensorFlow version also causes the memory issues. try by using
pip install tensorflow-gpu==1.8.0

Tensorflow ResNet model loading uses **~5 GB of RAM** - while loading from weights uses only ~200 MB

I trained a ResNet50 model using Tensorflow 2.0 by transfer learning. I slightly modified the architecture (new classification layer) and saved the model with the ModelCheckpoint callback https://keras.io/callbacks/#modelcheckpoint during training. Training was fine. The model saved by callback takes ~206 MB on the hard drive.
To predict using the model I did:
I started a Jupyter Lab notebook. I used my_model = tf.keras.models.load_model('../models_using/my_model.hdf5') for loading the model. (btw, the same occurs using IPython).
I used the free linux command line tool to measure the free RAM just before the loading and after. The model loading takes about 5 GB of RAM.
I saved the weights of the model and the config as json. This takes about 105 MB.
I loaded the model from the json config and weights. This takes about ~200 MB of RAM.
Compared the predictions of both models. Exactly the same.
I tested the same procedure with a slightly different architeture (trained the same way) and the results were the same.
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
Btw, given a model in Keras, can you find out the compliation procedure ( optimizer,..)? Model.summary() does not help..
2019-12-07 - EDIT: Thanks to this answer, I conducted a series of tests:
I used the !free command in JupyterLab to measure the available memory before and after each test. Since I get_weights returns a list, I used copy.deepcopy to really copy the objects. Note, the commands below were separate Jupyter cells and the memory comments were added just for this answer.
!free
model = tf.keras.models.load_model('model.hdf5', compile=True)
# 25278624 - 21491888 = 3786.736 MB used
!free
weights = copy.deepcopy(model.get_weights())
# 21491888 - 21440272 = 51.616 MB used
!free
optimizer_weights = copy.deepcopy(model.optimizer.get_weights())
# 21440272 - 21339404 = 100.868 MB used
!free
model2 = tf.keras.models.load_model('model.hdf5', compile=False)
# 21339404 - 21140176 = 199.228 MB used
!free
Loading the model from json:
!free
# loading from json
with open('model_json.json') as f:
model_json_weights = tf.keras.models.model_from_json(f.read())
model_json_weights.load_weights('model_weights.h5')
!free
# 21132664 - 20971616 = 161.048 MB used
The difference between checkpoint and JSON+Weights is in the optimizer:
The checkpoint or model.save() save the optimizer and its weights (load_model compiles the model)
JSON + weights doesn't save the optimizer
Unless you are using a very simple optimizer, it's normal for it to have about the same number of weights as the model (a tensor of "momentum" for each weight tensor, for instance).
Some optimizers might take two times the size of the model, because it has two tensors of optimizer weights for each tensor of model weights.
Saving and loading the optimizer is important if you want to continue training. Starting training again with a new optimizer without proper weights will sort of destroy the model's performance (at least in the beginning).
Now, the 5GB is not really clear to me. But I suppose that:
There should be a lot of compression in saved weights
It might have to do with also allocating memory for all the gradient and backpropagation operations
Interesting tests:
Compression: check how much memory is used by the results of model.get_weights() and model.optimizer.get_weights(). These weights will be numpy, copied from the original tensors
Grandient/Backpropagation: check how much memory is used by:
load_model(name, compile=True)
load_model(name, compile=False)
I had a very similar issue with a recent model that I was attempting to load. All be it this is a year old issue, so I am uncertain if my solution will work for you. However, reading through the keras documentation of saved model.
I found this piece of code to be very useful:
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
tf.config.experimental.set_memory_growth(device, True)
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
It turns out in my case the loaded model was using up all of the GPU memory and causing issues so this forces it to use physical device memory or at least that is my conclusion.

Tensorflow OOM error when allocating resources to GPU [duplicate]

I work in an environment in which computational resources are shared, i.e., we have a few server machines equipped with a few Nvidia Titan X GPUs each.
For small to moderate size models, the 12 GB of the Titan X is usually enough for 2–3 people to run training concurrently on the same GPU. If the models are small enough that a single model does not take full advantage of all the computational units of the GPU, this can actually result in a speedup compared with running one training process after the other. Even in cases where the concurrent access to the GPU does slow down the individual training time, it is still nice to have the flexibility of having multiple users simultaneously train on the GPU.
The problem with TensorFlow is that, by default, it allocates the full amount of available GPU memory when it is launched. Even for a small two-layer neural network, I see that all 12 GB of the GPU memory is used up.
Is there a way to make TensorFlow only allocate, say, 4 GB of GPU memory, if one knows that this is enough for a given model?
You can set the fraction of GPU memory to be allocated when you construct a tf.Session by passing a tf.GPUOptions as part of the optional config argument:
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The per_process_gpu_memory_fraction acts as a hard upper bound on the amount of GPU memory that will be used by the process on each GPU on the same machine. Currently, this fraction is applied uniformly to all of the GPUs on the same machine; there is no way to set this on a per-GPU basis.
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
https://github.com/tensorflow/tensorflow/issues/1578
For TensorFlow 2.0 and 2.1 (docs):
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
For TensorFlow 2.2+ (docs):
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
The docs also list some more methods:
Set environment variable TF_FORCE_GPU_ALLOW_GROWTH to true.
Use tf.config.experimental.set_virtual_device_configuration to set a hard limit on a Virtual GPU device.
Here is an excerpt from the Book Deep Learning with TensorFlow
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as it is needed by the process. TensorFlow provides two configuration options on the session to control this. The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations, it starts out allocating very little memory, and as sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process.
1) Allow growth: (more flexible)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
The second method is per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. Note: No release of memory needed, it can even worsen memory fragmentation when done.
2) Allocate fixed memory:
To only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Note:
That's only useful though if you truly want to bind the amount of GPU memory available on the TensorFlow process.
For Tensorflow version 2.0 and 2.1 use the following snippet:
import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
For prior versions , following snippet used to work for me:
import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)
All the answers above assume execution with a sess.run() call, which is becoming the exception rather than the rule in recent versions of TensorFlow.
When using the tf.Estimator framework (TensorFlow 1.4 and above) the way to pass the fraction along to the implicitly created MonitoredTrainingSession is,
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
trainingConfig = tf.estimator.RunConfig(session_config=conf, ...)
tf.estimator.Estimator(model_fn=...,
config=trainingConfig)
Similarly in Eager mode (TensorFlow 1.5 and above),
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
tfe.enable_eager_execution(config=conf)
Edit: 11-04-2018
As an example, if you are to use tf.contrib.gan.train, then you can use something similar to bellow:
tf.contrib.gan.gan_train(........, config=conf)
You can use
TF_FORCE_GPU_ALLOW_GROWTH=true
in your environment variables.
In tensorflow code:
bool GPUBFCAllocator::GetAllowGrowthValue(const GPUOptions& gpu_options) {
const char* force_allow_growth_string =
std::getenv("TF_FORCE_GPU_ALLOW_GROWTH");
if (force_allow_growth_string == nullptr) {
return gpu_options.allow_growth();
}
Tensorflow 2.0 Beta and (probably) beyond
The API changed again. It can be now found in:
tf.config.experimental.set_memory_growth(
device,
enable
)
Aliases:
tf.compat.v1.config.experimental.set_memory_growth
tf.compat.v2.config.experimental.set_memory_growth
References:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/config/experimental/set_memory_growth
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
See also:
Tensorflow - Use a GPU: https://www.tensorflow.org/guide/gpu
for Tensorflow 2.0 Alpha see: this answer
All the answers above refer to either setting the memory to a certain extent in TensorFlow 1.X versions or to allow memory growth in TensorFlow 2.X.
The method tf.config.experimental.set_memory_growth indeed works for allowing dynamic growth during the allocation/preprocessing. Nevertheless one may like to allocate from the start a specific-upper limit GPU memory.
The logic behind allocating a specific GPU memory would also be to prevent OOM memory during training sessions. For example, if one trains while opening video-memory consuming Chrome tabs/any other video consumption process, the tf.config.experimental.set_memory_growth(gpu, True) could result in OOM errors thrown, hence the necessity of allocating from the start more memory in certain cases.
The recommended and correct way in which to allot memory per GPU in TensorFlow 2.X is done in the following manner:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
Shameless plug: If you install the GPU supported Tensorflow, the session will first allocate all GPUs whether you set it to use only CPU or GPU. I may add my tip that even you set the graph to use CPU only you should set the same configuration(as answered above:) ) to prevent the unwanted GPU occupation.
And in an interactive interface like IPython and Jupyter, you should also set that configure, otherwise, it will allocate all memory and leave almost none for others. This is sometimes hard to notice.
If you're using Tensorflow 2 try the following:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
For Tensorflow 2.0 this this solution worked for me. (TF-GPU 2.0, Windows 10, GeForce RTX 2070)
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)
# allocate 60% of GPU memory
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
set_session(tf.Session(config=config))
this code has worked for me:
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
Well I am new to tensorflow, I have Geforce 740m or something GPU with 2GB ram, I was running mnist handwritten kind of example for a native language with training data containing of 38700 images and 4300 testing images and was trying to get precision , recall , F1 using following code as sklearn was not giving me precise reults. once i added this to my existing code i started getting GPU errors.
TP = tf.count_nonzero(predicted * actual)
TN = tf.count_nonzero((predicted - 1) * (actual - 1))
FP = tf.count_nonzero(predicted * (actual - 1))
FN = tf.count_nonzero((predicted - 1) * actual)
prec = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * prec * recall / (prec + recall)
plus my model was heavy i guess, i was getting memory error after 147, 148 epochs, and then I thought why not create functions for the tasks so I dont know if it works this way in tensrorflow, but I thought if a local variable is used and when out of scope it may release memory and i defined the above elements for training and testing in modules, I was able to achieve 10000 epochs without any issues, I hope this will help..
i tried to train unet on voc data set but because of huge image size, memory finishes. i tried all the above tips, even tried with batch size==1, yet to no improvement. sometimes TensorFlow version also causes the memory issues. try by using
pip install tensorflow-gpu==1.8.0

Tensorflow - Inference time evaluation

I'm evaluating different image classification models using Tensorflow, and specifically inference time using different devices.
I was wondering if I have to use pretrained models or not.
I'm using a script generating 1000 random input images feeding them 1 by 1 to the network, and calculating mean inference time.
Thank you !
Let me start by a warning:
A proper benchmark of neural networks is done in a wrong way by most people. For GPUs there is disk I/O, memory bandwidth, PCI bandwidth, the GPU speed itself. Then there are implementation faults like using feed_dict in TensorFlow. This is also true for a efficient training of these models.
Let's start by a simple example considering a GPU
import tensorflow as tf
import numpy as np
data = np.arange(9 * 1).reshape(1, 9).astype(np.float32)
data = tf.constant(data, name='data')
activation = tf.layers.dense(data, 10, name='fc')
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(tf.global_variables_initializer())
print sess.run(activation)
All it does is creating a const tensor and apply a fully connected layer.
All the operations are placed on the GPU:
fc/bias: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587959: I tensorflow/core/common_runtime/placer.cc:874] fc/bias: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587970: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587979: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587988: I tensorflow/core/common_runtime/placer.cc:874] fc/kernel: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
...
Looks good, right?
Benchmarking this graph might give a rough estimate how fast the TensorFlow graph can be executed. Just replace tf.layers.dense by your network. If you accept the overhead of using pythons time package, you are done.
But this is, unfortunately, not the entire story.
There is copying the result back from the tensor-op 'fc/BiasAdd:0' accessing device memory (GPU) and copying to host memory (CPU, RAM).
Hence there is the PCI bandwidth limitation at some point. And there is a python interpreter somewhere sitting as well, taking CPU cycles.
Further, the operations are placed on the GPU, not necessary the values themselves. Not sure, which TF version you are using. But even a tf.const gave no guarantees in older version to be placed on the GPU. Which I only noticed when writing my own Ops. Btw: see my other answer on how TF decides where to place operations.
Now, the hard part: It depends on your graph. Having a tf.cond/tf.where sitting somewhere makes things harder to benchmark. Now, you need to go through all these struggles which you need to address when efficiently training a deep network. Meaning, a simple const cannot address all cases.
A solutions starts by putting/staging some values directly into GPU memory by running
stager = data_flow_ops.StagingArea([tf.float32])
enqeue_op = stager.put([dummy])
dequeue_op = tf.reduce_sum(stager.get())
for i in range(1000):
sess.run(enqeue_op)
beforehand. But again, the TF resource manager is deciding where it puts values (And there is no guarantee about the ordering or dropping/keeping values).
To sum it up: Benchmarking is a highly complex task as benchmarking CUDA code is complex. Now, you have CUDA and additionally python parts.
And it is a highly subjective task, depending on which parts you are interested in (just graph, including disk i/o, ...)
I usually run the graph with a tf.const input as in the example and use the profiler to see whats going on in the graph.
For some general ideas on how to improve runtime performance you might want to read the Tensorflow Performance Guide
So, to clarify, you are only interested in the runtime per inference step and not in the accuracy or any ML related performance metrics?
In this case it should not matter much if you initialize your model from a pretrained checkpoint or just from scratch via the given initializers (e.g. truncated_normal or constant) assigned to each variable in your graph.
The underlying mathematical operations will be the same, mainly matrix-multiply operations for whom it doesn't matter (much) which values the underlying add and multiply operations are performed on.
This could be a bit different, if your graph contains some more advanced control-flow structures like tf.while_loop that can influence the actual size of your graph depending on the values of certain Tensors.
Of course, the time it takes to initialize your graph at the very beginning of program execution will differ depending on if you initialize from scratch or from checkpoint.
Hope this helps.

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.