tf.function uses all CPU RAM - tensorflow2.0

I cannot understand why the function I have posted below uses up all of my RAM. I could understand if I were running it eagerly, but I thought the point of a tf.function was to create a graph that is reused, much like creating an operation and running it in tf 1.x. I am new to tensorflow 2.0 so I might have the wrong idea about what tf.function is doing.
def clip_w(self, weight):
return tf.clip_by_value(weight, -0.01, 0.01)
Could anyone help me understand this? Thanks
EDIT: Here is the code where I use this function
def clip_weights(self):
for l in self.C.layers:
weights = l.get_weights()
weights = [self.clip_w(w) for w in weights]


Wrong gradient from tf custom gradient - even though gradient is implemented using the inbuilt Jacobian

I'm trying to write a wrapper around a model, such that the tf model can be called as a function of its weights (and input). However this wrapper returns different gradients than the gradients fromt the original model. Details in the code below (including a colab notebook to reproduce directly), but at the core I'm using the custom gradient decorator - the respective gradient is computed directly as the upstream 'gradient' matmul (via tensordot) the respective jacobian.
To make this clear: I'm computing the gradient for a model, once directly, once by using my custom wrapper. In both cases the parameters in the model are the same. The Jacobian is implemented by TF, so nothing should be wrong there. Still the resulting gradient seems to be wrong.
I'm not sure, whether this is a coding mistake I made somewhere, or possibly just a numeric problem stemming from the Jacobian matmul - however my tests regarding correlation of the gradients suggest this is more than a numeric issue for now. Code of the function is provided below, a link to colab notebook reproducing the problem can be found here: Colab Notebook reproducing the problem
Why: This is important for a bunch of metalearning, which I'm trying to build a small library for currently.
My current 'wrapper' looks something like this:
#calls model on input x but replaces internal weights with the weights argument
#critically supposed to compute the respective gradient for the weights tensor argument!
def call_model_with_weights(model, x, weights, dim_output=2):
def _call_with_weights(x_and_w):
x, weights = x_and_w
#be careful; this assigns weights to the model as a side effect, can ignore for dummy version
ctrls = [var.assign(val) for var, val in zip(model.trainable_weights, weights)]
with tf.control_dependencies(ctrls):
with tf.GradientTape() as tape:
y = model(x)
jacobians = tape.jacobian(y, model.trainable_weights)
def grad(upstream, variables):
assert len(variables)==len(weights)
#gradient for each weight should be upstream dotproduct respective jacobian
dy_dw = [tf.tensordot(upstream, j, axes=[list(range(dim_output)), list(range(dim_output))]) for j in jacobians]
dy_dw_weights = dy_dw
return (None, dy_dw_weights), [None for _ in dy_dw] # returning x as derivative of x is wrong, but not important here rn
return y, grad
y = _call_with_weights((x, weights))
return y
Thanks a lot for any help (including how this could be done in a more elegant way), helping out means you are contributing to package that plans to mimic PyTorch 'higher' for TF which I hope helps some more people <3

Tensorflow: How to prefetch data on the GPU from CPU (from_generator)

I am struggling with the following. I am creating a using the from_generator method. I perform these actions on CPU as I don't want to overload my GPU memory.
The dataset consists of tuples, which contain a tf.bool 1-D mask (tf.Tensor) with fixed length, and a tf.float 2-D matrix (tf.Tensor) with variable size. The loss function is decorated using the following decorator, so I would not assume the variable size is the problem.
Ideally, the dataset is kept on the CPU, but then prefetched onto the GPU.
def gen():
for i, j in zip(mask_list, wmat_list):
yield i, j
dataset =, output_types=(tf.bool, tf.float32))
The main training loop currently relies on tf.identity to move the data to the gpu, which is inefficient. As shown on the screenshot from Tensorboard below. Roughly 70% of the time is spend loading the data and moving it to GPU.
for b, (mask, wmat) in enumerate(dataset):
with tf.GradientTape() as tape:
mask = tf.identity(mask)
wmat = tf.identity(wmat)
mean_error, loss = self.model.loss(mask, wmat)
epoch_loss += loss.numpy()
epoch_mean_error += mean_error.numpy()
I have tried the "prefetch_to_device" function. However, it did not move the data onto the GPU. As verified by printing e.g. mask.device in the training loop.
gpu_transform ='/gpu')
For me it resembles to this bug: . However, it is marked as solved and is over a year old.
Running TF 2.3 using the official Docker image.
I have found the solution to my own question.
The problem was that the tuples in the dataset did not contain tf.Tensors, but numpy arrays. Therefore, the pipeline was probably limited by the functionality of py_func().
The screenshot below show that the pipeline does not block on the CPU. However there is still a considerable MemCpy. The prefetch_to_device() still does not do anything. This is likely due to a known issue which should be fixed in TF2.4
The (unconfirmed) suggested workaround also did not work for me. (see edit)
with tf.device("/gpu:0"):
ds = ds.prefetch(1)
I have further investigated this issue and filed a bug report. It does now seem that the suggested workaround does something, but not sure if it completely prefetches in time.

Use TensorFlow loss Global Objectives (recall_at_precision_loss) with Keras (not metrics)

I have a multi-label classification problem with 5 labels (e.g. [1 0 1 1 0]). Therefore, I want my model to improve at metrics such as fixed recall, precision-recall AUC or ROC AUC.
It doesn't make sense to use a loss function (e.g. binary_crossentropy) that is not directly related to the performance measurement I want to optimize. Therefore, I want to use TensorFlow's global_objectives.recall_at_precision_loss() or similar as loss function.
Relevant GitHub:
Relevant paper (Scalable Learning of Non-Decomposable Objectives):
Not metric
I'm not looking for implementing a tf.metrics. I already succeeded in that following:
I think my issue can be divided into 2 problems:
How to use global_objectives.recall_at_precision_loss() or similar?
How to use it in a Keras model with TF backend?
Problem 1
There is a file called on the global objectives GitHub page (same as above). However, since I don't have much experience with TF, I don't really understand how to use it. Also, Googling for TensorFlow recall_at_precision_loss example or TensorFlow Global objectives example won't give me any clearer example.
How do I use global_objectives.recall_at_precision_loss() in a simple TF example?
Problem 2
Would something like (in Keras): model.compile(loss = ??.recall_at_precision_loss, ...) be enough?
My feeling tells me it is more complex than that, due to the use of global variables used in
How to use loss functions similar to global_objectives.recall_at_precision_loss() in Keras?
Similar to Martino's answer, but will infer shape from input (setting it to a fixed batch size did not work for me).
The outside function isn't strictly necessary, but it feels a bit more natural to pass params as you configure the loss function, especially when your wrapper is defined in an external module.
import keras.backend as K
from global_objectives.loss_layers import precision_at_recall_loss
def get_precision_at_recall_loss(target_recall):
def precision_at_recall_loss_wrapper(y_true, y_pred):
y_true = K.reshape(y_true, (-1, 1))
y_pred = K.reshape(y_pred, (-1, 1))
return precision_at_recall_loss(y_true, y_pred, target_recall)[0]
return precision_at_recall_loss_wrapper
Then, when compiling the model:
model.compile(optimizer='adam', loss=get_precision_at_recall_loss(TARGET_RECALL))
I managed to make it work by:
Explicitly reshaping tensors to BATCH_SIZE length (see code below)
Cutting the dataset size to a multiple of BATCH_SIZE
def precision_recall_auc_loss(y_true, y_pred):
y_true = keras.backend.reshape(y_true, (BATCH_SIZE, 1))
y_pred = keras.backend.reshape(y_pred, (BATCH_SIZE, 1))
util.get_num_labels = lambda labels : 1
return loss_layers.precision_recall_auc_loss(y_true, y_pred)[0]

How to run Tensorflow Estimator on multiple GPUs with data parallelism

I have a standard tensorflow Estimator with some model and want to run it on multiple GPUs instead of just one. How can this be done using data parallelism?
I searched the Tensorflow Docs but did not find an example; only sentences saying that it would be easy with Estimator.
Does anybody have a good example using the tf.learn.Estimator? Or a link to a tutorial or so?
I think tf.contrib.estimator.replicate_model_fn is a cleaner solution. The following is from tf.contrib.estimator.replicate_model_fn documentation,
def model_fn(...): # See `model_fn` in `Estimator`.
loss = ...
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
optimizer = tf.contrib.estimator.TowerOptimizer(optimizer)
if mode == tf.estimator.ModeKeys.TRAIN:
# See the section below on `EstimatorSpec.train_op`.
return EstimatorSpec(mode=mode, loss=loss,
# No change for `ModeKeys.EVAL` or `ModeKeys.PREDICT`.
return EstimatorSpec(...)
classifier = tf.estimator.Estimator(
What you need to do is to wrap optimizer with tf.contrib.estimator.TowerOptimize and model_fn() with tf.contrib.estimator.replicate_model_fn().
I followed the description and make an TPU squeezenet model work on a machine with 4 GPUs. My modifications here.
I think this is all you need.
More Details:
dist_strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
config = tf.estimator.RunConfig(train_distribute=dist_strategy)
estimator = tf.estimator.Estimator(model_fn,model_dir,config=config)
With TF-2.0 and Keras you may use this (
The standard example is:
One way to run it data-parallel would be to loop over available GPU devices, and send chunks of your batch to copied versions of your model (all done within your model_fn), then merge the results.
You can use scope and device for that:
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
Full example there:
You can find an example using tf.distribute.MirroredStrategy and tf.estimator.train_and_evaluate here.

How to use evaluation_loop with train_loop in tf-slim

I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.
My question is: what is the canonical way to use these loops?
As a followup: is it possible to use early stopping with train_loop?
Currently I have a model and my training file looks like this
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op(
Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.
But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called which contains my evaluation loop.
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops(
dataset.num_classes() )
1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?
2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.
3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.
~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~
This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.
evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.
You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.
Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.
Thanks to #kmalakoff, the TensorFlow issue gave a brilliant way to the problem that how to validate or test model in tf.slim training. The main idea is overriding train_step_fn function:
import …
from tensorflow.contrib.slim.python.slim.learning import train_step
accuracy_validation = ...
accuracy_test = ...
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_every_n_step == 0:
accuracy =
print('your validation info')
if train_step_fn.step % FLAGS.test_every_n_step == 0:
accuracy =
print('your test info')
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
train_step_fn.accuracy_test = accuracy_test
# run training.
Adding my 2-cent:
I currently have this model for the evaluation_loop hogging up an
entire GPU, but it's rarely being used
Usually an evaluation model takes less GPU memory. You could prevent TF from hogging the whole GPU memory by setting the session config allow_growth to True. This way you can use the same GPU for both training and evaluation
Example # Training
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
Example # validation
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True