Passing batches to Tensorflow Structural Time Seires

I am creating a model for time series prediction with Tensorflow Probability, following this tutorial. In these examples I need to pass all data at once, but this is prohibitive when dealing with big data (my case), how should I pass batches or any other kind of lazy loaded data to this tool?

This is a general problem for most probabilistic inference cases: using most non-full-batch gradients will yield biased samples.
You should be able to write a target_log_prob_fn with a tf.custom_gradient to iterate over a iterator. Since the target logprob is a scalar, you can accumulate both gradients and logprobs as the function proceeds over all minibatches in the dataset.
ds = build_dataset()
def build_model(params):
return time_series_model(..)
#tf.function # autograph should turn the dataset loop into a tf.while_loop.
def log_prob(*params):
total_lp = 0.
total_grad = tf.nest.map_structure(tf.zeros_like, params)
for batch in ds:
lp, grad = tfp.math.value_and_gradient(
lambda *p: build_model(p).log_prob(batch),
total_lp += lp
total_grad = tf.nest.map_structure(lambda x,y: x+y, total_grad, grad)
return total_lp, lambda dy: tf.nest.map_structure(lambda g: dy*g, total_grad)


Why is GPflow's Scipy optimizer incompatible with decorating the optimization step with tf.function?

I am supplying different minibatches to optimize a GPflow model (SVGP). If I decorate the optimization_step with tf.function I get the following error:
NotImplementedError: Cannot convert a symbolic Tensor (concat:0) to a
numpy array. This error may indicate that you're trying to pass a
Tensor to a NumPy call, which is not supported
In order for the optimizer to run I had to remove the tf.function decorator, losing the speed-up advantages. What do I need to change so that I can keep using the tf.function decorator?
The xAndY input shapes and types are all numpy arrays.
Out[71]: tuple
Out[72]: (245760, 2)
Out[73]: (245760, 1)
Out[74]: numpy.ndarray
def run_optimizer_on_minibatch_size(model, iterations, minibatch_size, xAndY):
Utility function running a Scipy optimizer
:param model: GPflow model
:param interations: number of iterations
N = xAndY[0].shape[0]
tensor_data = tuple(map(tf.convert_to_tensor, xAndY))
train_dataset =
logf = []
train_iter = iter(train_dataset.batch(minibatch_size))
training_loss = model.training_loss_closure(train_iter, compile=True)
optimizer = gpflow.optimizers.Scipy()
#tf.function # had to remove this decorator
def optimization_step():
optimizer.minimize(training_loss, model.trainable_variables)
# step = 0
for step in range(iterations):
if step % 10 == 0:
elbo = -training_loss().numpy()
return logf
from gpflow.ci_utils import ci_niter
maxiter = ci_niter(20000)
logf = run_optimizer_on_minibatch_size(m, maxiter, minibatch_size, (X,Y))
GPflow's gpflow.optimizers.Scipy() is a wrapper around Scipy's minimize(), and as it calls into non-TensorFlow operations, you cannot wrap it in tf.function. Moreover, the optimizers implemented in Scipy's minimize are second-order methods that assume that your gradients are not stochastic, and aren't compatible with minibatching.
If you want to do full-batch optimization with Scipy: The minimize() method of gpflow.optimizers.Scipy(), by default, does wrap the objective and gradient computation inside tf.function (see its compile argument with default True). It also does the full optimization, so you only have to call the minimize() method once (by default it runs until convergence or failure to continue optimization; you can supply a maximum number of iterations using the options=dict(maxiter=1000) argument).
If you want to use mini-batching: simply use one of the TensorFlow optimizers, such as tf.optimizers.Adam(), and then your code should run fine including the #tf.function decorator on your optimization_step() function (and in that case you do need to call it in a loop as in your example).

Tensorflow 2.0 Custom loss function with multiple inputs

I am trying to optimize a model with the following two loss functions
def loss_1(pred, weights, logits):
weighted_sparse_ce = kls.SparseCategoricalCrossentropy(from_logits=True)
policy_loss = weighted_sparse_ce(pred, logits, sample_weight=advantages)
def loss_2(y_pred, y):
return kls.mean_squared_error(y_pred, y)
however, because TensorFlow 2 expects loss function to be of the form
def fn(y_pred, y_true):
I am using a work-around for loss_1 where I pack pred and weights into a single tensor before passing to loss_1 in the call to and then unpack them in loss_1. This is inelegant and nasty because pred and weights are of different data types and so this requires an additional cast, pack, un-pack and un-cast each time I call
Furthermore, I am aware of the sample_weight argument to fit, which is kind of like the solution to this question. This might be a workable solution were it not for the fact that I am using two loss functions and I only want the sample_weight applied to one of them. Also, even if this were a solution, would it not be generalizable to other types of custom loss functions.
All that being said, my question, said concisely, is:
What is the best way to create a loss function with an arbitrary number of
arguments in TensorFlow 2?
Another thing I have tried is passing a tf.tuple but that also seems to violate TensorFlow's desires for a loss function input.
This problem can be easily solved using custom training in TF2. You need only compute your two-component loss function within a GradientTape context and then call an optimizer with the produced gradients. For example, you could create a function custom_loss which computes both losses given the arguments to each:
def custom_loss(model, loss1_args, loss2_args):
# model: tf.model.Keras
# loss1_args: arguments to loss_1, as tuple.
# loss2_args: arguments to loss_2, as tuple.
with tf.GradientTape() as tape:
l1_value = loss_1(*loss1_args)
l2_value = loss_2(*loss2_args)
loss_value = [l1_value, l2_value]
return loss_value, tape.gradient(loss_value, model.trainable_variables)
# In training loop:
loss_values, grads = custom_loss(model, loss1_args, loss2_args)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
In this way, each loss function can take an arbitrary number of eager tensors, regardless of whether they are inputs or outputs to the model. The sets of arguments to each loss function need not be disjoint as shown in this example.
To expand on Jon's answer. In case you want to still have the benefits of a Keras Model you can expand the model class and write your own custom train_step:
from tensorflow.python.keras.engine import data_adapter
# custom loss function that takes two outputs of the model
# as input parameters which would otherwise not be possible
def custom_loss(gt, x, y):
return tf.reduce_mean(x) + tf.reduce_mean(y)
class CustomModel(keras.Model):
def compile(self, optimizer, my_loss):
self.my_loss = my_loss
def train_step(self, data):
data = data_adapter.expand_1d(data)
input_data, gt, sample_weight = data_adapter.unpack_x_y_sample_weight(data)
with tf.GradientTape() as tape:
y_pred = self(input_data, training=True)
loss_value = self.my_loss(gt, y_pred[0], y_pred[1])
grads = tape.gradient(loss_value, self.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
return {"loss_value": loss_value}
model = CustomModel(inputs=input_tensor0, outputs=[x, y])
model.compile(optimizer=tf.keras.optimizers.Adam(), my_loss=custom_loss)
In tf 1.x we have tf.nn.weighted_cross_entropy_with_logits function which allows us trade off recall and precision by adding extra positive weights for each class. In multi-label classification, it should be a (N,) tensor or numpy array. However, in tf 2.0, I haven't found similar loss functions yet, so I wrote my own loss function with extra arguments pos_w_arr.
from tensorflow.keras.backend import epsilon
def pos_w_loss(pos_w_arr):
Define positive weighted loss function
def fn(y_true, y_pred):
_epsilon = tf.convert_to_tensor(epsilon(), dtype=y_pred.dtype.base_dtype)
_y_pred = tf.clip_by_value(y_pred, _epsilon, 1. - _epsilon)
cost = tf.multiply(tf.multiply(y_true, tf.math.log(
_y_pred)), pos_w_arr)+tf.multiply((1-y_true), tf.math.log(1-_y_pred))
return -tf.reduce_mean(cost)
return fn
Not sure what do you mean it wouldn't work when using eager tensors or numpy array as inputs though. Please correct me if I'm wrong.

keras compile with dataset and flexible loss/metrics

I'm porting a bunch of code from tf.estimator.Estimator API to tf.keras using and I'm hoping to stay as close to the provided compile/fit as possible. I'm being frustrated by compile's loss and metrics args.
Essentially, I'd like to use a loss function which uses multiple outputs and labels in a non-additive way, i.e. I want to provide
def custom_loss(all_labels, model_outputs):
all_labels: all labels in the dataset, as a single tensor, tuple or dict
model_outputs: all outputs of model as a single tensor, tuple or dict
single loss tensor to be averaged.
I can't provide this to compile because as far as I'm aware it only supports weighted sums of per-output/label losses, and makes assumptions about the shape of each label based on the the corresponding model output. I can't create it separately and use model.add_loss because I never have explicit access to a labels tensor if I want to let handle dataset iteration. I've considered flattening/concatenating all outputs and labels together, but then I can't monitor multiple metrics.
I can write my own training loop using model.train_on_batch, but that forces me to replicate behaviour already implemented in fit such as dataset iteration, callbacks, validation, distribution strategies etc.
As an example, I'd like to replicate the following estimator.
def model_fn(features, labels, mode):
outputs = get_outputs(features) # dict
loss = custom_loss(labels, outputs)
train_op = tf.train.AdamOptimizer(1e-3).minimize(loss)
eval_metrics_op = {
'a_mean': tf.metrics.mean(outputs['a'])
return tf.estimator.EstimatorSpec(
loss=loss, train_op=train_op, mode=mode, eval_metric_ops=eval_metric_ops)
estimator = tf.estimator.Estimator(model_fn=model_fn)

Calculating Perplexity and Memory Issues in Keras/Tensorflow

I'd like to evaluate my model with Perplexity after each training epoch. I'm using Keras with Tensorflow backend. The problem is, that after each evaluation more and more memory is used but never released. So after a few epochs my system crashes. It would work without the memory issue if I'm not using keras and tensorflow functions. But then it would be waaay too slow.
Here is the code:
def compute_perplexity(self, modelName, sentences):
all_labels, all_predictions = self.predictLabels_for_perplexity_evaluation(self.models[modelName], sentences)
# add an axis to fit tensor shape
for i in range(len(all_labels)):
all_labels[i] = all_labels[i][:,:, np.newaxis]
#calculate perplexity for each sentence length and each datapoint and append to list
perplexity = []
for i in range(10,15): #range(len(all_labels)):
start = time.time()
xentropy = K.sparse_categorical_crossentropy(tf.convert_to_tensor(all_labels[i]), tf.convert_to_tensor(all_predictions[i]))
perplexity.append(K.eval(K.pow(2.0, xentropy)))
print('time for one set of sentences. ', time.time()- start)
#average for each datapoint
for i in range(len(perplexity)):
perplexity[i] = np.average(perplexity[i], axis=1)
perplexity[i] = np.average(perplexity[i])
return np.mean(perplexity)
There is no need to evaluate this metric using TensorFlow, what you code does is to add the all_labels array to the graph each time it is called, which explains the memory usage you are seeing.
Consider implementing all this computation using numpy, or making an operation that you evaluate with new data in a session using feed_dict (without using tf.convert_to_tensor).

Parallelism isn't reducing the time in dataset map

TF Map function supports parallel calls. I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10, there is no improvement in performance run time. Here is a simple code
import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False,
batch_size=1, repeat=1, num_iterations=10):
start = time.time()
dataset_x = x: tf.py_func(
squarer, [x], [tf.int64]),
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = x: tf.py_func(
squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
i = 0
while True:
res =[X, Y])
i += 1
if i == num_iterations:
except tf.errors.OutOfRangeError as e:
Here are the timings
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=2, batch_size=2, batch=True)
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=5, batch_size=2, batch=True)
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=10, batch_size=2, batch=True)
I used %timeit in Juypter notebook. What am I doing it wrong?
The problem here is that the only operation in the function is a tf.py_func() op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).
Your code example didn't include the definition of the squarer() function, but it might be possible to replace tf.py_func() with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x), and you might then enjoy some parallel speedup.
Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example() or performing some image distortions as part of a data augmentation pipeline.
The reason maybe the squarer cost less time than overhead time. I modified the code with adding a quarter function which cost 2 seconds. Then the parameter num_parallel_calls works as expected. Here is the complete code:
import tensorflow as tf
import time
def squarer(x):
t0 = time.time()
while time.time() - t0 < 2:
y = x ** 2
return y
def test_two_custom_function_parallelism(num_parallel_calls=1,
start = time.time()
dataset_x =
lambda x: tf.py_func(squarer, [x], [tf.int64]),
# dataset_x = dataset_x.prefetch(4)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y =
lambda x: tf.py_func(squarer, [x], [tf.int64]),
# dataset_y = dataset_y.prefetch(4)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
i = 0
while True:
t0 = time.time()
res =[X, Y])
i += 1
if i == num_iterations:
except tf.errors.OutOfRangeError as e:
print('step elapse: %.4f' % (time.time() - t0))
print('total time: %.4f' % (time.time() - start))
num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)
the output is:
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 4.0204
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 4.0836
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 4.1529
[(array([36, 49]),), (array([36, 49]),)]
total time: 16.3374
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 2.2139
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 0.0585
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 0.0469
[(array([36, 49]),), (array([36, 49]),)]
total time: 2.5317
So I am confused with the effect of "Global Interpreter Lock" mentioned by #mrry.
I setup my own version of map to get something similar to the TensorFlow's, but which will use multiple CPUs for py_functions.
Instead of
mapped_dataset = x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)
with the below code, you can get a CPU parallel py_function version using
mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)
(The output type(s) for the py_function can also be specified if it's not a single tf.float32)
Internally, this creates a pool of multiprocessing workers. It still uses the single regular GIL limited TensorFlow map, but only to pass the input to a worker and get the output back. The workers processing the data happen in parallel on the CPU.
The function passed needs to be picklable to work with the multiprocessing pool. This should work for most cases, but some closures or whatnot may fail. Packages like dill might loosen this restriction, but I haven't looked into that.
If you pass an object's method as the function, you also need to be careful about how the object is duplicated across processes (each process will have its own copy of the object, so you can't rely on the attributes being shared).
As long as these considerations are kept in mind, this code should work for many cases.
Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
import multiprocessing
from typing import Callable, Union, List
import signal
import tensorflow as tf
class PyMapper:
A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
def __init__(self, map_function: Callable, number_of_parallel_calls: int):
self.map_function = map_function
self.number_of_parallel_calls = number_of_parallel_calls
self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)
def pool_worker_initializer():
Used to initialize each worker process.
# Corrects bug where worker instances catch and throw away keyboard interrupts.
signal.signal(signal.SIGINT, signal.SIG_IGN)
def send_to_map_pool(self, element_tensor):
Sends the tensor element to the pool for processing.
:param element_tensor: The element to be processed by the pool.
:return: The output of the map function on the element.
result = self.pool.apply_async(self.map_function, (element_tensor,))
mapped_element = result.get()
return mapped_element
def map_to_dataset(self, dataset:,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
Maps the map function to the passed dataset.
:param dataset: The dataset to apply the map function to.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
def map_py_function(*args):
"""A py_function wrapper for the map function."""
return tf.py_function(self.send_to_map_pool, args, output_types)
return, self.number_of_parallel_calls)
def map_py_function_to_dataset(dataset:, map_function: Callable, number_of_parallel_calls: int,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
) ->
A one line wrapper to allow mapping a parallel py function to a dataset.
:param dataset: The dataset whose elements the mapping function will be applied to.
:param map_function: The function to map to the dataset.
:param number_of_parallel_calls: The number of parallel calls of the mapping function.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
return mapped_dataset