Parallelism isn't reducing the time in dataset map - tensorflow

TF Map function supports parallel calls. I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10, there is no improvement in performance run time. Here is a simple code
import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False,
batch_size=1, repeat=1, num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
try:
res = sess.run([X, Y])
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
pass
Here are the timings
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=2, batch_size=2, batch=True)
370ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=5, batch_size=2, batch=True)
372ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=10, batch_size=2, batch=True)
384ms
I used %timeit in Juypter notebook. What am I doing it wrong?

The problem here is that the only operation in the Dataset.map() function is a tf.py_func() op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).
Your code example didn't include the definition of the squarer() function, but it might be possible to replace tf.py_func() with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x), and you might then enjoy some parallel speedup.
Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel Dataset.map() is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example() or performing some image distortions as part of a data augmentation pipeline.

The reason maybe the squarer cost less time than overhead time. I modified the code with adding a quarter function which cost 2 seconds. Then the parameter num_parallel_calls works as expected. Here is the complete code:
import tensorflow as tf
import time
def squarer(x):
t0 = time.time()
while time.time() - t0 < 2:
y = x ** 2
return y
def test_two_custom_function_parallelism(num_parallel_calls=1,
batch=False,
batch_size=1,
repeat=1,
num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_x = dataset_x.prefetch(4)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_y = dataset_y.prefetch(4)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
t0 = time.time()
try:
res = sess.run([X, Y])
print(res)
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
print(i)
break
print('step elapse: %.4f' % (time.time() - t0))
print('total time: %.4f' % (time.time() - start))
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)
the output is:
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 4.0204
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 4.0836
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 4.1529
[(array([36, 49]),), (array([36, 49]),)]
total time: 16.3374
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 2.2139
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 0.0585
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 0.0469
[(array([36, 49]),), (array([36, 49]),)]
total time: 2.5317
So I am confused with the effect of "Global Interpreter Lock" mentioned by #mrry.

I setup my own version of map to get something similar to the TensorFlow's Dataset.map, but which will use multiple CPUs for py_functions.
Usage
Instead of
mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)
with the below code, you can get a CPU parallel py_function version using
mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)
(The output type(s) for the py_function can also be specified if it's not a single tf.float32)
Internally, this creates a pool of multiprocessing workers. It still uses the single regular GIL limited TensorFlow map, but only to pass the input to a worker and get the output back. The workers processing the data happen in parallel on the CPU.
Caveats
The function passed needs to be picklable to work with the multiprocessing pool. This should work for most cases, but some closures or whatnot may fail. Packages like dill might loosen this restriction, but I haven't looked into that.
If you pass an object's method as the function, you also need to be careful about how the object is duplicated across processes (each process will have its own copy of the object, so you can't rely on the attributes being shared).
As long as these considerations are kept in mind, this code should work for many cases.
Code
"""
Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
"""
import multiprocessing
from typing import Callable, Union, List
import signal
import tensorflow as tf
class PyMapper:
"""
A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
"""
def __init__(self, map_function: Callable, number_of_parallel_calls: int):
self.map_function = map_function
self.number_of_parallel_calls = number_of_parallel_calls
self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)
#staticmethod
def pool_worker_initializer():
"""
Used to initialize each worker process.
"""
# Corrects bug where worker instances catch and throw away keyboard interrupts.
signal.signal(signal.SIGINT, signal.SIG_IGN)
def send_to_map_pool(self, element_tensor):
"""
Sends the tensor element to the pool for processing.
:param element_tensor: The element to be processed by the pool.
:return: The output of the map function on the element.
"""
result = self.pool.apply_async(self.map_function, (element_tensor,))
mapped_element = result.get()
return mapped_element
def map_to_dataset(self, dataset: tf.data.Dataset,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
"""
Maps the map function to the passed dataset.
:param dataset: The dataset to apply the map function to.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
def map_py_function(*args):
"""A py_function wrapper for the map function."""
return tf.py_function(self.send_to_map_pool, args, output_types)
return dataset.map(map_py_function, self.number_of_parallel_calls)
def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
) -> tf.data.Dataset:
"""
A one line wrapper to allow mapping a parallel py function to a dataset.
:param dataset: The dataset whose elements the mapping function will be applied to.
:param map_function: The function to map to the dataset.
:param number_of_parallel_calls: The number of parallel calls of the mapping function.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
return mapped_dataset

Related

Why is GPflow's Scipy optimizer incompatible with decorating the optimization step with tf.function?

I am supplying different minibatches to optimize a GPflow model (SVGP). If I decorate the optimization_step with tf.function I get the following error:
NotImplementedError: Cannot convert a symbolic Tensor (concat:0) to a
numpy array. This error may indicate that you're trying to pass a
Tensor to a NumPy call, which is not supported
In order for the optimizer to run I had to remove the tf.function decorator, losing the speed-up advantages. What do I need to change so that I can keep using the tf.function decorator?
The xAndY input shapes and types are all numpy arrays.
type(xAndY)
Out[71]: tuple
xAndY[0].shape
Out[72]: (245760, 2)
xAndY[1].shape
Out[73]: (245760, 1)
type(xAndY[0])
Out[74]: numpy.ndarray
def run_optimizer_on_minibatch_size(model, iterations, minibatch_size, xAndY):
"""
Utility function running a Scipy optimizer
:param model: GPflow model
:param interations: number of iterations
"""
N = xAndY[0].shape[0]
tensor_data = tuple(map(tf.convert_to_tensor, xAndY))
train_dataset = tf.data.Dataset.from_tensor_slices(tensor_data).repeat().shuffle(N)
logf = []
train_iter = iter(train_dataset.batch(minibatch_size))
training_loss = model.training_loss_closure(train_iter, compile=True)
optimizer = gpflow.optimizers.Scipy()
#tf.function # had to remove this decorator
def optimization_step():
optimizer.minimize(training_loss, model.trainable_variables)
# step = 0
for step in range(iterations):
optimization_step()
if step % 10 == 0:
elbo = -training_loss().numpy()
logf.append(elbo)
print(elbo)
return logf
from gpflow.ci_utils import ci_niter
maxiter = ci_niter(20000)
logf = run_optimizer_on_minibatch_size(m, maxiter, minibatch_size, (X,Y))
GPflow's gpflow.optimizers.Scipy() is a wrapper around Scipy's minimize(), and as it calls into non-TensorFlow operations, you cannot wrap it in tf.function. Moreover, the optimizers implemented in Scipy's minimize are second-order methods that assume that your gradients are not stochastic, and aren't compatible with minibatching.
If you want to do full-batch optimization with Scipy: The minimize() method of gpflow.optimizers.Scipy(), by default, does wrap the objective and gradient computation inside tf.function (see its compile argument with default True). It also does the full optimization, so you only have to call the minimize() method once (by default it runs until convergence or failure to continue optimization; you can supply a maximum number of iterations using the options=dict(maxiter=1000) argument).
If you want to use mini-batching: simply use one of the TensorFlow optimizers, such as tf.optimizers.Adam(), and then your code should run fine including the #tf.function decorator on your optimization_step() function (and in that case you do need to call it in a loop as in your example).

Linear regression using tf.data with federated core API and data on remote execution client

I'm trying to do a demonstration of federated learning with tff. And I've got this far but the error messages I get are just too confusing. The important part is that I want to demostrate that the data is in the remote engine, which is why I use the tf.data.experimental.CsvDataset and I could not find anything similar in any tutorial. I've managed to do a mini experiment where data was read in the remote site, but I can't get this larger example to work.
Currently it complains about 'p = x * w + b', I believe because x is not a federated_value. But I've tried many many variations and just can't get it to work. The Salary.csv is from a tutorial here https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression?select=Salary_Data.csv
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_federated as tff
import grpc
ip_address = '127.0.0.1'
port = 8000
channels = [grpc.insecure_channel(f'{ip_address}:{port}') for _ in range(10)]
tff.backends.native.set_remote_execution_context(channels, rpc_mode='STREAMING')
#tf.function()
def load_data():
return tf.data.experimental.CsvDataset('data/Salary.csv', [tf.float64,tf.float64], header=True)
W_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
B_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
#tff.federated_computation(W_TYPE, B_TYPE)
def train(w, b):
data = load_data()
loss = tf.Variable(0.0, dtype=tf.float64)
with tf.GradientTape() as tape:
for x, y in data:
p = x * w + b
loss = loss + tf.square(p - y)
g_w, g_b = tape.gradient(loss, [w, b])
w.assign_sub(0.0001 * g_w)
b.assign_sub(0.0001 * g_b)
return [w, b]
w = tf.Variable(2.0, dtype=tf.float64)
b = tf.Variable(3.0, dtype=tf.float64)
for _ in range(1000):
w, b = train(data, tff.federated_broadcast(w), tff.federated_broadcast(b))
TFF does not support the mixing of federated computation with TensorFlow. The usual paradigm in TFF goes something like:
Write your local functions in TensorFlow, using TFF's #tff.tf_computation decorator
Inside the body of a tff.federated_computation, invoke intrinsic and communication operators (like tff.federated_broadcast above), and tff.federated_map your TF computations to federated elements.
There are many subtleties that have led to the design of a pattern like the above, but the main one is the desire for all logic to be expressible in a serialized representation, so that it can be split up and sent to the various devices in your federated system during true deployment. This is a little hand-wavey; you can think of TFF's decorators as defining a context in which the code you write will be traced, so that it can run in a totally platform-independent way.
Using this pattern, your computation above would look something like:
# These lines set up the TFF execution environment, and importantly are not used
# during the function definitions below--only upon *invocation* of your function
# in the outer Python context will the execution environment be used.
channels = [grpc.insecure_channel(f'{ip_address}:{port}') for _ in range(10)]
tff.backends.native.set_remote_execution_context(channels, rpc_mode='STREAMING')
#tf.function()
def load_data():
# From TFF's perspective, the code written here will be represented as a
# "blob of Tensorflow" that can be shipped out anywhere. We don't actually
# *need* a tf_computation decorator here, since this will be invoked in the
# body of another tf_computation and the logic will get embedded there, but
# we could put one here if we wanted.
return tf.data.experimental.CsvDataset(
'data/Salary.csv', [tf.float64,tf.float64], header=True)
#tff.tf_computation(tf.float64, tf.float64)
#tf.function
def local_train(w, b):
data = load_data()
# We don't need a variable for the loss here, and though TFF will allow you
# to construct a variable to use as a temporary in this fashion, tf.function
# won't. We're pushing the TF team on that one ;).
loss = tf.constant(0.0, dtype=tf.float64)
with tf.GradientTape() as tape:
# We must be inside a tf.function decorator, or the eager Python runtime,
# to iterate over a tf.data.Dataset in this way.
for x, y in data:
p = x * w + b
loss = loss + tf.square(p - y)
g_w, g_b = tape.gradient(loss, [w, b])
w = w- 0.0001 * g_w
b = b- 0.0001 * g_b
return w, b
# Making a symbol to represent these types is always a good idea
W_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
B_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
#tff.federated_computation(W_TYPE, B_TYPE)
def train(w, b):
# Here w and b are elements of federated type, placed at clients.
# We map the training function over these values.
# If they were SERVER-placed instead, we would federated_broadcast them
# out to the clients.
updated_w, updated_b = tff.federated_map(local_train, (w, b))
# TFF's Python runtime represents federated values placed at clients as a list of
# values, with as many elements as there are clients. This number will be inferred
# by the runtime and used to distribute the work. We also technically don't need
# variables here, but I believe they won't cause any problem.
clients_placed_w = [tf.Variable(2.0, dtype=tf.float64)]
clients_placed_b = [tf.Variable(3.0, dtype=tf.float64)]
# We could move the execution environment setup lines all the way down here,
# no problem.
for _ in range(1000):
clients_placed_w, clients_placed_b = train(clients_placed_w, clients_placed_b)

Tensorflow2.x custom data generator with multiprocessing

I just upgraded to tensorflow 2.3.
I want to make my own data generator for training.
With tensorflow 1.x, I did this:
def get_data_generator(test_flag):
item_list = load_item_list(test_flag)
print('data loaded')
while True:
X = []
Y = []
for _ in range(BATCH_SIZE):
x, y = get_random_augmented_sample(item_list)
X.append(x)
Y.append(y)
yield np.asarray(X), np.asarray(Y)
data_generator_train = get_data_generator(False)
data_generator_test = get_data_generator(True)
model.fit_generator(data_generator_train, validation_data=data_generator_test,
epochs=10000, verbose=2,
use_multiprocessing=True,
workers=8,
validation_steps=100,
steps_per_epoch=500,
)
This code worked fine with tensorflow 1.x. 8 processes were created in the system. The processor and video card were loaded perfectly. "data loaded" was printed 8 times.
With tensorflow 2.3 i got warning:
WARNING: tensorflow: multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
"data loaded" was printed once(should 8 times). GPU is not fully utilized. It also have memory leak every epoch, so traning will stops after several epochs. use_multiprocessing flag did not help.
How to make a generator / iterator in tensorflow(keras) 2.x that can easily be parallelized across multiple CPU processes? Deadlocks and data order are not important.
With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.
In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.
import numpy as np
import tensorflow as tf
x_shape = (32, 32, 3)
y_shape = () # A single item (not array).
classes = 10
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE
def generator_fn(n_samples):
"""Return a function that takes no arguments and returns a generator."""
def generator():
for i in range(n_samples):
# Synthesize an image and a class label.
x = np.random.random_sample(x_shape).astype(np.float32)
y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
yield x, y
return generator
def augment(x, y):
return x * tf.random.normal(shape=x_shape), y
samples = 10
batch_size = 5
epochs = 2
# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
generator=gen,
output_types=(np.float32, np.int32),
output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
augment,
num_parallel_calls=AUTOTUNE,
# Order does not matter.
deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)
# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)

Passing batches to Tensorflow Structural Time Seires

I am creating a model for time series prediction with Tensorflow Probability, following this tutorial. In these examples I need to pass all data at once, but this is prohibitive when dealing with big data (my case), how should I pass batches or any other kind of lazy loaded data to this tool?
This is a general problem for most probabilistic inference cases: using most non-full-batch gradients will yield biased samples.
You should be able to write a target_log_prob_fn with a tf.custom_gradient to iterate over a tf.data.Dataset iterator. Since the target logprob is a scalar, you can accumulate both gradients and logprobs as the function proceeds over all minibatches in the dataset.
ds = build_dataset()
def build_model(params):
return time_series_model(..)
#tf.custom_gradient
#tf.function # autograph should turn the dataset loop into a tf.while_loop.
def log_prob(*params):
total_lp = 0.
total_grad = tf.nest.map_structure(tf.zeros_like, params)
for batch in ds:
lp, grad = tfp.math.value_and_gradient(
lambda *p: build_model(p).log_prob(batch),
params)
total_lp += lp
total_grad = tf.nest.map_structure(lambda x,y: x+y, total_grad, grad)
return total_lp, lambda dy: tf.nest.map_structure(lambda g: dy*g, total_grad)

Computing exact moving average over multiple batches in tensorflow

During training, I would like to write the average loss over the last N mini-batches to SummaryWriter as a way of smoothing the very noisy batch loss. It's easy to compute this in python and print it, but I would like to add this to a summary so that I can see it in tensorboard. Here's an overly simplified example of what I'm doing now.
losses = []
for i in range(10000):
_, loss = session.run([train_op, loss_op])
losses.append(loss)
if i % 100 == 0:
# How to produce a scalar_summary here?
print sum(losses)/len(losses)
losses = []
I'm aware that I could use ExponentialMovingAverage with a decay of 1.0, but I would still need some way to reset this every N batches. Really, if all I care about is visualizing loss in tensorboard, the reset probably isn't necessary, but I'm still curious how one would go about aggregating across batches for other reasons (e.g. computing total accuracy over a test dataset that is too big to run in a single batch).
You can manually construct the Summary object, like this:
from tensorflow.core.framework import summary_pb2
def make_summary(name, val):
return summary_pb2.Summary(value=[summary_pb2.Summary.Value(tag=name,
simple_value=val)])
summary_writer.add_summary(make_summary('myvalue', myvalue), step)
Passing data from python to a graph function like tf.scalar_summary can be done using a placeholder and feed_dict.
average_pl = tf.placeholder(tf.float32)
average_summary = tf.summary.scalar("average_loss", average_pl)
writer = tf.summary.FileWriter("/tmp/mnist_logs", sess.graph_def)
losses = []
for i in range(10000):
_, loss = session.run([train_op, loss_op])
losses.append(loss)
if i % 100 == 0:
# How to produce a scalar_summary here?
feed = {average_pl: sum(losses)/len(losses)}
summary_str = sess.run(average_summary, feed_dict=feed)
writer.add_summary(summary_str, i)
losses = []
I haven't tried it and this was hastily copied from the visualizing data how to but I expect something like this would work.