tensorflow runs extremely slow inside a python for loop - tensorflow

Following code uses tensorflow library and it runs terribly slower compared to numpy library. I am aware that I am calling a function that uses tensorflow library inside python for loop (which I will parallelize with python multiprocessing later), but the code as is, runs extremely slow.
Could someone please help how I can make this code run faster? Thanks.
from math import *
import numpy as np
import sys
from multiprocessing import Pool
import tensorflow as tf
def Trajectory_Fun(tspan, a, b, session=None, server=None):
# Open tensorflow session
if session==None:
if server==None:
sess = tf.Session()
else:
sess = tf.Session(server.target)
else:
sess = session
B = np.zeros(np.size(tspan), dtype=np.float64)
B[0] = b
for i, t in enumerate(tspan):
r = np.random.rand(1)
if r>a:
c = sess.run(tf.trace(tf.random_normal((4, 4), r, 1.0)))
else:
c = 0.0 # sess.run(tf.trace(tf.random_normal((4, 4), 0.0, 1.0)))
B[i] = c
# Close tensorflow session
if session==None:
sess.close()
return B
def main(argv):
# Parameters
tspan = np.arange(0.0, 1000.0)
a = 0.1
b = 0.0
# Run test program
B = Trajectory_Fun(tspan, a, b, None, None)
print 'Done!'
if __name__ == "__main__":
main(sys.argv[1:])

As stated in your question, this program will give poor performance because it creates several new TensorFlow graph nodes per operation. The underlying assumption in TensorFlow is (approximately) that you'll build a graph once and then call sess.run() on (various parts of) it multiple times. The first time you run a graph is relatively expensive, because TensorFlow has to build various data structures and optimize the execution of the graph across multiple devices.
However, TensorFlow caches this work, so subsequent uses are much cheaper.
You can make this program much faster by constructing the graph once and using (e.g.) a tf.placeholder() op to feed in the value that changes in each iteration. For example, the following should do the trick:
B = np.zeros(np.size(tspan), dtype=np.float64)
B[0] = b
# Define the TensorFlow graph once and reuse it in each iteration of the for loop.
r_placeholder = tf.placeholder(tf.float32, shape=[])
out_t = tf.trace(tf.random_normal((4, 4), r_placeholder, 1.0))
with tf.Session() as sess:
for i, t in enumerate(tspan):
r = np.random.rand(1)
if r > a:
c = sess.run(out_t, feed_dict={r_placeholder: r})
else:
c = 0.0
B[i] = c
return B
You could potentially make this even more efficient by using a TensorFlow loop and making fewer calls to sess.run(), but the general principle is the same: reuse same the graph multiple times to get the benefit of TensorFlow.

Related

Linear regression using tf.data with federated core API and data on remote execution client

I'm trying to do a demonstration of federated learning with tff. And I've got this far but the error messages I get are just too confusing. The important part is that I want to demostrate that the data is in the remote engine, which is why I use the tf.data.experimental.CsvDataset and I could not find anything similar in any tutorial. I've managed to do a mini experiment where data was read in the remote site, but I can't get this larger example to work.
Currently it complains about 'p = x * w + b', I believe because x is not a federated_value. But I've tried many many variations and just can't get it to work. The Salary.csv is from a tutorial here https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression?select=Salary_Data.csv
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_federated as tff
import grpc
ip_address = '127.0.0.1'
port = 8000
channels = [grpc.insecure_channel(f'{ip_address}:{port}') for _ in range(10)]
tff.backends.native.set_remote_execution_context(channels, rpc_mode='STREAMING')
#tf.function()
def load_data():
return tf.data.experimental.CsvDataset('data/Salary.csv', [tf.float64,tf.float64], header=True)
W_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
B_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
#tff.federated_computation(W_TYPE, B_TYPE)
def train(w, b):
data = load_data()
loss = tf.Variable(0.0, dtype=tf.float64)
with tf.GradientTape() as tape:
for x, y in data:
p = x * w + b
loss = loss + tf.square(p - y)
g_w, g_b = tape.gradient(loss, [w, b])
w.assign_sub(0.0001 * g_w)
b.assign_sub(0.0001 * g_b)
return [w, b]
w = tf.Variable(2.0, dtype=tf.float64)
b = tf.Variable(3.0, dtype=tf.float64)
for _ in range(1000):
w, b = train(data, tff.federated_broadcast(w), tff.federated_broadcast(b))
TFF does not support the mixing of federated computation with TensorFlow. The usual paradigm in TFF goes something like:
Write your local functions in TensorFlow, using TFF's #tff.tf_computation decorator
Inside the body of a tff.federated_computation, invoke intrinsic and communication operators (like tff.federated_broadcast above), and tff.federated_map your TF computations to federated elements.
There are many subtleties that have led to the design of a pattern like the above, but the main one is the desire for all logic to be expressible in a serialized representation, so that it can be split up and sent to the various devices in your federated system during true deployment. This is a little hand-wavey; you can think of TFF's decorators as defining a context in which the code you write will be traced, so that it can run in a totally platform-independent way.
Using this pattern, your computation above would look something like:
# These lines set up the TFF execution environment, and importantly are not used
# during the function definitions below--only upon *invocation* of your function
# in the outer Python context will the execution environment be used.
channels = [grpc.insecure_channel(f'{ip_address}:{port}') for _ in range(10)]
tff.backends.native.set_remote_execution_context(channels, rpc_mode='STREAMING')
#tf.function()
def load_data():
# From TFF's perspective, the code written here will be represented as a
# "blob of Tensorflow" that can be shipped out anywhere. We don't actually
# *need* a tf_computation decorator here, since this will be invoked in the
# body of another tf_computation and the logic will get embedded there, but
# we could put one here if we wanted.
return tf.data.experimental.CsvDataset(
'data/Salary.csv', [tf.float64,tf.float64], header=True)
#tff.tf_computation(tf.float64, tf.float64)
#tf.function
def local_train(w, b):
data = load_data()
# We don't need a variable for the loss here, and though TFF will allow you
# to construct a variable to use as a temporary in this fashion, tf.function
# won't. We're pushing the TF team on that one ;).
loss = tf.constant(0.0, dtype=tf.float64)
with tf.GradientTape() as tape:
# We must be inside a tf.function decorator, or the eager Python runtime,
# to iterate over a tf.data.Dataset in this way.
for x, y in data:
p = x * w + b
loss = loss + tf.square(p - y)
g_w, g_b = tape.gradient(loss, [w, b])
w = w- 0.0001 * g_w
b = b- 0.0001 * g_b
return w, b
# Making a symbol to represent these types is always a good idea
W_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
B_TYPE = tff.FederatedType(tf.float64, tff.CLIENTS, all_equal=True)
#tff.federated_computation(W_TYPE, B_TYPE)
def train(w, b):
# Here w and b are elements of federated type, placed at clients.
# We map the training function over these values.
# If they were SERVER-placed instead, we would federated_broadcast them
# out to the clients.
updated_w, updated_b = tff.federated_map(local_train, (w, b))
# TFF's Python runtime represents federated values placed at clients as a list of
# values, with as many elements as there are clients. This number will be inferred
# by the runtime and used to distribute the work. We also technically don't need
# variables here, but I believe they won't cause any problem.
clients_placed_w = [tf.Variable(2.0, dtype=tf.float64)]
clients_placed_b = [tf.Variable(3.0, dtype=tf.float64)]
# We could move the execution environment setup lines all the way down here,
# no problem.
for _ in range(1000):
clients_placed_w, clients_placed_b = train(clients_placed_w, clients_placed_b)

Tensorflow's while loop slow on GPU?

For unknown reasons, the following code takes twice much slower on GPU then on CPU. Could anyone explain why:
import time
import tensorflow as tf
with tf.device('/device:GPU:0'): # gpu takes: 5.132448434829712 seconds
# with tf.device('/cpu:0'): # cpu takes: 3.440524101257324 seconds
i = tf.constant(0)
while_condition = lambda i: tf.less(i, 2 ** 20)
a = tf.fill([16, 16], 1.1)
b = tf.fill([16, 16], 2.2)
def body(i):
res = tf.matmul(a, b)
# increment i
add = tf.add(i, 1)
return (add,)
ini_matmul = tf.matmul(a, b)
# do the loop:
loop = tf.while_loop(while_condition, body, [i])
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(ini_matmul) # force GPU to initilise anything it needs.
t0 = time.time()
sess.run(loop)
t1 = time.time()
print(t1 - t0)
sess.close()
Note: Usually, the GPU runs for 5 seconds, CPU runs for 3 seconds, and CPU version using numpy runs for only 1.5 seconds.
Hardware: Tensorflow code running on Google's Colab. Numpy code running on local Intel Core i5-7267U.
Numpy version:
import numpy as np
import time
i = 0
a = np.full([16,16],1.1)
b = np.full([16,16],2.2)
t0 = time.time()
while i < 2**20:
a.dot(b)
i += 1
t1 = time.time()
print(t1-t0)
Update
This is becoming more and more wired to me because scaling up the matrix does not really helps. Here's the updated code and data in it (running of Titan XP card/Intel i7 CPU). Essentially tensorflow is running much slower.
import time
import tensorflow as tf
dimension = 11
repeat = 2**10
use_gpu = False
# Device: /device:GPU:0, Dimension 11, Repeat: 1024, Time cost: 0.00457597 seconds.
# Device: /cpu:0, Dimension 11, Repeat: 1024, Time cost: 0.00353599 seconds.
dev_name = '/device:GPU:0' if use_gpu else '/cpu:0'
with tf.device(dev_name):
i = tf.constant(0)
while_condition = lambda i: tf.less(i, repeat)
a = tf.constant(1.1, shape=[2**dimension, 2**dimension])
b = tf.constant(2.2, shape=[2**dimension, 2**dimension])
def body(i):
res = tf.matmul(a, b)
add = tf.add(i, 1)
return (add,)
ini_matmul = tf.matmul(a, b)
# do the loop:
loop = tf.while_loop(while_condition, body, [i])
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(ini_matmul) # force initialisation.
t0 = time.time()
sess.run(loop)
t1 = time.time()
print('Device: {dev}, Dimension {dim:d}, Repeat: {r:d}, Time cost: {t:.8f} seconds.'.format(
dev = dev_name,
dim = dimension, r = repeat,
t = t1 - t0
))
sess.close()
In the end, I figured out that the matmul operation is not executed by tensorflow, because it is an orphan node in the graph.
This is an interesting question.
The relative slowdown you're seeing between GPU and CPU execution in the TensorFlow snippet is almost certainly due to GPU memory allocation overhead. To summarize the link, cudaMalloc is slower than malloc. This memory allocation slowdown is offset by a speedup in the requested operation (matmul in this case) if and only if the speedup exceeds the difference in memory allocation times. This is always true for matmul when the matrices are large. It is not true when the matrices are small, as is the case in your example. To validate this hypothesis, iteratively increase the size of the multiplicands and record both the CPU and GPU running times - the two should converge, then cross, if memory allocation is indeed the issue.
The difference between the Numpy running time and the CPU-only running time is likely due to very subtle difference between the Numpy and TensorFlow code. Note that in the Numpy code you only instantiate a and b once. It looks like you do the same thing in the TensorFlow code because you only call your initialization once, but you're still populating the tensors in every iteration! To see why, note that tf.fill returns a Tensor. By definition, Tensor objects are populated each time sess.run is called on the graph that contains them. Thus, the two snippets actually do slightly different things. A more direct comparison would be to make a and b both tf.constant in the TensorFlow snippet.

Parallelism isn't reducing the time in dataset map

TF Map function supports parallel calls. I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10, there is no improvement in performance run time. Here is a simple code
import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False,
batch_size=1, repeat=1, num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
try:
res = sess.run([X, Y])
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
pass
Here are the timings
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=2, batch_size=2, batch=True)
370ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=5, batch_size=2, batch=True)
372ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=10, batch_size=2, batch=True)
384ms
I used %timeit in Juypter notebook. What am I doing it wrong?
The problem here is that the only operation in the Dataset.map() function is a tf.py_func() op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).
Your code example didn't include the definition of the squarer() function, but it might be possible to replace tf.py_func() with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x), and you might then enjoy some parallel speedup.
Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel Dataset.map() is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example() or performing some image distortions as part of a data augmentation pipeline.
The reason maybe the squarer cost less time than overhead time. I modified the code with adding a quarter function which cost 2 seconds. Then the parameter num_parallel_calls works as expected. Here is the complete code:
import tensorflow as tf
import time
def squarer(x):
t0 = time.time()
while time.time() - t0 < 2:
y = x ** 2
return y
def test_two_custom_function_parallelism(num_parallel_calls=1,
batch=False,
batch_size=1,
repeat=1,
num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_x = dataset_x.prefetch(4)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_y = dataset_y.prefetch(4)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
t0 = time.time()
try:
res = sess.run([X, Y])
print(res)
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
print(i)
break
print('step elapse: %.4f' % (time.time() - t0))
print('total time: %.4f' % (time.time() - start))
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)
the output is:
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 4.0204
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 4.0836
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 4.1529
[(array([36, 49]),), (array([36, 49]),)]
total time: 16.3374
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 2.2139
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 0.0585
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 0.0469
[(array([36, 49]),), (array([36, 49]),)]
total time: 2.5317
So I am confused with the effect of "Global Interpreter Lock" mentioned by #mrry.
I setup my own version of map to get something similar to the TensorFlow's Dataset.map, but which will use multiple CPUs for py_functions.
Usage
Instead of
mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)
with the below code, you can get a CPU parallel py_function version using
mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)
(The output type(s) for the py_function can also be specified if it's not a single tf.float32)
Internally, this creates a pool of multiprocessing workers. It still uses the single regular GIL limited TensorFlow map, but only to pass the input to a worker and get the output back. The workers processing the data happen in parallel on the CPU.
Caveats
The function passed needs to be picklable to work with the multiprocessing pool. This should work for most cases, but some closures or whatnot may fail. Packages like dill might loosen this restriction, but I haven't looked into that.
If you pass an object's method as the function, you also need to be careful about how the object is duplicated across processes (each process will have its own copy of the object, so you can't rely on the attributes being shared).
As long as these considerations are kept in mind, this code should work for many cases.
Code
"""
Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
"""
import multiprocessing
from typing import Callable, Union, List
import signal
import tensorflow as tf
class PyMapper:
"""
A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
"""
def __init__(self, map_function: Callable, number_of_parallel_calls: int):
self.map_function = map_function
self.number_of_parallel_calls = number_of_parallel_calls
self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)
#staticmethod
def pool_worker_initializer():
"""
Used to initialize each worker process.
"""
# Corrects bug where worker instances catch and throw away keyboard interrupts.
signal.signal(signal.SIGINT, signal.SIG_IGN)
def send_to_map_pool(self, element_tensor):
"""
Sends the tensor element to the pool for processing.
:param element_tensor: The element to be processed by the pool.
:return: The output of the map function on the element.
"""
result = self.pool.apply_async(self.map_function, (element_tensor,))
mapped_element = result.get()
return mapped_element
def map_to_dataset(self, dataset: tf.data.Dataset,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
"""
Maps the map function to the passed dataset.
:param dataset: The dataset to apply the map function to.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
def map_py_function(*args):
"""A py_function wrapper for the map function."""
return tf.py_function(self.send_to_map_pool, args, output_types)
return dataset.map(map_py_function, self.number_of_parallel_calls)
def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
) -> tf.data.Dataset:
"""
A one line wrapper to allow mapping a parallel py function to a dataset.
:param dataset: The dataset whose elements the mapping function will be applied to.
:param map_function: The function to map to the dataset.
:param number_of_parallel_calls: The number of parallel calls of the mapping function.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
return mapped_dataset

My TensorFlow Graph is abnormally large using Edward

I have code here that I've modified from this website. Basically what I have written is this:
#import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
#from tensorflow.examples.tutorials.mnist import input_data
from edward.models import Categorical, Normal
import edward as ed
#ed.set_seed(39)
import pandas as pd
import csv
# Use the TensorFlow method to download and/or load the data.
with open ("data_final.csv", "r") as csvfile:
reader1 = csv.reader(csvfile)
data1 = np.array(list(reader1)).astype(np.float)
#mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
N = data1.shape[0] -1 # number of images in a minibatch.
D = 4 # number of features.
K = 4 # number of classes.
# Create a placeholder to hold the data (in minibatches) in a TensorFlow graph.
x = tf.placeholder(tf.float32, [N, D])
# Normal(0,1) priors for the variables. Note that the syntax assumes TensorFlow 1.1.
w = Normal(loc=tf.zeros([D, K]), scale=tf.ones([D, K]))
b = Normal(loc=tf.zeros(K), scale=tf.ones(K))
# Categorical likelihood for classication.
y =tf.matmul(x,w)+b
# Contruct the q(w) and q(b). in this case we assume Normal distributions.
qw = Normal(loc=tf.Variable(tf.random_normal([D, K])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, K]))))
qb = Normal(loc=tf.Variable(tf.random_normal([K])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([K]))))
# We use a placeholder for the labels in anticipation of the traning data.
y_ph = tf.placeholder(tf.float32, [N, K])
# Define the VI inference technique, ie. minimise the KL divergence between q and p.
inference = ed.KLqp({w: qw, b: qb}, data={y:y_ph})
# Initialse the infernce variables
inference.initialize(n_iter=5000, n_print=100, scale={y: 1})
# We will use an interactive session.
sess = tf.InteractiveSession()
# Initialise all the vairables in the session.
tf.global_variables_initializer().run()
I use the data linked here, to run the code. I get an error after less than a second of running the code (so I have a hard time believing this actually happened) that said:
ValueError: GraphDef cannot be larger than 2GB.
I think there were other topics with the same error as mine, but those people had instantiated like 1 million parameters of something. I have on the order to 20 parameters, so unsure why I'm getting this error.
In my case, there were still variables (and likely a graphs) that were not garbage collected from a previous Edward runs. Garbage collecting/resetting the console fixed the problem.

Passing bool to feed dict

So here is an example of using batch normalization over a 1-D input vector. Batch normalization is performed over 100 training examples xTr. I then want to test on say just 1 example later on xTe.
import tensorflow as tf
import numpy as np
from tensorflow.contrib.layers import layers
if __name__ == "__main__":
bn = layers.batch_norm
nFeats = 3
nObs = 100
xTr = np.random.rand(nObs,nFeats) # Train
xTe = np.random.rand(1,nFeats) # Test
bnTrain = tf.placeholder(tf.bool)
X = tf.placeholder(tf.float32,[None,nFeats])
Y = bn(X,nFeats,is_training=bnTrain) # want to be able to change is_training via a feed_dict.
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_op)
yTr_ = Y.eval(feed_dict={X:xTr,bnTrain:True})
yTe_ = Y.eval(feed_dict={X:xTe,bnTrain:False})
But I can't pass a tf.Tensor to a function expecting a normal python bool. What is the best way of going about this so I can change a bool during a session.
The current implementation of the tf.contrib.layers.batch_norm() function is designed to accept a tf.Tensor as the is_training argument (although this fact doesn't appear to be documented), and looking at the revision history, it was added in the TensorFlow 0.10 release. If you are using an older version, please try upgrading to the latest release (currently 0.12), and your existing code should work. Among other improvements, it contains a fused implementation of batch normalization that should make a significant performance improvement.