Tensorflow's while loop slow on GPU? - tensorflow

For unknown reasons, the following code takes twice much slower on GPU then on CPU. Could anyone explain why:
import time
import tensorflow as tf
with tf.device('/device:GPU:0'): # gpu takes: 5.132448434829712 seconds
# with tf.device('/cpu:0'): # cpu takes: 3.440524101257324 seconds
i = tf.constant(0)
while_condition = lambda i: tf.less(i, 2 ** 20)
a = tf.fill([16, 16], 1.1)
b = tf.fill([16, 16], 2.2)
def body(i):
res = tf.matmul(a, b)
# increment i
add = tf.add(i, 1)
return (add,)
ini_matmul = tf.matmul(a, b)
# do the loop:
loop = tf.while_loop(while_condition, body, [i])
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(ini_matmul) # force GPU to initilise anything it needs.
t0 = time.time()
sess.run(loop)
t1 = time.time()
print(t1 - t0)
sess.close()
Note: Usually, the GPU runs for 5 seconds, CPU runs for 3 seconds, and CPU version using numpy runs for only 1.5 seconds.
Hardware: Tensorflow code running on Google's Colab. Numpy code running on local Intel Core i5-7267U.
Numpy version:
import numpy as np
import time
i = 0
a = np.full([16,16],1.1)
b = np.full([16,16],2.2)
t0 = time.time()
while i < 2**20:
a.dot(b)
i += 1
t1 = time.time()
print(t1-t0)
Update
This is becoming more and more wired to me because scaling up the matrix does not really helps. Here's the updated code and data in it (running of Titan XP card/Intel i7 CPU). Essentially tensorflow is running much slower.
import time
import tensorflow as tf
dimension = 11
repeat = 2**10
use_gpu = False
# Device: /device:GPU:0, Dimension 11, Repeat: 1024, Time cost: 0.00457597 seconds.
# Device: /cpu:0, Dimension 11, Repeat: 1024, Time cost: 0.00353599 seconds.
dev_name = '/device:GPU:0' if use_gpu else '/cpu:0'
with tf.device(dev_name):
i = tf.constant(0)
while_condition = lambda i: tf.less(i, repeat)
a = tf.constant(1.1, shape=[2**dimension, 2**dimension])
b = tf.constant(2.2, shape=[2**dimension, 2**dimension])
def body(i):
res = tf.matmul(a, b)
add = tf.add(i, 1)
return (add,)
ini_matmul = tf.matmul(a, b)
# do the loop:
loop = tf.while_loop(while_condition, body, [i])
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(ini_matmul) # force initialisation.
t0 = time.time()
sess.run(loop)
t1 = time.time()
print('Device: {dev}, Dimension {dim:d}, Repeat: {r:d}, Time cost: {t:.8f} seconds.'.format(
dev = dev_name,
dim = dimension, r = repeat,
t = t1 - t0
))
sess.close()

In the end, I figured out that the matmul operation is not executed by tensorflow, because it is an orphan node in the graph.

This is an interesting question.
The relative slowdown you're seeing between GPU and CPU execution in the TensorFlow snippet is almost certainly due to GPU memory allocation overhead. To summarize the link, cudaMalloc is slower than malloc. This memory allocation slowdown is offset by a speedup in the requested operation (matmul in this case) if and only if the speedup exceeds the difference in memory allocation times. This is always true for matmul when the matrices are large. It is not true when the matrices are small, as is the case in your example. To validate this hypothesis, iteratively increase the size of the multiplicands and record both the CPU and GPU running times - the two should converge, then cross, if memory allocation is indeed the issue.
The difference between the Numpy running time and the CPU-only running time is likely due to very subtle difference between the Numpy and TensorFlow code. Note that in the Numpy code you only instantiate a and b once. It looks like you do the same thing in the TensorFlow code because you only call your initialization once, but you're still populating the tensors in every iteration! To see why, note that tf.fill returns a Tensor. By definition, Tensor objects are populated each time sess.run is called on the graph that contains them. Thus, the two snippets actually do slightly different things. A more direct comparison would be to make a and b both tf.constant in the TensorFlow snippet.

Related

Tensorflow2.x custom data generator with multiprocessing

I just upgraded to tensorflow 2.3.
I want to make my own data generator for training.
With tensorflow 1.x, I did this:
def get_data_generator(test_flag):
item_list = load_item_list(test_flag)
print('data loaded')
while True:
X = []
Y = []
for _ in range(BATCH_SIZE):
x, y = get_random_augmented_sample(item_list)
X.append(x)
Y.append(y)
yield np.asarray(X), np.asarray(Y)
data_generator_train = get_data_generator(False)
data_generator_test = get_data_generator(True)
model.fit_generator(data_generator_train, validation_data=data_generator_test,
epochs=10000, verbose=2,
use_multiprocessing=True,
workers=8,
validation_steps=100,
steps_per_epoch=500,
)
This code worked fine with tensorflow 1.x. 8 processes were created in the system. The processor and video card were loaded perfectly. "data loaded" was printed 8 times.
With tensorflow 2.3 i got warning:
WARNING: tensorflow: multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
"data loaded" was printed once(should 8 times). GPU is not fully utilized. It also have memory leak every epoch, so traning will stops after several epochs. use_multiprocessing flag did not help.
How to make a generator / iterator in tensorflow(keras) 2.x that can easily be parallelized across multiple CPU processes? Deadlocks and data order are not important.
With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.
In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.
import numpy as np
import tensorflow as tf
x_shape = (32, 32, 3)
y_shape = () # A single item (not array).
classes = 10
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE
def generator_fn(n_samples):
"""Return a function that takes no arguments and returns a generator."""
def generator():
for i in range(n_samples):
# Synthesize an image and a class label.
x = np.random.random_sample(x_shape).astype(np.float32)
y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
yield x, y
return generator
def augment(x, y):
return x * tf.random.normal(shape=x_shape), y
samples = 10
batch_size = 5
epochs = 2
# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
generator=gen,
output_types=(np.float32, np.int32),
output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
augment,
num_parallel_calls=AUTOTUNE,
# Order does not matter.
deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)
# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)

Basic TPU Cross Shard Optimizer Not Working

In general, there are some good examples that use TF optimizers for solving general (non deep learning) problems. Given:
https://databricks.com/tensorflow/training-and-convergence
https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd
We want to be able to combine the two above and make use of TPU based optimization in solving high dimensional problems.
To that end I've got a simple colab code that does this merging the two examples above:
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1
# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()
with tf.Session(tpu_address) as session:
session.run(tf.contrib.tpu.initialize_system())
print('init')
session.run(model)
for i in range(10000):
print(i)
x_value = np.random.rand()
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x: x_value, y: y_value})
w_value = session.run(w)
print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
session.run(tpu.shutdown_system())
When I run it (in colab) as it is it just runs the first loop printing:
init
0
and then does nothing and colab just keeps spanning.
If I do not use
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
And other TPU features, then it works fine estimating the w Variable.
The questions are:
Why doesn't this work and how can we get the cross shard replicator to optimise this simple function?
How can shall I shape variable w to make use of parallel batches/shards on the TPU?
How can we make this even more efficient through use of an equivalent Dataset prefetch operation or using infeed queues?
The goal is to make use of lower level TPU APIs without TPUEstimator for example to help solve custom problems by leveraging the power of TPUs using the tensors , queues and shards only.
It doesn't work because you are overriding the number of shards without actually splitting the calculations into shards. When I run your code, I get the following error:
InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.
Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".
It is trying to perform the computations on eight shards and combine the results, but it only has one shard to work with. Take a look at tf.contrib.tpu.shard. It creates a shard context using the given number of shards and distributes a computation over those shards. So, instead of setting the number of shards manually, you can define your variables as usual and then wrap any computations with them in a function to be sharded:
# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
# Average across the entire batch
error = tf.reduce_mean(tf.square(y - y_model))
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)
You don't need to shape w to make use of shards, because sharding occurs across the batch dimension and you only have one set of weights for all inputs. You'll want to add a batch dimension to your inputs so that each batch can be distributed across the cores. shard assumes the first dimension is the batch dimension, but includes an argument to change it if your data is shaped differently. According to the TPU troubleshooting page, the ideal batch size is 1024 so that there are 128 samples per TPU core. If that is too big for your model, you can go smaller as long as it is a multiple of 128. Check out the above link and the performance guide for more tips on increasing performance.
for i in range(1000):
print(i)
x_value = np.random.rand(1024) # Generate a batch of 1024 values
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
Everything else should remain the same. I was able to train the model for all 10000 iterations. Keep in mind that for this simple model it will probably be slower than using CPU/GPU, but you should expect performance improvements for more complex problems with larger datasets.
I'm not familiar enough with Datasets or infeed queues to comment on this, but shard includes an argument for infeed queues so it likely has support for them. You might have to play around with it to see how it gets data to the computation function.

Parallelism isn't reducing the time in dataset map

TF Map function supports parallel calls. I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10, there is no improvement in performance run time. Here is a simple code
import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False,
batch_size=1, repeat=1, num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
try:
res = sess.run([X, Y])
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
pass
Here are the timings
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=2, batch_size=2, batch=True)
370ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=5, batch_size=2, batch=True)
372ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=10, batch_size=2, batch=True)
384ms
I used %timeit in Juypter notebook. What am I doing it wrong?
The problem here is that the only operation in the Dataset.map() function is a tf.py_func() op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).
Your code example didn't include the definition of the squarer() function, but it might be possible to replace tf.py_func() with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x), and you might then enjoy some parallel speedup.
Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel Dataset.map() is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example() or performing some image distortions as part of a data augmentation pipeline.
The reason maybe the squarer cost less time than overhead time. I modified the code with adding a quarter function which cost 2 seconds. Then the parameter num_parallel_calls works as expected. Here is the complete code:
import tensorflow as tf
import time
def squarer(x):
t0 = time.time()
while time.time() - t0 < 2:
y = x ** 2
return y
def test_two_custom_function_parallelism(num_parallel_calls=1,
batch=False,
batch_size=1,
repeat=1,
num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_x = dataset_x.prefetch(4)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_y = dataset_y.prefetch(4)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
t0 = time.time()
try:
res = sess.run([X, Y])
print(res)
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
print(i)
break
print('step elapse: %.4f' % (time.time() - t0))
print('total time: %.4f' % (time.time() - start))
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)
the output is:
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 4.0204
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 4.0836
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 4.1529
[(array([36, 49]),), (array([36, 49]),)]
total time: 16.3374
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 2.2139
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 0.0585
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 0.0469
[(array([36, 49]),), (array([36, 49]),)]
total time: 2.5317
So I am confused with the effect of "Global Interpreter Lock" mentioned by #mrry.
I setup my own version of map to get something similar to the TensorFlow's Dataset.map, but which will use multiple CPUs for py_functions.
Usage
Instead of
mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)
with the below code, you can get a CPU parallel py_function version using
mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)
(The output type(s) for the py_function can also be specified if it's not a single tf.float32)
Internally, this creates a pool of multiprocessing workers. It still uses the single regular GIL limited TensorFlow map, but only to pass the input to a worker and get the output back. The workers processing the data happen in parallel on the CPU.
Caveats
The function passed needs to be picklable to work with the multiprocessing pool. This should work for most cases, but some closures or whatnot may fail. Packages like dill might loosen this restriction, but I haven't looked into that.
If you pass an object's method as the function, you also need to be careful about how the object is duplicated across processes (each process will have its own copy of the object, so you can't rely on the attributes being shared).
As long as these considerations are kept in mind, this code should work for many cases.
Code
"""
Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
"""
import multiprocessing
from typing import Callable, Union, List
import signal
import tensorflow as tf
class PyMapper:
"""
A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
"""
def __init__(self, map_function: Callable, number_of_parallel_calls: int):
self.map_function = map_function
self.number_of_parallel_calls = number_of_parallel_calls
self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)
#staticmethod
def pool_worker_initializer():
"""
Used to initialize each worker process.
"""
# Corrects bug where worker instances catch and throw away keyboard interrupts.
signal.signal(signal.SIGINT, signal.SIG_IGN)
def send_to_map_pool(self, element_tensor):
"""
Sends the tensor element to the pool for processing.
:param element_tensor: The element to be processed by the pool.
:return: The output of the map function on the element.
"""
result = self.pool.apply_async(self.map_function, (element_tensor,))
mapped_element = result.get()
return mapped_element
def map_to_dataset(self, dataset: tf.data.Dataset,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
"""
Maps the map function to the passed dataset.
:param dataset: The dataset to apply the map function to.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
def map_py_function(*args):
"""A py_function wrapper for the map function."""
return tf.py_function(self.send_to_map_pool, args, output_types)
return dataset.map(map_py_function, self.number_of_parallel_calls)
def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
) -> tf.data.Dataset:
"""
A one line wrapper to allow mapping a parallel py function to a dataset.
:param dataset: The dataset whose elements the mapping function will be applied to.
:param map_function: The function to map to the dataset.
:param number_of_parallel_calls: The number of parallel calls of the mapping function.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
return mapped_dataset

Back-propagation exhibiting quadratic memory consumption

I'm running into a weird problem with TensorFlow. I've set up a very simple classification problem, four input variables, one binary output variable, one layer of weights and bias, output goes through a sigmoid to 0 or 1.
The problem is, memory consumption is quadratic in the number of records of training data! With only 5,000 records, it's already 900 megabytes; at 10,000, it runs into a few gigabytes. Since I want to end up using at least a few tens of thousands of records, this is a problem.
It is happening specifically in the back propagation step; when I just try to evaluate the cost function, memory consumption is linear in the number of records, as expected.
Code follows. What am I doing wrong?
import numpy as np
import os
import psutil
import tensorflow as tf
process = psutil.Process(os.getpid())
sess = tf.InteractiveSession()
# Parameters
learning_rate = 0.01
random_seed = 1
tf.set_random_seed(random_seed)
# Data
data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
train_X = data[:, :-1]
train_Y = data[:, -1]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(np.float32, shape=(rows, cols))
Y = tf.placeholder(np.float32, shape=rows,)
# Weights
W = tf.Variable(tf.random_normal((cols, 1)))
b = tf.Variable(tf.random_normal(()))
# Model
p = tf.nn.sigmoid(tf.matmul(X, W) + b)
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Just one optimizer step is enough to demonstrate the problem
optimizer.run({X: train_X, Y: train_Y})
# Memory consumption is quadratic in number of rows
print('{0:,} bytes'.format(process.memory_info().peak_wset))
It turns out to be again the problem of shape. Using matmul the way I did there, generates output of shape (n,1). Using that in a context where shape (n,) was expected, silently generates quadratic blowup.
The solution is squeeze. Specifically, tf.squeeze(tf.matmul(X, W)).
It makes sense that memory consumption blows up like that since the backprop requires the extra memory to keep track of the gradients of each operation (though I can't figure out how it ends up being quadratic).
Solution : Mini-batches
This is usually the goto method when it comes to training models. Split up your training data into little mini-batches each containing a fixed number of samples (this is rarely more than 200 samples) at feed it to the optimizer one mini-batch at a time. So if your batch_size=64 then the train_X and train_Y fed to the optimizer will be of the shapes (64, 4) and (64,) respectively.
I would try something like this
batch_size = 64
for i in range(rows):
batch_X = train_X[i*batch_size : (i + 1)*batch_size]
batch_Y = train_Y[i*batch_size : (i + 1)*batch_size]
optimizer.run({X: batch_X, Y:batch_Y})

tensorflow runs extremely slow inside a python for loop

Following code uses tensorflow library and it runs terribly slower compared to numpy library. I am aware that I am calling a function that uses tensorflow library inside python for loop (which I will parallelize with python multiprocessing later), but the code as is, runs extremely slow.
Could someone please help how I can make this code run faster? Thanks.
from math import *
import numpy as np
import sys
from multiprocessing import Pool
import tensorflow as tf
def Trajectory_Fun(tspan, a, b, session=None, server=None):
# Open tensorflow session
if session==None:
if server==None:
sess = tf.Session()
else:
sess = tf.Session(server.target)
else:
sess = session
B = np.zeros(np.size(tspan), dtype=np.float64)
B[0] = b
for i, t in enumerate(tspan):
r = np.random.rand(1)
if r>a:
c = sess.run(tf.trace(tf.random_normal((4, 4), r, 1.0)))
else:
c = 0.0 # sess.run(tf.trace(tf.random_normal((4, 4), 0.0, 1.0)))
B[i] = c
# Close tensorflow session
if session==None:
sess.close()
return B
def main(argv):
# Parameters
tspan = np.arange(0.0, 1000.0)
a = 0.1
b = 0.0
# Run test program
B = Trajectory_Fun(tspan, a, b, None, None)
print 'Done!'
if __name__ == "__main__":
main(sys.argv[1:])
As stated in your question, this program will give poor performance because it creates several new TensorFlow graph nodes per operation. The underlying assumption in TensorFlow is (approximately) that you'll build a graph once and then call sess.run() on (various parts of) it multiple times. The first time you run a graph is relatively expensive, because TensorFlow has to build various data structures and optimize the execution of the graph across multiple devices.
However, TensorFlow caches this work, so subsequent uses are much cheaper.
You can make this program much faster by constructing the graph once and using (e.g.) a tf.placeholder() op to feed in the value that changes in each iteration. For example, the following should do the trick:
B = np.zeros(np.size(tspan), dtype=np.float64)
B[0] = b
# Define the TensorFlow graph once and reuse it in each iteration of the for loop.
r_placeholder = tf.placeholder(tf.float32, shape=[])
out_t = tf.trace(tf.random_normal((4, 4), r_placeholder, 1.0))
with tf.Session() as sess:
for i, t in enumerate(tspan):
r = np.random.rand(1)
if r > a:
c = sess.run(out_t, feed_dict={r_placeholder: r})
else:
c = 0.0
B[i] = c
return B
You could potentially make this even more efficient by using a TensorFlow loop and making fewer calls to sess.run(), but the general principle is the same: reuse same the graph multiple times to get the benefit of TensorFlow.