Optimize Tensorflow for multiple runs - tensorflow

I need to execute the statement, sess.run() multiple times.
I create the sess once at the beginning of my code. However, each sess.run() statement takes almost 0.5-0.8 seconds on my CPU machine. Is there any way I can optimize this? Since Tensorflow does lazy loading, is there any way I can make it not do it, and make this faster?
I'm using the Inception model from the image classifying example.
def load_network():
with gfile.FastGFile('model.pb', 'rb') as f:
graph_def = tf.GraphDef()
data = f.read()
graph_def.ParseFromString(data)
png_data = tf.placeholder(tf.string, shape=[])
decoded_png = tf.image.decode_png(png_data, channels=3)
_ = tf.import_graph_def(graph_def, name=input_map={'DecodeJpeg': decoded_png})
return png_data
def get_pool3(sess, png_data, imgBuffer):
pool3 = sess.graph.get_tensor_by_name('pool_3:0')
pool3Vector = sess.run(pool3, {png_data: imgBuffer.getvalue()})
return pool3Vector
def main():
sess = getTensorSession()
png_data = load_network()
# The below line needs to be called multiple times, which is what takes
# nearly 0.5-0.8 seconds.
# imgBuffer contains the stored value of the image.
pool3 = get_pool3(sess, png_data, imgBuffer)

Tensorflow runs operations lazily --- nothing is actually computed until sess.run() is called. When you call sess.run(), Tensorflow executes all of the operations in your computation graph. So if sess.run() is taking 0.5-0.8 seconds, it is likely that your computation itself is taking 0.5-0.8s.
(There is some overhead to sess.run(), but it shouldn't be anywhere near the order of half a second.)
Hope that helps!
Added:
Here are some things you might look into to speed your computation up:
use Tensorflow's profiling tools to look at what part of your computation is taking the time. They are not documented yet, but you can find some information about them in this github issue: https://github.com/tensorflow/tensorflow/issues/1824
make your computation cheaper --- reduce the complexity of your model, use smaller images, etc.
run your computation on a GPU instead of CPU.

Related

Why parallel_iterations in dynamic_rnn doesn't work?

I'm wondering how to use the dynamic_rnn function and make it parallel. I set gpu_options.allow_growth = True and use tf.nn.dynamic_rnn(rnn_cell, inputs=X, dtype=tf.float32, time_major=False, parallel_iterations=50) to do so. But both the GPU memory consumption and run time don't change when I changeing the value of parallel_iterations.
It is a very simple rnn, so I think there may not be data dependency.
basic_cell = BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32, parallel_iterations=50)
logits = fully_connected(states, n_outputs, activation_fn=None)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
Thanks in advance! I appreciate any suggestion.
Your observations don't mean that parallel_iterations don't work.
Whenever you have an RNN, you have a data dependency since the output of n'th step is fed into (n+1)'th step. In your example with BasicRNNCell, every computation is effectively dependent on previous computation. So, there are basically no opportunities to run multiple steps in parallel. With more complex cells you might have some computation in each step that is independent of previous steps (e.g. doing some attention over constant memory). In such cases there are opportunities for parallel execution of different steps.
Even if you model would allow for parallel execution, you might not be able to see it being reflected in memory usage. Memory usage depends on many factors including when TF returns memory to GPU; if you are computing gradients, you might need to keep most activations in memory whether you run iterations in parallel or not; the iterations that are run in parallel might not produce a lot of tensors; etc.
Similarly for CPU, if running stuff in parallel always helped performance, we would run a thousand threads in every process. parallel_iterations is simply a knob that is useful to have in some cases.

Is it possible to loop through all minibatches in a single tensorflow op using dataset/iterators?

I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)

Using tf.cond() to feed my graph for training and validation

In my TensorFlow code I want my network to take inputs from one of the two StagingArea objects depending upon whether I want to do training or testing.
A part of the graph construction code I wrote is as follows :
with tf.device("/gpu:0"):
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
phase = tf.get_variable("phase", [], initializer=tf.zeros_initializer(),dtype=tf.uint8, trainable=False)
phaseassigntest = phase.assign(1)
phaseassigntrain = phase.assign(0)
phasetest = tf.equal(phase, 0)
is_training = tf.cond(phasetest, lambda: tf.constant(True), lambda: tf.constant(False))
trainstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
putoptrain = trainstagingarea.put(train_iterator.get_next())
trainputop.append(putoptrain)
getoptrain = trainstagingarea.get()
traingetop.append(getoptrain)
trainclearop = trainstagingarea.clear()
trainstageclear.append(trainclearop)
trainsizeop = trainstagingarea.size()
trainstagesize.append(trainsizeop)
valstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[valbatchsize, 3, 221, 221], [valbatchsize]], capacity=20)
putopval = valstagingarea.put(val_iterator.get_next())
valputop.append(putopval)
getopval = valstagingarea.get()
valgetop.append(getopval)
valclearop = valstagingarea.clear()
valstageclear.append(valclearop)
valsizeop = valstagingarea.size()
valstagesize.append(valsizeop)
#elem = valgetop[i]
elem = tf.cond(is_training,lambda: traingetop[i],lambda: valgetop[i])
img = elem[0]
label = elem[1]
labelonehot = tf.one_hot(label, depth=numclasses)
net, networksummaries = overfeataccurate(img,numclasses=numclasses, phase=is_training)
I have used tf.cond to make sure that the network is fed by one of the two StagingArea objects. One is meant for training and the other one is meant for validation.
Now, when I try to execute the graph as follows, I do not get any result and infact the code just hangs and I have to kill the process.
with tf.Session(graph=g,config=config) as sess:
sess.run(init_op)
sess.run(tf.local_variables_initializer())
sess.run(val_initialize)
for i in range(20):
sess.run(valputop)
print(sess.run(valstagesize))
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
print("Performing Validation")
sess.run(phaseassigntest)
saver = tf.train.Saver()
while(epoch<10):
time_init = time.time()
while True:
try:
[val_accu, _, summaries] = sess.run([towervalidation, towervalidationupdateop,validation_summary_op])
print(val_accu)
when instead of tf.cond() I directly assign elem = valgetop[i], the code works just fine.
Am I missing something over here ?
What is the right way to feed my network based on whether I want to do training or testing ?
NOTE The error does not go away even if I set numgpus to 1.
Your problem
What you think tf.cond does
Based on the flag, execute what is required to put either traingetop[i] or valgetop[i] into your elem tensor.
What tf.cond actually does
Executes what is required to get both traingetop[i] and valgetop[i], then passes one of them into your elem tensor.
So
The reason it is hanging forever is because it's waiting for an element to be added to your training staging area (so that it can get that element and discard it). You're forgiven for not realising this is what it's doing; it's actually very counter-intuitive. The documentation is awfully unclear on how to deal with this.
Recommended Solution (by Tensorflow documentation)
If you really need the queues to be in the same graph, then you need to make two copies of your ENTIRE graph, one that is fed by your training staging area, and one that is fed by your validation staging area. Then you just use the relevant tensor in your sess.run call. I recommend creating a function that takes a queue output tensor, and returns a model_output tensor. Now you have a train_time_output tensor and a validation_time_output tensor, and you can choose which one you want to execute in your sess.run.
Warning
You need to make sure that you aren't actually creating new variables to go along with these new ops. To do that take a look at the latest documentation on variables. It looks like they've simplified it from v0.12, and it essentially boils down to using tf.get_variable instead of tf.Variable to create your variables.
My preferred work around
Although that is the recommended solution (AFAIK), it is extremely unsatisfying to me; you're creating a whole other set of operations on the graph that just happen to use the same weights. It seems like there's a lot of potential for programmer error by abusing the separation between train time and test/validation time (resulting in the model acting unexpectedly different at these times). Worse; it doesn't solve the problem of tf.cond demanding the values for inputs to both branches, it just forces you to copy your whole graph, which is not always possible.
I prefer to just not have my queues in the graph like that, and treat the model as a function which can be fed an example without caring where it's from. That is, I would instantiate the model with a tf.placeholder as the input, and at execution time I would use feed_dict to actually provide the value. It would function something like this
#inside main training loop
if time_to_train:
example = sess.run(traingettop)
else:
example = sess.run(valgettop)
result = sess.run(model_output, {input_placeholder: example})
It's very useful to note that you can use the feed_dict to feed any value for any tensor anywhere in your model. So, you can change any model definition that, due to tf.cond would always require an input, like:
a = tf.constant(some_value)
b = tf.placeholder(tf.float32)
flag = tf.placeholder(tf.bool, [])
one_of_them = tf.cond(flag, a, b)
model_output = build_graph(one_of_them)
Into a definition that doesn't, like:
a = tf.constant(some_value)
model_output = build_graph(a)
Remembering that you can always overwrite what a is at execution time:
# In main training loop,
sess.run(train_op, {a: some_other_value})
This essentially pushes the conditional into native python land. In your code you might end up with something like:
if condition_satisfied:
sess.run(train_op, {a:some_other_value})
else:
sess.run(train_op)
Performance concerns
If you are using tensorflow on a single machine, then there is practically no performance cost to this solution, as the numpy array/s put into the example python variable are actually still stored on the GPU.
If you are using tensorflow in a distributed fashion, then this solution would kill your performance; it would require sending the example from whatever machine it's on to the master so that it can send it back.

How to use evaluation_loop with train_loop in tf-slim

I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.
My question is: what is the canonical way to use these loops?
As a followup: is it possible to use early stopping with train_loop?
Currently I have a model and my training file train.py looks like this
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op(
total_loss,
optimizer,
global_step=global_step,
clip_gradient_norm=FLAGS.clip_gradient_norm,
summarize_gradients=True)
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs)
Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.
But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called eval.py which contains my evaluation loop.
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops(
logits,
labels,
dataset.num_classes() )
slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=config)
Questions:
1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?
2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.
3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.
Update
~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~
This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.
evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.
You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.
Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.
Thanks to #kmalakoff, the TensorFlow issue gave a brilliant way to the problem that how to validate or test model in tf.slim training. The main idea is overriding train_step_fn function:
import …
from tensorflow.contrib.slim.python.slim.learning import train_step
...
accuracy_validation = ...
accuracy_test = ...
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_validation)
print('your validation info')
if train_step_fn.step % FLAGS.test_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_test)
print('your test info')
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
train_step_fn.accuracy_test = accuracy_test
# run training.
slim.learning.train(
train_op,
FLAGS.logs_dir,
train_step_fn=train_step_fn,
graph=graph,
number_of_steps=FLAGS.max_steps)
Adding my 2-cent:
I currently have this model for the evaluation_loop hogging up an
entire GPU, but it's rarely being used
Usually an evaluation model takes less GPU memory. You could prevent TF from hogging the whole GPU memory by setting the session config allow_growth to True. This way you can use the same GPU for both training and evaluation
Example # Training
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
session_config=session_config)
Example # validation
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=session_config)

Run train op multiple times in tensorflow

I have some fairly large batch sizes on which I'd like to take multiple gradient steps. While I could easily do this with a python for loop, I imagine that there might be a more efficient method that doesn't involve transferring the data to gpu on each iteration. I've tried putting the train op in the fetch list multiple times, but I'm not sure that it's actually being run more than once (the runtime is exactly the same).
If you have variable-sized batch then variable is a bad fit for saving it, and you could instead persist this data between run calls using peristent tensors. Here's a toy example
t = tf.int32
params = tf.Variable(tf.ones_initializer((), dtype=dt))
data_batches = [[1], [2, 3], [4, 5, 6]]
# op that uploads data to TF and saves it as a persistent Tensor
data_saver_placeholder = tf.placeholder(dt)
tensor_handle_op = tf.get_session_handle(data_saver_placeholder)
data_placeholder, data = tf.get_session_tensor(dt)
train_op = tf.assign_add(params, tf.reduce_prod(data))
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for batch in data_batches:
# upload tensor to TF runtime and save its handle
tensor_handle = sess.run(tensor_handle_op, feed_dict={data_saver_placeholder: batch})
# run train op several times reusing same data
for i in range(3):
sess.run(train_op, feed_dict={data_placeholder: tensor_handle.handle})
assert sess.run(params) == 382
If you do sess.run([myop,myop]) that'll only run myop once.
If you want to run the op, but not fetch its results to Python runtime you could use a control dependency. A simple way to do this is with a group op, ie
sess.run(tf.group(myop))
sess.run(tf.group(myop))
If your graph is large you may get an extra overhead by constructing group op between run calls (maybe 10-100ms for >10k node graph), so you could construct it ahead of time
myop_nooutput = tf.group(myop)
sess.run(myop_nooutput)
sess.run(myop_nooutput)