Tensorflow slow inference speed in a loop - tensorflow

I am working on a reinforcement learning implementation using Tensorflow. After profiling on the training procedure, I found something really weird:
The following code is in a training loop:
state_batch, \
action_batch, \
reward_batch, \
next_state_batch, \
is_episode_finished_batch = self.data_manager.get_next_batch()
state_batch = np.divide(state_batch, 10.0)
next_state_batch = np.divide(next_state_batch, 10.0)
# Calculate y for the td_error of the critic
y_batch = []
next_action_batch = self.actor_network.target_evaluate(
next_state_batch, action_batch)
q_value_batch = self.critic_network.target_evaluate(
next_state_batch, next_action_batch)
for i in range(0, self.batch_size):
if is_episode_finished_batch[i]:
y_batch.append([reward_batch[i]])
else:
y_batch.append(reward_batch[i] + GAMMA * q_value_batch[i])
# Now that we have the y batch, train the critic
self.critic_network.train(y_batch, state_batch, action_batch)
# Then get the action gradient batch and adapt the gradient with the gradient inverting method
action_batch_for_gradients = self.actor_network.evaluate(
state_batch, action_batch)
q_gradient_batch = self.critic_network.get_action_gradient(
state_batch, action_batch_for_gradients)
q_gradient_batch = self.grad_inv.invert(
q_gradient_batch, action_batch_for_gradients)
# Now we can train the actor
self.actor_network.train(q_gradient_batch, state_batch, action_batch)
actor_network and critic_network are two classes that implement actor and critic in actor-critic algorithm. Each of them has their own network and operations, but all in the same graph and will run within the same session. Each of the member function (like evaluate, train...) contains a session.run and feed the data they need by passing parameter.
I observed that action_batch_for_gradients runs extremely slow, taking 0.x seconds to do one inference, even much slower than the self.critic_network.train. action_batch_for_gradients is simply an inference operation in actor network to get action. I then copy this line and duplicate it and found that only the first action_batch_for_gradients, right after self.critic_network.train is slow, but the second one is of the normal speed of a forward operation. I think it has something to do with switching within a graph, between training a network and forward in another network. But I can't tell how to avoid.
I found some discussions on stackoverflow about using same graph in the loop, instead of building new ones each time, to speed up using tensorflow. But I already build the graph beforehand and only run the different part of the graph in the training loop. So I don't know how i wrongly use tensorflow on this loop training. I am using Tensorflow 1.6.
I would appreciate your help!

Related

OOM in second round of cross-validation

What I need help with / What I was wondering
I am performing cross-validation using the keras API, and have put all the code to perform one round of CV into a single function. The first round of CV works, but then upon the second round, I get an OOM error upon trying to build the next model.
Why is this happening?
How do I properly do this type of CV from a single python process?
Is there a way to completely flush the GPU/TPU memory to control things like memory fragmentation?
import tensorflow as tf
def run_fold_training(k_fold, num_folds, batch_size):
#clear graph
tf.keras.backend.clear_session()
#try to get tpu or else gpu
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Device:', tpu.master())
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
strategy = tf.distribute.get_strategy()
print('Number of replicas:', strategy.num_replicas_in_sync)
with strategy.scope():
# make k-fold dataset
ds = build_dataset()
train_ds = ds.enumerate().filter(
lambda i, ds, num_folds=num_folds, k_fold=k_fold: i % num_folds != k_fold).map(
lambda i, ds: ds).batch(batch_size)
test_ds = ds.enumerate().filter(
lambda i, ds, num_folds=num_folds, k_fold=k_fold: i % num_folds == k_fold).map(
lambda i, ds: ds).batch(batch_size)
# make, train, evaluate model
model = MyModel(**model_kwargs)
model.compile(**compile_kwargs)
model.fit(train_ds, epochs=25)
results = model.evaluate(test_ds, return_dict=True)
return results["score"]
num_folds = 5
batch_size = 8
cv_loss = sum([run_fold_training(k, num_folds, batch_size) for k in range(num_folds)]) / num_folds
print(f"Final {num_folds}-fold cross validation score is: {cv_loss}")
What I've tried so far
I'm clearing the keras backend at the start of the CV round and I'm also creating a new distribute strategy scope per round. I've already tried batch sizes of [1,2,4,8]. For all batchsizes it does one round fine, but gives OOM at the start of the next round.
It would be nice if...
It would be great it there was access to lower level control over memory management. This could be in tiers of complexity. Like, simplest case would be a function that frees all device memory related to a certain graph. In TF1 I would have just made a new session per CV round, and this wouldn't be a problem.
Environment information
(if applicable)
Operating System: ubuntu 18.04
Python version: 3.8
Docker: tensorflow/tensorflow:2.3.1-gpu
The answer was discovered by a friend. If there are references to graph ops/variables created outside the run_fold_training function then the clear_session will not completely work. The solution is to make sure that entire new graph is created after the clear_session. E.g. don't reuse optimizers, etc.

Implementing stochastic forward passes in part of a neural network in Keras?

my problem is the following:
I am working on an object detection problem and would like to use dropout during test time to obtain a distribution of outputs. The object detection network consists of a training model and a prediction model, which wraps around the training model. I would like to perform several stochastic forward passes using the training model and combine these e.g. by averaging the predictions in the prediction wrapper. Is there a way of doing this in a keras model instead of requiring an intermediate processing step using numpy?
Note that this question is not about how to enable dropout during test time
def prediction_wrapper(model):
# Example code.
# Arguments
# model: the training model
regression = model.outputs[0]
classification = model.outputs[1]
predictions = # TODO: perform several stochastic forward passes (dropout during train and test time) here
avg_predictions = # TODO: combine predictions here, e.g. by computing the mean
outputs = # TODO: do some processing on avg_predictions
return keras.models.Model(inputs=model.inputs, outputs=outputs, name=name)
I use keras with a tensorflow backend.
I appreciate any help!
The way I understand, you're trying to average the weight updates for a single sample while Dropout is enabled. Since dropout is random, you would get different weight updates for the same sample.
If this understanding is correct, then you could create a batch by duplicating the same sample. Here I am assuming that the Dropout is different for each sample in a batch. Since, backpropagation averages the weight updates anyway, you would get your desired behavior.
If that does not work, then you could write a custom loss function and train with a batch-size of one. You could update a global counter inside your custom loss function and return non-zero loss only when you've averaged them the way you want it. I don't know if this would work, it's just an idea.

How does one move data to multiple GPU towers using Tensorflow's Dataset API

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.
A snippet of code below shows how we are doing this.
train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)
is_validating = tf.placeholder(dtype=bool, shape=())
next_batch = tf.cond(is_validating,
lambda: val_iterator.get_next(),
lambda: train_iterator.get_next())
validation_tower = self.num_gpus - 1
tower_grads = []
for i in range(self.num_gpus):
with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
if i == validation_tower:
images, labels = next_batch
# Loss funcs snipped out
else:
images, labels = next_batch
# Loss funcs snipped out
The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.
The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset
The question I have is:
Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?
At present there are three main options, which have different usability and performance trade-offs:
In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.
In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)
Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism
Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.

How to use evaluation_loop with train_loop in tf-slim

I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.
My question is: what is the canonical way to use these loops?
As a followup: is it possible to use early stopping with train_loop?
Currently I have a model and my training file train.py looks like this
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op(
total_loss,
optimizer,
global_step=global_step,
clip_gradient_norm=FLAGS.clip_gradient_norm,
summarize_gradients=True)
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs)
Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.
But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called eval.py which contains my evaluation loop.
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops(
logits,
labels,
dataset.num_classes() )
slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=config)
Questions:
1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?
2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.
3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.
Update
~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~
This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.
evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.
You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.
Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.
Thanks to #kmalakoff, the TensorFlow issue gave a brilliant way to the problem that how to validate or test model in tf.slim training. The main idea is overriding train_step_fn function:
import …
from tensorflow.contrib.slim.python.slim.learning import train_step
...
accuracy_validation = ...
accuracy_test = ...
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_validation)
print('your validation info')
if train_step_fn.step % FLAGS.test_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_test)
print('your test info')
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
train_step_fn.accuracy_test = accuracy_test
# run training.
slim.learning.train(
train_op,
FLAGS.logs_dir,
train_step_fn=train_step_fn,
graph=graph,
number_of_steps=FLAGS.max_steps)
Adding my 2-cent:
I currently have this model for the evaluation_loop hogging up an
entire GPU, but it's rarely being used
Usually an evaluation model takes less GPU memory. You could prevent TF from hogging the whole GPU memory by setting the session config allow_growth to True. This way you can use the same GPU for both training and evaluation
Example # Training
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
session_config=session_config)
Example # validation
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=session_config)

Tensorflow: Which graph statements are executed after the graph is built?

In Tensorflow, which statements within a graph definition block are executed only to build the graph vs. which are executed during training? For example:
with tf.Graph().as_default():
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits]))
weightsLayer1 = tf.div(weightsLayer1, tf.sqrt(tf.to_float(nInputUnits)))
biasesLayer1 = tf.Variable(tf.zeros([nUnitsHiddenLayer1]))
layer1output = tf.tanh(tf.matmul(images_placeholder, weightsLayer1) + biasesLayer1)
Intuitively, the lines defining weightsLayer1 and biasesLayer1 I assume are only executed once at startup, since they initialize weights and biases. However, the line computing layer1output I assume executes at every training step, since layer1output is used downstream to compute loss, which is minimized by the optimizer. So, how does Tensorflow know, during training, to only execute the last line and not the previous ones (which would re-initialize the weights and biases)?
You as the user are actually telling tensorflow which operations to run. During training, you typically tell tensorflow to execute operations that are provided by an optimizer. This looks something like this:
opt = tf.train.GradientDescentOptimizer(0.01)
train_step = opt.minimize(loss) #
for i in range(100):
sess.run(train_step, feed_dict=...)
Calling opt.minimize adds to the computation graphs the gradients w.r.t. the trainable variables as well as operations that update the variables using the gradients. train_step is in fact these update operations grouped using tf.group. If you (the user) run train_step, tensorflow figures out what parts of the computation graph it needs to run in order to execute these desired operations.
Likewise, if you do something like sess.run(fetches=loss, feed_dict=...), you are asking tensorflow to execute all operations in the graph that are necessary to compute loss.
Finally, initialization operations like the one in weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits])) are usually run by sess.run(tf.initialize_all_variables()).
Edit: After re-reading your question, I want to be more clear about one aspect. No operations are actually executed by the graph definition code you provided. Tensorflow operations are executed if and only if you start a session and request the execution of parts of your graph. As stated above, that includes the initialization operations.