OOM in second round of cross-validation - tensorflow

What I need help with / What I was wondering
I am performing cross-validation using the keras API, and have put all the code to perform one round of CV into a single function. The first round of CV works, but then upon the second round, I get an OOM error upon trying to build the next model.
Why is this happening?
How do I properly do this type of CV from a single python process?
Is there a way to completely flush the GPU/TPU memory to control things like memory fragmentation?
import tensorflow as tf
def run_fold_training(k_fold, num_folds, batch_size):
#clear graph
tf.keras.backend.clear_session()
#try to get tpu or else gpu
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Device:', tpu.master())
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
strategy = tf.distribute.get_strategy()
print('Number of replicas:', strategy.num_replicas_in_sync)
with strategy.scope():
# make k-fold dataset
ds = build_dataset()
train_ds = ds.enumerate().filter(
lambda i, ds, num_folds=num_folds, k_fold=k_fold: i % num_folds != k_fold).map(
lambda i, ds: ds).batch(batch_size)
test_ds = ds.enumerate().filter(
lambda i, ds, num_folds=num_folds, k_fold=k_fold: i % num_folds == k_fold).map(
lambda i, ds: ds).batch(batch_size)
# make, train, evaluate model
model = MyModel(**model_kwargs)
model.compile(**compile_kwargs)
model.fit(train_ds, epochs=25)
results = model.evaluate(test_ds, return_dict=True)
return results["score"]
num_folds = 5
batch_size = 8
cv_loss = sum([run_fold_training(k, num_folds, batch_size) for k in range(num_folds)]) / num_folds
print(f"Final {num_folds}-fold cross validation score is: {cv_loss}")
What I've tried so far
I'm clearing the keras backend at the start of the CV round and I'm also creating a new distribute strategy scope per round. I've already tried batch sizes of [1,2,4,8]. For all batchsizes it does one round fine, but gives OOM at the start of the next round.
It would be nice if...
It would be great it there was access to lower level control over memory management. This could be in tiers of complexity. Like, simplest case would be a function that frees all device memory related to a certain graph. In TF1 I would have just made a new session per CV round, and this wouldn't be a problem.
Environment information
(if applicable)
Operating System: ubuntu 18.04
Python version: 3.8
Docker: tensorflow/tensorflow:2.3.1-gpu

The answer was discovered by a friend. If there are references to graph ops/variables created outside the run_fold_training function then the clear_session will not completely work. The solution is to make sure that entire new graph is created after the clear_session. E.g. don't reuse optimizers, etc.

Related

Basic TPU Cross Shard Optimizer Not Working

In general, there are some good examples that use TF optimizers for solving general (non deep learning) problems. Given:
https://databricks.com/tensorflow/training-and-convergence
https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd
We want to be able to combine the two above and make use of TPU based optimization in solving high dimensional problems.
To that end I've got a simple colab code that does this merging the two examples above:
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1
# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()
with tf.Session(tpu_address) as session:
session.run(tf.contrib.tpu.initialize_system())
print('init')
session.run(model)
for i in range(10000):
print(i)
x_value = np.random.rand()
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x: x_value, y: y_value})
w_value = session.run(w)
print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
session.run(tpu.shutdown_system())
When I run it (in colab) as it is it just runs the first loop printing:
init
0
and then does nothing and colab just keeps spanning.
If I do not use
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
And other TPU features, then it works fine estimating the w Variable.
The questions are:
Why doesn't this work and how can we get the cross shard replicator to optimise this simple function?
How can shall I shape variable w to make use of parallel batches/shards on the TPU?
How can we make this even more efficient through use of an equivalent Dataset prefetch operation or using infeed queues?
The goal is to make use of lower level TPU APIs without TPUEstimator for example to help solve custom problems by leveraging the power of TPUs using the tensors , queues and shards only.
It doesn't work because you are overriding the number of shards without actually splitting the calculations into shards. When I run your code, I get the following error:
InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.
Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".
It is trying to perform the computations on eight shards and combine the results, but it only has one shard to work with. Take a look at tf.contrib.tpu.shard. It creates a shard context using the given number of shards and distributes a computation over those shards. So, instead of setting the number of shards manually, you can define your variables as usual and then wrap any computations with them in a function to be sharded:
# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
# Average across the entire batch
error = tf.reduce_mean(tf.square(y - y_model))
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)
You don't need to shape w to make use of shards, because sharding occurs across the batch dimension and you only have one set of weights for all inputs. You'll want to add a batch dimension to your inputs so that each batch can be distributed across the cores. shard assumes the first dimension is the batch dimension, but includes an argument to change it if your data is shaped differently. According to the TPU troubleshooting page, the ideal batch size is 1024 so that there are 128 samples per TPU core. If that is too big for your model, you can go smaller as long as it is a multiple of 128. Check out the above link and the performance guide for more tips on increasing performance.
for i in range(1000):
print(i)
x_value = np.random.rand(1024) # Generate a batch of 1024 values
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
Everything else should remain the same. I was able to train the model for all 10000 iterations. Keep in mind that for this simple model it will probably be slower than using CPU/GPU, but you should expect performance improvements for more complex problems with larger datasets.
I'm not familiar enough with Datasets or infeed queues to comment on this, but shard includes an argument for infeed queues so it likely has support for them. You might have to play around with it to see how it gets data to the computation function.

Calculating Perplexity and Memory Issues in Keras/Tensorflow

I'd like to evaluate my model with Perplexity after each training epoch. I'm using Keras with Tensorflow backend. The problem is, that after each evaluation more and more memory is used but never released. So after a few epochs my system crashes. It would work without the memory issue if I'm not using keras and tensorflow functions. But then it would be waaay too slow.
Here is the code:
def compute_perplexity(self, modelName, sentences):
all_labels, all_predictions = self.predictLabels_for_perplexity_evaluation(self.models[modelName], sentences)
# add an axis to fit tensor shape
for i in range(len(all_labels)):
all_labels[i] = all_labels[i][:,:, np.newaxis]
#calculate perplexity for each sentence length and each datapoint and append to list
perplexity = []
for i in range(10,15): #range(len(all_labels)):
start = time.time()
xentropy = K.sparse_categorical_crossentropy(tf.convert_to_tensor(all_labels[i]), tf.convert_to_tensor(all_predictions[i]))
perplexity.append(K.eval(K.pow(2.0, xentropy)))
print('time for one set of sentences. ', time.time()- start)
#average for each datapoint
for i in range(len(perplexity)):
perplexity[i] = np.average(perplexity[i], axis=1)
perplexity[i] = np.average(perplexity[i])
return np.mean(perplexity)
There is no need to evaluate this metric using TensorFlow, what you code does is to add the all_labels array to the graph each time it is called, which explains the memory usage you are seeing.
Consider implementing all this computation using numpy, or making an operation that you evaluate with new data in a session using feed_dict (without using tf.convert_to_tensor).

Tensorflow slow inference speed in a loop

I am working on a reinforcement learning implementation using Tensorflow. After profiling on the training procedure, I found something really weird:
The following code is in a training loop:
state_batch, \
action_batch, \
reward_batch, \
next_state_batch, \
is_episode_finished_batch = self.data_manager.get_next_batch()
state_batch = np.divide(state_batch, 10.0)
next_state_batch = np.divide(next_state_batch, 10.0)
# Calculate y for the td_error of the critic
y_batch = []
next_action_batch = self.actor_network.target_evaluate(
next_state_batch, action_batch)
q_value_batch = self.critic_network.target_evaluate(
next_state_batch, next_action_batch)
for i in range(0, self.batch_size):
if is_episode_finished_batch[i]:
y_batch.append([reward_batch[i]])
else:
y_batch.append(reward_batch[i] + GAMMA * q_value_batch[i])
# Now that we have the y batch, train the critic
self.critic_network.train(y_batch, state_batch, action_batch)
# Then get the action gradient batch and adapt the gradient with the gradient inverting method
action_batch_for_gradients = self.actor_network.evaluate(
state_batch, action_batch)
q_gradient_batch = self.critic_network.get_action_gradient(
state_batch, action_batch_for_gradients)
q_gradient_batch = self.grad_inv.invert(
q_gradient_batch, action_batch_for_gradients)
# Now we can train the actor
self.actor_network.train(q_gradient_batch, state_batch, action_batch)
actor_network and critic_network are two classes that implement actor and critic in actor-critic algorithm. Each of them has their own network and operations, but all in the same graph and will run within the same session. Each of the member function (like evaluate, train...) contains a session.run and feed the data they need by passing parameter.
I observed that action_batch_for_gradients runs extremely slow, taking 0.x seconds to do one inference, even much slower than the self.critic_network.train. action_batch_for_gradients is simply an inference operation in actor network to get action. I then copy this line and duplicate it and found that only the first action_batch_for_gradients, right after self.critic_network.train is slow, but the second one is of the normal speed of a forward operation. I think it has something to do with switching within a graph, between training a network and forward in another network. But I can't tell how to avoid.
I found some discussions on stackoverflow about using same graph in the loop, instead of building new ones each time, to speed up using tensorflow. But I already build the graph beforehand and only run the different part of the graph in the training loop. So I don't know how i wrongly use tensorflow on this loop training. I am using Tensorflow 1.6.
I would appreciate your help!

How do I use Batch Normalization in a multi-gpu setting in TensorFlow?

Referencing this post on How could I use Batch Normalization in TensorFlow?.
I have a multi-gpu setup similar to the CIFAR10 example. When I insert tf.contrib.layers.batch_norm to my network definition, I get a NoneType object in average_gradients. Specifically, the variable g is the NoneType.
def average_gradients(tower_grads):
average_grads = []
for grad_and_vars in zip(*tower_grads):
grads = []
for g, _ in grad_and_vars:
expanded_g = tf.expand_dims(g, 0)
grads.append(expanded_g)
grad = tf.concat(0, grads)
grad = tf.reduce_mean(grad, 0)
v = grad_and_vars[0][1]
grad_and_var = (grad, v)
average_grads.append(grad_and_var)
return average_grads
Some sample code on how to run Batch Normalization in a multi-gpu environment would help.
EDIT:
Simply removing the "batch_norm" variables solves this bug. However, the pressing question here is that each Batch Normalization has a beta and gamma on each GPU, with their own moving averages. How are all these moving averages over all the GPUs resolved at inference?
Just use BN independently across GPUs, while using one of the tower means to update the moving mean.
with tf.device('..'):
x,y = iterator.get_next()
// NN with variables copied over to each of the GPUs
loss = tower_loss(..)
// use last tower statistics to update the moving mean/variance
batchnorm_updates = tf.get_collection(tf.GraphKeys.UPDATE_OPS, scope=scope)
apply_gradient_op = average_gradients(*grads)
batchnorm_updates_op = tf.group(*batchnorm_updates)
train_op = tf.group(apply_gradient_op, batchnorm_updates_op)
As gleaned from multiple comments here, this simple asynchronous approach works well in practice for most domains, with the exception of problems like semantic segmentation, action video recognition etc. where the batch size is extremely small and async BN isn't able to afford the speed boost that it normally does.

How to use evaluation_loop with train_loop in tf-slim

I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.
My question is: what is the canonical way to use these loops?
As a followup: is it possible to use early stopping with train_loop?
Currently I have a model and my training file train.py looks like this
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op(
total_loss,
optimizer,
global_step=global_step,
clip_gradient_norm=FLAGS.clip_gradient_norm,
summarize_gradients=True)
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs)
Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.
But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called eval.py which contains my evaluation loop.
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops(
logits,
labels,
dataset.num_classes() )
slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=config)
Questions:
1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?
2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.
3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.
Update
~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~
This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.
evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.
You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.
Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.
Thanks to #kmalakoff, the TensorFlow issue gave a brilliant way to the problem that how to validate or test model in tf.slim training. The main idea is overriding train_step_fn function:
import …
from tensorflow.contrib.slim.python.slim.learning import train_step
...
accuracy_validation = ...
accuracy_test = ...
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_validation)
print('your validation info')
if train_step_fn.step % FLAGS.test_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_test)
print('your test info')
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
train_step_fn.accuracy_test = accuracy_test
# run training.
slim.learning.train(
train_op,
FLAGS.logs_dir,
train_step_fn=train_step_fn,
graph=graph,
number_of_steps=FLAGS.max_steps)
Adding my 2-cent:
I currently have this model for the evaluation_loop hogging up an
entire GPU, but it's rarely being used
Usually an evaluation model takes less GPU memory. You could prevent TF from hogging the whole GPU memory by setting the session config allow_growth to True. This way you can use the same GPU for both training and evaluation
Example # Training
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
session_config=session_config)
Example # validation
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=session_config)