How to set session configuration in example CNN Estimator for MNIST (built with tf.layers) - tensorflow

I am new to TensorFlow and learning about how to implement CNN (Convolutional Neural Networks). I am using this official example (code). When I try to run it on GPU it gives cuda_error_out_of_memory, as it tries to allocate the entire GPU memory available. I ran it on CPU by setting CUDA_VISIBLE_DEVICES="" environment variable and it worked fine but took lot of time.
I looked for solution to cuda_error_out_of_memory and found it can be mitigated by setting config.gpu_options.allow_growth = True or config.gpu_options.per_process_gpu_memory_fraction = in the tf session.
Question: In the code I shared above for CNN where do I set the session configuration as I don't see any session.run() type of command. I assume its being called internally in the layer methods. So, where do it set it? Is there any way I can set session configuration globally for one file?

You can add any configuration in the Estimator's constructor:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config = tf.ConfigProto(gpu_options=gpu_options)
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model", session_config=config)

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess_config = tf.ConfigProto(gpu_options=gpu_options)
run_config = tf.estimator.RunConfig(session_config = sess_config)
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model", config=run_config)
Add this code before creating Esitimator.

Related

Extremely slow when saving model on Colab TPU

my situation is that saving model is extremely slow under Colab TPU environment.
I first encountered this issue when using checkpoint callback, which causes the training stuck at the end of the 1st epoch.
Then, I tried taking out callback and just save the model using model.save_weights(), but nothing has changed. By using Colab terminal, I found that the saving speed is about ~100k for 5 minutes.
The version of Tensorflow = 2.3
My code of model fitting is here:
with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU
Baseline = create_model()
checkpoint = keras.callbacks.ModelCheckpoint('baseline_{epoch:03d}.h5',
save_weights_only=True, save_freq="epoch")
hist = model.fit(get_train_ds().repeat(),
steps_per_epoch = 100,
epochs = 5,
verbose = 1,
callbacks = [checkpoint])
model.save_weights("epoch-test.h5", overwrite=True)
I found the issue happened because I explicitly switched to graph mode by writing
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
Before
with tpu_strategy.scope():
model.fit(...)
Though I still don't understand the cause, remove disable_eager_execution solved the issue.

What is "cudnn_status_internal_error" meaning Tensorflow pipeline?

If we start an object_detection project using TensorFlow Model API, we got an error "cudnn_status_internal_error".
I'm using TensorFlow object detection pipeline to train the model.
How to solve it?
In my case, it is because the GPU resource is not enough.
To solve this issue in the pipeline, please add a piece of code in "model_main.py"
session_config = tf.ConfigProto()
#session_config.gpu_options.allow_growth = True
session_config.gpu_options.per_process_gpu_memory_fraction = 0.8
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config)

Tensorflow: How can I save a checkpoint only if the error is minimized during training?

I am running a tensorflow program and I want to store the best model for later use. I am using estimator (tf.contrib.tpu.TPUEstimator module that takes a run_config argument, where I set save_checkpoints_secs=20*60) for training.
estimator.train takes a train_input_fn and num_train_steps as arguments.
eg: estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
Instead of saving the checkpoint after every 'n' seconds, I want to store the best model which has minimal error while training.
Any help is welcomed.
tf.estimator.BestExporter seems like it's exactly what you're looking for. According to the documentation, it states:
This class performs a model export every time when the new model is
better than any existing model.
estimator = tf.estimator.DNNClassifier(
config=tf.estimator.RunConfig(
model_dir='/my_model', save_summary_steps=100),
feature_columns=[categorial_feature_a_emb, ...],
hidden_units=[1024, 512, 256])
serving_feature_spec = tf.feature_column.make_parse_example_spec(
categorial_feature_a_emb)
serving_input_receiver_fn = (
tf.estimator.export.build_parsing_serving_input_receiver_fn(
serving_feature_spec))
exporter = tf.estimator.BestExporter(
name="best_exporter",
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=5)
train_spec = tf.estimator.TrainSpec(...)
eval_spec = [tf.estimator.EvalSpec(
input_fn=eval_input_fn,
steps=100,
exporters=exporter,
start_delay_secs=0,
throttle_secs=5)]

How to control GPU memory size with tf.estimator

I'm trying to control the size of GPU memory allocated for one tensorflow estimator tf.estimator.Estimator. The purpose is to only allocate half to run other tensorflow net on the same GPU. I found for the contrib version but not for the official. Someone knows if it's possible?
When you create an Estimator instance, you can pass in the constructor's config a tf.estimator.RunConfig instance.
The RunConfig has a session_config attribute you can use to set a tf.ConfigProto with the session's parameters.
In code, this translates to:
session_config = tf.ConfigProto()
session_config.gpu_options.per_process_gpu_memory_fraction = 0.5
estimator_config = tf.estimator.RunConfig(session_config=session_config)
my_estimator = tf.estimator.Estimator(..., config=estimator_config)

Tensorflow (tf-slim) Model with is_training True and False

I would like to run a given model both on the train set (is_training=True) and on the validation set (is_training=False), specifically with how dropout is applied. Right now the prebuilt models expose a parameter is_training that is passed it the dropout layer when building the network. The issue is that If I call the method twice with different values of is_training, I will get two different networks that do no share weights (I think?). How do I go about getting the two networks to share the same weights such that I can run the network that I have trained on the validation set?
I wrote a solution with your comment to use Overfeat in train and test mode. (I couldn't test it so you can check if it works?)
First some imports and parameters:
import tensorflow as tf
slim = tf.contrib.slim
overfeat = tf.contrib.slim.nets.overfeat
batch_size = 32
inputs = tf.placeholder(tf.float32, [batch_size, 231, 231, 3])
dropout_keep_prob = 0.5
num_classes = 1000
In train mode, we pass a normal scope to the function overfeat:
scope = 'overfeat'
is_training = True
output = overfeat.overfeat(inputs, num_classes, is_training,
dropout_keep_prob, scope=scope)
Then in test mode, we create the same scope but with reuse=True.
scope = tf.VariableScope(reuse=True, name='overfeat')
is_training = False
output = overfeat.overfeat(inputs, num_classes, is_training,
dropout_keep_prob, scope=scope)
you can just use a placeholder for is_training:
isTraining = tf.placeholder(tf.bool)
# create nn
net = ...
net = slim.dropout(net,
keep_prob=0.5,
is_training=isTraining)
net = ...
# training
sess.run([net], feed_dict={isTraining: True})
# testing
sess.run([net], feed_dict={isTraining: False})
It depends on the case, the solutions are different.
My first option would be to use a different process to do the evaluation. You only need to check that there is a new checkpoint and load that weights into the evaluation network (with is_training=False):
checkpoint = tf.train.latest_checkpoint(self.checkpoints_path)
# wait until a new check point is available
while self.lastest_checkpoint == checkpoint:
time.sleep(30) # sleep 30 seconds waiting for a new checkpoint
checkpoint = tf.train.latest_checkpoint(self.checkpoints_path)
logging.info('Restoring model from {}'.format(checkpoint))
self.saver.restore(session, checkpoint)
self.lastest_checkpoint = checkpoint
The second option is after every epoch you unload the graph and create a new evaluation graph. This solution waste a lot of time loading and unloading graphs.
The third option is to share the weights. But feeding these networks with queues or dataset can lead to issues, so you have to be very careful. I only use this for Siamese networks.
with tf.variable_scope('the_scope') as scope:
your_model(is_training=True)
scope.reuse_variables()
your_model(is_training=False)