I'm using the TensorFlow dataset API with the switching mechanic to switch between training and test set.
dataset_iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
features, labels = dataset_iter.get_next()
train_init_op = dataset_iter.make_initializer(train_dataset)
test_init_op = dataset_iter.make_initializer(test_dataset)
features and labels is used for the graph, for example:
logits = tf.layers.dense(features, units=dataset.labels.shape[-1])
loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
For each epoch for testing and training the dataset is switched by calling the respective initializer (train_init_op, test_init_op).
Now I would like to use a dropout layer but I don't know how to determine how to check if for the current run the training or test set is initialized:
is_training = ???
net = tf.layers.dropout(net, rate=0.25, training=is_training)
is_training needs to be a variable and should not be evaluated on graph building time. If should be evaluated on every run.
How to do this? I don't want to redefine the graph for test or training.
Okay, I already came up with a solution:
is_training = tf.Variable(False, dtype=tf.bool)
train_init_op = tf.group(dataset_iter.make_initializer(train_dataset), tf.assign(is_training, True))
test_init_op = tf.group(dataset_iter.make_initializer(test_dataset), tf.assign(is_training, False))
I added an additional variable that tracks the state (training / testing).
This variable also gets called when the initializer is called and is set the the right value.
I hoped that there is a integrated / out of the box version of that.
If someone knows one such solution, please feel free to provide an addition answer.
Maybe we can try tf.control_dependencies :
with tf.control_dependencies([train_init_op]):
net = tf.layers.dropout(net, rate=0.25, training=True)
Related
I'm trying to compute the gradient of the output layer with respect to the input layer. My neural network is relatively small (input layer composed of 9 activation units and the output layer of 1) and the training went fine as the test provided a very good accuracy. I made the NN model using Keras.
In order to solve my problem, I need to compute the gradient of the output with respect to the input. This is, I need to obtain the Jacobian which as dimension [1x9]. The gradients function in tensorflow should provide me with everything I need, but when I run the code below I obtain a different solution every time.
output_v = model.output
input_v = model.input
gradients = tf.gradients(output_v, input_v)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
print(sess.run(model.input,feed_dict={model.input:x_test_N[0:1,:]}))
evaluated_gradients = sess.run(gradients,feed_dict{model.input:x_test_N[0:1,:]})
print(evaluated_gradients)
sess.close()
The first print command shows this value every time I run it (just to make sure that the input values are not modified):
[[-1.4306372 -0.1272892 0.7145787 1.338818 -1.2957293 -0.5402862-0.7771702 -0.5787912 -0.9157122]]
But the second print shows different ones:
[[ 0.00175761, -0.0490326 , -0.05413761, 0.09952173, 0.06112418, -0.04772799, 0.06557006, -0.02473242, 0.05542536]]
[[-0.00416433, 0.08235116, -0.00930298, 0.04440641, 0.03752216, 0.06378302, 0.03508484, -0.01903783, -0.0538374 ]]
Using finite differences, evaluated_gradients[0,0] = 0.03565103, which isn't close to any of the first values previously printed.
Thanks for your time!
Alberto
Solved by creating a specific session just before training my model:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
K.set_session(sess)
history = model.fit(x_train_N, y_train_N, epochs=n_epochs,
validation_split=split, verbose=1, batch_size=n_batch_size,
shuffle='true', callbacks=[early_stop, tensorboard])
And evaluating the gradient after training, while tf.session is still open:
evaluated_gradients = sess.run(K.gradients(model.output, model.input), feed_dict={model.input: x_test_N})
Presumably your network is set up to initialize weights to random values. When you run sess.run(tf.initialize_all_variables()), you are initializing your variables to new random values. Therefore you get different values for output_v in every run, and hence different gradients. If you want to use a model you trained before, you should replace the initialization with initialize_all_variables() with a restore command. I am not familiar with how this is done in Keras since I usually work directly with tensorflow, but I would try this.
Also note that initialize_all_variables is deprecated and you should use global_variables_initializer instead.
This question has been asked here, difference is my problem is focused on Estimator.
Some context: We have trained a model using estimator and get some variable defined within Estimator input_fn, this function preprocesses data to batches. Now, we are moving to prediction. During the prediction, we use the same input_fn to read in and process the data. But got error saying variable (word_embeddings) does not exist (variables exist in the chkp graph), here's the relevant bit of code in input_fn:
with tf.variable_scope('vocabulary', reuse=tf.AUTO_REUSE):
if mode == tf.estimator.ModeKeys.TRAIN:
word_to_index, word_to_vec = load_embedding(graph_params["word_to_vec"])
word_embeddings = tf.get_variable(initializer=tf.constant(word_to_vec, dtype=tf.float32),
trainable=False,
name="word_to_vec",
dtype=tf.float32)
else:
word_embeddings = tf.get_variable("word_to_vec", dtype=tf.float32)
basically, when it's in prediction mode, else is invoked to load up variables in checkpoint. Failure of recognizing this variable indicates a) inappropriate usage of scope; b) graph is not restored. I don't think scope matters that much here as long as reuse is set properly.
I suspect that is because the graph is not yet restored at input_fn phase. Usually, the graph is restored by calling saver.restore(sess, "/tmp/model.ckpt") reference. Investigation of estimator source code doesn't get me anything relating to restore, the best shot is MonitoredSession, a wrapper of training. It's already been stretch so much from the original problem, not confident if I'm on the right path, I'm looking for help here if anyone has any insights.
One line summary of my question: How does graph get restored within tf.estimator, via input_fn or model_fn?
Hi I think that you error comes simply because you didn't specify the shape in the tf.get_variable (at predict) , it seems that you need to specify the shape even if the variable is going to be restored.
I've made the following test with a simple linear regressor estimator that simply needs to predict x + 5
def input_fn(mode):
def _input_fn():
with tf.variable_scope('all_input_fn', reuse=tf.AUTO_REUSE):
if mode == tf.estimator.ModeKeys.TRAIN:
var_to_follow = tf.get_variable('var_to_follow', initializer=tf.constant(20))
x_data = np.random.randn(1000)
labels = x_data + 5
return {'x':x_data}, labels
elif mode == tf.estimator.ModeKeys.PREDICT:
var_to_follow = tf.get_variable("var_to_follow", dtype=tf.int32, shape=[])
return {'x':[0,10,100,var_to_follow]}
return _input_fn
featcols = [tf.feature_column.numeric_column('x')]
model = tf.estimator.LinearRegressor(featcols, './outdir')
This code works perfectly fine, the value of the const is 20 and also for fun use it in my test set to confirm :p
However if you remove the shape=[] , it breaks, you can also give another initializer such as tf.constant(500) and everything will work and 20 will be used.
By running
model.train(input_fn(tf.estimator.ModeKeys.TRAIN), max_steps=10000)
and
preds = model.predict(input_fn(tf.estimator.ModeKeys.PREDICT))
print(next(preds))
You can visualize the graph and you'll see that a) the scoping is normal and b) the graph is restored.
Hope this will help you.
I'm trying to add image summary operations to visualize how well my network manages to reconstruct inputs from the validation set. However, since there are too many images in the validation set I would only like to plot a small subset of them.
I managed to achieve this with manual training loop, but I struggle to achieve the same with the new Tensorflow Estimator/Experiment/Datasets API. Has anyone done something like this?
The Experiment and Estimator are high level TensorFlow APIs. Although you could probably solve your issue with a hook, if you want more control on what's happening during the training process, it may be easier not to use these APIs.
That said, you can still use the Dataset API which will bring you a lot of useful features.
To solve your problem with the Dataset API, you will need to switch between train and validation datasets in your training loop.
One way to do that is to use a feedable iterator. See here for more details:
https://www.tensorflow.org/programmers_guide/datasets
You can also see a full example switching between training and validation with the Dataset API in this notebook.
In brief, after having created your train_dataset and your val_dataset, your training loop could be something like this:
# create TensorFlow Iterator objects
training_iterator = val_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
with tf.Session() as sess:
# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)
# Create training data and validation data handles
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(val_iterator.string_handle())
for epoch in range(number_of_epochs):
# Tell iterator to go to beginning of dataset
sess.run(training_iterator.initializer)
print ("Starting epoch: ", epoch)
# iterate over the training dataset and train
while True:
try:
sess.run(train_op, feed_dict={handle: training_handle})
except tf.errors.OutOfRangeError:
# End of epoch
break
# Tell validation iterator to go to beginning of dataset
sess.run(val_iterator.initializer)
# run validation on only 10 examples
for i in range(10):
my_value = sess.run(my_validation_op, feed_dict={handle: validation_handle}))
# Do whatever you want with my_value
...
I figured out a solution that uses Estimator/Experiment API.
First you need to modify your Dataset input to not only provide labels and features, but also some form of an identifier for each sample (in my case it was a filename). Then in the hyperparameters dictionary (params argument) you need to specify which of the validation samples you want to plot. You also will have to pass the model_dir in those parameters. For example:
params = tf.contrib.training.HParams(
model_dir=model_dir,
images_to_plot=["100307_EMOTION.nii.gz", "100307_FACE-SHAPE.nii.gz",
"100307_GAMBLING.nii.gz", "100307_RELATIONAL.nii.gz",
"100307_SOCIAL.nii.gz"]
)
learn_runner.run(
experiment_fn=experiment_fn,
run_config=run_config,
schedule="train_and_evaluate",
hparams=params
)
Having this set up you can create conditional Summary operations in your model_fn and an evaluation hook to include them in your outputs.
if mode == tf.contrib.learn.ModeKeys.EVAL:
summaries = []
for image_to_plot in params.images_to_plot:
is_to_plot = tf.equal(tf.squeeze(filenames), image_to_plot)
summary = tf.cond(is_to_plot,
lambda: tf.summary.image('predicted', predictions),
lambda: tf.summary.histogram("ignore_me", [0]),
name="%s_predicted" % image_to_plot)
summaries.append(summary)
evaluation_hooks = [tf.train.SummarySaverHook(
save_steps=1,
output_dir=os.path.join(params.model_dir, "eval"),
summary_op=tf.summary.merge(summaries))]
else:
evaluation_hooks = None
Note that the summaries have to be conditional - we are either plotting an image (computationally expensive) or saving a constant (computationally cheap). I opted for using histogram versus scalar in for the dummy summaries to avoid cluttering my tensorboard dashboard.
Finally you need to pass the hook in the return object of your `model_fn'
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op,
evaluation_hooks=evaluation_hooks
)
Please note that this only works when your batch size is 1 when evaluating the model (which should not be a problem).
In my TensorFlow code I want my network to take inputs from one of the two StagingArea objects depending upon whether I want to do training or testing.
A part of the graph construction code I wrote is as follows :
with tf.device("/gpu:0"):
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
phase = tf.get_variable("phase", [], initializer=tf.zeros_initializer(),dtype=tf.uint8, trainable=False)
phaseassigntest = phase.assign(1)
phaseassigntrain = phase.assign(0)
phasetest = tf.equal(phase, 0)
is_training = tf.cond(phasetest, lambda: tf.constant(True), lambda: tf.constant(False))
trainstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
putoptrain = trainstagingarea.put(train_iterator.get_next())
trainputop.append(putoptrain)
getoptrain = trainstagingarea.get()
traingetop.append(getoptrain)
trainclearop = trainstagingarea.clear()
trainstageclear.append(trainclearop)
trainsizeop = trainstagingarea.size()
trainstagesize.append(trainsizeop)
valstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[valbatchsize, 3, 221, 221], [valbatchsize]], capacity=20)
putopval = valstagingarea.put(val_iterator.get_next())
valputop.append(putopval)
getopval = valstagingarea.get()
valgetop.append(getopval)
valclearop = valstagingarea.clear()
valstageclear.append(valclearop)
valsizeop = valstagingarea.size()
valstagesize.append(valsizeop)
#elem = valgetop[i]
elem = tf.cond(is_training,lambda: traingetop[i],lambda: valgetop[i])
img = elem[0]
label = elem[1]
labelonehot = tf.one_hot(label, depth=numclasses)
net, networksummaries = overfeataccurate(img,numclasses=numclasses, phase=is_training)
I have used tf.cond to make sure that the network is fed by one of the two StagingArea objects. One is meant for training and the other one is meant for validation.
Now, when I try to execute the graph as follows, I do not get any result and infact the code just hangs and I have to kill the process.
with tf.Session(graph=g,config=config) as sess:
sess.run(init_op)
sess.run(tf.local_variables_initializer())
sess.run(val_initialize)
for i in range(20):
sess.run(valputop)
print(sess.run(valstagesize))
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
print("Performing Validation")
sess.run(phaseassigntest)
saver = tf.train.Saver()
while(epoch<10):
time_init = time.time()
while True:
try:
[val_accu, _, summaries] = sess.run([towervalidation, towervalidationupdateop,validation_summary_op])
print(val_accu)
when instead of tf.cond() I directly assign elem = valgetop[i], the code works just fine.
Am I missing something over here ?
What is the right way to feed my network based on whether I want to do training or testing ?
NOTE The error does not go away even if I set numgpus to 1.
Your problem
What you think tf.cond does
Based on the flag, execute what is required to put either traingetop[i] or valgetop[i] into your elem tensor.
What tf.cond actually does
Executes what is required to get both traingetop[i] and valgetop[i], then passes one of them into your elem tensor.
So
The reason it is hanging forever is because it's waiting for an element to be added to your training staging area (so that it can get that element and discard it). You're forgiven for not realising this is what it's doing; it's actually very counter-intuitive. The documentation is awfully unclear on how to deal with this.
Recommended Solution (by Tensorflow documentation)
If you really need the queues to be in the same graph, then you need to make two copies of your ENTIRE graph, one that is fed by your training staging area, and one that is fed by your validation staging area. Then you just use the relevant tensor in your sess.run call. I recommend creating a function that takes a queue output tensor, and returns a model_output tensor. Now you have a train_time_output tensor and a validation_time_output tensor, and you can choose which one you want to execute in your sess.run.
Warning
You need to make sure that you aren't actually creating new variables to go along with these new ops. To do that take a look at the latest documentation on variables. It looks like they've simplified it from v0.12, and it essentially boils down to using tf.get_variable instead of tf.Variable to create your variables.
My preferred work around
Although that is the recommended solution (AFAIK), it is extremely unsatisfying to me; you're creating a whole other set of operations on the graph that just happen to use the same weights. It seems like there's a lot of potential for programmer error by abusing the separation between train time and test/validation time (resulting in the model acting unexpectedly different at these times). Worse; it doesn't solve the problem of tf.cond demanding the values for inputs to both branches, it just forces you to copy your whole graph, which is not always possible.
I prefer to just not have my queues in the graph like that, and treat the model as a function which can be fed an example without caring where it's from. That is, I would instantiate the model with a tf.placeholder as the input, and at execution time I would use feed_dict to actually provide the value. It would function something like this
#inside main training loop
if time_to_train:
example = sess.run(traingettop)
else:
example = sess.run(valgettop)
result = sess.run(model_output, {input_placeholder: example})
It's very useful to note that you can use the feed_dict to feed any value for any tensor anywhere in your model. So, you can change any model definition that, due to tf.cond would always require an input, like:
a = tf.constant(some_value)
b = tf.placeholder(tf.float32)
flag = tf.placeholder(tf.bool, [])
one_of_them = tf.cond(flag, a, b)
model_output = build_graph(one_of_them)
Into a definition that doesn't, like:
a = tf.constant(some_value)
model_output = build_graph(a)
Remembering that you can always overwrite what a is at execution time:
# In main training loop,
sess.run(train_op, {a: some_other_value})
This essentially pushes the conditional into native python land. In your code you might end up with something like:
if condition_satisfied:
sess.run(train_op, {a:some_other_value})
else:
sess.run(train_op)
Performance concerns
If you are using tensorflow on a single machine, then there is practically no performance cost to this solution, as the numpy array/s put into the example python variable are actually still stored on the GPU.
If you are using tensorflow in a distributed fashion, then this solution would kill your performance; it would require sending the example from whatever machine it's on to the master so that it can send it back.
I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.
My question is: what is the canonical way to use these loops?
As a followup: is it possible to use early stopping with train_loop?
Currently I have a model and my training file train.py looks like this
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op(
total_loss,
optimizer,
global_step=global_step,
clip_gradient_norm=FLAGS.clip_gradient_norm,
summarize_gradients=True)
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs)
Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.
But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called eval.py which contains my evaluation loop.
import ...
train_log_dir = ...
with tf.device("/cpu:0"):
images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching(
subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops(
logits,
labels,
dataset.num_classes() )
slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=config)
Questions:
1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?
2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.
3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.
Update
~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~
This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.
evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.
You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.
Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.
Thanks to #kmalakoff, the TensorFlow issue gave a brilliant way to the problem that how to validate or test model in tf.slim training. The main idea is overriding train_step_fn function:
import …
from tensorflow.contrib.slim.python.slim.learning import train_step
...
accuracy_validation = ...
accuracy_test = ...
def train_step_fn(session, *args, **kwargs):
total_loss, should_stop = train_step(session, *args, **kwargs)
if train_step_fn.step % FLAGS.validation_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_validation)
print('your validation info')
if train_step_fn.step % FLAGS.test_every_n_step == 0:
accuracy = session.run(train_step_fn.accuracy_test)
print('your test info')
train_step_fn.step += 1
return [total_loss, should_stop]
train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
train_step_fn.accuracy_test = accuracy_test
# run training.
slim.learning.train(
train_op,
FLAGS.logs_dir,
train_step_fn=train_step_fn,
graph=graph,
number_of_steps=FLAGS.max_steps)
Adding my 2-cent:
I currently have this model for the evaluation_loop hogging up an
entire GPU, but it's rarely being used
Usually an evaluation model takes less GPU memory. You could prevent TF from hogging the whole GPU memory by setting the session config allow_growth to True. This way you can use the same GPU for both training and evaluation
Example # Training
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.learning.train(train_tensor,
logdir=train_log_dir,
local_init_op=tf.initialize_local_variables(),
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
session_config=session_config)
Example # validation
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
slim.evaluation.evaluation_loop(
'',
checkpoint_dir=train_log_dir,
logdir=train_log_dir,
num_evals=FLAGS.num_eval_batches,
eval_op=names_to_updates.values(),
summary_op=tf.merge_summary(summary_ops),
eval_interval_secs=FLAGS.eval_interval_secs,
session_config=session_config)