`get_variable()` doesn't recognize existing variables for tf.estimator - tensorflow

This question has been asked here, difference is my problem is focused on Estimator.
Some context: We have trained a model using estimator and get some variable defined within Estimator input_fn, this function preprocesses data to batches. Now, we are moving to prediction. During the prediction, we use the same input_fn to read in and process the data. But got error saying variable (word_embeddings) does not exist (variables exist in the chkp graph), here's the relevant bit of code in input_fn:
with tf.variable_scope('vocabulary', reuse=tf.AUTO_REUSE):
if mode == tf.estimator.ModeKeys.TRAIN:
word_to_index, word_to_vec = load_embedding(graph_params["word_to_vec"])
word_embeddings = tf.get_variable(initializer=tf.constant(word_to_vec, dtype=tf.float32),
trainable=False,
name="word_to_vec",
dtype=tf.float32)
else:
word_embeddings = tf.get_variable("word_to_vec", dtype=tf.float32)
basically, when it's in prediction mode, else is invoked to load up variables in checkpoint. Failure of recognizing this variable indicates a) inappropriate usage of scope; b) graph is not restored. I don't think scope matters that much here as long as reuse is set properly.
I suspect that is because the graph is not yet restored at input_fn phase. Usually, the graph is restored by calling saver.restore(sess, "/tmp/model.ckpt") reference. Investigation of estimator source code doesn't get me anything relating to restore, the best shot is MonitoredSession, a wrapper of training. It's already been stretch so much from the original problem, not confident if I'm on the right path, I'm looking for help here if anyone has any insights.
One line summary of my question: How does graph get restored within tf.estimator, via input_fn or model_fn?

Hi I think that you error comes simply because you didn't specify the shape in the tf.get_variable (at predict) , it seems that you need to specify the shape even if the variable is going to be restored.
I've made the following test with a simple linear regressor estimator that simply needs to predict x + 5
def input_fn(mode):
def _input_fn():
with tf.variable_scope('all_input_fn', reuse=tf.AUTO_REUSE):
if mode == tf.estimator.ModeKeys.TRAIN:
var_to_follow = tf.get_variable('var_to_follow', initializer=tf.constant(20))
x_data = np.random.randn(1000)
labels = x_data + 5
return {'x':x_data}, labels
elif mode == tf.estimator.ModeKeys.PREDICT:
var_to_follow = tf.get_variable("var_to_follow", dtype=tf.int32, shape=[])
return {'x':[0,10,100,var_to_follow]}
return _input_fn
featcols = [tf.feature_column.numeric_column('x')]
model = tf.estimator.LinearRegressor(featcols, './outdir')
This code works perfectly fine, the value of the const is 20 and also for fun use it in my test set to confirm :p
However if you remove the shape=[] , it breaks, you can also give another initializer such as tf.constant(500) and everything will work and 20 will be used.
By running
model.train(input_fn(tf.estimator.ModeKeys.TRAIN), max_steps=10000)
and
preds = model.predict(input_fn(tf.estimator.ModeKeys.PREDICT))
print(next(preds))
You can visualize the graph and you'll see that a) the scoping is normal and b) the graph is restored.
Hope this will help you.

Related

How to check if training or test data is initialized?

I'm using the TensorFlow dataset API with the switching mechanic to switch between training and test set.
dataset_iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
features, labels = dataset_iter.get_next()
train_init_op = dataset_iter.make_initializer(train_dataset)
test_init_op = dataset_iter.make_initializer(test_dataset)
features and labels is used for the graph, for example:
logits = tf.layers.dense(features, units=dataset.labels.shape[-1])
loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
For each epoch for testing and training the dataset is switched by calling the respective initializer (train_init_op, test_init_op).
Now I would like to use a dropout layer but I don't know how to determine how to check if for the current run the training or test set is initialized:
is_training = ???
net = tf.layers.dropout(net, rate=0.25, training=is_training)
is_training needs to be a variable and should not be evaluated on graph building time. If should be evaluated on every run.
How to do this? I don't want to redefine the graph for test or training.
Okay, I already came up with a solution:
is_training = tf.Variable(False, dtype=tf.bool)
train_init_op = tf.group(dataset_iter.make_initializer(train_dataset), tf.assign(is_training, True))
test_init_op = tf.group(dataset_iter.make_initializer(test_dataset), tf.assign(is_training, False))
I added an additional variable that tracks the state (training / testing).
This variable also gets called when the initializer is called and is set the the right value.
I hoped that there is a integrated / out of the box version of that.
If someone knows one such solution, please feel free to provide an addition answer.
Maybe we can try tf.control_dependencies :
with tf.control_dependencies([train_init_op]):
net = tf.layers.dropout(net, rate=0.25, training=True)

Variable rnn/gru_cell/gates/weights already exists, disallowed

I want to implement the code in th book Tesorflow for Machine Intelligence, the code runs well at the first time,but when run it again ,the error
"Variable rnn/gru_cell/gates/weights already exists, disallowed" occurs. when I restart the console the error disapear and it occurs after the first running or debug. the code is below:
def prediction(self):
output, _ = tf.nn.dynamic_rnn(tf.contrib.rnn.GRUCell(300),
self.data,
dtype = tf.float32,
sequence_length = self.length)
last = self._last_relevant(output, self.length)
#softmax层
num_classes =int(self.target.get_shape()[1])
weight = tf.Variable(tf.truncated_normal([self.params.rnn_hidden, num_classes], stddev = 0.01))
bias = tf.Variable(tf.constant(0.1, shape = [num_classes]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
return prediction
anyone can help me with the problem?
Code that adds things to your graph (which includes pretty much everything in the function you posted) should usually only be run once. When you want to train your model or have it make a prediction, you would use something like sess.run with a feed_dict and the ops you want output from.
If you actually want to delete your graph without restarting the console, you can use tf.reset_default_graph().

Creating an image summary only for a subset of validation set images using Tensorflow Estimator API

I'm trying to add image summary operations to visualize how well my network manages to reconstruct inputs from the validation set. However, since there are too many images in the validation set I would only like to plot a small subset of them.
I managed to achieve this with manual training loop, but I struggle to achieve the same with the new Tensorflow Estimator/Experiment/Datasets API. Has anyone done something like this?
The Experiment and Estimator are high level TensorFlow APIs. Although you could probably solve your issue with a hook, if you want more control on what's happening during the training process, it may be easier not to use these APIs.
That said, you can still use the Dataset API which will bring you a lot of useful features.
To solve your problem with the Dataset API, you will need to switch between train and validation datasets in your training loop.
One way to do that is to use a feedable iterator. See here for more details:
https://www.tensorflow.org/programmers_guide/datasets
You can also see a full example switching between training and validation with the Dataset API in this notebook.
In brief, after having created your train_dataset and your val_dataset, your training loop could be something like this:
# create TensorFlow Iterator objects
training_iterator = val_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
with tf.Session() as sess:
# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)
# Create training data and validation data handles
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(val_iterator.string_handle())
for epoch in range(number_of_epochs):
# Tell iterator to go to beginning of dataset
sess.run(training_iterator.initializer)
print ("Starting epoch: ", epoch)
# iterate over the training dataset and train
while True:
try:
sess.run(train_op, feed_dict={handle: training_handle})
except tf.errors.OutOfRangeError:
# End of epoch
break
# Tell validation iterator to go to beginning of dataset
sess.run(val_iterator.initializer)
# run validation on only 10 examples
for i in range(10):
my_value = sess.run(my_validation_op, feed_dict={handle: validation_handle}))
# Do whatever you want with my_value
...
I figured out a solution that uses Estimator/Experiment API.
First you need to modify your Dataset input to not only provide labels and features, but also some form of an identifier for each sample (in my case it was a filename). Then in the hyperparameters dictionary (params argument) you need to specify which of the validation samples you want to plot. You also will have to pass the model_dir in those parameters. For example:
params = tf.contrib.training.HParams(
model_dir=model_dir,
images_to_plot=["100307_EMOTION.nii.gz", "100307_FACE-SHAPE.nii.gz",
"100307_GAMBLING.nii.gz", "100307_RELATIONAL.nii.gz",
"100307_SOCIAL.nii.gz"]
)
learn_runner.run(
experiment_fn=experiment_fn,
run_config=run_config,
schedule="train_and_evaluate",
hparams=params
)
Having this set up you can create conditional Summary operations in your model_fn and an evaluation hook to include them in your outputs.
if mode == tf.contrib.learn.ModeKeys.EVAL:
summaries = []
for image_to_plot in params.images_to_plot:
is_to_plot = tf.equal(tf.squeeze(filenames), image_to_plot)
summary = tf.cond(is_to_plot,
lambda: tf.summary.image('predicted', predictions),
lambda: tf.summary.histogram("ignore_me", [0]),
name="%s_predicted" % image_to_plot)
summaries.append(summary)
evaluation_hooks = [tf.train.SummarySaverHook(
save_steps=1,
output_dir=os.path.join(params.model_dir, "eval"),
summary_op=tf.summary.merge(summaries))]
else:
evaluation_hooks = None
Note that the summaries have to be conditional - we are either plotting an image (computationally expensive) or saving a constant (computationally cheap). I opted for using histogram versus scalar in for the dummy summaries to avoid cluttering my tensorboard dashboard.
Finally you need to pass the hook in the return object of your `model_fn'
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op,
evaluation_hooks=evaluation_hooks
)
Please note that this only works when your batch size is 1 when evaluating the model (which should not be a problem).

Using tf.cond() to feed my graph for training and validation

In my TensorFlow code I want my network to take inputs from one of the two StagingArea objects depending upon whether I want to do training or testing.
A part of the graph construction code I wrote is as follows :
with tf.device("/gpu:0"):
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
phase = tf.get_variable("phase", [], initializer=tf.zeros_initializer(),dtype=tf.uint8, trainable=False)
phaseassigntest = phase.assign(1)
phaseassigntrain = phase.assign(0)
phasetest = tf.equal(phase, 0)
is_training = tf.cond(phasetest, lambda: tf.constant(True), lambda: tf.constant(False))
trainstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
putoptrain = trainstagingarea.put(train_iterator.get_next())
trainputop.append(putoptrain)
getoptrain = trainstagingarea.get()
traingetop.append(getoptrain)
trainclearop = trainstagingarea.clear()
trainstageclear.append(trainclearop)
trainsizeop = trainstagingarea.size()
trainstagesize.append(trainsizeop)
valstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[valbatchsize, 3, 221, 221], [valbatchsize]], capacity=20)
putopval = valstagingarea.put(val_iterator.get_next())
valputop.append(putopval)
getopval = valstagingarea.get()
valgetop.append(getopval)
valclearop = valstagingarea.clear()
valstageclear.append(valclearop)
valsizeop = valstagingarea.size()
valstagesize.append(valsizeop)
#elem = valgetop[i]
elem = tf.cond(is_training,lambda: traingetop[i],lambda: valgetop[i])
img = elem[0]
label = elem[1]
labelonehot = tf.one_hot(label, depth=numclasses)
net, networksummaries = overfeataccurate(img,numclasses=numclasses, phase=is_training)
I have used tf.cond to make sure that the network is fed by one of the two StagingArea objects. One is meant for training and the other one is meant for validation.
Now, when I try to execute the graph as follows, I do not get any result and infact the code just hangs and I have to kill the process.
with tf.Session(graph=g,config=config) as sess:
sess.run(init_op)
sess.run(tf.local_variables_initializer())
sess.run(val_initialize)
for i in range(20):
sess.run(valputop)
print(sess.run(valstagesize))
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
print("Performing Validation")
sess.run(phaseassigntest)
saver = tf.train.Saver()
while(epoch<10):
time_init = time.time()
while True:
try:
[val_accu, _, summaries] = sess.run([towervalidation, towervalidationupdateop,validation_summary_op])
print(val_accu)
when instead of tf.cond() I directly assign elem = valgetop[i], the code works just fine.
Am I missing something over here ?
What is the right way to feed my network based on whether I want to do training or testing ?
NOTE The error does not go away even if I set numgpus to 1.
Your problem
What you think tf.cond does
Based on the flag, execute what is required to put either traingetop[i] or valgetop[i] into your elem tensor.
What tf.cond actually does
Executes what is required to get both traingetop[i] and valgetop[i], then passes one of them into your elem tensor.
So
The reason it is hanging forever is because it's waiting for an element to be added to your training staging area (so that it can get that element and discard it). You're forgiven for not realising this is what it's doing; it's actually very counter-intuitive. The documentation is awfully unclear on how to deal with this.
Recommended Solution (by Tensorflow documentation)
If you really need the queues to be in the same graph, then you need to make two copies of your ENTIRE graph, one that is fed by your training staging area, and one that is fed by your validation staging area. Then you just use the relevant tensor in your sess.run call. I recommend creating a function that takes a queue output tensor, and returns a model_output tensor. Now you have a train_time_output tensor and a validation_time_output tensor, and you can choose which one you want to execute in your sess.run.
Warning
You need to make sure that you aren't actually creating new variables to go along with these new ops. To do that take a look at the latest documentation on variables. It looks like they've simplified it from v0.12, and it essentially boils down to using tf.get_variable instead of tf.Variable to create your variables.
My preferred work around
Although that is the recommended solution (AFAIK), it is extremely unsatisfying to me; you're creating a whole other set of operations on the graph that just happen to use the same weights. It seems like there's a lot of potential for programmer error by abusing the separation between train time and test/validation time (resulting in the model acting unexpectedly different at these times). Worse; it doesn't solve the problem of tf.cond demanding the values for inputs to both branches, it just forces you to copy your whole graph, which is not always possible.
I prefer to just not have my queues in the graph like that, and treat the model as a function which can be fed an example without caring where it's from. That is, I would instantiate the model with a tf.placeholder as the input, and at execution time I would use feed_dict to actually provide the value. It would function something like this
#inside main training loop
if time_to_train:
example = sess.run(traingettop)
else:
example = sess.run(valgettop)
result = sess.run(model_output, {input_placeholder: example})
It's very useful to note that you can use the feed_dict to feed any value for any tensor anywhere in your model. So, you can change any model definition that, due to tf.cond would always require an input, like:
a = tf.constant(some_value)
b = tf.placeholder(tf.float32)
flag = tf.placeholder(tf.bool, [])
one_of_them = tf.cond(flag, a, b)
model_output = build_graph(one_of_them)
Into a definition that doesn't, like:
a = tf.constant(some_value)
model_output = build_graph(a)
Remembering that you can always overwrite what a is at execution time:
# In main training loop,
sess.run(train_op, {a: some_other_value})
This essentially pushes the conditional into native python land. In your code you might end up with something like:
if condition_satisfied:
sess.run(train_op, {a:some_other_value})
else:
sess.run(train_op)
Performance concerns
If you are using tensorflow on a single machine, then there is practically no performance cost to this solution, as the numpy array/s put into the example python variable are actually still stored on the GPU.
If you are using tensorflow in a distributed fashion, then this solution would kill your performance; it would require sending the example from whatever machine it's on to the master so that it can send it back.

TF LSTM: Save State from training session for prediction session later

I am trying to save the latest LSTM State from training to be reused during the prediction stage later. The problem I am encountering is that in the TF LSTM model the State is passed around from one training iteration to next via a combination of a placeholder and a numpy array -- neither of which seems to be included in the Graph by default when the session is saved.
To work around this, I am creating a dedicated TF variable to hold the latest version of the state so as to add it to the Session graph, like so:
# latest State from last training iteration:
_, y, ostate, smm = sess.run([train_step, Y, H, summaries], feed_dict=feed_dict)
# now add to TF variable:
savedState = tf.Variable(ostate, dtype=tf.float32, name='savedState')
tf.variables_initializer([savedState]).run()
save_path = saver.save(sess, pathModel + '/my_model.ckpt')
This seems to add the savedState variable to the saved session graph well, and is easily recoverable later with the rest of the Session.
The problem though, is that the only way I have managed to actually use that variable later in the restored Session, is that if I initialize all variables in the session AFTER I recover it (which seems to reset all trained variables, including the weights/biases/etc.!). If I initialize variables first and THEN recover the session (which works fine in terms of preserving the trained varialbes), then I am getting an error that I'm trying to access an uninitialized variable.
I know there is a way to initialize a specific individual varialbe (which i am using while saving it originally) but the problem is that when we recover them, we refer to them by name as strings, we don't just pass the variable itself?!
# This produces an error 'trying to use an uninitialized varialbe
gInit = tf.global_variables_initializer().run()
new_saver = tf.train.import_meta_graph(pathModel + 'my_model.ckpt.meta')
new_saver.restore(sess, pathModel + 'my_model.ckpt')
fullState = sess.run('savedState:0')
What is the right way to get this done? As a workaround, I am currently saving the State to CSV just as a numpy array and then recover it the same way. It works OK, but clearly not the cleanest solution given that every other aspect of saving/restoring the TF session works perfectly.
Any suggestions appreciated!
**EDIT:
Here's the code that works well, as described in the accepted answer below:
# make sure to define the State variable before the Saver variable:
savedState = tf.get_variable('savedState', shape=[BATCHSIZE, CELL_SIZE * LAYERS])
saver = tf.train.Saver(max_to_keep=1)
# last training iteration:
_, y, ostate, smm = sess.run([train_step, Y, H, summaries], feed_dict=feed_dict)
# now save the State and the whole model:
assignOp = tf.assign(savedState, ostate)
sess.run(assignOp)
save_path = saver.save(sess, pathModel + '/my_model.ckpt')
# later on, in some other program, recover the model and the State:
# make sure to initialize all variables BEFORE recovering the model!
gInit = tf.global_variables_initializer().run()
local_saver = tf.train.import_meta_graph(pathModel + 'my_model.ckpt.meta')
local_saver.restore(sess, pathModel + 'my_model.ckpt')
# recover the state from training and get its last dimension
fullState = sess.run('savedState:0')
h = fullState[-1]
h = np.reshape(h, [1, -1])
I haven't tested yet whether this approach unintentionally initializes any other variables in the saved Session, but don't see why it should, since we only run the specific one.
The issue is that creating a new tf.Variable after the Saver was constructed means that the Saver has no knowledge of the new variable. It still gets saved in the metagraph, but not saved in the checkpoint:
import tensorflow as tf
with tf.Graph().as_default():
var_a = tf.get_variable("a", shape=[])
saver = tf.train.Saver()
var_b = tf.get_variable("b", shape=[])
print(saver._var_list) # [<tf.Variable 'a:0' shape=() dtype=float32_ref>]
initializer = tf.global_variables_initializer()
with tf.Session() as session:
session.run([initializer])
saver.save(session, "/tmp/model", global_step=0)
with tf.Graph().as_default():
new_saver = tf.train.import_meta_graph("/tmp/model-0.meta")
print(saver._var_list) # [<tf.Variable 'a:0' shape=() dtype=float32_ref>]
with tf.Session() as session:
new_saver.restore(session, "/tmp/model-0") # Only var_a gets restored!
I've annotated the quick reproduction of your issue above with the variables that the Saver knows about.
Now, the solution is relatively easy. I would suggest creating the Variable before the Saver, then using tf.assign to update its value (make sure you run the op returned by tf.assign). The assigned value will be saved in checkpoints and restored just like other variables.
This could be handled better by the Saver as a special case when None is passed to its var_list constructor argument (i.e. it could pick up new variables automatically). Feel free to open a feature request on Github for this.