End of Sequence Error when using tf.estimator and tf.data - tensorflow

I am using tf.estimator.train_and_evaluate and tf.data.Dataset to feed data to the estimator:
Input Data function:
def data_fn(data_dict, batch_size, mode, num_epochs=10):
dataset = {}
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = tf.data.Dataset.from_tensor_slices(data_dict['train_data'].astype(np.float32))
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size= batch_size * 10).repeat(num_epochs).batch(batch_size)
else:
dataset = tf.data.Dataset.from_tensor_slices(data_dict['valid_data'].astype(np.float32))
dataset = dataset.cache()
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
return next_element
Train Function:
def train_model(data):
tf.logging.set_verbosity(tf.logging.INFO)
config = tf.ConfigProto(allow_soft_placement=True,
log_device_placement=False)
config.gpu_options.allow_growth = True
run_config = tf.contrib.learn.RunConfig(
save_checkpoints_steps=10,
keep_checkpoint_max=10,
session_config=config
)
train_input = lambda: data_fn(data, 100, tf.estimator.ModeKeys.TRAIN, num_epochs=1)
eval_input = lambda: data_fn(data, 1000, tf.estimator.ModeKeys.EVAL)
estimator = tf.estimator.Estimator(model_fn=model_fn, params=hps, config=run_config)
train_spec = tf.estimator.TrainSpec(train_input, max_steps=100)
eval_spec = tf.estimator.EvalSpec(eval_input,
steps=None,
throttle_secs = 30)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
The training goes fine, but when it comes to evaluation I get this error:
OutOfRangeError (see above for traceback): End of sequence
If I don't use Dataset.batch on evaluation dataset (by omitting the line dataset[name] = dataset[name].batch(batch_size) in data_fn) I get the same error but after a much longer time.
I can only avoid this error if I don't batch the data and use steps=1 for evaluation, but does that perform the evaluation on the whole dataset?
I don't understand what causes this error as the documentation suggests I should be able to evaluate on batches too.
Note: I get the same error when using tf.estimator.evaluate on data batches.

I posted this question as a github issue and here is the response from the Tensorflow team:
https://github.com/tensorflow/tensorflow/issues/19541
Copying from "xiejw" for completeness:
If I understand correctly, this issue is "once give estimator an input_fn with dataset inside, the evaluate process will error out with OutOfRangeError."
Estimator can handle this correctly actually. However, a known common root cause for this is metrics defined in model_fn have bug. We need to rule that part out first.
#mrezak if possible, can you show the code about the model_fn? Or if you have a minimal reproducible script, that will be extremely helpful. -- Thanks in advance.
A common problem for this is: metric in tensorflow should return two Ops: update_op and value_op. Estimator calls the update_op for each batch of the data in input source and, once it is exhausted, it call the value_op to get the metric values. The value_op here should have dependency back to variables reading only.
Many model_fn puts the dependency of value_op with the input pipeline, so, estimator.evaluate will thereby trigger the input pipeline one more time, which errors out with OutOfRangeError
The problem was indeed how I defined the eval_metric in model_fn. In my actual code my total loss to be optimized was composed of multiple losses (reconstruction + L2 + KL) and in the evaluation part I wanted to get the reconstruction loss (on the validation data), which depended on the input data pipeline. My actual reconstruction cost was more complex than MSE (none of the other tf.metric functions as well) which was not straightforward to be implemented using tf.metric basic functions.
This is "xiejw"'s suggestion which fixed the issue:
my_total_loss = ... # the loss you care. Pay attention to how you reduce the loss.
eval_metric_ops = {'total_loss: tf.metrics.mean(my_total_loss)}

Related

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

Memory leak tf.data + Keras

I have a memory leak in my training pipeline and don't know how to fix it.
I use Tensorflow version: 1.9.0 and Keras (tf) version: 2.1.6-tf with Python 3.5.2
This is how my training pipeline looks like:
for i in range(num_epochs):
training_data = training_set.make_one_shot_iterator().get_next()
hist = model.fit(training_data[0],[training_data[1],training_data[2],training_data[3]],
steps_per_epoch=steps_per_epoch_train,epochs=1, verbose=1, callbacks=[history, MemoryCallback()])
# custom validation
It looks like memory of the iterator is not freed after the iterator is exhausted. I have already tried del traininig_data after model.fit. It didn't work.
Can anybody give some hints?
Edit:
This is how I create the dataset.
dataset = tf.data.TFRecordDataset(tfrecords_filename)
dataset = dataset.map(map_func=preprocess_fn, num_parallel_calls=8)
dataset = dataset.shuffle(100)
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)
Including the repeat() method to reinitialize your iterator might solve your problem. You can take a look at Input Pipeline Performance Guide to figure out what would be the a good optimized order of your methods according to your requirements.
dataset = dataset.shuffle(100)
dataset = dataset.repeat() # Can specify num_epochs as input if needed
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)
In case you can afford to do the validation as a part of the fit method, you can use something like the code below and lose the loop altogether to make your life easier.
training_data = training_set.make_one_shot_iterator().get_next()
# val_data refers to your validation data and steps_per_epochs_val refers to no of your validation batches
hist = model.fit(training_data[0],training_data[1],training_data[2],training_data[3]], validation_data=val_data.make_one_shot_iterator(), validation_steps=steps_per_epochs_val,
steps_per_epoch=steps_per_epoch_train, epochs=num_epochs, verbose=1, callbacks=[history, MemoryCallback()])
Reference: https://github.com/keras-team/keras/blob/master/examples/mnist_dataset_api.py

How to get the same data batch multiple times using TensorFlow's `tf.data` API

Is there a way to evaluate a tensor that depends on an tf.data iterator but temporarily pause the iterator so that it returns the previous batch?
Imagine snippet below:
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
train_op = next_batch * 10
Every time I evaluate train_op it does so by fetching a new batch of data – which is what I want. However every N steps I'd like to do some additional stuff for debugging like evaluating accuracy on the training batch, creating a checkpoint, running things with dropout disabled etc. I'd like these operations to happen on the same data batch I have just used but I haven't found a way to pause tf.data iterator for one or multiple steps.
The obvious solution is to use placeholders instead of directly using next_batch. This means I have to evaluate next_batch first, and then feed it back to the session using feed_dict to evaluate train_op. I believe this is not recommended due to performance penalty. Is that still the case? If so what is the recommended way to deal with these cases?
Edit: adding pseudo code for what I'm after:
for step in num_steps:
sess.run(train_op) # train_op depends on next_batch and therefore fetches new batch
if step % N == 0:
# I want below to run on the same batch above but acc_op also
# depends on next_batch and therefore fetches a new batch
acc = sess.run([acc_op, saver_op, feed_dic={keep_drop:1}])
Does not it work in following ways,
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
train_op = next_batch * 10
other_ops = do_other_stuff(next_batch)
num_train_batch = 50
for ep in range(num_train_batch):
if ep%N==0:
_, other_stuffs = sess.run([train_op, other_ops])
else:
_ = ses.run(train_op)
and, you can feed the dropout differently each time

Is it possible to loop through all minibatches in a single tensorflow op using dataset/iterators?

I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)

Creating an image summary only for a subset of validation set images using Tensorflow Estimator API

I'm trying to add image summary operations to visualize how well my network manages to reconstruct inputs from the validation set. However, since there are too many images in the validation set I would only like to plot a small subset of them.
I managed to achieve this with manual training loop, but I struggle to achieve the same with the new Tensorflow Estimator/Experiment/Datasets API. Has anyone done something like this?
The Experiment and Estimator are high level TensorFlow APIs. Although you could probably solve your issue with a hook, if you want more control on what's happening during the training process, it may be easier not to use these APIs.
That said, you can still use the Dataset API which will bring you a lot of useful features.
To solve your problem with the Dataset API, you will need to switch between train and validation datasets in your training loop.
One way to do that is to use a feedable iterator. See here for more details:
https://www.tensorflow.org/programmers_guide/datasets
You can also see a full example switching between training and validation with the Dataset API in this notebook.
In brief, after having created your train_dataset and your val_dataset, your training loop could be something like this:
# create TensorFlow Iterator objects
training_iterator = val_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
with tf.Session() as sess:
# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)
# Create training data and validation data handles
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(val_iterator.string_handle())
for epoch in range(number_of_epochs):
# Tell iterator to go to beginning of dataset
sess.run(training_iterator.initializer)
print ("Starting epoch: ", epoch)
# iterate over the training dataset and train
while True:
try:
sess.run(train_op, feed_dict={handle: training_handle})
except tf.errors.OutOfRangeError:
# End of epoch
break
# Tell validation iterator to go to beginning of dataset
sess.run(val_iterator.initializer)
# run validation on only 10 examples
for i in range(10):
my_value = sess.run(my_validation_op, feed_dict={handle: validation_handle}))
# Do whatever you want with my_value
...
I figured out a solution that uses Estimator/Experiment API.
First you need to modify your Dataset input to not only provide labels and features, but also some form of an identifier for each sample (in my case it was a filename). Then in the hyperparameters dictionary (params argument) you need to specify which of the validation samples you want to plot. You also will have to pass the model_dir in those parameters. For example:
params = tf.contrib.training.HParams(
model_dir=model_dir,
images_to_plot=["100307_EMOTION.nii.gz", "100307_FACE-SHAPE.nii.gz",
"100307_GAMBLING.nii.gz", "100307_RELATIONAL.nii.gz",
"100307_SOCIAL.nii.gz"]
)
learn_runner.run(
experiment_fn=experiment_fn,
run_config=run_config,
schedule="train_and_evaluate",
hparams=params
)
Having this set up you can create conditional Summary operations in your model_fn and an evaluation hook to include them in your outputs.
if mode == tf.contrib.learn.ModeKeys.EVAL:
summaries = []
for image_to_plot in params.images_to_plot:
is_to_plot = tf.equal(tf.squeeze(filenames), image_to_plot)
summary = tf.cond(is_to_plot,
lambda: tf.summary.image('predicted', predictions),
lambda: tf.summary.histogram("ignore_me", [0]),
name="%s_predicted" % image_to_plot)
summaries.append(summary)
evaluation_hooks = [tf.train.SummarySaverHook(
save_steps=1,
output_dir=os.path.join(params.model_dir, "eval"),
summary_op=tf.summary.merge(summaries))]
else:
evaluation_hooks = None
Note that the summaries have to be conditional - we are either plotting an image (computationally expensive) or saving a constant (computationally cheap). I opted for using histogram versus scalar in for the dummy summaries to avoid cluttering my tensorboard dashboard.
Finally you need to pass the hook in the return object of your `model_fn'
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op,
evaluation_hooks=evaluation_hooks
)
Please note that this only works when your batch size is 1 when evaluating the model (which should not be a problem).