Tensorflow Ddat Pipeline that Loads on Each Batch - tensorflow

I am trying to think through this dataset pipeline for Tensorflow.
For the dataset, on the first epoch, I want it to use a generator that generates synthetic data. So for each batch, it will call the generator. The generator needs saved to disk so when the next epoch happens, it will pull from the same data as the first epoch from disk.
In addition, for each epoch, I want it to load the next batch during the current batches process https://www.tensorflow.org/guide/data_performance. So for the first epoch, it will call the generator on every bactch for the next batch. On the subsequent epochs, it will load the next batch from disk on the current batch.
Has anyone done something like this?

Related

What is an epoch, when using generators?

What is an epoch when you're using a generator for your model.fit data?
it makes sense with the standard NumPy-array dataset - the epoch is the processing of the entire dataset.
However, with a generator, there's no length - hence no "epochs".
Does the epoch simply represent an arbitrarily sized group of steps, when using a generator-dataset?
Is there something special that happens at the end of an epoch?
Yes, an epoch is an arbitrary group of steps but generally it's one pass through the whole dataset.
However, you don't define that in the generator. You write a generator that yields batches, and then calculate steps_per_epoch = int(training_samples / batch_size ) something like that, and then pass the steps_per_epoch to the training/fit generator function (In keras for example).
Regarding the second question: Yes you can evaluate the model at the end of each epoch and log it to see the improvements, you can also save model checkpoints.

Does steps_per_epoch use up the whole dataset

I have a large training dataset created by a generator, about 60,000 batches (size 32). Due to the time required for training, I need to use a callback to save the model periodically. However, I want to save it more frequently than once per epoch of 60,000, because that takes about 2 hours on Colab.
As I understand it, setting steps_per_epoch will give me smaller epochs, Say, 10,000. What is not clear to me from the documentation is will this still cycle through my whole 60k batches, or will it stop at 10k and just repeat that 10k? i.e. Does a new epoch start from where the last one left off when using steps_per_epoch?
Thanks, Julian
While I don't know about that option specifically, it wouldn't reuse old data because datasets are only meant to be processed forwards. If it repeated the data, it would have to store a copy of everything it's already processed somewhere since you can't reset a generator. That wouldn't be practical on a large dataset.

Training Estimators less than one epoch using dataset API?

I am trying to train a model on a large dataset. I would like to run the evaluation step multiple times before one epoch of training has been completed. Looking at the implementation of Dataset API with Estimators it looks like every time I restart the training after the evaluation step, Estimator creates a fresh dataset from scratch and the training never completes for the full data.
I have written an input function very similar to one provided on the tensorflow website.
def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features),
labels))
# Shuffle, repeat, and batch the examples.
dataset = dataset.repeat(1).batch(batch_size)
# Return the read end of the pipeline.
return dataset
I then use the tf.estimator.Estimator.train to call my input function. I call the above input function with the following method.
classifier.train(input_fn=lambda: train_input_fn,
steps=n_steps)
where n_steps in number less than the total step taken to complete one epoch.
I then call an evaluation function like this.
classifier.evaluate(input_fn=lambda: eval_input_fn())
I want the run both the step in a loop.
Every time the loop reaches training, It initialization the dataset in the train_input_fn. This applies the training only in first n_steps of training data.
If you want to evaluate multiple times during training, you can check InMemoryEvaluatorHook.
You can probably refer this discussion about train_and_evaluate and InMemoryEvaluatorHook for more details on how to better use them.

What is fit_generator doing with a sub-sample of training and validation data?

I'm using a data generator with fit_generator in keras (for both training and validation data).
I was getting unexpected results so I instrumented the generator to output the batch index and count the number of steps since the last epoch. I have added ['acc'] to the model metrics.
When fit_generator runs I see it do several things:
It queues up the validation data (but I'm guessing it doesn't evaluate yet).
It iterates through all the training data and calls on_epoch_end()
It calls another 10 steps of training data. I assume this must be coming from a callback. What is it doing?
It completes iterating through the validation data and calls on_epoch_end()
It calls another 10 steps of validation data. Again, what is it doing?
fit_generator prints train/validation loss and accuracy and returns.
on_epoch_end() is never called after the 10 steps at 3 and 5. This is probably a bug, since we need the generators to be reset before the next epoch.
I'm mainly interested to understand what is going on at 3 and 5- why are the generators called, and why for only ten steps?
Versions:
print(keras.__version__)
2.2.2
print(tf.__version__)
1.9.0
Per the comments from Matias: the additional batches correspond to pre-queued batches for the next epoch. They are dropped on the floor when fit_generator returns. It's up to the user to reset the generators before calling fit_generator again.

Training from CSV file - use every train example per epoch

I have a CSV file with 200.000 training samples that I would like to train my network with.
I'm using an InputProducer and DecodeCSV to get the data. I then run all the data through shuffle_batch, where I set batch_size=50, min_after_dequeue=10000 and capacity=min_after_dequeue + 3 * batch_size.
I then run a loop and call sess.run() repeatedly.
The question I have is that I now want to run this for several epochs. In each epoch I would like to exaust the entire training set. I don't think the current setup does this. How would I go about doing that?
I'm not even sure, if I understood the inner workings of shuffle_batch and its parameters fully as of yet.
Thank you in advance.
The queue should block at the end of the epoch. When that happens, you will know that you have exhausted the training set. More information in this related question: Tensor Flow shuffle_batch() blocks at end of epoch