Does steps_per_epoch use up the whole dataset - tensorflow

I have a large training dataset created by a generator, about 60,000 batches (size 32). Due to the time required for training, I need to use a callback to save the model periodically. However, I want to save it more frequently than once per epoch of 60,000, because that takes about 2 hours on Colab.
As I understand it, setting steps_per_epoch will give me smaller epochs, Say, 10,000. What is not clear to me from the documentation is will this still cycle through my whole 60k batches, or will it stop at 10k and just repeat that 10k? i.e. Does a new epoch start from where the last one left off when using steps_per_epoch?
Thanks, Julian

While I don't know about that option specifically, it wouldn't reuse old data because datasets are only meant to be processed forwards. If it repeated the data, it would have to store a copy of everything it's already processed somewhere since you can't reset a generator. That wouldn't be practical on a large dataset.

Related

When is .repeat used when loading a Tensorflow dataset for training?

I have seen tutorials which use .repeat() while doing the loading, shuffling, mapping, batching, prefetching etc. for a Tensorflow dataset while there are others that completely skip it.
I know what repeat does and how it is used, but am not able to figure out when it is used and when it is not.
Any help?
It depends. Let's use MNIST as an example. Say we build a dataset using from_tensor_slices. The training dataset has 60000 samples.
Let's say we use batch size 100 and do not use repeat. This means the dataset will provide 600 batches. Now, if we try to train a model, for example using the keras fit interface, the dataset will simply run out of samples after 600 steps! We will not be able to train more than that. Using repeat, the dataset will instead simply "start fresh" once it runs out, and we can train as long as we like.
Other tutorials might use a manual training loop. Perhaps you have a loop like
for batch in data_set:
...
In this example, once again, the loop will simply stop after 600 batches if we do not use repeat. However, we can do this:
for epoch in range(n_epochs):
for batch in data_set:
...
In this example, we specify the number of passes over the dataset in n_epochs. The inner loop stops after 600 batches, but then the outer loop (epoch) simply increments by 1, and the inner loop starts again. This way, we can have more than 600 batches even without using repeat.
Finally, there are of course other ways to create datasets. For example, from_generator can be used to stream a dataset from a Python generator that can run infinitely long, so repeat is not necessary at all.
Without having seen the tutorials you are referring to, I can only guess that the differences regarding use of repeat can be explained by differences in how the training loop is coded, such as the above.

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

How to control how frequent data is being recorded in Tensorboard?

So, I created a TensoBoard callback, but, I'm training for 1000's of epochs, and when I view TensorBoard, it is too sluggish because of the enormity of the data to be loaded and plotted, basically millions of datapoints, that's because it is writing everything happening at a batch level. I want only results at the end of epoch, not batch. Can I get to control that?
Additionally, it is by default recording: loss, validation loss, and plenty other things that I didn't ask for. How can I control what is being recorded?
Try update_freq=10000.
This will make it update every 10000 sample.
Or maybe update_freq='epoch', this will update after the epoch end.
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L997

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

batch size in model.fit and model.predict

In keras, both model.fit and model.predict has a parameter of batch_size. My understanding is that batch size in model.fit is related to batch optimization, what's the physical meaning of batch_size in model_predict? Does it need to be equal to the one used by model.fit?
No it doesn‘t. Imagine inside your model there is a function which increases the amount of memory significantly. Therefore, you might run into resource errors if you try to predict all your data in one go. This is often the case when you use gpu with limited gpu memory for predicting. So instead you choose to predict only small batches at the same time. The batch_size parameter in the predict function will not alter your results in any way. So you can choose any batch_size you want for prediction.
It depends on your model and whether the batch size when training must match the batch size when predicting. For example, if you're using a stateful LSTM then the batch size matters because the entire sequence of data is spread across multiple batches, i.e. it's one long sequence that transcends the batches. In that case the batch size used to predict should match the batch size when training because it's important they match in order to define the whole length of the sequence. In stateless LSTM, or regular feed-forward perceptron models the batch size doesn't need to match, and you actually don't need to specify it for predict().
Just to add; this is different to train_on_batch() where you can supply a batch of input samples and get an equal number of prediction outputs. So, if you create a batch of 100 samples, you submit to train_on_batch() then you get 100 predictions, i.e. one for each sample. This can have performance benefits over issuing one at a time to predict().
As said above, batch size just increases the number of training data that is fed in at one go(batches). Increasing it may increase chances of your computer resources running out, assuming you are running it on your personal computer. If you are running it on the cloud with higher resources, you should be fine. You can toggle the number as you want, but don't put in a big number, I suggest going up slowly. Also, you may want to read this before you increase your batch size:
https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network