How to control how frequent data is being recorded in Tensorboard? - tensorflow

So, I created a TensoBoard callback, but, I'm training for 1000's of epochs, and when I view TensorBoard, it is too sluggish because of the enormity of the data to be loaded and plotted, basically millions of datapoints, that's because it is writing everything happening at a batch level. I want only results at the end of epoch, not batch. Can I get to control that?
Additionally, it is by default recording: loss, validation loss, and plenty other things that I didn't ask for. How can I control what is being recorded?

Try update_freq=10000.
This will make it update every 10000 sample.
Or maybe update_freq='epoch', this will update after the epoch end.
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L997

Related

Saving model weights after each forward pass

I am doing research in explainable AI by looking at the patterns in weights as a function of model hyperparameters and input data. One of the things I'm examining is how weights progress from randomness (or initializer starting values) to stabilization, after learning completes. I'd like to, instead of saving weights every epoch, save them at every other or third forward pass. How would I do that? Must I tweak the 'period' argument to the Keras model checkpoint method? If so, what's an easy formula to set that argument? Thx and have a great day.
just pass save_freq=3 when instantiating the tf.keras.callbacks.ModelCheckpoint.
Quoting the docs:
https://keras.io/api/callbacks/model_checkpoint/
save_freq: 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches. If the Model is compiled with steps_per_execution=N, then the saving criteria will be checked every Nth batch. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'.

TensorFlow: Fit gives great val_acc, but evaluate gives low acc

I have some data (reuters).
I combine the training and test data into one large data set.
I shuffle the entire data set.
I then split the data set into 50%. 50% for training and 50% for testing.
I then setup a sequential model (Embedding, GRU, BatchNormalization, Dense).
I compile with adam optimizer, and loss = sparse_categorical_crossentropy
I run fit on it, for 5 epochs, with validation_split = 0.2
I get a val_acc of 95%. I am very happy with this.
I then call evaluate with the test data, and get acc = 71%
When I repeat evaluate with the train data (instead of test data), just to see what happens, I get 95% acc (as I should).
I am trying to understand what is wrong with my acc score with the test data.
The data is shuffled between train and test each time, so it can not be related to the data.
I tried the checkpoint save/restore trick, that did not seem to help.
I have reduced epochs in case of overfitting, but this has not helped.
I am curious what is going on. The only thing I can think of is there is some sort of overfitting going on that is carrying over to test, but is NOT impacting val_acc.
(Side note: the 20% data saved for validation during fit should not cause an overfitting problem, correct?)
Any ideas what could be wrong with my approach?
Thank you.
Edit 1:
Okey, I might be onto something. I noticed something odd about the test data that is included with the reuters dataset, vs the training data.
In previous experiments, results were always poor evaluating against test set, vs an unused portion of the training data.
This time, I combined the training and test data sets into one set, and shuffled.
Then shuffled some more, and then generated pseudo data around it, and shuffled some more.
Then I peeled off 20% to use as test data.
This time it worked. I got 95% for training, validation and evaluation accuracy.
I tried searching for information on the test set to see if anyone came up with anything but found no results. So I'm going to presently chalk it up to test data that is significantly different from the training set.
Edit 2:
Nevermind edit 1. I think I was corrupting my test data with pseudo-generated version of training data, that was close enough to work.
The only conclusion I can draw is that there is a lot of overfitting going on during my training AND using validation data during training is misleading, as it is also being overfit.
(Why validation data is being used to help overfit, I do not know).
Overfitting occurs due to many reasons, to overcome this problem you can try different methods like:
L1/L2 regularization.
Dropout layers: By using dropout layers in the network, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting.
Early stopping: Monitors the performance of the model for every epoch on a held-out validation set during the training, and terminates the training conditional on the validation performance.
Feature selection: Improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating unnecessary and irrelevant features.
For more details refer to this link.

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

Does steps_per_epoch use up the whole dataset

I have a large training dataset created by a generator, about 60,000 batches (size 32). Due to the time required for training, I need to use a callback to save the model periodically. However, I want to save it more frequently than once per epoch of 60,000, because that takes about 2 hours on Colab.
As I understand it, setting steps_per_epoch will give me smaller epochs, Say, 10,000. What is not clear to me from the documentation is will this still cycle through my whole 60k batches, or will it stop at 10k and just repeat that 10k? i.e. Does a new epoch start from where the last one left off when using steps_per_epoch?
Thanks, Julian
While I don't know about that option specifically, it wouldn't reuse old data because datasets are only meant to be processed forwards. If it repeated the data, it would have to store a copy of everything it's already processed somewhere since you can't reset a generator. That wouldn't be practical on a large dataset.

Why and when do I need to use the global step in tensorflow

I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?
You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.