How is teacher-forcing implemented for the Transformer training? - tensorflow

In this part of Tensorflow's tutorial here, they mentioned that they are training with teacher-forcing. To my knowledge, teacher-forcing involves feeding the target output into the model so that it converges faster. So I'm curious as to how this is done here? The real target is tar_real, and as far as I can see, it is only used to calculate loss and accuracy. I'm curious as to how this code is implementing teacher-forcing?
Thanks in advance.

Each train_step takes in inp and tar objects from the dataset in the training loop. Teacher forcing is indeed used since the correct example from the dataset is always used as input during training (as opposed to the "incorrect" output from the previous training step):
tar is split into tar_inp, tar_real (offset by one character)
inp, tar_inp is used as input to the model
model produces an output which is compared with tar_real to calculate loss
model output is discarded (not used anymore)
repeat loop
Teacher forcing is a procedure ... in which during training the model receives the ground truth output y(t) as input at time t+1.
Page 372, Deep Learning, 2016.
Source: https://github.com/tensorflow/tensorflow/issues/30852#issuecomment-513528114

Related

Different between fit and evaluate in keras

I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.

Purpose of batch channel in tensorflow model on forward pass of 1 input

So far I have trained a couple different models in TensorFlow (with Keras) and I see that getting the batch_size right seems to be important not just for speed of training but also the resultant accuracy of the model.
What confuses me is a case where a model has an actual batch channel as the first dimension on the input (and output as well). If my batch size is 32 but I'm always inputting 1 data at run-time then where does the batch channel apply? How could I utilize the vast majority of it if I'm inherently only using 1/batch_size amount of it in forward pass?
If you are curious the model I am researching, it is this one:
https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd300.py
see:
Output shape of predictions: (batch, n_boxes_total, n_classes + 4 + 8)
predictions = Concatenate(axis=2, name='predictions')([mbox_conf_softmax, mbox_loc, mbox_priorbox])
The tensors had run through numerous other layers that had constants and such pretrained with [batch_size] as well. To me it just seems like inputs at various batch index would have to yield different results. Maybe I just need something incredibly obvious pointed out to me.
It would seem that after training you must recompile the model with a batch size of 1, then transfer the weights from the training model to the new model for evaluation. The alternative is performing 'batch_size' count of predictions at once (which of course is not always feasible per application). If there are alternatives (or if I read wrong) please feel free to add an answer.

Questions about tensorflow GetStarted tutorial

So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.

What is the difference between model.fit() an model.evaluate() in Keras?

I am using Keras with TensorFlow backend to train CNN models.
What is the between model.fit() and model.evaluate()? Which one should I ideally use? (I am using model.fit() as of now).
I know the utility of model.fit() and model.predict(). But I am unable to understand the utility of model.evaluate(). Keras documentation just says:
It is used to evaluate the model.
I feel this is a very vague definition.
fit() is for training the model with the given inputs (and corresponding training labels).
evaluate() is for evaluating the already trained model using the validation (or test) data and the corresponding labels. Returns the loss value and metrics values for the model.
predict() is for the actual prediction. It generates output predictions for the input samples.
Let us consider a simple regression example:
# input and output
x = np.random.uniform(0.0, 1.0, (200))
y = 0.3 + 0.6*x + np.random.normal(0.0, 0.05, len(y))
Now lets apply a regression model in keras:
# A simple regression model
model = Sequential()
model.add(Dense(1, input_shape=(1,)))
model.compile(loss='mse', optimizer='rmsprop')
# The fit() method - trains the model
model.fit(x, y, nb_epoch=1000, batch_size=100)
Epoch 1000/1000
200/200 [==============================] - 0s - loss: 0.0023
# The evaluate() method - gets the loss statistics
model.evaluate(x, y, batch_size=200)
# returns: loss: 0.0022612824104726315
# The predict() method - predict the outputs for the given inputs
model.predict(np.expand_dims(x[:3],1))
# returns: [ 0.65680361],[ 0.70067143],[ 0.70482892]
In Deep learning you first want to train your model. You take your data and split it into two sets: the training set, and the test set. It seems pretty common that 80% of your data goes into your training set and 20% goes into your test set.
Your training set gets passed into your call to fit() and your test set gets passed into your call to evaluate(). During the fit operation a number of rows of your training data are fed into your neural net (based on your batch size). After every batch is sent the fit algorithm does back propagation to adjust the weights in your neural net.
After this is done your neural net is trained. The problem is sometimes your neural net gets overfit which is a condition where it performs well for the training set but poorly for other data. To guard against this situation you run the evaluate() function to send new data (your test set) through your neural net to see how it performs with data it has never seen. There is no training occurring, this is purely a test. If all goes well then the score from training is similar to the score from testing.
fit(): Trains the model for a given number of epochs (this is for training time, with the training dataset).
predict(): Generates output predictions for the input samples (this is for somewhere between training and testing time).
evaluate(): Returns the loss value & metrics values for the model in test mode (this is for testing time, with the testing dataset).
While all the above answers explain what these functions : fit(), evaluate() or predict() do however more important point to keep in mind in my opinion is what data you should use for fit() and evaluate().
The most clear guideline that I came across in Machine Learning Mastery and particular quote in there:
Training set: A set of examples used for learning, that is to fit the parameters of the classifier.
Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.
Test set: A set of examples used only to assess the performance of a fully-specified classifier.
: By Brian Ripley, page 354, Pattern Recognition and Neural Networks, 1996
You should not use the same data that you used to train(tune) the model (validation data) for evaluating the performance (generalization) of your fully trained model (evaluate).
The test data used for evaluate() should be unseen/not used for training(fit()) in order to be any reliable indicator of model evaluation (for generlization).
For Predict() you can use just one or few example(s) that you choose (from anywhere) to get quick check or answer from your model. I don't believe it can be used as sole parameter for generalization.
One thing which was not mentioned here, I believe needs to be specified. model.evaluate() returns a list which contains a loss figure and an accuracy figure. What has not been said in the answers above, is that the "loss" figure is the sum of ALL the losses calculated for each item in the x_test array. x_test would contain your test data and y_test would contain your labels. It should be clear that the loss figure is the sum of ALL the losses, not just one loss from one item in the x_test array.
I would say the mean of losses incurred from all iterations, not the sum. But sure, that's the most important information here, otherwise the modeler would be slightly confused.

How to get both loss and model output at once, on a batch of data in Keras?

I'm using Keras w/ Tensorflow backend to train a NN.
I'm using train_on_batch for training, which returns the loss on the given batch. How do I also get the output classification on that batch ? (I'd like to do some visualisations of the output)
To do that I currently do another call to predict to get the model output, but that's redundant since train_on_batch have already passed the input batch "forward".
In Caffe, when an image is fed forward, the intermediate layer outputs stay stored in net.blobs, but in Keras/Tensorflow it seems that if we want to get an intermediate output we have to rerun the computational graph for each intermediate output we want to access on CPU, as described here. Is there a way to access many/all intermediate layers' outputs without rerunning the graph for each ?
I don't mind having a tensorflow-specific workaround.
If you use the function API, this is pretty straight forward.
In addition to #MohamedEzz's answer, you can create a custom callback which can perform the operations you require during the training process. They have methods which will run your code onEpochEnd, onEpochStart, onTrainingEnd and so on...
This way you can preserve the batch.