Tensorflow Neural Machine Translation Example - Loss Function - tensorflow

Im stepping through the code here: https://www.tensorflow.org/tutorials/text/nmt_with_attention
as a learning method and I am confused as to when the loss function is called and what is passed. I added two print statements in the loss_function and when the training loop runs, it only prints out
(64,)
(64, 4935)
at the very start multiple times and then nothing again. I am confused on two fronts:
Why doesnt the loss_function() get called repeatedly through the training loop and print the shapes? I expected that the loss function would get called at the end of each batch which is of size 64.
I expected the shapes of the actuals to be (batch size, time steps) and the predictions to be (batch size, time steps, vocabulary size). It looks like the loss gets called seperately for every time step (64 is the batch size and 4935 is the vocabulary size).
The relevant bits I believe are reproduced below.
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
print(real.shape)
print(pred.shape)
loss_ = loss_object(rea
l, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask #set padding entries to zero loss
return tf.reduce_mean(loss_)
#tf.function
def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
print(targ[:, t])
print(predictions)
loss += loss_function(targ[:, t], predictions)
# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
EPOCHS = 10
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
#print(batch)
batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
# saving (checkpoint) the model every 2 epochs
if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

The loss is treated similar to the rest of the graph. In tensorflow calls like tf.keras.layers.Dense and tf.nn.conv2d don't actually do the operation, but instead they define the graph for the operations. I have another post here How do backpropagation works in tensorflow that explains the backprop and some motivation of why this is.
The loss function you have above is
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
print(real.shape)
print(pred.shape)
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask #set padding entries to zero loss
result = tf.reduce_mean(loss_)
return result
Think of this function as a generate that returns result. Result defines the graph to compute the loss. Perhaps a better name for this function would be loss_function_graph_creator ... but that's another story.
Result, which is a graph that contains weights, bias, and information about how to both do the forward propagation and the back propagation is all model.fit needs. It no longer needs this function and it doesn't need to run the function every loop.
Truly, what is happening under the covers is that given your model (called my_model), the compile line
model.compile(loss=loss_function, optimizer='sgd')
is effectively the following lines
input = tf.keras.Input()
output = my_model(input)
loss = loss_function(input,output)
opt = tf.keras.optimizers.SGD()
gradient = opt.minimize(loss)
get_gradient_model = tf.keras.Model(input,gradient)
and there you have the gradient operation which can be use in a loop to get the gradients, which is conceptually what model.fit does.
Q and A
Is the fact that this function: #tf.function def train_step(inp, targ, enc_hidden): has the tf.function decorator (and the loss function is called in it) what makes this code run as you describe and not normal python?
No. It is not 'normal' python. It only defines the flow of tensors through the graph of matrix operations that will (hopefully) run on your GPU. All the tensorflow operations just set up the operations on the GPU (or a simulated GPU if you don't have one).
How can I tell the actual shapes being passed into loss_function (the second part of my question)?
No problem at all... simply run this code
loss_function(y, y).shape
This will compute the loss function of your expected output compared exactly to the same output. The loss will (hopefully) be zero, but actually calculating the value of the loss wasn't the point. You want the shape and this will give it to you.

Related

How can I replace the training data for each training loop by callback function in TensorFlow?

I saw the code from DB-VAE wrote by TensorFlow, the authors sampled data for each batch from all training set by a distribution:
All code can be found: https://github.com/aamini/introtodeeplearning/blob/master/lab2/solutions/Part2_Debiasing_Solution.ipynb
# The training loop -- outer loop iterates over the number of epochs
for i in range(num_epochs):
p_gep_pos = get_training_sample_probabilities(geps=all_gep_pos, dbvae=dbvae, latent_dim=latent_dim)
# get a batch of training data and compute the training step
loss = 0
class_loss = 0
for j in tqdm(range(loader.get_train_size() // batch_size)):
# load a batch of data
(x, y) = loader.get_batch(batch_size, p_pos=p_gep_pos) # also got some negative samples. <-- here
# loss optimization
loss = debiasing_train_step(x, y, optimizer=optimizer, dbvae=dbvae)
I want to use this sampling method on my own project. My question is that if it is possible to add it by a custom callback function (https://www.tensorflow.org/guide/keras/custom_callback) within training loops? So I can still use compile, fit functions from a valid TensorFlow model keras.Model.
Something may look like:
class CustomCallback(keras.callbacks.Callback):
def on_train_batch_begin(self, batch, logs=None):
keys = list(logs.keys())
print("...Training: start of batch {}; got log keys: {}".format(batch, keys))
# I don't know how to get the data for current batch and replace them by sampled data
self.x, self.y = loader.get_batch(batch_size, p_pos=p_gep_pos)
return self.x, self.y
model = MyModel()
model.compile(...)
model.fit(x, y, callbacks=CustomCallback()) # <-- add it here

Evaluating TF model inside a TF op throws error

I am using TensorFlow 2. I am trying to optimize a function which uses the loss of a trained tensorflow model (poison).
#tf.function
def totalloss(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel*np.ones(xt.shape[0])
loss1 = poison.evaluate(xt, label, steps=1)
loss2 = tf.linalg.norm(m, 1)
return loss1 + loss2
I am not able to execute this function, however, when I comment the #tf.function line the function works!
I need to use this function as a tensorflow op so as to optimize 'm' & 'd'.
Value Error: Unknown graph. Aborting.
This is how I am defining the model and variables:
# mask
m = tf.Variable(tf.zeros(shape=(1, 784)), name="m")
d = tf.Variable(tf.zeros(shape=(1, 784)), name="d")
# target
targetlabel = 6
poison = fcn()
poison.load_weights("MNISTP.h5")
adam = tf.keras.optimizers.Adam(lr=.002, decay=1e-6)
poison.compile(optimizer=adam, loss=tf.losses.sparse_categorical_crossentropy)
This is how I am calling the function later: (Executing this line results in an error listed below. However if I comment off the #tf.function line, this command works!)
loss = totalloss(ptestdata)
This is the entire traceback call:
ValueError: in converted code:
<ipython-input-52-4841ad87022f>:5 totalloss *
loss1 = poison.evaluate(xt, label, steps=1)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:746 evaluate
use_multiprocessing=use_multiprocessing)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:693 evaluate
callbacks=callbacks)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:187 model_iteration
f = _make_execution_function(model, mode)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:555 _make_execution_function
return model._make_execution_function(mode)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2034 _make_execution_function
self._make_test_function()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2010 _make_test_function
**self._function_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3544 function
return EagerExecutionFunction(inputs, outputs, updates=updates, name=name)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3429 __init__
raise ValueError('Unknown graph. Aborting.')
ValueError: Unknown graph. Aborting.
The purpose of #tf.function decorator is to convert Tensorflow operations written in Python into Tensorflow graph to achieve better performance. The error might come when you tried to use a pre-trained model with a serialized graph. Thus, the decorator cannot make the graph-to-graph conversion.
I've reported this error here: https://github.com/tensorflow/tensorflow/issues/33997
A (temporary) solution is that your loss function should be separated into two small functions. The decorator should only be used in the function not including the pre-trained model. In this way, you still can achieve better performance in other operations but not with the part of using the pre-trained model.
For example:
#tf.function
def _other_ops(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel * np.ones(xt.shape[0])
loss2 = tf.linalg.norm(m, 1)
return xt, label, loss2
def total_loss(x):
xt, label, loss2 = _other_ops(x)
loss1 = poison.evaluate(xt, label, steps=1)
return loss1 + loss2
Update:
According to the discussion in the above TF issue link, an elegant solution is to manually pass the input through each layer of the model. You could get a list of layers in your model by calling your_model.layers
In your case, you might calculate the loss from the prediction of your output with the label in the last layer. Thus, I think you should skip the last layer and calculate the loss outside of the loop:
#tf.function
def totalloss(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel*np.ones(xt.shape[0])
feat = xt
# Skip the last layer which calculates loss1
for i in range(len(poison.layers) - 1):
layer = poison.layers[i]
feat = layer(feat)
# Now, calculate loss by yourself
loss1 = tf.keras.losses.sparse_categorical_crossentropy(feat, label)
loss2 = tf.linalg.norm(m, 1)
return loss1 + loss2
The way that the TF engineers explain for this issue is that a model might wrap high-level processing which does guarantee by the #tf.function. So, putting a model inside a function decorated with #tf.function is not recommended. Thus, we need to break the model to smaller pieces to bypass it.

How to obtain second derivatives of a Loss function with respect to the parameters of a neural network using gradient tape in Tensorflow eager mode

I am creating a basic auto-encoder for the MNIST dataset using TensorFlow eager mode. I would like to observe the second-order partial derivatives of my loss function with respect to the parameters of the network as it trains. Currently, calling tape.gradient() on the output of in_tape.gradient returns None (where in_tape is a GradientTape nested inside the outer GradientTape called tape, I have included my code below)
I have tried calling the tape.gradient() directly on the in_tape.gradient() with None being returned. My next approach was to iterate over the output of in_tape.gradient() and apply tape.gradient() to each gradient individually (with respect to my model variables) with None being returned each time.
I receive a single None value for any tape.gradient() call, not a list of None values which I believe would indicate None for a single partial derivative, which would be expected in some cases.
I am currently only trying to get the second derivatives for the first set of weights (from input to hidden layers), however, I will scale it to include all weights once I have this working.
tf.enable_eager_execution()
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((train_images.shape[0], train_images.shape[1]*train_images.shape[2])).astype(np.float32)/255
test_images = test_images.reshape((test_images.shape[0], test_images.shape[1]*test_images.shape[2])).astype(np.float32)/255
num_epochs = 200
batch_size = 100
learning_rate = 0.0003
class MNISTModel(tf.keras.Model):
def __init__(self, device='/gpu:0'):
super(MNISTModel, self).__init__()
self.device = device
self.initializer = tf.initializers.random_uniform(0.0, 0.5)
self.hidden = tf.keras.layers.Dense(200, use_bias=False, kernel_initializer=tf.initializers.random_uniform(0.0, 0.5), name="Hidden")
self.out = tf.keras.layers.Dense(train_images.shape[1], use_bias=False, kernel_initializer=tf.initializers.random_uniform(0.0, 0.5), name="Output")
self.hidden.build(train_images.shape[1])
self.out.build(200)
def call(self, x):
return self.out(self.hidden(x))
def loss_func(model, x, y_):
return tf.reduce_mean(tf.losses.mean_squared_error(labels=y_, predictions=model(x)))
#return tf.reduce_mean((y_ - model(x))**4)
model = MNISTModel()
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
for epochs in range(num_epochs):
print("Started epoch ", epochs)
print("Num batches is: ", train_images.shape[0]/batch_size)
for i in range(0,1): #(int(train_images.shape[0]/batch_size)):
with tfe.GradientTape(persistent=True) as tape:
tape.watch(model.variables)
with tfe.GradientTape() as in_tape:
in_tape.watch(model.variables)
loss = loss_func(model,train_images[0:batch_size],train_images[0:batch_size])
grads = tape.gradient(loss, model.variables)
IH_partial_grads = np.array([])
for i in range(len(grads[0])):
collector = np.array([])
for j in range(len(grads[0][i])):
collector = np.append(collector, tape.gradient(grads[0][i][j], model.variables[0]))
IH_partial_grads = np.append(IH_partial_grads, collector)
optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())
print("Epoch test loss: ", loss_func(model, test_images, test_images))
My ultimate goal is to form the hessian matrix for the loss function with respect to all parameters of my network.
Thanks for any and all help!

Gradients are always zero

I have written an algorithm using tensorflow framework and faced with the problem, that tf.train.Optimizer.compute_gradients(loss) returns zero for all weights. Another problem is if I put batch size larger than about 5, tf.histogram_summary for weights throws an error that some of values are NaN.
I cannot provide here a reproducible example, because my code is quite bulky and I am not so good in TF for make it shorter. I will try to paste here some fragments.
Main loop:
images_ph = tf.placeholder(tf.float32, shape=some_shape)
labels_ph = tf.placeholder(tf.float32, shape=some_shape)
output = inference(BATCH_SIZE, images_ph)
loss = loss(labels_ph, output)
train_op = train(loss, global_step)
session = tf.Session()
session.run(tf.initialize_all_variables())
for i in xrange(MAX_STEPS):
images, labels = train_dataset.get_batch(BATCH_SIZE, yolo.INPUT_SIZE, yolo.OUTPUT_SIZE)
session.run([loss, train_op], feed_dict={images_ph : images, labels_ph : labels})
Train_op (here is the problem occures):
def train(total_loss)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(total_loss)
# Here gradients are zeros
for grad, var in grads:
if grad is not None:
tf.histogram_summary("gradients/" + var.op.name, grad)
return opt.apply_gradients(grads, global_step=global_step)
Loss (the loss is calculated correctly, since it changes from sample to sample):
def loss(labels, output)
return tf.reduce_mean(tf.squared_difference(labels, output))
Inference: a set of convolution layers with ReLU followed by 3 fully connected layers with sigmoid activation in the last layer. All weights initialized by truncated normal rv's. All labels are vectors of fixed length with real numbers in range [0,1].
Thanks in advance for any help! If you have some hypothesis for my problem, please share I will try them. Also I can share the whole code if you like.

Tensorflow: can not obtain same result mini-batch SGD optimizer compared to Kaldi nnet1

I am trying to build a Tensorflow example with a simple multl-layer
perceptron (MLP) functionality with one hidden layer. However, when I tested it and compared to other software e.g. Kaldi nnet1, the convergence during the training is not efficient, or cannot be comparable to Kaldi nnet1. I tried my best to make all the parameters the same (input, int target, batch size, learning rate, etc.), however, still confused where could be the reasons. Some pieces of codes are as follows:
Initialization:
self.weight = [tf.Variable(tf.truncated_normal([440, 8192],stddev=0.1))]
self.bias = [tf.Variable(tf.constant(0.01, shape=8192))]
self.weight.append( tf.Variable(tf.truncated_normal([8192, 8],stddev=0.1)) )
self.bias.append( tf.Variable(tf.constant(0.01, shape=8)) )
self.act = [tf.nn.sigmoid( tf.matmul(self.input, self.weight[0]) + self.bias[0] )]
self.nn_out = tf.matmul(self.act, self.weight[1]) + self.bias[1])
self.nn_softmax = tf.nn.softmax(self.nn_out)
self.nn_tgt = tf.placeholder("int64", shape=[None,])
self.cost_mean = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(self.nn_out, self.nn_tgt))
self.train_step = tf.train.GradientDescentOptimizer(self.learn_rate).minimize(self.cost_mean)
# saver
self.saver = tf.train.Saver()
self.sess = tf.Session()
self.sess.run(tf.initialize_all_variables())
Training:
for epoch in xrange(20):
feats_tr, tgts_tr = shuffle(feats_tr, tgts_tr, random_state=777)
# restore the exisiting model
ckpt = tf.train.get_checkpoint_state(ckpt_dir)
if ckpt and ckpt.model_checkpoint_path:
self.load(ckpt.model_checkpoint_path)
# mini-batch
tr_loss = []
for idx_begin in range(0,len(feats_tr), 512):
idx_end = idx_begin + batch_size
batch_feats, batch_tgts = feats_tr[idx_begin:idx_end],tgts_tr[idx_begin:idx_end]
_, loss_val = self.sess.run([self.train_step, self.cost_mean], feed_dict = {self.nn_in: batch_feats,
self.nn_tgt: batch_tgts,self.learn_rate: learn_rate})
tr_loss.append(loss_val)
# cross-validation
cv_loss = []
for idx_begin in range(0,len(feats_cv), 512):
idx_end = idx_begin + batch_size
batch_feats, batch_tgts = feats[idx_begin:idx_end],tgts[idx_begin:idx_end]
loss_all.append(self.sess.run(self.cost_mean,
feed_dict = { self.nn_in: batch_feats,
self.nn_tgt: batch_tgts}))
print( "Avg Loss for Training: "+str(np.mean(tr_loss)) + \
" Avg Loss for Validation: "+str(np.mean(cv_loss)) )
# save model per epoch if np.mean(cv_loss) less than previous
if (epoch+1)%1==0:
if loss_new < loss:
loss = loss_new
print( "Model accepted in epoch %d" %(epoch+1) )
# save model to ckpt_dir with mdl_nam
self.saver.save(self.sess, mdl_nam, global_step=epoch+1)
else:
print( "Model rejected in epoch %d" %(epoch+1) )
and I generated a simple annealing learning rate control as: if the average of cross-validation loss is not improved by a certain threshold, then halving the 'learn_late' with initial 0.008.
I checked all the parameters when compared to Kaldi nnet1, and the only difference now is the initialization parameters of weights and biases. I am not sure whether initialization will affect too much. However, the convergence in terms of 'cv_loss' during training in Tensorflow (Avg. CV Loss 1.99) is not good as Kaldi nnet1 (Avg. CV Loss 0.95). Can someone help to point out where I did something wrong or I missed something?
Many thanks in advance !!!
At each epoch, you call self.load(ckpt.model_checkpoint_path) which seems to load previously saved weights.
Your model cannot learn if it is reset to the initial weights at each epoch.