How to obtain second derivatives of a Loss function with respect to the parameters of a neural network using gradient tape in Tensorflow eager mode - tensorflow

I am creating a basic auto-encoder for the MNIST dataset using TensorFlow eager mode. I would like to observe the second-order partial derivatives of my loss function with respect to the parameters of the network as it trains. Currently, calling tape.gradient() on the output of in_tape.gradient returns None (where in_tape is a GradientTape nested inside the outer GradientTape called tape, I have included my code below)
I have tried calling the tape.gradient() directly on the in_tape.gradient() with None being returned. My next approach was to iterate over the output of in_tape.gradient() and apply tape.gradient() to each gradient individually (with respect to my model variables) with None being returned each time.
I receive a single None value for any tape.gradient() call, not a list of None values which I believe would indicate None for a single partial derivative, which would be expected in some cases.
I am currently only trying to get the second derivatives for the first set of weights (from input to hidden layers), however, I will scale it to include all weights once I have this working.
tf.enable_eager_execution()
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((train_images.shape[0], train_images.shape[1]*train_images.shape[2])).astype(np.float32)/255
test_images = test_images.reshape((test_images.shape[0], test_images.shape[1]*test_images.shape[2])).astype(np.float32)/255
num_epochs = 200
batch_size = 100
learning_rate = 0.0003
class MNISTModel(tf.keras.Model):
def __init__(self, device='/gpu:0'):
super(MNISTModel, self).__init__()
self.device = device
self.initializer = tf.initializers.random_uniform(0.0, 0.5)
self.hidden = tf.keras.layers.Dense(200, use_bias=False, kernel_initializer=tf.initializers.random_uniform(0.0, 0.5), name="Hidden")
self.out = tf.keras.layers.Dense(train_images.shape[1], use_bias=False, kernel_initializer=tf.initializers.random_uniform(0.0, 0.5), name="Output")
self.hidden.build(train_images.shape[1])
self.out.build(200)
def call(self, x):
return self.out(self.hidden(x))
def loss_func(model, x, y_):
return tf.reduce_mean(tf.losses.mean_squared_error(labels=y_, predictions=model(x)))
#return tf.reduce_mean((y_ - model(x))**4)
model = MNISTModel()
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
for epochs in range(num_epochs):
print("Started epoch ", epochs)
print("Num batches is: ", train_images.shape[0]/batch_size)
for i in range(0,1): #(int(train_images.shape[0]/batch_size)):
with tfe.GradientTape(persistent=True) as tape:
tape.watch(model.variables)
with tfe.GradientTape() as in_tape:
in_tape.watch(model.variables)
loss = loss_func(model,train_images[0:batch_size],train_images[0:batch_size])
grads = tape.gradient(loss, model.variables)
IH_partial_grads = np.array([])
for i in range(len(grads[0])):
collector = np.array([])
for j in range(len(grads[0][i])):
collector = np.append(collector, tape.gradient(grads[0][i][j], model.variables[0]))
IH_partial_grads = np.append(IH_partial_grads, collector)
optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())
print("Epoch test loss: ", loss_func(model, test_images, test_images))
My ultimate goal is to form the hessian matrix for the loss function with respect to all parameters of my network.
Thanks for any and all help!

Related

400% higher error with PyTorch compared with identical Keras model (with Adam optimizer)

TLDR:
A simple (single hidden-layer) feed-forward Pytorch model trained to predict the function y = sin(X1) + sin(X2) + ... sin(X10) substantially underperforms an identical model built/trained with Keras. Why is this so and what can be done to mitigate the difference in performance?
In training a regression model, I noticed that PyTorch drastically underperforms an identical model built with Keras.
This phenomenon has been observed and reported previously:
The same model produces worse results on pytorch than on tensorflow
CNN model in pytorch giving 30% less accuracy to Tensoflowflow model:
PyTorch Adam vs Tensorflow Adam
Suboptimal convergence when compared with TensorFlow model
RNN and Adam: slower convergence than Keras
PyTorch comparable but worse than keras on a simple feed forward network
Why is the PyTorch model doing worse than the same model in Keras even with the same weight initialization?
Why Keras behave better than Pytorch under the same network configuration?
The following explanations and suggestions have been made previously as well:
Using the same decimal precision (32 vs 64): 1, 2,
Using a CPU instead of a GPU: 1,2
Change retain_graph=True to create_graph=True in computing the 2nd derivative with autograd.grad: 1
Check if keras is using a regularizer, constraint, bias, or loss function in a different way from pytorch: 1,2
Ensure you are computing the validation loss in the same way: 1
Use the same initialization routine: 1,2
Training the pytorch model for longer epochs: 1
Trying several random seeds: 1
Ensure that model.eval() is called in validation step when training pytorch model: 1
The main issue is with the Adam optimizer, not the initialization: 1
To understand this issue, I trained a simple two-layer neural network (much simpler than my original model) in Keras and PyTorch, using the same hyperparameters and initialization routines, and following all the recommendations listed above. However, the PyTorch model results in a mean squared error (MSE) that is 400% higher than the MSE of the Keras model.
Here is my code:
0. Imports
import numpy as np
from scipy.stats import pearsonr
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from torch.utils.data import Dataset, DataLoader
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.regularizers import L2
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
1. Generate a reproducible dataset
def get_data():
np.random.seed(0)
Xtrain = np.random.normal(0, 1, size=(7000,10))
Xval = np.random.normal(0, 1, size=(700,10))
ytrain = np.sum(np.sin(Xtrain), axis=-1)
yval = np.sum(np.sin(Xval), axis=-1)
scaler = MinMaxScaler()
ytrain = scaler.fit_transform(ytrain.reshape(-1,1)).reshape(-1)
yval = scaler.transform(yval.reshape(-1,1)).reshape(-1)
return Xtrain, Xval, ytrain, yval
class XYData(Dataset):
def __init__(self, X, y):
super(XYData, self).__init__()
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.float32)
self.len = len(y)
def __getitem__(self, index):
return (self.X[index], self.y[index])
def __len__(self):
return self.len
# Data, dataset, and dataloader
Xtrain, Xval, ytrain, yval = get_data()
traindata = XYData(Xtrain, ytrain)
valdata = XYData(Xval, yval)
trainloader = DataLoader(dataset=traindata, shuffle=True, batch_size=32, drop_last=False)
valloader = DataLoader(dataset=valdata, shuffle=True, batch_size=32, drop_last=False)
2. Build Keras and PyTorch models with identical hyperparameters and initialization methods
class TorchLinearModel(nn.Module):
def __init__(self, input_dim=10, random_seed=0):
super(TorchLinearModel, self).__init__()
_ = torch.manual_seed(random_seed)
self.hidden_layer = nn.Linear(input_dim,100)
self.initialize_layer(self.hidden_layer)
self.output_layer = nn.Linear(100, 1)
self.initialize_layer(self.output_layer)
def initialize_layer(self, layer):
_ = torch.nn.init.xavier_normal_(layer.weight)
#_ = torch.nn.init.xavier_uniform_(layer.weight)
_ = torch.nn.init.constant(layer.bias,0)
def forward(self, x):
x = self.hidden_layer(x)
x = self.output_layer(x)
return x
def mean_squared_error(ytrue, ypred):
return torch.mean(((ytrue - ypred) ** 2))
def build_torch_model():
torch_model = TorchLinearModel()
optimizer = optim.Adam(torch_model.parameters(),
betas=(0.9,0.9999),
eps=1e-7,
lr=1e-3,
weight_decay=0)
return torch_model, optimizer
def build_keras_model():
x = layers.Input(shape=10)
z = layers.Dense(units=100, activation=None, use_bias=True, kernel_regularizer=None,
bias_regularizer=None)(x)
y = layers.Dense(units=1, activation=None, use_bias=True, kernel_regularizer=None,
bias_regularizer=None)(z)
keras_model = Model(x, y, name='linear')
optimizer = Adam(learning_rate=1e-3, beta_1=0.9, beta_2=0.9999, epsilon=1e-7,
amsgrad=False)
keras_model.compile(optimizer=optimizer, loss='mean_squared_error')
return keras_model
# Instantiate models
torch_model, optimizer = build_torch_model()
keras_model = build_keras_model()
3. Train PyTorch model for 100 epochs:
torch_trainlosses, torch_vallosses = [], []
for epoch in range(100):
# Training
losses = []
_ = torch_model.train()
for i, (x,y) in enumerate(trainloader):
optimizer.zero_grad()
ypred = torch_model(x)
loss = mean_squared_error(y, ypred)
_ = loss.backward()
_ = optimizer.step()
losses.append(loss.item())
torch_trainlosses.append(np.mean(losses))
# Validation
losses = []
_ = torch_model.eval()
with torch.no_grad():
for i, (x, y) in enumerate(valloader):
ypred = torch_model(x)
loss = mean_squared_error(y, ypred)
losses.append(loss.item())
torch_vallosses.append(np.mean(losses))
print(f"epoch={epoch+1}, train_loss={torch_trainlosses[-1]:.4f}, val_loss={torch_vallosses[-1]:.4f}")
4. Train Keras model for 100 epochs:
history = keras_model.fit(Xtrain, ytrain, sample_weight=None, batch_size=32, epochs=100,
validation_data=(Xval, yval))
5. Loss in training history
plt.plot(torch_trainlosses, color='blue', label='PyTorch Train')
plt.plot(torch_vallosses, color='blue', linestyle='--', label='PyTorch Val')
plt.plot(history.history['loss'], color='brown', label='Keras Train')
plt.plot(history.history['val_loss'], color='brown', linestyle='--', label='Keras Val')
plt.legend()
Keras records a much lower error in the training. Since this may be due to a difference in how Keras computes the loss, I calculated the prediction error on the validation set with sklearn.metrics.mean_squared_error
6. Validation error after training
ypred_keras = keras_model.predict(Xval).reshape(-1)
ypred_torch = torch_model(torch.tensor(Xval, dtype=torch.float32))
ypred_torch = ypred_torch.detach().numpy().reshape(-1)
mse_keras = metrics.mean_squared_error(yval, ypred_keras)
mse_torch = metrics.mean_squared_error(yval, ypred_torch)
print('Percent error difference:', (mse_torch / mse_keras - 1) * 100)
r_keras = pearsonr(yval, ypred_keras)[0]
r_pytorch = pearsonr(yval, ypred_torch)[0]
print("r_keras:", r_keras)
print("r_pytorch:", r_pytorch)
plt.scatter(ypred_keras, yval); plt.title('Keras'); plt.show(); plt.close()
plt.scatter(ypred_torch, yval); plt.title('Pytorch'); plt.show(); plt.close()
Percent error difference: 479.1312469426776
r_keras: 0.9115184443702814
r_pytorch: 0.21728812737220082
The correlation of predicted values with ground truth is 0.912 for Keras but 0.217 for Pytorch, and the error for Pytorch is 479% higher!
7. Other trials
I also tried:
Lowering the learning rate for Pytorch (lr=1e-4), R increases from 0.217 to 0.576, but it's still much worse than Keras (r=0.912).
Increasing the learning rate for Pytorch (lr=1e-2), R is worse at 0.095
Training numerous times with different random seeds. The performance is roughly the same, regardless.
Trained for longer than 100 epochs. No improvement was observed!
Used torch.nn.init.xavier_uniform_ instead of torch.nn.init.xavier_normal_ in the initialization of the weights. R improves from 0.217 to 0.639, but it's still worse than Keras (0.912).
What can be done to ensure that the PyTorch model converges to a reasonable error comparable with the Keras model?
The problem here is unintentional broadcasting in the PyTorch training loop.
The result of a nn.Linear operation always has shape [B,D], where B is the batch size and D is the output dimension. Therefore, in your mean_squared_error function ypred has shape [32,1] and ytrue has shape [32]. By the broadcasting rules used by NumPy and PyTorch this means that ytrue - ypred has shape [32,32]. What you almost certainly meant is for ypred to have shape [32]. This can be accomplished in many ways; probably the most readable is to use Tensor.flatten
class TorchLinearModel(nn.Module):
...
def forward(self, x):
x = self.hidden_layer(x)
x = self.output_layer(x)
return x.flatten()
which produces the following train/val curves

Why am I getting "ValueError: No gradients provided for any variable: ['Variable:0']." error?

I'm extremely new to tensorflow, and I'm trying to build a style transfer model, I understand the concept of how the model is but am having difficulty at actually implementing it, since I don't fully understand what is going on in tensorflow, yet. When I try to run the optimization for the generated image I get the "No gradients provided" error, which I don't understand since my code has:
loss = total_loss(content_feats, style_feats, output_feats)
grad = tape.gradient(loss, output_processado)
optimizer.apply_gradients(zip([grad],[output_processado]))
ValueError Traceback (most recent call
last)
in ()
8
9 grad = tape.gradient(loss, output_processado)
---> 10 optimizer.apply_gradients(zip([grad],[output_processado]))
11
12 clip = tf.clip_by_value(output_processado, min_value, max_value)
1 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py in _filter_grads(grads_and_vars) 1217 if not filtered: 1218
raise ValueError("No gradients provided for any variable: %s." %
-> 1219 ([v.name for _, v in grads_and_vars],)) 1220 if vars_with_empty_grads: 1221 logging.warning(
ValueError: No gradients provided for any variable: ['Variable:0'].
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
import numpy as np
from PIL import Image
import requests
from io import BytesIO
from keras.applications.vgg19 import VGG19
from keras.applications.vgg19 import preprocess_input
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.models import Model
import keras.backend as K
from matplotlib import pyplot as plt
from numpy import expand_dims
from tensorflow import GradientTape
ITERATIONS = 10
CHANNELS = 3
IMAGE_SIZE = 500
IMAGE_WIDTH = IMAGE_SIZE
IMAGE_HEIGHT = IMAGE_SIZE
CONTENT_WEIGHT = 0.02
STYLE_WEIGHT = 4.5
MEAN = np.array([103.939, 116.779, 123.68])
CONTENT_LAYERS = ['block4_conv2']
STYLE_LAYERS = ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1']
input_image_path = "input.png"
style_image_path = "style.png"
output_image_path = "output.png"
combined_image_path = "combined.png"
san_francisco_image_path = "https://www.economist.com/sites/default/files/images/print-edition/20180602_USP001_0.jpg"
tytus_image_path = "http://meetingbenches.com/wp-content/flagallery/tytus-brzozowski-polish-architect-and-watercolorist-a-fairy-tale-in-warsaw/tytus_brzozowski_13.jpg"
input_image = Image.open(BytesIO(requests.get(san_francisco_image_path).content))
input_image = input_image.resize((IMAGE_WIDTH, IMAGE_HEIGHT))
input_image.save(input_image_path)
#input_image
# Style visualization
style_image = Image.open(BytesIO(requests.get(tytus_image_path).content))
style_image = style_image.resize((IMAGE_WIDTH, IMAGE_HEIGHT))
style_image.save(style_image_path)
#style_image
def obter_modelo():
modelo = VGG19(include_top = False, weights = 'imagenet', input_tensor = None)
c_layer = CONTENT_LAYERS
s_layers = STYLE_LAYERS
output_layers = [modelo.get_layer(layer).output for layer in (c_layer + s_layers)]
return Model(modelo.inputs, output_layers)
def processar_imagem(img):
imagem = img.resize((IMAGE_HEIGHT, IMAGE_WIDTH))
imagem = img_to_array(imagem)
imagem = preprocess_input(imagem)
imagem = expand_dims(imagem, axis=0)
return imagem
def desprocessar_imagem(img):
imagem = img
mean = MEAN
imagem[..., 0] += mean[0]
imagem[..., 1] += mean[1]
imagem[..., 2] += mean[2]
imagem = imagem[..., ::-1]
return imagem.astype(int)
def content_loss(c_mat, out_mat):
return 0.5 * K.sum(K.square(out_mat - c_mat))
def matriz_gram(mat):
return K.dot(mat,K.transpose(mat))
def style_loss(s_mat, out_mat):
style_feat = K.batch_flatten(K.permute_dimensions(s_mat,(2,0,1)))
output_feat = K.batch_flatten(K.permute_dimensions(out_mat,(2,0,1)))
style_gram = matriz_gram(style_feat)
output_gram = matriz_gram(output_feat)
return K.sum(K.square(style_gram - output_gram)) / (4.0 * (CHANNELS ** 2) * (IMAGE_SIZE ** 2))
def total_loss(c_layer, s_layers, out_layers):
content_layer = c_layer[0]
out_content = out_layers[0]
style_layers = s_layers[1:]
out_style = out_layers[1:]
c_loss = content_loss(content_layer[0], out_content[0])
s_loss = None
for i in range(len(style_layers)):
if s_loss is None:
s_loss = style_loss(style_layers[i][0], out_style[i][0])
else:
s_loss += style_loss(style_layers[i][0], out_style[i][0])
return CONTENT_WEIGHT * c_loss + (STYLE_WEIGHT * s_loss)/len(style_layers)
modelo = obter_modelo()
#content image
content_processado = processar_imagem(input_image)
content_feats = modelo(K.variable(content_processado))
#style image
style_processado = processar_imagem(style_image)
style_feats = modelo(K.variable(style_processado))
#output image
output_processado = preprocess_input(np.random.uniform(0,250,(IMAGE_HEIGHT, IMAGE_WIDTH,CHANNELS)))
output_processado = expand_dims(output_processado, axis=0)
output_processado = K.variable(output_processado)
optimizer = tf.optimizers.Adam(5,beta_1=.99,epsilon=1e-3)
epochs=200
melhor_loss = K.variable(2000000.0)
melhor_imagem = None
min_value = MEAN
max_value = 255 + MEAN
loss = K.variable(0.0)
for e in range(epochs):
with tf.GradientTape() as tape:
tape.watch(output_processado)
output_feats = modelo(output_processado)
loss = total_loss(content_feats, style_feats, output_feats)
grad = tape.gradient(loss, output_processado)
optimizer.apply_gradients(zip([grad],[output_processado]))
clip = tf.clip_by_value(output_processado, min_value, max_value)
output_processado.assign(clip)
print("Epoch: " + str(e) )
For tape.gradient, you have to pass (loss, model.trainable_weights), but you are passing tape.gradient(loss, output_processado). Also for optimizer.apply_gradients, you have to pass (grad, model.trainable_variables), but you are passing (zip([grad],[output_processado]).
Calling a model inside a GradientTape scope enables you to retrieve the gradients of the trainable weights of the layer with respect to a loss value. Using an optimizer instance, you can use these gradients to update these variables (which you can retrieve using model.trainable_weights).
TensorFlow provides the tf.GradientTape API for automatic differentiation - computing the gradient of a computation with respect to its input variables. Tensorflow "records" all operations executed inside the context of a tf.GradientTape onto a "tape". Tensorflow then uses that tape and the gradients associated with each recorded operation to compute the gradients of a "recorded" computation using reverse mode differentiation.
If you want to process the gradients before applying them you can instead use the optimizer in three steps:
Compute the gradients with tf.GradientTape.
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
Here is a simple example for mnist data. The comments are present in the code to explain better.
Code-
import tensorflow as tf
print(tf.__version__)
from tensorflow import keras
from tensorflow.keras import layers
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Preprocess the data (these are Numpy arrays)
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255
y_train = y_train.astype('float32')
y_test = y_test.astype('float32')
# Reserve 10,000 samples for validation
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
# Get the model.
inputs = keras.Input(shape=(784,), name='digits')
x = layers.Dense(64, activation='relu', name='dense_1')(inputs)
x = layers.Dense(64, activation='relu', name='dense_2')(x)
outputs = layers.Dense(10, name='predictions')(x)
model = keras.Model(inputs=inputs, outputs=outputs)
# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# Prepare the training dataset.
batch_size = 64
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
epochs = 3
for epoch in range(epochs):
print('Start of epoch %d' % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables autodifferentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print('Training loss (for one batch) at step %s: %s' % (step, float(loss_value)))
print('Seen so far: %s samples' % ((step + 1) * 64))
Output -
2.2.0
Start of epoch 0
Training loss (for one batch) at step 0: 2.323657512664795
Seen so far: 64 samples
Training loss (for one batch) at step 200: 2.3156163692474365
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 2.2302279472351074
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 2.131979465484619
Seen so far: 38464 samples
Start of epoch 1
Training loss (for one batch) at step 0: 2.00234317779541
Seen so far: 64 samples
Training loss (for one batch) at step 200: 1.7992427349090576
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 1.8583933115005493
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 1.6005337238311768
Seen so far: 38464 samples
Start of epoch 2
Training loss (for one batch) at step 0: 1.6701987981796265
Seen so far: 64 samples
Training loss (for one batch) at step 200: 1.6237502098083496
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 1.3603084087371826
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 1.246948480606079
Seen so far: 38464 samples
You can find more about tf.GradientTape here. The example used here is taken from here.
Hope this answers your question. Happy Learning.

Weighted Absolute Error implementation doesn't work in tensorflow (keras)

I have created custom loss (Weighted Absolute error) in keras but implementation doesn't work - I get an error ValueError: No gradients provided for any variable: ['my_model/conv2d/kernel:0', 'my_model/conv2d/bias:0'].
I want to apply different weight for each pixel.
class WeightedMeanAbsoluteError(tf.keras.metrics.Metric):
def __init__(self, name='weighted_mean_absolute_error'):
super(WeightedMeanAbsoluteError, self).__init__(name=name)
self.wmae = self.add_weight(name='wmae', initializer='zeros')
def update_state(self, y_true, y_pred, loss_weights):
values = tf.math.abs(y_true - y_pred) * loss_weights
return self.wmae.assign_add(tf.reduce_sum(values))
def result(self):
return self.wmae
def reset_states(self):
# The state of the metric will be reset at the start of each epoch.
self.wmae.assign(0.)
loss_object = WeightedMeanAbsoluteError()
train_loss = WeightedMeanAbsoluteError()
I use the following code to implement a training step:
#tf.function
def train_step(input_images, output_images):
with tf.GradientTape() as tape:
# training=True is only needed if there are layers with different
# behavior during training versus inference (e.g. Dropout).
result_images = model(input_images, training=True)
loss = loss_object(output_images, result_images)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Also my code works just fine if I use
loss_object = tf.keras.losses.MeanAbsoluteError()
train_loss = tf.keras.metrics.MeanAbsoluteError()
The best and simple way to minimize a weighted standard loss (such mae) is using the sample_weights parameter in fit method where we pass an array with the desired weight of each sample
X = np.random.uniform(0,1, (1000,50))
y = np.random.uniform(0,1, 1000)
W = np.random.randint(1,10, 1000)
inp = Input((50))
x = Dense(64, activation='relu')(inp)
out = Dense(10)(x)
model = Model(inp, out)
model.compile('adam','mae')
model.fit(X,y, epochs=100, sample_weights=W)

Reuse the weight matrix from embedding layer with #tf.function

Without using #tf.function, the script work perfectly
I want to use it to speed up training, but it's giving me error where I reuse the weight matrix from the embedding layers.
I think the error is caused by get_weights(), because it converts tensor back to numpy
I tried to use a tf.keras.layers.Dense instead of re-using the weights from embedding, and it worked perfectly.
class Example(tf.keras.Model):
def __init__(self,):
super(Example, self).__init__()
self.embed_dim = embed_dim
self.vocab_size = vocab_size
self.embed = tf.keras.layers.Embedding(self.vocab_size, self.embed_dim)
...
def call(self, inputs, trianing):
...
embed_matrix = self.embed.get_weights()
# a dense layer
Vhid = tf.matmul(self.kernel, tf.transpose(embed_matrix[0]))
pred_w = tf.matmul(pred, Vhid) + self.bias
In my train script.
I did
#tf.function
def train_step(x, y, training=None):
with tf.GradientTape() as tape:
pred = model(x, y, training)
losses = compute_loss(y, pred)
grads = tape.gradient(losses, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return losses
/home/thomas/projects/tf_convsent/models/.py:195 call *
embed_matrix = self.embed.get_weights() # [vocab_size, 300]
/home/thomas/.conda/envs/tf2_p37/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:1177 get_weights
return backend.batch_get_value(params)
/home/thomas/.conda/envs/tf2_p37/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:3011 batch_get_value
raise RuntimeError('Cannot get value inside Tensorflow graph function.')
RuntimeError: Cannot get value inside Tensorflow graph function.
Found the easiest solution which improved 50% training speed(122 hrs to ~65 hrs)
just change
embed_matrix = self.embed.get_weights()
to
embed_matrix = self.embed.weights
will do the trick.

Tensorflow Eager Execution does not work with Learning Rate decay

trying here to make an eager exec model work with LR decay, but no success. It seems to be a bug, since it appear that the learning rate decay tensor does not get updated. If I am missing something can you land a hand here. Thanks.
The code bellow is learning some word embeddings. However, the learning rate decay section does not work at all.
class Word2Vec(tf.keras.Model):
def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED):
self.vocab_size = vocab_size
self.num_sampled = num_sampled
self.embed_matrix = tfe.Variable(tf.random_uniform(
[vocab_size, embed_size]), name="embedding_matrix")
self.nce_weight = tfe.Variable(tf.truncated_normal(
[vocab_size, embed_size],
stddev=1.0 / (embed_size ** 0.5)), name="weights")
self.nce_bias = tfe.Variable(tf.zeros([vocab_size]), name="biases")
def compute_loss(self, center_words, target_words):
"""Computes the forward pass of word2vec with the NCE loss."""
embed = tf.nn.embedding_lookup(self.embed_matrix, center_words)
loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.nce_weight,
biases=self.nce_bias,
labels=target_words,
inputs=embed,
num_sampled=self.num_sampled,
num_classes=self.vocab_size))
return loss
def gen():
yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES,
VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW,
VISUAL_FLD)
def main():
dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32),
(tf.TensorShape([BATCH_SIZE]),
tf.TensorShape([BATCH_SIZE, 1])))
global_step = tf.train.get_or_create_global_step()
starter_learning_rate = 1.0
end_learning_rate = 0.01
decay_steps = 1000
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step.numpy(),
decay_steps, end_learning_rate,
power=0.5)
train_writer = tf.contrib.summary.create_file_writer('./checkpoints')
train_writer.set_as_default()
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.95)
model = Word2Vec(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE)
grad_fn = tfe.implicit_value_and_gradients(model.compute_loss)
total_loss = 0.0 # for average loss in the last SKIP_STEP steps
checkpoint_dir = "./checkpoints/"
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
root = tfe.Checkpoint(optimizer=optimizer,
model=model,
optimizer_step=tf.train.get_or_create_global_step())
while global_step < NUM_TRAIN_STEPS:
for center_words, target_words in tfe.Iterator(dataset):
with tf.contrib.summary.record_summaries_every_n_global_steps(100):
if global_step >= NUM_TRAIN_STEPS:
break
loss_batch, grads = grad_fn(center_words, target_words)
tf.contrib.summary.scalar('loss', loss_batch)
tf.contrib.summary.scalar('learning_rate', learning_rate)
# print(grads)
# print(len(grads))
total_loss += loss_batch
optimizer.apply_gradients(grads, global_step)
if (global_step.numpy() + 1) % SKIP_STEP == 0:
print('Average loss at step {}: {:5.1f}'.format(
global_step.numpy(), total_loss / SKIP_STEP))
total_loss = 0.0
root.save(file_prefix=checkpoint_prefix)
if __name__ == '__main__':
main()
Note that when eager execution is enabled, the tf.Tensor objects represent concrete values (as opposed to symbolic handles of computation that will occur on Session.run() calls).
As a result, in your code snippet above, the line:
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step.numpy(),
decay_steps, end_learning_rate,
power=0.5)
is computing the decayed value once, using the global_step at the time it was invoked, and when the optimizer is being created with:
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.95)
it is being given a fixed learning rate.
To decay the learning rate, you'd want to invoke tf.train.polynomial_decay repeatedly (with updated values for global_step). One way to do this would be to replicate what is done in the RNN example, using something like this:
starter_learning_rate = 1.0
learning_rate = tfe.Variable(starter_learning_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.95)
while global_step < NUM_TRAIN_STEPS:
# ....
learning_rate.assign(tf.train.polynomial_decay(starter_learning_rate, global_step, decay_steps, end_learning_rate, power=0.5))
This way you've captured the learning_rate in a variable that can be updated. Furthermore, it's simple to include the current learning_rate in the checkpoint as well (by including it when creating the Checkpoint object).
Hope that helps.