Keras ML library: how to do weight clipping after gradient updates? TensorFlow backend - tensorflow

I'm trying to use Keras to implement part of an algorithm that requires weight clipping, i.e. limiting the weight values after a gradient update. I haven't found any solutions through web searches so far.
For background, this has to do with the WGANs algorithm:
https://arxiv.org/pdf/1701.07875.pdf
If you look at algorithm 1 on page 8, you'll see the following:
I've highlighted the lines that I'm trying to implement in Keras: after computing a gradient to use to update the weights in the network, I want to make sure that all the weights are clipped between some values [-c, c] that I can set.
How could I go about doing this in Keras?
For reference I am using the TensorFlow backend. I don't mind digging into things and adding messy quick-fixes for now.

While creating the optimizer object set param clipvalue. It will do precisely what you want.
# all parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
rsmprop = RMSprop(clipvalue=0.5)
and then use this object to for model compiling
model.compile(loss='mse', optimizer=rsmprop)
For more reference check: here.
Also, I prefer to use clipnorm over clipvalue because with clipnorm the optimization remains stable. For example say you have 2 parameters and the gradients came out to be [0.1, 3]. By using clipvalue the gradients will become [0.1, 0.5] ie there are chances that the direction of steepest decent can get changed drastically. While clipnorm don't have similar problem as all the gradients will be appropriately scaled and the direction will be preserved and all the while ensuring the constraint on the magnitude of the gradient.
Edit: The question asks weights clipping not gradient clipping:
Gradiant clipping on weights is not part of keras code. But maxnorm on weights constraints is. Check here.
Having said that it can be easily implemented. Here is a very small example:
from keras.constraints import Constraint
from keras import backend as K
class WeightClip(Constraint):
'''Clips the weights incident to each hidden unit to be inside a range
'''
def __init__(self, c=2):
self.c = c
def __call__(self, p):
return K.clip(p, -self.c, self.c)
def get_config(self):
return {'name': self.__class__.__name__,
'c': self.c}
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(30, input_dim=100, W_constraint = WeightClip(2)))
model.add(Dense(1))
model.compile(loss='mse', optimizer='rmsprop')
X = np.random.random((1000,100))
Y = np.random.random((1000,1))
model.fit(X,Y)
I have tested the running of the above code, but not the validity of the constraints. You can do so by getting the model weights after training using model.get_weights() or model.layers[idx].get_weights() and checking whether its abiding the constraints.
Note: The constrain is not added to all the model weights .. but just to the weights of the specific layer its used and also W_constraint adds constrain to W param and b_constraint to b (bias) param

Related

Custom optimizer with multiple loss function evaluations

I want to implement a custom optimization algorithm for TF models.
I have read the following sources
tf documentation on custom optimizers
tf SGD implementation
keras documentation on custom models
towardsdatascience guide on custom optimizers
However lot of questions remain.
It seems like it is not possible to evaluate the loss function multiple times (for different weight settings) before applying a gradient step, when using the custom optimizer API. For example in a line-search type of algorithm this is necessary.
I tried to do all steps manually.
Assume I have setup my model and my optimization problem like this
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import models
model = models.Sequential()
model.add(layers.Dense(15, input_dim=10))
model.add(layers.Dense(20))
model.add(layers.Dense(1))
x_train, y_train = get_train_data()
loss = losses.MeanSquaredError()
def val_and_grads(weights):
model.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, model(x_train))
grads = tape.gradient(val, model.trainable_variables)
return val, grads
initial_weights = model.get_weights()
optimal_weigths = my_fancy_optimization_algorithm(val_and_grads, initial_weights)
However my function val_and_grads needs a list of weights and returns a list of gradients from my_fancy_optimization_algorithms point of view that seems unnatural.
I could warp val_and_grads to "stack" the returned gradients and "split" the passed weights like this
def wrapped_val_and_grad(weights):
grads = val_and_grads(split_weights(weights))
return stack_grads(grads)
however that seems very inefficient.
Anyway, I do not like this approach since it seems that I would loose out out on a lot of the surrounding tensorflow infrastructure (printing of current loss function values and metrics during learning, tensorboard stuff, ...).
I could also pack the above in a custom model with a tailored train_step like this
def CustomModel(keras.Model):
def train_step(self, data):
x_train, y_train = data
def val_and_grads(weights):
self.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, self(x_train))
grads = tape.gradient(val, self.trainable_variables)
return val, grads
trainable_vars = self.trainable_variables
old_weights = self.get_weights()
update = my_fancy_update_finding_algorithm(val_and_grads, self.get_weights()) # this can do multiple evaluations of the model
self.set_weights(old_weights) # restore the weights
self.optimizer.apply_gradients(zip(update, trainable_vars))
Here I would need a accompanying custom optimizer that does nothing else than updating the current weights by adding the update (new_weigths = current_weights + update).
I am still unsure if this is the best way to go.
If someone can comment on the snippets and ideas above, guide me to any other resource that I should consider or provide new approaches and other feedback I would be very glad.
Thanks all.
Franz
EDIT:
Sadly I did not get any response here so far. Maybe my question is not concrete enough. As a first smaller question:
Given the model and val_and_grads in the first listing. How would I efficiently calculate the norm of the WHOLE gradient? What I do so far is
import numpy as np
_, grads = val_and_grad(model.get_weights())
norm_grads = np.linalg.norm(np.concatenate([grad.numpy().flatten() for grad in grad]))
This surely cannot be the "right" way.

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

Delayed echo of sin - cannot reproduce Tensorflow result in Keras

I am experimenting with LSTMs in Keras with little to no luck. At some moment I decided to scale back to the most basic problems in order finally achieve some positive result.
However, even with simplest problems I find that Keras is unable to converge while the implementation of the same problem in Tensorflow gives stable result.
I am unwilling to just switch to Tensorflow without understanding why Keras keeps diverging on any problem I attempt.
My problem is a many-to-many sequence prediction of delayed sin echo, example below:
Blue line is a network input sequence, red dotted line is an expected output.
The experiment was inspired by this repo and workable Tensorflow solution was also created from it too.
The relevant excerpts from the my code are below, and full version of my minimal reproducible example is available here.
Keras model:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=True))
model.add(TimeDistributed(Dense(n_input, activation='linear')))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
Tensorflow model:
x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_steps])
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_steps], seed = SEED))
}
biases = {
'out': tf.Variable(tf.random_normal([n_steps], seed = SEED))
}
lstm = rnn.LSTMCell(n_hidden, forget_bias=1.0)
outputs, states = tf.nn.dynamic_rnn(lstm, inputs=x,
dtype=tf.float32,
time_major=False)
h = tf.transpose(outputs, [1, 0, 2])
pred = tf.nn.bias_add(tf.matmul(h[-1], weights['out']), biases['out'])
individual_losses = tf.reduce_sum(tf.squared_difference(pred, y),
reduction_indices=1)
loss = tf.reduce_mean(individual_losses)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) \
.minimize(loss)
I claim that other parts of code (data_generation, training) are completely identical. But learning progress with Keras stalls early and yields unsatisfactory predictions. Graphs of logloss for both libraries and example predictions are attached below:
Logloss for Tensorflow-trained model:
Logloss for Keras-trained model:
It's not easy to read from graph, but Tensorflow reaches target_loss=0.15 and stops early after about 10k batches. But Keras uses up all 13k batches reaching loss about only 1.5. In a separate experiment where Keras was running for 100k batches it went no further stalling around 1.0.
Figures below contain: black line - model input signal, green dotted line - ground truth output, red line - acquired model output.
Predictions of Tensorflow-trained model:
Predictions of Keras-trained model:
Thank you for suggestions and insights, dear colleagues!
Ok, I have managed to solve this. Keras implementation now converges steadily to a sensible solution too:
The models were in fact not identical. You may inspect with extra caution the Tensorflow model version from the question and verify for yourself that actual Keras equivalent is listed below, and isn't what stated in the question:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=False))
model.add(Dense(n_steps, input_shape=(n_hidden,), activation='linear'))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
I will elaborate. Workable solution here uses that last column of size n_hidden spat out by LSTM as an intermediate activation then fed to the Dense layer.
So, in a way, the actual prediction here is made by the regular perceptron.
One extra take away note - source of mistake in the original Keras solution is already evident from the inference examples attached to question. We see there that earlier timestamps fail utterly, while later timestamps are near perfect. These earlier timestamps correspond to the states of LSTM when it were just initialized on new window and clueless of context.

Having trouble introducing a constant matrix to multiplication in Tensorflow

I am trying to implement a layer that is not fully connected. I have a matrix that specifies the connectivity I desire in the variable connectivity_matrix, which is a numpy array of ones and zeros.
The way I am currently trying to impliment the layer is by pairwise multiplying the weights, by this connectivity matrix F:
Is this the correct way to do this in tensorflow? Here is what I have so far
import numpy as np
import tensorflow as tf
import tflearn
num_input = 10
num_layer1 = 313
num_output = 700
# For example:
connectivity_matrix = np.array(np.random.choice([0, 1], size=(num_layer1, num_output)), dtype='float32')
input = tflearn.input_data(shape=[None, num_input])
# Here is where I specify the connectivity in tensorflow
connectivity = tf.constant(connectivity_matrix, shape=[num_layer1, num_output])
# One basic, fully connected layer
layer1 = tflearn.fully_connected(input, num_layer1, activation='relu')
# Here is where I want to have a non-fully connected layer
W = tf.Variable(tf.random_uniform([num_layer1, num_output]))
b = tf.Variable(tf.zeros([num_output]))
# so take a fully connected W, and do a pairwise multiplication with my tf_connectivity matrix
W_filtered = tf.mul(connectivity, W)
output = tf.matmul(layer1, W_filtered) + b
Masking out unwanted connections in each iteration should work, but I am not sure what the convergence properties are like. It may okay for a small enough learning rate?
Another approach would be to penalize unwanted weights in the cost function. You would use a mask matrix with 1's at unwanted connections, and 0's at wanted ones (or have a smoother transition). This would be multiplied by weights, squared/scaled and added to the cost function. This should converge more smoothly.
P.S.: If you've made progress on this, it would be great to hear your comments as I am also working on this problem.

Tensorflow: optimize over input with gradient descent

I have a TensorFlow model (a convolutional neural network) which I successfully trained using gradient descent (GD) on some input data.
Now, in a second step, I would like to provide an input image as initialization then and optimize over this input image with fixed network parameters using GD. The loss function will be a different one, but this a detail.
So, my main question is how to tell the gradient descent algorithm to
stop optimizing the network parameters
to optimize over the input image
The first can probably done with this
Holding variables constant during optimizer
Do you guys have ideas about the second point?
I guess I can recode the gradient descent algorithm myself using the TF gradient function, but my gut feeling tells me that there should be an easier way, which also allows me to benefit from more complex GD variants (Adam etc.).
No need for your SDG own implementation. TensorFlow provides all functions:
import tensorflow as tf
import numpy as np
# some input
data_pldhr = tf.placeholder(tf.float32)
img_op = tf.get_variable('input_image', [1, 4, 4, 1], dtype=tf.float32, trainable=True)
img_assign = img_op.assign(data_pldhr)
# your starting image
start_value = (np.ones((4, 4), dtype=np.float32) + np.eye(4))[None, :, :, None]
# override variable_getter
def nontrainable_getter(getter, *args, **kwargs):
kwargs['trainable'] = False
return getter(*args, **kwargs)
# all variables in this scope are not trainable
with tf.variable_scope('myscope', custom_getter=nontrainable_getter):
x = tf.layers.dense(img_op, 10)
y = tf.layers.dense(x, 10)
# the usual stuff
cost_op = tf.losses.mean_squared_error(x, y)
train_op = tf.train.AdamOptimizer(0.1).minimize(cost_op)
# fire up the training process
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(img_assign, {data_pldhr: start_value})
print(sess.run(img_op))
for i in range(10):
_, c = sess.run([train_op, cost_op])
print(c)
print(sess.run(img_op))
represent an image as tf.Variable with trainable=True
initialise this variable with the starting image (initial guess)
recreate the NN graph using TF variables with trainable=False and copy the weights from the trained NN graph using tf.assign
calculate the loss function
plug the loss into any TF optimiser algorithm you want
Another alternative is to use ScipyOptimizerInterface, which allows to use scipy's minimizer. This supports constrained minimization.
I'm looking for a solution to the same problem, but my model is not an easy one as I have an LSTM network with cells created with MultiRNNCell, I don't think it is possible to get the weight and clone the network. Is there any workaround so that I can compute the gradient wrt the input?