Implemention of early stopping with gradient descent - optimization

I am developing an algorithm based on gradient descent and I would like to add early stoping regularization. I have an objectif function,F, and I minimize it with respect to W.
This is given in the code below:
Data : X_Train, Y_Train
t=1;
while (t < MaxIteration):
W = W - step * Grad (F,X,W).
loss(t) = computeLoss(X,Y,W);
end
Now I want to add early stoping regularization : this technique would consist in choosing the moment when it is necessary to stop during the optimization process (break the loop). How should I choose this moment? I have to test my model for each iteration on the validation data and create a history?
What I'm trying to do is given below:
Data: X_Train, Y_Train, X_val, Y_val;
t=1;
maxIteration = 100;
models = array of size maxIteration
while (t < MaxIteration):
W = W - step * Grad(F,X,W).
loss(t) = computeLoss(X,Y,W);
models(t) = W;
t=t+1;
end
How do I choose the W model among all those I have stored?

Related

GradientTape for variable weighted sum of two Sequential models in TensorFlow

Suppose we want to minimize the following equation using gradient descent:
min f(alpha * v + (1-alpha)*w) with v and w the model weights and alpha the weight, between 0 and 1, for the sum resulting in the combined model v_bar or ū (here referred to as m).
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
w_weights = tff.learning.ModelWeights.from_model(w)
v_weights = tff.learning.ModelWeights.from_model(v)
m_weights = tff.learning.ModelWeights.from_model(m)
m_weights_trainable = tf.nest.map_structure(lambda v, w: alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable, m_weights_trainable)
In the paper of Adaptive Personalized Federated Learning, formula with update step for alpha suggests updating alpha based on the gradients of model m applied on a minibatch. I tried it with the watch or without, but it always leads to No gradients provided for any variable
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch([alpha])
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Some more information about the initialization of the models:
The m.forward_pass(batch) is the default implementation from tff.learning.Model (found here) by creating a model with tff.learning.from_keras_model and a tf.keras.Sequential model.
def model_fn():
keras_model = create_keras_model()
return tff.learning.from_keras_model(
keras_model,
input_spec = element_spec,
loss = tf.keras.losses.MeanSquaredError(),
metrics = [tf.keras.metrics.MeanSquaredError(),
tf.keras.metrics.MeanAbsoluteError()],
)
w = model_fn()
v = model_fn()
m = model_fn()
Some more experimenting as suggested below by Zachary Garrett:
It seems that whenever this weighted sum is calculated, and the new weights for the model are assigned, then it loses track of the previous trainable variables of both summed models. Again, it leads to the No gradients provided for any variable whenever optimizer.apply_gradients(zip([grad], [alpha])) is called. All gradients seem to be None.
with tf.GradientTape() as tape:
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
m_weights_t = tf.nest.map_structure(lambda w, v: tf.math.scalar_mul(alpha, v, name=None) + tf.math.scalar_mul(tf.constant(1.0) - alpha, w, name=None),
w.trainable,
v.trainable)
m_weights = tff.learning.ModelWeights.from_model(m)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable,
m_weights_trainable)
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Another edit:
I think I have a strategy to get it working, but it is bad practice as manually setting trainable_weights or _trainable_weights does not work. Any tips on improving this?
def do_weighted_combination():
def _mapper(target_layer, v_layer, w_layer):
target_layer.kernel = v_layer.kernel * alpha + w_layer.kernel * (1-alpha)
target_layer.bias = v_layer.bias * alpha + w_layer.bias * (1-alpha)
tf.nest.map_structure(_mapper, m.layers, v.layers, w.layers)
with tf.GradientTape(persistent=True) as tape:
do_weighted_combination()
predictions = m(x_data)
loss = m.compiled_loss(y_data, predictions)
g1 = tape.gradient(loss, v.trainable_weights) # Not None
g2 = tape.gradient(loss, alpha) # Not None
For TensorFlow auto-differentiation using tf.GradientTape, operations must occur within the tf.GradientTape Python context manager so that TensorFlow can "see" them.
Possibly what is happening here is that alpha is used outside/before the tape context, when setting the model variables. Then when m.forwad_pass is called TensorFlow doesn't see any access to alpha and thus can't compute a gradient for it (instead returning None).
Moving the
alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable
logic inside the tf.GradientTape context manager (possibly inside m.forward_pass) may be a solution.

How to minimise a multivariate cost function in Julia with Optim?

I am currently stuck trying to utilize the Optim package in Julia in an attempt to minimize a cost function. The cost function is the cost function for an L2 regularised logistic regression. It is constructed as follows;
using Optim
function regularised_cost(X, y, θ, λ)
m = length(y)
# Sigmoid predictions
h = sigmoid(X * θ)
# left side of the cost function
positive_class_cost = ((-y)' * log.(h))
# right side of the cost function
negative_class_cost = ((1 .- y)' * log.(1 .- h))
# lambda effect
lambda_regularization = (λ/(2*m) * sum(θ[2 : end] .^ 2))
# Current batch cost
𝐉 = (1/m) * (positive_class_cost - negative_class_cost) + lambda_regularization
# Gradients for all the theta members with regularization except the constant
∇𝐉 = (1/m) * (X') * (h-y) + ((1/m) * (λ * θ))
∇𝐉[1] = (1/m) * (X[:, 1])' * (h-y) # Exclude the constant
return (𝐉, ∇𝐉)
end
I would like to use LBFGS algorithm as a solver to find the best weights that minimize this function based on my training examples and labels which are defined as:
opt_train = [ones(size(X_train_scaled, 1)) X_train_scaled] # added intercept
initial_theta = zeros(size(opt_train, 2))
Having read the documentation, here's my current implementation which is currently not working:
cost, gradient! = regularised_cost(opt_train, y_train, initial_theta, 0.01)
res = optimize(regularised_cost,
gradient!,
initial_theta,
LBFGS(),
Optim.Options(g_tol = 1e-12,
iterations = 1000,
store_trace = true,
show_trace = true))
How do I pass my training examples and labels along with the gradients so that the solver (LBFGS) can find me the best weights for theta?
You need to close over your train data and make a loss function that only takes the parameters as inputs.
As per the docs on dealing with constant parameterised
It should be so.wthing like:
loss_and_grad(theta) = regularised_cost(opt_train, y_train, theta, 0.01)
loss(theta) = first(loss_and_grad(theta))
res = optimize(loss, initial_theta)
I will leave it to you to see how to hook the gradient in.
A reminder though: don't use non-const globals.
They are slow, in particular the way they are used in the loss_and_grad function I wrote will be slow.
So you should declare opt_train and y_train as const.
Or make a function that takes them and returns a loss function etc

LSTM from scratch in tensorflow 2

I'm trying to make LSTM in tensorflow 2.1 from scratch, without using the one already supplied with keras (tf.keras.layers.LSTM), just to learn and code something. To do so, I've defined a class "Model" that when called (like with model(input)) it computes the matrix multiplications of the LSTM. I'm pasting here part of my code, the other parts are on github (link)
class Model(object):
[...]
def __call__(self, inputs):
assert inputs.shape == (vocab_size, T_steps)
outputs = []
for time_step in range(T_steps):
x = inputs[:,time_step]
x = tf.expand_dims(x,axis=1)
z = tf.concat([self.h_prev,x],axis=0)
f = tf.matmul(self.W_f, z) + self.b_f
f = tf.sigmoid(f)
i = tf.matmul(self.W_i, z) + self.b_i
i = tf.sigmoid(i)
o = tf.matmul(self.W_o, z) + self.b_o
o = tf.sigmoid(o)
C_bar = tf.matmul(self.W_C, z) + self.b_C
C_bar = tf.tanh(C_bar)
C = (f * self.C_prev) + (i * C_bar)
h = o * tf.tanh(C)
v = tf.matmul(self.W_v, h) + self.b_v
v = tf.sigmoid(v)
y = tf.math.softmax(v, axis=0)
self.h_prev = h
self.C_prev = C
outputs.append(y)
outputs = tf.squeeze(tf.stack(outputs,axis=1))
return outputs
But this neural netoworks has three problems:
1) it is way slow during training. In comparison a model that uses tf.keras.layers.LSTM() is trained more than 10 times faster. Why is this? Maybe because I didn't use a minibatch training, but a stochastic one?
2) the NN seems to not learn anything at all. After just some (very few!) training examples, the loss seems to settle down and it won't decrease anymore, but rather it oscillates around the reached value. After training, I tested the NN making it generate some text, but it just outputs non-sense gibberish. Why isn't learning anything?
3) the loss function outputs very high values. I've coded a categorical cross-entropy loss function but, with 100 characters long sequence, the value of the function is over 370 per training example. Shouldn't it be way lower than this?
I've wrote the loss function like this:
def compute_loss(predictions, desired_outputs):
l = 0
for i in range(T_steps):
l -= tf.math.log(predictions[desired_outputs[i], i])
return l
I know they're open questions, but unfortunately I can't make it works. So any answer, even a short answer that help me to make myself solve the problem, is fine :)

Soft attention from scratch for video sequences

I am trying to implement soft attention for video sequences classification. As there are a lot of implementations and examples about NLP so I tried following this schema but for video 1. Basically a LSTM with an Attention Model in between.
1 https://blog.heuritech.com/2016/01/20/attention-mechanism/
My code for my attention layer is the following which I am not sure it is implemented correctly.
def attention_layer(self, input, context):
# Input is a Tensor: [batch_size, lstm_units]
# Input (Seq_length, batch_size, lstm_units)
# Context is a LSTMStateTuple: [batch_size, lstm_units]. Hidden_state, output = StateTuple
hidden_state, _ = context
weights_y = tf.get_variable("att_weights_Y", [self.lstm_units, self.lstm_units], initializer=tf.contrib.layers.xavier_initializer())
weights_c = tf.get_variable("att_weights_c", [self.lstm_units, self.lstm_units], initializer=tf.contrib.layers.xavier_initializer())
z_ = []
for feat in input:
# Equation => M = tanh(Wc c + Wy y)
Wcc = tf.matmul(hidden_state, weights_c)
Wyy = tf.matmul(feat, weights_y)
m = tf.add(Wcc, Wyy)
m = tf.tanh(m, name='M_matrix')
# Equation => s = softmax(m)
s = tf.nn.softmax(m, name='softmax_att')
z = tf.multiply(feat, s)
z_.append(z)
out = tf.stack(z_, axis=1)
out = tf.reduce_sum(out, 1)
return out, s
So, adding this layer in between my LSTMs (or at the begining of my 2 LSTM) makes the training so slow. More specifically, it takes a lot of time when I declare my optimizer:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
My questions are:
Is the implementation correct? If it is, is there a way to optimize it in order to make it train properly?
I was not able to make it work with the seq2seq APIs. Is there any API with Tensorflow that allows me tackle this specific issue?
Does it actually makes sense to use this for sequence classification?

Updating variable values in tensorflow

I've have a basic question about updating the values of tensors via the tensorflow python api.
Consider the code snippet:
x = tf.placeholder(shape=(None,10), ... )
y = tf.placeholder(shape=(None,), ... )
W = tf.Variable( randn(10,10), dtype=tf.float32 )
yhat = tf.matmul(x, W)
Now let's assume I want to implement some sort of algorithm that iteratively updates the value of W (e.g. some optimization algo). This will involve steps like:
for i in range(max_its):
resid = y_hat - y
W = f(W , resid) # some update
the problem here is that W on the LHS is a new tensor, not the W that is used in yhat = tf.matmul(x, W)! That is, a new variable is created and the value of W used in my "model" doesn't update.
Now one way around this would be
for i in range(max_its):
resid = y_hat - y
W = f(W , resid) # some update
yhat = tf.matmul( x, W)
which results in the creation of a new "model" for each iteration of my loop !
Is there a better way to implement this (in python) without creating a whole bunch of new models for each iteration of the loop - but instead updating the original tensor W "in-place" so to speak?
Variables have an assign method. Try:W.assign(f(W,resid))
#aarbelle's terse answer is correct, I'll expand it a bit in case someone needs more info. The last 2 lines below is used for updating W.
x = tf.placeholder(shape=(None,10), ... )
y = tf.placeholder(shape=(None,), ... )
W = tf.Variable(randn(10,10), dtype=tf.float32 )
yhat = tf.matmul(x, W)
...
for i in range(max_its):
resid = y_hat - y
update = W.assign(f(W , resid)) # do not forget to initialize tf variables.
# "update" above is just a tf op, you need to run the op to update W.
sess.run(update)
Precisely, the answer should be sess.run(W.assign(f(W,resid))). Then use sess.run(W) to show the change.