I know that optimizers in Tensorflow divide minimize into compute_gradients and apply_gradients. However, optimization algorithms like Adam generally process the gradients with momentum and some other techniques as the following figure suggests(Thanks #kmario23 for providing the figure).
I wonder when these techniques are applied to the gradients? Are they applied in compute_gradients or apply_gradients?
Update
sess = tf.Session()
x = tf.placeholder(tf.float32, [None, 1])
y = tf.layers.dense(x, 1)
loss = tf.losses.mean_squared_error(tf.ones_like(y), y)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(loss)
sess.run(tf.global_variables_initializer())
print(sess.run(grads, feed_dict={x: [[1]]}))
print(sess.run(grads, feed_dict={x: [[1]]}))
The above code outputs the same results twice, does it suggest that moment estimates are computed in apply_gradients? Because, IMHO, if moment estimates are computed in apply_gradients, then after the first print statement, first and second moments will be updated, which should result in different result in the second printstatement.
Below is the Adam algorithm as presented in the Deep Learning book. As for your question, the important thing to note here is the gradient of theta (written as Laplacian of theta) in second to last step.
As for how TensorFlow computes this is a two step process in the optimization (i.e. minimization)
1) compute_gradients
2) apply_gradients
In the first step all the necessary ingredients for the final gradients are computed. So, the second step is just applying the update to the parameters based on the gradients computed in the first step and the learning rate (lr).
compute_gradients computes only gradients, all other additional operations corresponding to specific optimization algorithms are done in apply_gradients. The code in the update is one evidence, another evidence is the following figure cropped from tensorboard, where Adam corresponds to the compute_gradient operation.
Related
When a custom loss is defined in a Keras model, online sources seem to indicate that the the loss should return an array of values (a loss for each sample in the batch). Something like this
def custom_loss_function(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
model.compile(optimizer='adam', loss=custom_loss_function)
In the example above, I have no idea when or if the model is taking the batch sum or mean with tf.reduce_sum() or tf.reduce_mean()
In another situation when we want to implement a custom training loop with a custom function, the template to follow according to Keras documentation is this
for epoch in range(epochs):
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
with tf.GradientTape() as tape:
y_batch_pred = model(x_batch_train, training=True)
loss_value = custom_loss_function(y_batch_train, y_batch_pred)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
So by the book, if I understand correctly, we are supposed to take the mean of the batch gradients. Therefore, the loss value above should be a single value per batch.
However, the example will work with both of the following variations:
tf.reduce_mean(squared_difference, axis=-1) # array of loss for each sample
tf.reduce_mean(squared_difference) # mean loss for batch
So, why does the first option (array loss) above still work? Is apply_gradients applying small changes for each value sequentially? Is this wrong although it works?
What is the correct way without a custom loop, and with a custom loop?
Good question. In my opinion, this is not well documented in the TensorFlow/Keras API. By default, if you do not provide a scalar loss_value, TensorFlow will add them up (and the updates are not sequential). Essentially, this is equivalent to summing the losses along the batch axis.
Currently, the losses in the TensorFlow API include a reduction argument (for example, tf.losses.MeanSquaredError) that allows specifying how to aggregate the loss along the batch axis.
I'm trying to run through a simple linear regression example in Tensorflow, and it appears that the training algorithm is converging to a solution, but once it gets close to the solution, it starts bouncing around and eventually blows up.
I'm passing data for a y = 2x line, so the gradient descent optimizer should be able to easily converge to a solution.
import tensorflow as tf
M = tf.Variable([0.4], dtype=tf.float32)
b = tf.Variable([-0.4], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = M * x + b
error = linear_model - y
loss = tf.square(error)
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for i in range(100):
sess.run(optimizer, {x: i, y: 2 * i})
print(sess.run([M, b]))
Here's the result. I circled the portion where it gets close to a solution. Why does the gradient descent break once it gets close to the solution, or is there's something that I'm doing wrong?
Your code feeds the training data one at a time for only one epoch. This corresponds to stochastic gradient descent, where the loss value tends to fluctuate more frequently than batch and mini-batch gradient descent during training. Moreover, since the data is fed in an increasing order of x, the gradient value also increases along with x. That is why you see larger fluctuations in the later part of an epoch.
This can happen if the learning rate is too high; try lowering it.
My guess would be that you have chosen a high learning rate. You can use grid search and find the optimal learning rate, then fit data using the optimal learning rate.
Specifically, within one step, how does it training the model? What is the quitting condition for the gradient descent and back propagation?
Docs here: https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
e.g.
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": X_train},
y=y_train,
batch_size=50,
num_epochs=None,
shuffle=True)
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
I understand that training one step means that we feed the neural network model with batch_size many data points once. My questions is, within this one step, how many times does it perform gradient descent? Does it do back propagation and gradient descent just once or does it keep performing gradient descent until the model weights reach a optimal for this batch of data?
In addition to #David Parks answer, using batches for performing gradient descent is referred to as stochastic gradient descent. Instead of updating the weights after each training sample, you average over the sum of gradients of the batch and use this new gradient to update your weights.
For example, if you have 1000 trainings samples and use batches of 200, you calculate the average gradient for 200 samples, and update your weights with it. That means that you only perform 5 updates overall instead of updating your weights 1000 times. On sufficiently big data sets, you will experience a much faster training process.
Michael Nielsen has a really nice way to explain this concept in his book.
1 step = 1 gradient update. And each gradient update step requires one forward pass and one backward pass.
The stopping condition is generally left up to you and is arguably more art than science. Commonly you will plot (tensorboard is handy here) your cost, training accuracy, and periodically your validation set accuracy. The low point on validation accuracy is generally a good point to stop. Depending on your dataset validation accuracy may drop and at some point increase again, or it may simply flatten out, at which point the stopping condition often correlates with the developer's degree of impatience.
Here's a nice article on stopping conditions, a google search will turn up plenty more.
https://stats.stackexchange.com/questions/231061/how-to-use-early-stopping-properly-for-training-deep-neural-network
Another common approach to stopping is to drop the learning rate every time you compute that no change has occurred to validation accuracy for some "reasonable" number of steps. When you've effectively hit 0 learning rate, you call it quits.
The input function emits batches (when num_epochs=None, num_batches is infinite):
num_batches = num_epochs * (num_samples / batch_size)
One step is processing 1 batch, if steps > num_batches, the training will stop after num_batches.
I'm trying to do Stanfords CS20: TensorFlow for Deep Learning Research course. The first 2 lectures provide a good introduction to the low level plumbing and computation framework (that frankly the official introductory tutorials seem to skip right over for reasons I can only fathom as sadism). In lecture 3, it starts performing a linear regression and makes what seems like a fairly heavy cognitive leap for me. Instead of session.run on a tensor computation, it does it on the GradientDescentOptimizer.
sess.run(optimizer, feed_dict={X: x, Y:y})
The full code is available on page 3 of the lecture 3 notes.
EDIT: code and data also available at this github - code is available in examples/03_linreg_placeholder.py and data in examples/data/birth_life_2010.txt
EDIT: code is below as per request
import tensorflow as tf
import utils
DATA_FILE = "data/birth_life_2010.f[txt"
# Step 1: read in data from the .txt file
# data is a numpy array of shape (190, 2), each row is a datapoint
data, n_samples = utils.read_birth_life_data(DATA_FILE)
# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')
# Step 3: create weight and bias, initialized to 0
w = tf.get_variable('weights', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))
# Step 4: construct model to predict Y (life expectancy from birth rate)
Y_predicted = w * X + b
# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name='loss')
# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
with tf.Session() as sess:
# Step 7: initialize the necessary variables, in this case, w and b
sess.run(tf.global_variables_initializer())
# Step 8: train the model
for i in range(100): # run 100 epochs
for x, y in data:
# Session runs train_op to minimize loss
sess.run(optimizer, feed_dict={X: x, Y:y})
# Step 9: output the values of w and b
w_out, b_out = sess.run([w, b])
I've done the coursera machine learning course course so I (think) I understand the notion of Gradient Descent. But I'm quite lost as to what is happening in this specific case.
What I would expect to have to happen:
Calculate the gradient (either by calculus or numerical methods)
Calculate the parameter change (alpha multiplied by the predicted vs actual over the entire dataset)
Adjust the parameters
Repeat the above N times (in this case 100 times for 100 epochs)
I understand that in practice you'd apply things like batching and subsets but in this case I believe this is just looping over the entire dataset 100 times.
I can (and have) implemented this before. But I'm struggling to fathom how the code above could be achieving this. For one thing is the optimizer is called on each data point (i.e. it's in an inner loop of the 100 epochs and then each data point). I would have expected an optimization call which took in the entire dataset.
Question 1 - is the gradient adjustment operating over the entire data set 100 times, or over the entire data set 100 times in batches of 1 (so 100*n times, for n examples)?
Question 2 - how does the optimizer 'know' how to to adjust w and b? It's only provided the loss tensor - is it reading back through the graph and just going "well, w and b are the only variables, so I'll wiggle the hell out of those"
Question 2b - if so, what happens if you put in other variables? Or more complex functions? Does it just auto-magically calculate gradient adjustment for every variable in the predecessor graph**
Question 2c - pursuant to that I've tried adjusting to a quadratic expression as suggested in page 3 of the tutorial but end up getting a higher loss. Is this normal? The tutorial seems to suggest it should be better. At the least I would expect it not to be worse - is this subject to changing hyperparameters?
EDIT: Full code for my attempts to adjust to quadratic are here. Not that this is the same as the above with lines 28, 29, 30 and 34 modified to use a quadratic predictor. These edits are (what I interpret) to be what's suggested in the lecture 3 notes on page 4
""" Solution for simple linear regression example using placeholders
Created by Chip Huyen (chiphuyen#cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import utils
DATA_FILE = 'data/birth_life_2010.txt'
# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)
# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')
# Step 3: create weight and bias, initialized to 0
# w = tf.get_variable('weights', initializer=tf.constant(0.0)) old single weight
w = tf.get_variable('weights_1', initializer=tf.constant(0.0))
u = tf.get_variable('weights_2', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))
# Step 4: build model to predict Y
#Y_predicted = w * X + b #linear
Y_predicted = w * X * X + X * u + b #quadratic
#Y_predicted = w # test of nonsense
# Step 5: use the squared error as the loss function
# you can use either mean squared error or Huber loss
loss = tf.square(Y - Y_predicted, name='loss')
#loss = utils.huber_loss(Y, Y_predicted)
# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
start = time.time()
writer = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph())
with tf.Session() as sess:
# Step 7: initialize the necessary variables, in this case, w and b
sess.run(tf.global_variables_initializer())
# Step 8: train the model for 100 epochs
for i in range(100):
total_loss = 0
for x, y in data:
# Session execute optimizer and fetch values of loss
_, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y})
total_loss += l
print('Epoch {0}: {1}'.format(i, total_loss/n_samples))
# close the writer when you're done using it
writer.close()
# Step 9: output the values of w and b
w_out, b_out = sess.run([w, b])
print('Took: %f seconds' %(time.time() - start))
print(f'w = {w_out}')
# plot the results
plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
plt.legend()
plt.show()
For the linear predictor I get loss of (this aligns with lecture notes):
Epoch 99: 30.03552558278714
For my attempts at the quadratic I get loss of:
Epoch 99: 127.2992221294363
In the code you linked, it's 100 epochs in batches of 1 (assuming each element of data is a single input). I.e. compute the gradient of the loss with respect to a single example, update the parameters, go to the next example... until you went over the whole dataset. Do this 100 times.
A lot of things happen in that minimize call of the optimizer. Indeed, you only put in the cost: Under the hood, Tensorflow will then compute gradients for all requested variables (we'll get to that in a second) that are involved in the cost computation (it can infer this from the computational graph) and return an op that "applies" the gradients. This means an op that takes all the requested variables and assigns a new value to them, something like tf.assign(var, var - learning_rate*gradient). This is related to another question you asked: minimize returns just an op, this doesn't do anything! Running this op in a session will do a "gradient step" each time.
As to which variables are actually affected by this op: You can give this as an argument to the minimize call! See here -- the argument is var_list. If this is not given, Tensorflow will simply use all "trainable variables". By default, any variable you create with tf.Variable or tf.get_variable is trainable. However you can pass trainable=False to these functions to create variables that are not (by default) going to be affected by the op returned by minimize. Play around with this! See what happens if you set some variables not to be trainable, or if you pass a custom var_list to minimize.
In general, the "whole idea" of Tensorflow is that it can "magically" calculate gradients based on only a feedforward description of the model.
EDIT: This is possible because machine learning models (including deep learning) are composed of quite simple building blocks such as matrix multiplications and mostly pointwise nonlinearities. These simple blocks also have simple derivatives, which can be composed via the chain rule. You might want to read up on the backpropagation algorithm.
It will certainly take longer with very large models. But it is always possible as long as there is a clear "path" through the computation graph where all components have defined derivatives.
As to whether this can generate poor models: Yes, and this is a fundamental problem of deep learning. Very complex/deep models lead to highly non-convex cost functions which are difficult to optimize with methods like gradient descent.
With regards to the quadratic function: Looks like there are two problems here.
Not enough training epochs. More complex problems (in this case, we have more variables) might simply need longer to train. E.g. with your setup I can reach a cost of ~58 after about 330 epochs with the quadratic function.
The learning rate. The above is still suspicious since with more variables we should definitely be able to reach better results (as long as the inputs for those variables aren't superfluous), and since this is a simple linear regression problem gradient descent should be able to find them. In this case the learning rate is usually the problem. I changed it to 0.0001 (lowered by a factor of 10) and after about 3400 epochs reach costs below 30 (haven't tested how low it goes). Now obviously lower learning rates lead to slower training, but they are often necessary towards the end to avoid "jumping over" better solutions. This is why in practice, some kind of learning rate annealing is usually performed -- start with a large learning rate to make rapid progress in the beginning, then make it smaller and smaller as training progresses. In general, the learning rate (and its annealing schedule) is the hyperparameter that needs the most tuning in machine learning problems.
There are also methods such as Adam that use an "adaptive" learning rate. Generally an untuned adaptive method will outperform an untuned gradient descent, so they are good for quick experiments. However, well-tuned gradient descent will usually outperform them in turn.
I'm wondering how to use stop_gradient in tensorflow, and the documentation is not clear to me.
I'm currently using stop_gradient to produce the gradient of the loss function w.r.t. the word embeddings in a CBOW word2vec model. I want to just get the value, and not do backpropagation (as I'm generating adversarial examples).
Currently, I'm using the code:
lossGrad = gradients.gradients(loss, embed)[0]
real_grad = lossGrad.eval(feed_dict)
But when I run this, it does the backpropogation anyway! What am I doing wrong, and just as importantly, how can I fix this?
CLARIFICATION: To clarify by "backpropagation" I mean "calculating values and updating model parameters".
UPDATE
If I run the two lines above after the first training step, the I get a different loss after 100 training steps than when I don't run those two lines. I might be fundamentally misunderstanding something about Tensorflow.
I've tried setting using set_random_seed both in the beginning of the graph declaration and before each training step. The total loss is consistent between multiple runs, but not between including/excluding those two lines. So if it's not the RNG causing the disparity, and it's not unanticipated updating of the model parameters between training steps, do you have any idea what would cause this behavior?
SOLUTION
Welp, it's a bit late but here's how I solved it. I only wanted to optimize over some, but not all, variables. I thought that the way to prevent optimizing some variables would be to use stop_grad - but I never found a way to make that work. Maybe there is a way, but what worked for me was to adjust my optimizer to only optimize over a list of variables. So instead of:
opt = tf.train.GradientDescentOptimizer(learning_rate=eta)
train_op = opt.minimize(loss)
I used:
opt = tf.train.GradientDescentOptimizer(learning_rate=eta)
train_op = opt.minimize(loss, var_list=[variables to optimize over])
This prevented opt from updating the variables not in var_list. Hopefully it works for you, too!
tf.stop_gradient provides a way to not compute gradient with respect to some variables during back-propagation.
For example, in the code below, we have three variables, w1, w2, w3 and input x. The loss is square((x1.dot(w1) - x.dot(w2 * w3))). We want to minimize this loss wrt to w1 but want to keep w2 and w3 fixed. To achieve this we can just put tf.stop_gradient(tf.matmul(x, w2*w3)).
In the figure below, I plotted how w1, w2, and w3 from their initial values as the function of training iterations. It can be seen that w2 and w3 remain fixed while w1 changes until it becomes equal to w2 * w3.
An image showing that w1 only learns but not w2 and w3:
import tensorflow as tf
import numpy as np
w1 = tf.get_variable("w1", shape=[5, 1], initializer=tf.truncated_normal_initializer())
w2 = tf.get_variable("w2", shape=[5, 1], initializer=tf.truncated_normal_initializer())
w3 = tf.get_variable("w3", shape=[5, 1], initializer=tf.truncated_normal_initializer())
x = tf.placeholder(tf.float32, shape=[None, 5], name="x")
a1 = tf.matmul(x, w1)
a2 = tf.matmul(x, w2*w3)
a2 = tf.stop_gradient(a2)
loss = tf.reduce_mean(tf.square(a1 - a2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
gradients = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradients)
tf. gradients(loss, embed) computes the partial derivative of the tensor loss with respect to the tensor embed. TensorFlow computes this partial derivative by backpropagation, so it is expected behavior that evaluating the result of tf. gradients(...) performs backpropagation. However, evaluating that tensor does not perform any variable updates, because the expression does not include any assignment operations.
tf.stop_gradient() is an operation that acts as the identity function in the forward direction but stops the accumulated gradient from flowing through that operator in the backward direction. It does not prevent backpropagation altogether, but instead prevents an individual tensor from contributing to the gradients that are computed for an expression. The documentation for the operation has more details about the operation, and when to use it.