TensorFlow get updates computed by optimizer - tensorflow

In tf, the optimizer class only has two function:
compute_gradients
apply_gradients
where apply_gradients returns an op that performs the update w <- w + Δw via a tf.assign_add function.
However I need direct access to the Δw itself. (or w' = w+Δw). I know that the optimizer adds nodes to the computational graph which compute this Δw for each variable. How can I access them? Or do I have to re-implement the optimizer myself?
The reason is that I need to compute gradients dw'/dw, as I am working on something related to gradient based hyperparameter optimization (cf. https://arxiv.org/abs/1703.01785)

The "delta" applied to each variable is not accessible through any common method or name. In fact, looking a bit into the source it seems rather difficult extract, as it varies from one optimizer to the other.
What you can do, at least, is to compute the differences between variable values and their updates. For example it could work like this:
import tensorflow as tf
with tf.Graph().as_default():
# Setup example model
x = tf.placeholder(tf.float32, [None, 1])
y = tf.placeholder(tf.float32, [None, 2])
w = tf.Variable([[1., 2.]], tf.float32)
pred = x # w
loss = (tf.reduce_sum(tf.squared_difference(pred, y))
/ tf.cast(tf.shape(x)[0], tf.float32))
# Record variable values before training step
# (tf.identity should work here but it does not so we use
# a trivial add operation to enforce the control dependency)
w_old = w + 0
# Train after having recorded variable values
with tf.control_dependencies([w_old]):
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# Compute deltas after training
with tf.control_dependencies([train_op]):
w_delta = w.read_value() - w_old
init_op = tf.global_variables_initializer()
# Test
with tf.Session() as sess:
sess.run(init_op)
print(sess.run(w))
# [[1. 2.]]
_, w_delta_val = sess.run(
[train_op, w_delta],
feed_dict={x: [[1.], [2.]], y: [[3., 4], [5., 6.]]})
print(w_delta_val)
# [[0.79999995 0.5999999 ]]
print(sess.run(w))
# [[1.8 2.6]]

To get the updated w', you get just print the w directly after you have executed optimizer.apply_gradients(). The w at present is the w'.
Now, if you want to acquire the gradient of w, just do the operation of w'-w.

Related

Meta-Gradients / Multi-Batch Backpropagation in tansorflow

I am trying to implement a meta-gradient based pruning-at-initialization method by Alizadeh et al. (2022) in tensorflow. The method works roughly like this:
Take some batches from the dataset.
Mask all weights of the network with ones (e. g. tf.ones).
Perform one update of the weights, including the mask.
UNMASK all weights and perform the rest of the updates through the other batches.
Compute the meta-gradient of the loss w. r. t. the mask, i. e. backpropagate through all batches and weight-updates until the mask from the first iteration is "reached".
The authors implement this in pytorch, which I typically do not use at work. I want to implement it in tensorflow, yet I run into the following problem: tensorflow is not designed to process gradients "through" assign-operations. E. g. that means:
w = tf.Variable([4.])
c = tf.Variable([2.])
with tf.GradientTape() as tape:
tape.watch(c)
w.assign(w * c)
output = 2. * w
print(output)
# >> tf.Tensor([16.], shape=(1,), dtype=float32)
print(tape.gradient(output, c))
# >> None
That being said, my "pruning loop" is looking somewhat like this:
test_factor = tf.Variable(1., dtype=tf.float32)
with tf.GradientTape(persistent=True) as outer_tape:
outer_tape.watch(masked_model.masks)
outer_tape.watch(test_factor)
## First btach
X_batch, y_batch = wrp.non_random_batch(X_train, y_train, 0, 256)
with tf.GradientTape() as tape1:
y_pred = masked_model(X_batch)
loss = test_factor*loss_fn(y_batch, y_pred)
gradients = tape1.gradient(loss, masked_model.proper_weights)
## Updating weights
for w, g in zip(masked_model.proper_weights, gradients):
w.assign(w - 0.05*g)
## Unmasking
masked_model.unmask_forward_passes()
## Second batch (and more)
X_batch, y_batch = wrp.non_random_batch(X_train, y_train, 1, 256)
with tf.GradientTape() as tape2:
y_pred = masked_model(X_batch)
loss = loss_fn(y_batch, y_pred)
gradients = tape2.gradient(loss, masked_model.proper_weights)
print(outer_tape.gradient(loss, masked_model.masks))
# >> ListWrapper([None, None, ..., None])
print(outer_tape.gradient(loss, test_factor))
# >> None
Where after the second batch more batches would be to come.
I inserted the test_factor to show, that this problem is not some problem with my masks, but with the general structure. Simply changing the line w.assign(w - 0.05*g) to w = w - 0.05*g enables the usage of the gradient, but then the weights are not actually updated...
For the authors of the paper mentioned, this does not seem to be a problem. Is pytorch simply more powerful in such cases, or do I miss some kind of trick to get this to work in tensorflow?

Why we need to pass values using feed_dict to print loss value in TensorFlow

Below is small Tensorflow code
# coding: utf-8
# In[27]:
import tensorflow as tf
# In[28]:
# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# In[29]:
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
# In[30]:
y = tf.placeholder(tf.float32)
# In[31]:
# loss
loss = tf.reduce_sum(tf.square(linear_model - y))
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# In[32]:
# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# In[33]:
# training loop
init = tf.global_variables_initializer()
# In[34]:
with tf.Session() as sess:
sess.run(init)
for i in range(1000):
sess.run(train, {x: x_train, y: y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
# In[ ]:
here in the for loop we have below code
with tf.Session() as sess:
sess.run(init)
for i in range(1000):
sess.run(train, {x: x_train, y: y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
my question is when we run sess.run(train, {x: x_train, y: y_train}) , loss also gets calculated , then why we need to pass feed_dict when want to retrieve loss value like below ? can anyone please help me understand this ?
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
You have defined 2 placeholders in your code: x and y. The tf.placeholder is a container which can be fed different values at each execution of the program.
When you use tf.placeholder, TensorFlow internally defines its computational graph using this container (placeholder). sess.run() runs this computational graph, but the graph by itself makes no sense because the placeholder containers are empty – they do not contain anything. Thus, whenever you use placeholders in your code, you are required to pass the values for these placeholders in your graph using the feed_dict parameter of sess.run().
The advantage of placeholders is that the values you put in them for one execution of sess.run() are not remembered. That is, the second call of sess.run() will again have empty placeholders, and you will again have to put values into them through feed_dict. This is why you have to send values for the placeholders at every call of sess.run().
A useful analogy might be to think of your TensorFlow computational graph as a physical machine – with inputs pipes (x and y) and output pipes (loss). The machine consumes data from the input pipes (so the data doesn't remain across multiple calls), and the machine also spits out stuff from the output pipes – if you didn't catch the output, you lost it. The machine (graph) doesn't store any value or result within it. It is just used to define a workflow which applies different operations on data.
Ops like train are levers of the machine, which when pulled do something within the machine. Now for the machine to do any work, you must put something in the input pipes. When you called sess.run(train), the machine used up the data in the placeholders, computed loss (which it sent out through the loss output pipe, which you didn't catch) and modified its internal variables via backpropagation. Now the input pipes are empty again, and the old value of loss is lost! Thus, when you wish to calculate loss, you put in the data in the input pipes, and ask the machine to output the loss through the loss pipe.
You might be tempted to do this:
loss_value, _ = sess.run([loss, train], {x: x_train, y: y_train})
but unfortunately, TensorFlow gives no guarantees as to the order in which sess.run() evaluates its ops. So in the above line of code you won't know whether the loss_value returned is the loss before running the training op or after. The only way to do this is to first run the training op, then run the loss op in 2 separate calls to sess.run() as you have done in your code.
The loss is evaluated using y and linear_model.
Observe that:
y is a placeholder, and
the calculation of linear_model requires the placeholder x
So, once you have a placeholder, data will have to be passed-in using feed_dict.
By the way, running sess.run(train, {x: x_train, y: y_train}) invokes Gradient descent for optimizing the loss function.
While running curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train}) is for printing out the current value of the loss that has been optimizied after executing the train operation train.

How to force Tensorflow to show a simple linear regression prediction result?

I have a simple linear regression question as below:
My codes are as below:
import tensorflow as tf
import numpy as np
batch_xs=np.array([[0,0,1],[1,1,1],[1,0,1],[0,1,1]])
batch_ys=np.array([[0],[1],[1],[0]])
x = tf.placeholder(tf.float32, [None, 3])
W = tf.Variable(tf.zeros([3, 1]))
b = tf.Variable(tf.zeros([1]))
y = tf.nn.sigmoid(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 1])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
learning_rate = 0.05
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
sess = tf.Session()
tf.global_variables_initializer().run(session=sess)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
Prediction:
x0=np.array([[1.,0.,0.]])
x0=np.float32(x0)
y0=tf.nn.softmax(tf.matmul(x0,W) + b)
print(y0)
However, print(y0) shows Tensor("Softmax_2:0", shape=(1, 1), dtype=float32) instead of a figure. I expect y0 would be around 0.99.
I tried y0.eval(), but I got ValueError: Cannot evaluate tensor using 'eval()': No default session is registered..
How can I make a change to obtain the result? Thanks!
There are a couple of ways to get things to print out while writing TensorFlow code. Of course, there’s the classic Python built-in, print (Or the function print(), if we’re being Python 3 about it). And then there’s TensorFlow’s print function, tf.Print (notice the capital P).
When working with TensorFlow, it’s important to remember that everything is ultimately a graph computation. This means that if you print a TensorFlow operation using Python’s print, it will simply show a description of what that operation is, since no values have been passed through it yet. It will also often show the dimensions that are expected to be in that node, if they’re known.
If you want to print the values that are ‘flowing’ through a particular part of the graph as it’s being executed, then we need to turn to using tf.Print.

Can't learn parameters of tf.contrib.distributions.MultivariateNormalDiag via optimization

Working example:
import numpy as np
import tensorflow as tf
## construct data
np.random.seed(723888)
N,P = 50,3 # number and dimensionality of observations
Xbase = np.random.multivariate_normal(mean=np.zeros((P,)), cov=np.eye(P), size=N)
## construct model
X = tf.placeholder(dtype=tf.float32, shape=(None, P), name='X')
mu = tf.Variable(np.random.normal(loc=0.0, scale=0.1, size=(P,)), dtype=tf.float32, name='mu')
xDist = tf.contrib.distributions.MultivariateNormalDiag(loc=mu, scale_diag=tf.ones(shape=(P,), dtype=tf.float32), name='xDist')
xProbs = xDist.prob(X, name='xProbs')
## prepare optimizer
eta = 1e-3 # learning rate
loss = -tf.reduce_mean(tf.log(xProbs), name='loss')
optimizer = tf.train.AdamOptimizer(learning_rate=eta).minimize(loss)
## launch session
with tf.Session() as sess:
tf.global_variables_initializer().run()
sess.run(optimizer, feed_dict={X: Xbase})
I want to do optimization over the parameters of a multivariate gaussian distribution in tensorflow, as in my above example. I can successfully run commands like sess.run(loss, feed_dict={X: Xbase}), so I have implemented the distribution correctly. When I try to run the optimization op, I get an odd error message:
InvalidArgumentError: -1 is not between 0 and 3
[[Node: gradients_1/xDist_7/xProbs/Prod_grad/InvertPermutation = InvertPermutation[T=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](gradients_1/xDist_7/xProbs/Prod_grad/concat)]]
Caused by op 'gradients_1/xDist_7/xProbs/Prod_grad/InvertPermutation'
That I do not understand.
I get the same error message if I use tf.contrib.distributions.MultivariateNormalFullCovariance instead of tf.contrib.distributions.MultivariateNormalDiag. I do not get the error if scale_diag and not loc is the variable being optimized over.
I'm still looking into why this is failing, but for a short-term fix, does making the following change work?
xLogProbs = xDist.log_prob(X, name='xLogProbs')
loss = -tf.reduce_mean(xLogProbs, name='loss')
Note: this is actually preferable to tf.log(xProbs) because it is never less numerically precise--and sometimes substantially more precise. (This is true of all tf.Distributions.)

Basic neural network in TensorFlow

I am trying to implement a very basic neural network in TensorFlow but I am having some problems. It is a very basic network that takes as input to values (hours or sleep and hours of study) and predicts the score on a test (I found this example on you-tube). So basically I have only one hidden layer with three units, each one computes an activation function (sigmoid) and the cost function is sum of square errors and I am using Gradient descent to minimize it. So the problem is, when I train the net with the training data and try to make some predictions using the same training data, the results do not quite match and they also appear strange because the look equal each other.
import tensorflow as tf
import numpy as np
import input_data
sess = tf.InteractiveSession()
# create a 2-D version of input for plotting
trX = np.matrix(([3,5], [5,1],[10,2]), dtype=float)
trY = np.matrix(([85], [82], [93]), dtype=float) # 3X1 matrix
trX = trX / np.max(trX, axis=0)
trY = trY / 100 # 100 is the maximum score allowed
teX = np.matrix(([3,5]), dtype=float)
teY = np.matrix(([85]), dtype=float)
teX = teX/np.amax(teX, axis=0)
teY = teY/100
def init_weights(shape):
return tf.Variable(tf.random_normal(shape, stddev=0.01))
def model(X, w_h, w_o):
z2 = tf.matmul(X, w_h)
a2 = tf.nn.sigmoid(z2) # this is a basic mlp, think 2 stacked logistic regressions
z3 = tf.matmul(a2, w_o)
yHat = tf.nn.sigmoid(z3)
return yHat # note that we dont take the softmax at the end because our cost fn does that for us
X = tf.placeholder("float", [None, 2])
Y = tf.placeholder("float", [None, 1])
W1 = init_weights([2, 3]) # create symbolic variables
W2 = init_weights([3, 1])
sess.run(tf.initialize_all_variables())
py_x = model(X, W1, W2)
cost = tf.reduce_mean(tf.square(py_x - Y))
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost) # construct an optimizer
predict_op = py_x
sess.run(train_op, feed_dict={X: trX, Y: trY})
print sess.run(predict_op, feed_dict={X: trX})
sess.close()
It yields:
[[ 0.51873487]
[ 0.51874501]
[ 0.51873082]]
and I believe it should be similar to the training data results.
I am quite new to neural nets and machine learning so pardon me for any mistakes, thanks in advance.
The main reason that your network isn't training is that the statement:
sess.run(train_op, feed_dict={X: trX, Y: trY})
…only executes once. In TensorFlow, running train_op (or whatever operation is returned from Optimizer.minimize() will only cause the network to take a single gradient descent step. You should execute it in a loop to perform iterative training, and the weights will eventually converge.
Two other tips: (i) you might achieve faster convergence if you feed a subset of your training data in each step, rather than the entire dataset; and (ii) the learning rate of 0.5 is probably too high (although this depends on the data).