How to fetch gradients with respect to certain occurrences of variables in tensorflow? - tensorflow

Since tensorflow supports variable reuse, some part of computing graph may occur multiple times in both forward and backward process. So my question is, is it possible to update variables with respect their certain occurrences in the compute graph?
For example, in X_A->Y_B->Y_A->Y_B, Y_B occurs twice, how to update them respectively? I mean, at first, we take the latter occurrence as constant, and update the previous one, then do opposite.
A more simple example is, say X_A, Y_B, Y_A are all scalar variable, then let Z = X_A * Y_B * Y_A * Y_B, here the gradient of Z w.r.t both occurrences of Y_B is X_A * Y_B * Y_A, but actually the gradient of Z to Y_B is 2*X_A * Y_B * Y_A. In this example computing gradients respectively may seems unnecessary, but not always are those computation commutative.
In the first example, gradients to the latter occurrence may be computed by calling tf.stop_gradient on X_A->Y_B. But I could not think of a way to fetch the previous one. Is there a way to do it in tensorflow's python API?
Edit:
#Seven provided an example on how to deal with it when reuse a single variable. However often it's a variable scope that is reused, which contains many variables and functions that manage them. As far as I know, their is no way to reuse a variable scope with applying tf.stop_gradient to all variables it contains.

With my understanding, when you use A = tf.stop_gradient(A), A will be considered as a constant. I have an example here, maybe it can help you.
import tensorflow as tf
wa = tf.get_variable('a', shape=(), dtype=tf.float32,
initializer=tf.constant_initializer(1.5))
b = tf.get_variable('b', shape=(), dtype=tf.float32,
initializer=tf.constant_initializer(7))
x = tf.placeholder(tf.float32, shape=())
l = tf.stop_gradient(wa*x) * (wa*x+b)
op_gradient = tf.gradients(l, x)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print sess.run([op_gradient], feed_dict={x:11})

I have a workaround for this question. Define a custom getter for the concerning variable scope, which wraps the default getter with tf.stop_gradient. This could set all variables returned in this scope as a Tensor contributing no gradients, though sometimes things get complicated because it returns a Tensor instead of a variable, such as when using tf.nn.batch_norm.

Related

How to add after each iteration in tensorflow

I am trying to achieve the following:
compute the losses in the previous 25 predictions and sum them before
computing the gradient. I have tried this:
loss_summation=tf.Variable(0,dtype=tf.dtypes.float32,name="loss")
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=next_element[1],logits=logits2,name="xentropy")
loss=tf.math.reduce_sum(tf.reduce_mean(xentropy,name="loss"))
loss_summation=tf.assign(loss_summation,loss_summation+loss)
optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
gvs = optimizer.compute_gradients(loss_summation,[vars])
with tf.Session() as sess():
for i in range(25):
b=sess.run([loss_summation])
However optimizer.compute_gradients() complains that
None values not supported. How can go around this ?
I am actually trying to implement the following function(feedforward of LSTM) in tensorflow to predict the next word given the previous ones
def feedforward(self,x_s,hpre,targets,p_s):
fts,its,gts,css,ots,output,inputs=[],[],[],[],[],[],[]
losses=[]
hprev=hpre
hts=[hprev]
loss=0
losses=[]
previous_state=p_s
css.append(previous_state)
for x,y in zip(x_s,targets):
k=np.zeros((self.vocab_size,1))
k[x]=1
M_c=np.row_stack((hprev,k))
ft=self.sigmoid(np.dot(self.W1,M_c)+self.b1)
fts.append(ft)
it=self.sigmoid(np.dot(self.W2,M_c)+self.b2)
its.append(it)
gt=np.tanh(np.dot(self.W3,M_c)+self.b3)
gts.append(gt)
cs=(ft*previous_state)+(it*gt)
previous_state=cs
css.append(cs)
ot=self.sigmoid(np.dot(self.W4,M_c)+self.b4)
ots.append(ot)
ht=ot*np.tanh(cs)
hts.append(ht)
yt=self.softmax(np.dot(self.W5,ht)+self.b5)
hprev=ht
output.append(yt)
inputs.append(M_c)
loss+=-np.log(yt[y])
losses.append(loss)
return fts,its,gts,css,ots,output,hts,loss,hts[-1],css[-1],inputs
x_s is a list of integers representing words.
x_s=[0,1,2,3,4,5,6,7,8....,24]
targets is the list of integers expected i.e if x_s=0 then next letter is 1
targets=[1,2,3,4,5,6,7,8,9...,25]
The loss which is a summation of 25 losses is what will be minimized.
There are a few things you need to address here:
Is there a good reason not to just use larger batches? Are you trying to implement the lookahead optimizer or something?
You look like you're getting started with TensorFlow. Consider turning on eager execution with tf.enable_eager_execution(). TensorFlow 2.0 is coming soon, don't waste your time messing with tf.Sessions.
Variables are not differentiable. So accumulating the losses in a variable doesn't make any sense.
I would make a copy of all the model's variables, and accumulate new values there. Then, after N iterations assign those values back to the model. Something like:
model = tf.keras.Sequential(...)
vars = model.trainable_variables
weight_acc = [tf.Variable(var) for var in model.trainable_variables]
for n,(batch, label) in enumerate(dataset):
with tf.GradientTape() as tape:
pred = model(batch)
loss = cal_loss(batch, label)
grads = tape.gradients(loss, vars)
for g, a in zip(grad, weight_acc):
a.assign_add(learning_rate*g)
if n%25 == 0:
for a, v in zip(weight_acc, vars):
v.assign_add(lookahead_fraction*(a-v))

How to make tensorflow assignment op part of computational graph without explicitly running its output?

I am trying to create a custom gradient in tensorflow to implement the exponentially smoothed (unbiased) gradient of a logarithm that is suggested in this paper (https://arxiv.org/pdf/1801.04062.pdf). What I need to do is crease a new variable that stores an exponentially smoothed value, which is updated and used in a custom gradient function. Additionally, I need a flag which tells me when the first gradient calculation is being done, so I can initialize the exponentially smoothed value to the appropriate (data-dependent) value. Furthermore, the output of the custom gradient function must be just the gradient, so it will be a pain in the butt to access the output of a tf.assign from inside the custom gradient. Lastly, I do not want to create a second operation that 'manually' initializes the exponential smoothing by running it separately in my training loop. Anyway, this is all too complicated, so I have an abstract, but simple, problem outlined below, the solution to which would solve my problem:
What I need to be able to do is update one variable in a manner which is conditional upon a second, and furthermore I need to update the second variable without providing it as explicit output by my function. Example code demonstrating my problem is below:
import tensorflow as tf
a = tf.get_variable(name = "test",initializer=True)
b = tf.get_variable(name = "testval",initializer = 10.)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
def make_function(inp):
with tf.variable_scope("",reuse = True):
a = tf.get_variable(name = "test",dtype = tf.bool)
b = tf.get_variable(name = "testval")
iftrue = lambda: [tf.assign(b,inp),tf.assign(a,False)]
iffalse = lambda: [tf.assign(b,(b + inp)/2),tf.assign(a,False)]
acond,bcond = tf.cond(a,iftrue,iffalse)
return acond
I = tf.placeholder(tf.float32)
tcond = make_function(I)
print("{}\tThe initial values of a and b".format(sess.run([a,b])))
print("{}\t\tRun, tcond1. output is the updated value of b.".format(sess.run(tcond,{I:1})))
print("{}\tNow we see that b has been updated, but a has not.".format(sess.run([a,b])))
print("{}\t\tSo now the value is 2 instead of 1.5 like it should be.".format(sess.run(tcond,{I:2})))
The output is:
[True, 10.0] The initial values of a and b
1.0 Run, tcond1. output is the updated value of b.
[True, 1.0] Now we see that b has been updated, but a has not.
2.0 So now the value is 2 instead of 1.5 like it should be.
Now, I understand that I need to have a line like sess.run(acond) where acond is the output of the conditional within make_function, but I can't return that because my function needs to only return the value of b (not a), and I don't want to have to carry around an extra op that I need to remember to run on the first training iteration, but not on the others.
So, is there a way to add the assignment op acond to the computational graph without explicitly returning it and running with it sess.run?
Add this operation to a custom collection and, then, create a dependency between your final op (e.g. the train_op) and your acond.
Inside the method:
tf.add_to_collection("to_run", acond)
In the definition of the final op:
to_run = tf.get_collection("to_run")
with tf.control_dependencies(to_run):
final_op = <something>
When you run final_op you are assured your acond has been already executed.

Weights/bias initialization with constants cntk

Is there a way to initialize weights/bias with constant matrices. E.g., instead of Dense(hidden_layers_dim_1, init=he_normal()), can I do Dense(hidden_layers_dim_1, init=W), where W is a float matrix.
Update-1:
Dense layers seem to accept a numpy array and constant values as initial weights now (cntk-2.0rc3) as per the parameter documentation here.
Layers cannot take initial weight values, yet. However, you can pass in an initial bias value using init_bias named argument in any appropriate layer. But, if you must use a initial weight value, I guess you have create a variable and define your own network as you have done. i.e.
features = <your_input_var>
W = cntk.Parameter((<proper_shape>), init=<intial_value>)
B = cntk.Parameter((<proper_shape>), init=<intial_value>)
output = features # W + B;

questions on defining and calling train.momentumoptimizer

I have some questions regarding the following code segment
def _optimizer(self,training_iters, global_step, opt_kwargs={}):
learning_rate = self.opt_kwargs.pop("learning_rate", 0.2)
decay_rate = self.opt_kwargs.pop("decay_rate", 0.95)
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node,
**self.opt_kwargs).minimize(self.net.cost,
global_step=global_step)
The input pararameter of opt_kwargs is setup as opt_kwargs=dict(momentum=0.2)
Why we need to use self.opt_kwargs.pop("learning_rate", 0.2) to assign learning_rate. My guess is that this way can inject the learning rate and decay rate information into the dict structure of opt_kwargs. But I don't see the real usage here.
Secondly, regarding tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node,
**self.opt_kwargs), looks like **self.opt_kwargs will pass the whole opt_kwargs dict into the MomentumOptimizer. However, according to tf.train.MomentumOptimizer.init(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False), it only needs the momentum value. Here, we are passing both learning_rate and decay_rate included in self.opt_kwargs. Is this a correct way?
1.) The argument pop is so that you extract the learning_rate and decay_rate value and feed it to exponential_decay(), which accepts them as individual argument. 2.) It's not clean but is ok to feed in a dict with extra entries. This makes it flexible so that ex. you can easily swap MomentumOptimizer with another optimizer that takes in decay_rate, etc as part of arguments.
tf.train.MomentumOptimizer.init(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False) This means you need to explicitly pass a momentum value to the function. For self.opt_kwargs.pop, you do not need to pass a "learning_rate" or "decay_rate" to your function since they are set default using 0.2 and 0.95.

How to assign values to a subset of a tensor in tensorflow?

Two parts to this question:
(1) What is the best way to update a subset of a tensor in tensorflow? I've seen several related questions:
Adjust Single Value within Tensor -- TensorFlow
and
How to update a subset of 2D tensor in Tensorflow?
and I'm aware that Variable objects can be assigned using Variable.assign() (and/or scatter_update, etc.), but it seems very strange to me that tensorflow does not have a more intuitive way to update a part of a Tensor object. I have searched through the tensorflow api docs and stackoverflow for quite some time now and can't seem to find a simpler solution than what is presented in the links above. This seems particularly odd, especially given that Theano has an equivalent version with Tensor.set_subtensor(). Am I missing something or is there no simple way to do this through the tensorflow api at this point?
(2) If there is a simpler way, is it differentiable?
Thanks!
I suppose the immutability of Tensors is required for the construction of a computation graph; you can't have a Tensor update some of its values without becoming another Tensor or there will be nothing to put in the graph before it. The same issue comes up in Autograd.
It's possible to do this (but ugly) using boolean masks (make them variables and use assign, or even define them prior in numpy). That would be differentiable, but in practice I'd avoid having to update subtensors.
If you really have to, and I really hope there is a better way to do this, but here is a way to do it in 1D using tf.dynamic_stitch and tf.setdiff1d:
def set_subtensor1d(a, b, slice_a, slice_b):
# a[slice_a] = b[slice_b]
a_range = tf.range(a.shape[0])
_, a_from = tf.setdiff1d(a_range, a_range[slice_a])
a_to = a_from
b_from, b_to = tf.range(b.shape[0])[slice_b], a_range[slice_a]
return tf.dynamic_stitch([a_to, b_to],
[tf.gather(a, a_from),tf.gather(b, b_from)])
For higher dimensions this could be generalised by abusing reshape (where nd_slice could be implemented like this but there is probably a better way):
def set_subtensornd(a, b, slice_tuple_a, slice_tuple_b):
# a[*slice_tuple_a] = b[*slice_tuple_b]
a_range = tf.range(tf.reduce_prod(tf.shape(a)))
a_idxed = tf.reshape(a_range, tf.shape(a))
a_dropped = tf.reshape(nd_slice(a_idxed, slice_tuple_a), [-1])
_, a_from = tf.setdiff1d(a_range, a_dropped)
a_to = a_from
b_range = tf.range(tf.reduce_prod(tf.shape(b)))
b_idxed = tf.reshape(b_range, tf.shape(b))
b_from = tf.reshape(nd_slice(b_idxed, slice_tuple_b), [-1])
b_to = a_dropped
a_flat, b_flat = tf.reshape(a, [-1]), tf.reshape(b, [-1])
stitched = tf.dynamic_stitch([a_to, b_to],
[tf.gather(a_flat, a_from),tf.gather(b_flat, b_from)])
return tf.reshape(stitched, tf.shape(a))
I have no idea how slow this will be. I'd guess quite slow. And, I haven't tested it much beyond running it on a couple of tensors.