Tensorflow: How to Manually Edit Gradient Values - tensorflow

I am reading gradient values from a outside source (i.e computation is done elsewhere, but I want to accumulate the different sources in a "master" network), and I would like to just use the apply_gradients() op in tensorflow. The problem is, the gradients get sent in as floats. Is there any way I can use the float array to apply the gradients with the built-in Optimizer functions?
In a very minimal example / test case, this is what I would essentially like to do.
W = tf.Variable(1.0)
b = tf.Variable(2.0)
trainable_variables = [W, b]
gradients = [0.05, 0.01] # Example gradients for W, b
# ... Somehow make this gradient vector into a tensor
optimizer.apply_gradients(zip(gradients_tensor, trainable_variables))

There are many ways of doing so, in particular, you can just create placeholders for your external gradients, and combine them by simply performing arithmetics on them before apply_gradients.
x = tf.Variable( ... )
f = x ** 2
g = tf.gradients(f, x)
my_gradient['x'] = tf.placeholder( ... ) # same size and type as x
g = [(grad + my_gradient[var.name], var) for grad, var in g]
optimizer.apply_gradients(g)
and now during optimisation step, just feed_dict my_gradient['x'] value to the computed one.
If they do not change overtime, you could use tf.constant() instead but I can't see any mathematical situation to have a constant (and non-zero) gradient in ML.

Related

How to map an array of values for y_true to a single value in order to compare to y_pred in a Tensorflow loss function (Tensorflow/Tensorflow Quantum)

I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.

Where is backpropagation performed in this example

I have an example of DNN learning XOR (right click to open in new tab): https://colab.research.google.com/drive/1M5xFp4gaXPCbnejM8-5_yLp1B6UvwdL8
I'm confused in these 2 lines (related to backpropagation):
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
I'm guessing the backward loop is at T.gradient because those are gradient values related to loss, but I'm still not clear. The questions are:
Question1. Is there backpropagation (the backward loop) in those 2 lines?
Question2. If there is backpropagation, it's at T.gradient or Optim.apply_gradients?
Question3. Because backpropagation is done backward, is the order of [W1,B1,W2,B2] important? I believe, eg. this shuffled [B1,W2,B2,W1] can't be the same, because backpropagation needs layer order from output back to input.
From my trying, when shuffling the order of weights and biases in the variable array, the optimisation process is still working. But backpropagation needs layer order from output back to input, I don't get this.
Source code:
#!pip install tensorflow==2.0.0rc2
%tensorflow_version 2.x
%reset -f
#libs
import tensorflow as tf;
#data
X = [[0,0],[0,1],[1,0],[1,1]];
Y = [[0], [1], [1], [0] ];
X = tf.convert_to_tensor(X,tf.float32);
Y = tf.convert_to_tensor(Y,tf.float32);
#model
W1 = tf.Variable(tf.random.uniform([2,20],-1,1));
B1 = tf.Variable(tf.random.uniform([ 20],-1,1));
W2 = tf.Variable(tf.random.uniform([20,1],-1,1));
B2 = tf.Variable(tf.random.uniform([ 1],-1,1));
#tf.function
def feedforward(X):
H1 = tf.nn.leaky_relu(tf.matmul(X,W1) + B1);
Out = tf.sigmoid(tf.matmul(H1,W2) + B2);
return Out;
#end def
#train
Optim = tf.keras.optimizers.SGD(1e-1);
Steps = 1000;
for I in range(Steps):
if I%(Steps/10)==0:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
print("Loss:",Loss.numpy());
#end if
with tf.GradientTape() as T:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
#end with
#BACKPROPAGATION HERE?
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
#end for
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
print("Loss:",Loss.numpy(),"(Last)");
print("\nDone.");
#eof
Let's take this one step at a time.
Step 1: Calculation of Gradients:
Grads = T.gradient(Loss,[W1,B1,W2,B2])
Here, we calculate the gradients of the loss with respect to the variables in the provided list. The list of gradients is indexed based on the indices of the variables. This means that Grads[0] will be the gradients with respect to W1, and so on.
Step 2: Next, we perform the update. This is done in:
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]))
Here, Grads[0] are used to update W1, Grads[1] to update B1 and so on.
Note that gradient calculation and the update steps are performed separately. So as long as the variables appear in the same order in both lists, there shouldn't be any problems.
Also, GradientTape has to be used with Eager Execution.
With TensorFlow 2 in default eager mode, and even without the #tf.function decorator to make graph. TensorFlow is still tracking the relation between tensors while calculation: https://stats.stackexchange.com/a/272000/142160
TensorFlow tracks every variables here:
with tf.GradientTape() as T:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
It is automatic differentiation (kinda Monte Carlo method) instead of mathematical differentiation, and thus, all gradients obtained by the following function is already at their proper depths in backpropagation (just like the backward loop to calculate errors at all layers):
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
After that, optimiser will apply gradients to change weights and biases:
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));

How can I replace a variable with another one in Tensorflow's computation graph?

Problem: I have two pretrained models with variables W1,b1 and W2,b2 saved as numpy arrays.
I want to set a mixture of these two pretrained models as the variables of my model, and only update the mixture weights alpha1 and alpha2 during training.
In order to do that I create two variables alpha1 and alpha2 and load the numpy arrays and create the mixture nodes: W_new, b_new.
I want to replace W and b in the computation graph with W_new and b_new and then only train alpha1 and alpha2 parameter by opt.minimize(loss, var_list= [alpha1, alpha2]).
I don't know how to replace W_new and b_new in the computation graph. I tried assigning tf.trainable_variables()[0] = W_new but this doesn't work.
I'd appreciate if anyone could give me some clues.
Note 1: I don't want to assign values to W and b (this will disconnect the graph from alpha1 and alpha2), I want the mixture of parameters to be a part of the graph.
Note 2: You might say that you could compute y using the new variables, but problem is, the code here is just a toy sample to simplify things. In reality instead of linear regression I have several bilstms with crf. So I can't manually compute the formula. I'll have to replace these variables in the graph.
import tensorflow as tf
import numpy as np
np.random.seed(7)
tf.set_random_seed(7)
#define a linear regression model with 10 params and 1 bias
with tf.variable_scope('main'):
X = tf.placeholder(name='input', dtype=float)
y_gold = tf.placeholder(name='output', dtype=float)
W = tf.get_variable('W', shape=(10, 1))
b = tf.get_variable('b', shape=(1,))
y = tf.matmul(X, W) + b
#loss = tf.losses.mean_squared_error(y_gold, y)
#numpy matrices saved from two different trained models with the exact same architecture
W1 = np.random.rand(10, 1)
W2 = np.random.rand(10, 1)
b1 = np.random.rand(1)
b2 = np.random.rand(1)
with tf.variable_scope('mixture'):
alpha1 = tf.get_variable('alpha1', shape=(1,))
alpha2 = tf.get_variable('alpha2', shape=(1,))
W_new = alpha1 * W1 + alpha2 * W2
b_new = alpha1 * b1 + alpha2 * b2
all_trainable_vars = tf.trainable_variables()
print(all_trainable_vars)
#replace the original W and b with the new mixture variables in the computation graph (**doesn't do what I want**)
all_trainable_vars[0] = W_new
all_trainable_vars[1] = b_new
#this doesn't work
#note that I could just do the computation for y using the new variables as y = tf.matmul(X, W_new) + b_new
#but the problem is this is just a toy example. In real world, my model has a big architecture with several
#bilstms whose variables I want to replace with these new ones.
#Now what I need is to replace W and b trainable parameters (items 0 and 1 in all_trainable vars)
#with W_new and b_new in the computation graph.
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_writer = tf.summary.FileWriter('./' + 'graph',
sess.graph)
#print(sess.run([W, b]))
#give the model 3 samples and predict on them
print(sess.run(y, feed_dict={X:np.random.rand(3, 10)}))
Why do I want to do this?
Assume you have several pretrained models (in different domains) but you don't have access to any of their data.
Then you have a little bit of training data from another domain that doesn't give you that much performance, but if you could train the model jointly with the data that you don't have you could get a good performance.
Assuming the data is somehow represented in the trained models,
we want to learn a mixture of the pretrained models, by learning the mixing coefficients, using little labelled data that we have as supervision.
We don't want to pretrain any parameters, we only want to learn a mix of pretrained models. What are the mixture weights? we need to learn that from the little supervision that we have.
Update 1:
I realised I could set the parameters of the model before I create it as:
model = Model(W_new, b_new)
But as I said my real model uses several tf.contrib.rnn.LSTMCell objects.
So I'll need to give the LSTMCell class and the new variables instead of letting it to create its own new variables. So now the problem is how to set the variables of LSTMCell instead of letting it create them. I guess I'll need to subclass the LSTMCell class and make the changes. Is there any easy way to do this, is my question now. Maybe I should ask this as a new question.
What I want to do:
W = tf.get_variable(...)
b = tf.get_variable(...)
cell_fw = tf.contrib.rnn.LSTMCell(W, b,
state_is_tuple=True)
created a separate question for this here because it might be useful for others for different reasons.

Breaking TensorFlow gradient calculation into two (or more) parts

Is it possible to use TensorFlow's tf.gradients() function in parts, that is - calculate the gradient from of loss w.r.t some tensor, and of that tensor w.r.t the weight, and then multiply them to get the original gradient from the loss to the weight?
For example, let W,b be some weights, let x be an input of a network, and let y0 denote labels.
Assume a forward graph such as
h=Wx+b
y=tanh(h)
loss=mse(y-y0)
We can calculate tf.gradients(loss,W) and then apply (skipping some details) optimizer.apply_gradients() to update W.
I then try to extract an intermediate tensor, by using var=tf.get_default_graph().get_tensor_by_name(...), and then calculate two gradients: g1=tf.gradients(loss,var) and g2=tf.gradients(var,W).
I would then, by the chain rule, expect the dimensions of g1 and g2 to work out so that I can write g=g1*g2 in some sense, and get back tf.gradients(loss,W).
Unfortunately, this is not the case. The dimensions are incorrect. Each gradient's dimensions will be that of the "w.r.t variable", so there won't be a correspondence between the first gradient and the second one. What am I missing, and how can I do this?
Thanks.
tf.gradients will sum over the gradients of the input tensor. To avoid it you have to split the tensor into scalars and apply tf.gradients to each of them:
import tensorflow as tf
x = tf.ones([1, 10])
w = tf.get_variable("w", initializer=tf.constant(0.5, shape=[10, 5]))
out = tf.matmul(x, w)
out_target = tf.constant(0., shape=[5])
loss = tf.reduce_mean(tf.square(out - out_target))
grad = tf.gradients(loss, x)
part_grad_1 = tf.gradients(loss, out)
part_grad_2 = tf.concat([tf.gradients(i, x) for i in tf.split(out, 5, axis=1)], axis=1)
grad_by_parts = tf.matmul(part_grad_1, part_grad_2)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([grad]))
print(sess.run([grad_by_parts]))
From the docs, tf.gradients (emphasis mine)
constructs symbolic derivatives of sum of ys w.r.t. x in xs.
If any tensor in ys in multidimensional, it is reduce_summed before the resulting list of scalar is itself summed, before being differenciated. This is why the output gradient has the same size as the xs.
This also explain why losses can be multidimensional in tensorflow: they are implicitely summed over before differentiation.
for future readers:
Tensorflow has made some advancements, and as for tf2.7 (and maybe even earlier versions) you can use tf.GradientTape.jacobian to avoid the sum over the target's dimensions.
https://www.tensorflow.org/guide/advanced_autodiff#jacobians

Attentive Convolution with keras

I've implemented an attentive convolution layer in keras, as described in this paper.
You can see the code for it at this gist
I'm new to implementing custom layers and it is still very slow. I'm using a lot of tf.map_fn and I think this is the reason why it's so slow, but I don't know a different way to do this.
It would be nice if someone has some tips how to improve the layer or general tips on implementing custom layers like how to avoid backend (tensorflow) functions.
I'm using keras 2.1.3 and tensorflow 1.5 as backend.
Thanks
I don't see why you use tf.map_fn, you could avoid it everywhere...
Here are some hints (which may or may not make the code faster).
Casting
Do you really need to cast the values to float? If (at least) x[0] is an embedding, it's already a float, right? (Not sure about the nature of "context")
Lines 37 and 38:
text = x[0]
context = x[1]
Why map functions that are already supported in keras?
For instance, why do this (L42):
weighted_attentive_context = tf.map_fn(self._compute_attentive_context, (text, context), dtype=K.floatx())
When you can do this?
weighted_attentive_context = self._compute_attentive_context(text,context)
With:
def _comput_attentive_context(self,text,context):
Suggestion for _compute_attentive_context:
def _compute_attentive_context(self, text, context):
#computes the context-score for every vector like equation 2
temp = tf.matmul(text, self.We)
scores = tf.matmul(temp, K.transpose(context))
#why not?
scores_softmax = K.softmax(scores)
#computes the context featur_map like equation 4
res = tf.matmul(scores_softmax, context)
#why not?
res = self._weight_for_output(res)
return res
And why not use a K.conv1D instead of all these complicated repeat, concatenation, etc?
def _conv(self, x):
return K.conv1D(x, self.W1, padding='same')
#if you have special reasons for what you're doing, please share them in the comments,
#please also share the exact shapes of the inputs and desired outputs
#here, you should make self.W1 with shape (filterLength, em_dim, desired_output_dim)
Suggestion for call:
def call(self, x, mask=None):
#x is a list of two tensors
text = x[0]
context = x[1]
#applies bilinear energy funtion (text * We * context)
#and weights the computed feature map like in equation 6 (W2 * ci)
weighted_attentive_context = self._compute_attentive_context(text, context)
#does the actual convolution, this is still kind of hacky
conv = K.conv1D(text,self.W1,padding='same')
added = conv + weighted_attentive_context
batch = K.bias_add(added, self.bias)
return batch
Batch matrix multiplication
For those multiplications, you can use K.dot(), following this:
If batch x weights: K.dot(x, self.W)
If weights x batch: K.permute_dimensions(K.dot(self.W,x),(1,0,2))
Considering you have these shapes:
If batch x weights -> x: (batch, words, emb) | W: (emb, any)
If weights x batch -> W: (any, words) | x: (batch, words, emb)
The results will be:
If batch x weights: (words,any) <- this seems the logical choice
If weigths x batch: (any, emb)