How to compute gradients with tf.scatter_sub? - tensorflow

When implementing lambda-opt(an algorithm published on KDD'19) in tensorflow, I came across a problem to compute gradients with tf.scatter_sub。
θ refers to an embedding matrix for docid.
The formulation is
θ(t+1)=θ(t) - α*(grad+2*λ*θ),
delta = theta_grad_no_reg.values * lr + 2 * lr * cur_scale * cur_theta
next_theta_tensor = tf.scatter_sub(theta,theta_grad_no_reg.indices,delta)
then I use θ(t+1) for some computation. Finally, I want to compute gradients with respect to λ, not θ.
But the gradient is None.
I wrote a demo like this:
import tensorflow as tf
w = tf.constant([[1.0], [2.0], [3.0]], dtype=tf.float32)
y = tf.constant([5.0], dtype=tf.float32)
# θ
emb_matrix = tf.get_variable("embedding_name", shape=(10, 3),
initializer=tf.random_normal_initializer(),dtype=tf.float32)
# get one line emb
cur_emb=tf.nn.embedding_lookup(emb_matrix,[0])
# The λ matrix
doc_lambda = tf.get_variable(name='docid_lambda', shape=(10, 3),
initializer=tf.random_normal_initializer(), dtype=tf.float32)
# get one line λ
cur_lambda=tf.nn.embedding_lookup(doc_lambda, [0])
# θ(t+1) Tensor("ScatterSub:0", shape=(10, 3), dtype=float32_ref)
next_emb_matrix=tf.scatter_sub(emb_matrix, [0], (cur_emb *cur_lambda))
# do some compute with θ(t+1) Tensor ,not Variable
next_cur_emb=tf.nn.embedding_lookup(next_emb_matrix,[0])
y_ = tf.matmul(next_cur_emb, w)
loss = tf.reduce_mean((y - y_) ** 2)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
grad_var_list=optimizer.compute_gradients(loss)
print(grad_var_list)
# [(None, <tf.Variable 'embedding_name:0' shape=(10, 3) dtype=float32_ref>), (None, <tf.Variable 'docid_lambda:0' shape=(10, 3) dtype=float32_ref>)]
The gradient is None, too. It seems that tf.scatter_sub op doesn't provide gradient?
Thanks for your help!
If you have an interest in this algorithm, you can search for it, but it's not important about this question.

Related

How does BatchNormalization work on an example?

I am trying to understand batchnorm.
My humble example
layer1 = tf.keras.layers.BatchNormalization(scale=False, center=False)
x = np.array([[3.,4.]])
out = layer1(x)
print(out)
Prints
tf.Tensor([[2.99850112 3.9980015 ]], shape=(1, 2), dtype=float64)
My attempt to reproduce it
e=0.001
m = np.sum(x)/2
b = np.sum((x - m)**2)/2
x_=(x-m)/np.sqrt(b+e)
print(x_)
It prints
[[-0.99800598 0.99800598]]
What am I doing wrong?
Two problems here.
First, batch norm has two "modes": Training, where normalization is done via the batch statistics, and inference, where normalization is done via "population statistics" that are collected from batches during training. Per default, keras layers/models function in inference mode, and you need to specify training=True in their call to change this (there are other ways, but that is the simplest one).
layer1 = tf.keras.layers.BatchNormalization(scale=False, center=False)
x = np.array([[3.,4.]], dtype=np.float32)
out = layer1(x, training=True)
print(out)
This prints tf.Tensor([[0. 0.]], shape=(1, 2), dtype=float32). Still not right!
Second, batch norm normalizes over the batch axis, separately for each feature. However, the way you specify the input (as a 1x2 array) is basically a single input (batch size 1) with two features. Batch norm just normalizes each feature to mean 0 (standard deviation is not defined). Instead, you want two inputs with a single feature:
layer1 = tf.keras.layers.BatchNormalization(scale=False, center=False)
x = np.array([[3.],[4.]], dtype=np.float32)
out = layer1(x, training=True)
print(out)
This prints
tf.Tensor(
[[-0.99800634]
[ 0.99800587]], shape=(2, 1), dtype=float32)
Alternatively, specify the "feature axis":
layer1 = tf.keras.layers.BatchNormalization(axis=0, scale=False, center=False)
x = np.array([[3.,4.]], dtype=np.float32)
out = layer1(x, training=True)
print(out)
Note that the input shape is "wrong", but we told batchnorm that axis 0 is the feature axis (it defaults to -1, the last axis). This will also give the desired result:
tf.Tensor([[-0.99800634 0.99800587]], shape=(1, 2), dtype=float32)

How to change the LSTMCell weight format from tensorflow to tf.keras

I I have some old code from tensorflow that I want to make work for tensorflow2/tf.keras. I would like to keep the same LSTM weights, but cannot figure out how to convert the format.
I have the old weights saved in a checkpoint file, and also have them saved in csv files.
My old code looks something like this:
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.nn.rnn_cell.LSTMCell(layer_size), tf.nn.rnn_cell.LSTMCell(layer_size)]
stacked = tf.contrib.rnn.MultiRNNCell(lstm_layers)
features, state = tf.nn.dynamic_rnn(stacked, input_placeholder, dtype=tf.float32)
And my new code looks something like this:
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.keras.layers.LSTMCell(layer_size),tf.keras.layers.LSTMCell(layer_size)]
stacked = tf.keras.layers.StackedRNNCells(lstm_layers)
features = stacked(input_placeholder)
... #later in the code
features.set_weights(previous_weights)
The old bias seems to match the new bias.
The old kernel seems to be the concatenation of the kernel and recurrent kernel.
I am able to load the previous_weights into the model (have explicitly checked the weights loaded correctly), however tests I have fail to produce the same result.
Digging into the source code, the kernels seem to have a different format under the hood.
Is it possible to calculate the kernel and recurrent_kernel (tf.keras) using these old saved kernel weights?
Links if they're helpful:
https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/rnn_cell_impl.py
https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/keras/layers/recurrent.py
In case anyone else encounters this.
There are three differences that I found for migrating weights:
The kernel is shuffled in axis=0. Both implementations use one (or two) dot-products to do four dot-product operations that the lstm calls for by concatenating weights in axis=0. The challenge is that the middle two quarters of this concatenated weight matrix are swapped.
The kernel is divided in axis=1. The rnn_cell implementation has a single weights matrix that is dot-producted with a concatenation of the inputs and the hidden state, where as the keras implementation stores these as two attributes: _kernel and _recurrent_kernel, and dot-products these separately before summing them.
A forget bias is explicitly added in the cell calculation from rnn_cell, but is integrated into the cell bias in keras, with the option modifying the initialisation only.
A migration function that accounts for these three differences is
def convert_lstm_weights(tf1_kernel, tf1_bias, forget_bias=True):
a, b, c, d = tf.split(tf1_kernel, num_or_size_splits=4, axis=1)
lstm_kernel = tf.concat(values=[a, c, b, d], axis=1)
kernel, recurrent_kernel = lstm_kernel[:-hps.hidden_dim], lstm_kernel[-hps.hidden_dim:]
a, b, c, d = tf.split(tf1_bias, num_or_size_splits=4, axis=0)
bias = tf.concat(values=[a, c + float(forget_bias), b, d], axis=0) # + 1 to account for forget bias
return kernel, recurrent_kernel, bias
And two differences I've found that need to be accounted for during use:
The activation function in tf.compat.v1.nn.rnn_cell.LSTMCell is sigmoid but tf.keras.LSTMCell is hard sigmoid so this needs to be set on initialization with activation="sigmoid".
The states are returned in opposite orders.
output, (c_state_new, m_state_new) = tf.compat.v1.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)(input, (c_state, m_state))
becomes
output, (h_state_new, c_state_new) = tf.keras.layers.LSTMCell(hidden_size, activation="sigmoid")(input, (h_state, c_state))
where the hidden state is referred to by m in rnn_cell and h in keras.
You can split the matrix:
If you see here, kernel matrix of TF1 has shape of (input_shape[-1], self.units).
Let's say you have 20 inputs and 128 nodes in an LSTM layer
input_units=20
layer_size = 128
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.nn.rnn_cell.LSTMCell(layer_size), tf.nn.rnn_cell.LSTMCell(layer_size)]
stacked = tf.contrib.rnn.MultiRNNCell(lstm_layers)
output, state = tf.nn.dynamic_rnn(stacked, input_placeholder, dtype=tf.float32)
Your trainable parameters will have these shapes:
[<tf.Variable 'rnn/multi_rnn_cell/cell_0/lstm_cell/kernel:0' shape=(148, 512) dtype=float32_ref>,
<tf.Variable 'rnn/multi_rnn_cell/cell_0/lstm_cell/bias:0' shape=(512,) dtype=float32_ref>,
<tf.Variable 'rnn/multi_rnn_cell/cell_1/lstm_cell/kernel:0' shape=(256, 512) dtype=float32_ref>,
<tf.Variable 'rnn/multi_rnn_cell/cell_1/lstm_cell/bias:0' shape=(512,) dtype=float32_ref>]
In TF 1.0, kernel and recurrent kernel of TF 2.0 is concatenated (see here)
def build(self, input_shape):
self.kernel = self.add_weight(shape=(input_shape[-1], self.units),
initializer='uniform',
name='kernel')
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units),
initializer='uniform',
name='recurrent_kernel')
self.built = True
At this new version you have now two different weight matrices.
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.keras.layers.LSTMCell(layer_size),tf.keras.layers.LSTMCell(layer_size)]
stacked = tf.keras.layers.StackedRNNCells(lstm_layers)
output = tf.keras.layers.RNN(stacked, return_sequences=True, return_state=True, dtype=tf.float32)
Thus, your trainable parameters are:
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/kernel:0' shape=(20, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/recurrent_kernel:0' shape=(128, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/bias:0' shape=(512,) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/kernel_1:0' shape=(128, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/recurrent_kernel_1:0' shape=(128, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/bias_1:0' shape=(512,) dtype=float32>]

Neural network output issue

I built a neural network with tensorflow, here the code :
class DQNetwork:
def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
with tf.variable_scope(name):
# We create the placeholders
self.inputs_ = tf.placeholder(tf.float32, shape=[state_size[1], state_size[0]], name="inputs")
self.actions_ = tf.placeholder(tf.float32, [None, self.action_size], name="actions_")
# Remember that target_Q is the R(s,a) + ymax Qhat(s', a')
self.target_Q = tf.placeholder(tf.float32, [None], name="target")
self.fc = tf.layers.dense(inputs = self.inputs_,
units = 50,
kernel_initializer=tf.contrib.layers.xavier_initializer(),
activation = tf.nn.elu)
self.output = tf.layers.dense(inputs = self.fc,
units = self.action_size,
kernel_initializer=tf.contrib.layers.xavier_initializer(),
activation=None)
# Q is our predicted Q value.
self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_))
# The loss is the difference between our predicted Q_values and the Q_target
# Sum(Qtarget - Q)^2
self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
self.optimizer = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)
But i have an issue with the output,
the output should normaly be at the same size than "action_size", and action_size value is 3
but i got an output like [[5][3]] instead of just [[3]] and i realy don't understand why...
This network got 2 dense layers, one with 50 perceptrons and the other with 3 perceptrons (= action_size).
state_size is format : [[9][5]]
If someone know why my output is two dimensions i will be very thankful
Your self.inputs_ placeholder has shape (5, 9). You perform the matmul(self.inputs_, fc1.w) operation in dense layer fc1 which has shape (9, 50) and it results in shape (5, 50). You then apply another dense layer with shape (50, 3) which results in output shape (5, 3).
The same schematically:
matmul(shape(5, 9), shape(9, 50)) ---> shape(5, 50) # output of 1st dense layer
matmul(shape(5, 50), shape(50, 3)) ---> shape(5, 3) # output of 2nd dense layer
Usually, the first dimension of the input placeholder represents batch size and the second dimension is the dimension of inputs feature vector. So for each sample in a batch you (batch size is 5 in your case) you get the output shape 3.
To get probabilities, use this:
import tensorflow as tf
import numpy as np
inputs_ = tf.placeholder(tf.float32, shape=(None, 9))
actions_ = tf.placeholder(tf.float32, shape=(None, 3))
fc = tf.layers.dense(inputs=inputs_, units=2)
output = tf.layers.dense(inputs=fc, units=3)
reduced = tf.reduce_mean(output, axis=0)
probs = tf.nn.softmax(reduced) # <--probabilities
inputs_vals = np.ones((5, 9))
actions_vals = np.ones((1, 3))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(probs.eval({inputs_:inputs_vals,
actions_:actions_vals}))
# [0.01858923 0.01566187 0.9657489 ]

Add Placeholder to layer

I have a Tensorflow layer with 2 nodes. These are the output nodes of another 2 larger hidden layers. Now I want to add 2 new nodes to this layer, so I end up with 4 nodes in total, and do some last computation. The added nodes are implemented as Placeholders so far, and have a dynamic shape depending on the batch size. Here is a sketch of the net:
Now I want to concatenate Nodes 3 and 4 to the nodes 1 and 2 of the previously computed layer. I know there is tf.concat for this, but I don't understand how to do this correctly.
How do I add Placeholders of the same batchsize as the original net input to a specific layer?
EDIT:
When I use tf.concat over axis=1, I end up with the following problem:
z = tf.placeholder(tf.float32, shape=[None, 2])
Weight_matrix = weight_variable([4, 2])
bias = bias_variable([4, 2])
concat = tf.concat((dnn_out, z), 1)
h_fc3 = tf.nn.relu(tf.matmul(concat, Weight_matrix) + bias)
Adding the bias to the tf.matmul result throws an InvalidArgumentError: Incompatible shapes: [20,2] vs. [4,2].
Since your data is batched, probably over the first dimension, you need to concatenate over the second (axis=1):
import tensorflow as tf
import numpy as np
dnn_output = tf.placeholder(tf.float32, (None, 2)) # replace with your DNN(input) result
additional_nodes = tf.placeholder(tf.float32, (None, 2))
concat = tf.concat((dnn_output, additional_nodes), axis=1)
print(concat)
# > Tensor("concat:0", shape=(?, 4), dtype=float32)
dense_output = tf.layers.dense(concat, units=2)
print(dense_output)
# > Tensor("dense/BiasAdd:0", shape=(?, 2), dtype=float32)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(dense_output, feed_dict={dnn_output: np.ones((5, 2)),
additional_nodes: np.zeros((5, 2))}))

How to have a variable number of hidden layers in Tensorflow?

Suppose that we want to try sort of hidden layer numbers and their size. How can we do in Tensorflow?
Consider following example to make it clear:
# Create a Neural Network Layer
def fc_layer(input, size_in, size_out):
w = tf.Variable(tf.truncated_normal([None, size_in, size_out]), name="W")
b = tf.Variable(tf.constant(0.1, shape=[size_out]))
act = tf.matmul(input, w) + b
return act
n_hiddenlayers=3 #number of hidden layers
hidden_layer=tf.placeholder(tf.float32,[n_hiddenlayers, None, None])
#considering 4 as size of inputs and outputs of all layers
sizeInpOut=4
for i in range(n_hiddenlayers):
hidden_layer(i,:,:)= tf.nn.sigmoid(fc_layer(X, sizeInpOut, sizeInpOut))
It results in an error about hidden_layer(i,:,:)= ...
In the other word, I need tensor of tensors.
I did this just using a list to hold the different layers as follows, seemed to work fine.
# inputs
x_size=2 # first layer nodes
y_size=1 # final layer nodes
h_size=[3,4,3] # variable length list of hidden layer nodes
# set up input and output
X = tf.placeholder(tf.float32, [None,x_size])
y_true = tf.placeholder(tf.float32, [None,y_size])
# set up parameters
W = []
b = []
layer = []
# first layer
W.append(tf.Variable(tf.random_normal([x_size, h_size[0]], stddev=0.1)))
b.append(tf.Variable(tf.zeros([h_size[0]])))
# add hidden layers (variable number)
for i in range(1,len(h_size)):
W.append(tf.Variable(tf.random_normal([h_size[i-1], h_size[i]], stddev=0.1)))
b.append(tf.Variable(tf.zeros([h_size[i]])))
# add final layer
W.append(tf.Variable(tf.random_normal([h_size[-1], y_size], stddev=0.1)))
b.append(tf.Variable(tf.zeros([y_size])))
# define model
layer.append(tf.nn.relu(tf.matmul(X, W[0]) + b[0]))
for i in range(1,len(h_size)):
layer.append(tf.nn.relu(tf.matmul(layer[i-1], W[i]) + b[i]))
if self.type_in == "classification":
y_pred = tf.nn.sigmoid(tf.matmul(layer[-1], W[-1]) + b[-1])
loss = tf.reduce_mean(-1. * ((y_true * tf.log(y_pred)) + ((1.-y_true) * tf.log(1.-y_pred))))
correct_prediction = tf.equal(tf.round(y_pred), tf.round(y_true))
metric = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
metric_name = "accuracy"
Not a direct answer, but you could consider using tensorflow-slim. It's one of the many APIs distributed as part of tensorflow. It is lightweight and compatible with defining all the variables by hand as you are doing. If you look at the webpage I linked, slim.repeat and slim.stack allow you to create multiple layers of different widths in one line. To make things more complicated: I think part of slim is now the module called layers in tensorflow.
But maybe you just want to play directly with tf variables to understand how it works and not use a higher level API until later.
In the code you posted, since you want to create three layers, you should call fc_layer three times, but you only call it once. By the way this implies that w and b will be created three different times, as different variables with different internal tf names. And it is what you want.
You should have some for-loop or while-loop which iterates three times. Note that the output tensor at the end of the loop will become the input tensor in the next iteration. The initial input is the true input and the very last output is the true output.
Another issue with your code is that the non-linearity (the sigmoid) should be at the end of fc_layer. You want a non-linear operation between all layers.
EDIT: some code of what would usually be done:
import tensorflow as tf
input_size = 10
output_size = 4
layer_sizes = [7, 6, 5]
def fc_layer(input, size, layer_name):
in_size = input.shape.as_list()[1]
w = tf.Variable(tf.truncated_normal([in_size, size]),
name="W" + layer_name)
b = tf.Variable(tf.constant(0.1, shape=[size]),
name="b" + layer_name)
act = tf.nn.sigmoid(tf.matmul(input, w) + b)
return act
input = tf.placeholder(tf.float32, [None, input_size])
# output will be the intermediate activations successively and in the end the
# final activations (output).
output = input
for i, size in enumerate(layer_sizes + [output_size]):
output = fc_layer(output , size, layer_name=str(i + 1))
print("final output var: " + str(output))
print("All vars in the tensorflow graph:")
for var in tf.global_variables():
print(var)
With output:
final output: Tensor("Sigmoid_3:0", shape=(?, 4), dtype=float32)
<tf.Variable 'W1:0' shape=(10, 7) dtype=float32_ref>
<tf.Variable 'b1:0' shape=(7,) dtype=float32_ref>
<tf.Variable 'W2:0' shape=(7, 6) dtype=float32_ref>
<tf.Variable 'b2:0' shape=(6,) dtype=float32_ref>
<tf.Variable 'W3:0' shape=(6, 5) dtype=float32_ref>
<tf.Variable 'b3:0' shape=(5,) dtype=float32_ref>
<tf.Variable 'W4:0' shape=(5, 4) dtype=float32_ref>
<tf.Variable 'b4:0' shape=(4,) dtype=float32_ref>
In your code your were using the same name for w, which creates conflicts since different variables with the same name would be created. I fixed it in my code, but even if you use the same name tensorflow is intelligent enough and will rename each variable to a unique name by adding an underscore and a number.
EDIT: here is what I think you wanted to do:
import tensorflow as tf
hidden_size = 4
input_size = hidden_size # equality required!
output_size = hidden_size # equality required!
n_hidden = 3
meta_tensor = tf.Variable(tf.truncated_normal([n_hidden, hidden_size, hidden_size]),
name="meta")
def fc_layer(input, i_layer):
w = meta_tensor[i_layer]
# more verbose: w = tf.slice(meta_tensor, begin=[i_layer, 0, 0], size=[1, hidden_size, hidden_size])[0]
b = tf.Variable(tf.constant(0.1, shape=[hidden_size]),
name="b" + str(i_layer))
act = tf.nn.sigmoid(tf.matmul(input, w) + b)
return act
input = tf.placeholder(tf.float32, [None, input_size])
# output will be the intermediate activations successively and in the end the
# final activations (output).
output = input
for i_layer in range(0, n_hidden):
output = fc_layer(output, i_layer)
print("final output var: " + str(output))
print("All vars in the tensorflow graph:")
for var in tf.global_variables():
print(var)
With output:
final output var: Tensor("Sigmoid_2:0", shape=(?, 4), dtype=float32)
All vars in the tensorflow graph:
<tf.Variable 'meta:0' shape=(3, 4, 4) dtype=float32_ref>
<tf.Variable 'b0:0' shape=(4,) dtype=float32_ref>
<tf.Variable 'b1:0' shape=(4,) dtype=float32_ref>
<tf.Variable 'b2:0' shape=(4,) dtype=float32_ref>
As I said this is not standard. While coding it I also realized that it is quite limiting since all hidden layers must have the same size. A meta-tensor can be used to store many matrices, but those must all have the same dimensions. So you could not do like I did in the example above where the hidden first layer has size 7 and the next one size 6 and the final one size 5, before an output of size 4.