Tensorflow: calculate gradient for tf.multiply - tensorflow

I'm building a neural network that has the following two layers
pseudo_inputs = tf.Variable(a_numpy_ndarray)
weights = tf.Variable(tf.truncated_normal(...))
I then want to multiply them using tf.multiply (which, unlike tf.matmul multiplies corresponding indices, i.e. c_ij = a_ij * b_ij)
input = tf.multiply(pseudo_inputs, weights)
My goal is to learn weights. So I run
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss, var_list=[weights])
But it doesn't work. The network doesn't change at all.
Looking at tensorboard, I could see that 'input' has no gradient, so I'm assuming that's the problem. Any ideas how to solve this?
From reading tensorflow docs it seems like I might have to write a gradient op for tf.multiply, but I find it hard to believe no one needed to do this before.

I thought the pseudo_inputs should be set as Placeholders in the first line.
And in this line:
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss, var_list=[weights])
Since weights are to be trained in the graph by minimizing loss then it should not passed as a parameter here.
train = tf.train.AdamOptimizer(learn_rate).minimize(loss)
Then you should first run the train using the samples(you don't have labels) you have.
for x_train, y_train in samples:
sess.run(train, {pseudo_inputs:x_train, y:y_train})
And after that you can get weights by:
W_c, loss_c = sess.run([W, loss], {pseudo_inputs=x_train, y:y_train})

Related

can't reproduce model.fit with GradientTape

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.

tf.gradients(model.output, model.input) computes a different value each time I run it

I'm trying to compute the gradient of the output layer with respect to the input layer. My neural network is relatively small (input layer composed of 9 activation units and the output layer of 1) and the training went fine as the test provided a very good accuracy. I made the NN model using Keras.
In order to solve my problem, I need to compute the gradient of the output with respect to the input. This is, I need to obtain the Jacobian which as dimension [1x9]. The gradients function in tensorflow should provide me with everything I need, but when I run the code below I obtain a different solution every time.
output_v = model.output
input_v = model.input
gradients = tf.gradients(output_v, input_v)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
print(sess.run(model.input,feed_dict={model.input:x_test_N[0:1,:]}))
evaluated_gradients = sess.run(gradients,feed_dict{model.input:x_test_N[0:1,:]})
print(evaluated_gradients)
sess.close()
The first print command shows this value every time I run it (just to make sure that the input values are not modified):
[[-1.4306372 -0.1272892 0.7145787 1.338818 -1.2957293 -0.5402862-0.7771702 -0.5787912 -0.9157122]]
But the second print shows different ones:
[[ 0.00175761, -0.0490326 , -0.05413761, 0.09952173, 0.06112418, -0.04772799, 0.06557006, -0.02473242, 0.05542536]]
[[-0.00416433, 0.08235116, -0.00930298, 0.04440641, 0.03752216, 0.06378302, 0.03508484, -0.01903783, -0.0538374 ]]
Using finite differences, evaluated_gradients[0,0] = 0.03565103, which isn't close to any of the first values previously printed.
Thanks for your time!
Alberto
Solved by creating a specific session just before training my model:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
K.set_session(sess)
history = model.fit(x_train_N, y_train_N, epochs=n_epochs,
validation_split=split, verbose=1, batch_size=n_batch_size,
shuffle='true', callbacks=[early_stop, tensorboard])
And evaluating the gradient after training, while tf.session is still open:
evaluated_gradients = sess.run(K.gradients(model.output, model.input), feed_dict={model.input: x_test_N})
Presumably your network is set up to initialize weights to random values. When you run sess.run(tf.initialize_all_variables()), you are initializing your variables to new random values. Therefore you get different values for output_v in every run, and hence different gradients. If you want to use a model you trained before, you should replace the initialization with initialize_all_variables() with a restore command. I am not familiar with how this is done in Keras since I usually work directly with tensorflow, but I would try this.
Also note that initialize_all_variables is deprecated and you should use global_variables_initializer instead.

Using custom error function in tensorflow

I have a trained convolutional neural network A that outputs the propability that a given picture contains a square or a circle.
Another Network B takes images of random noise. My idea is to have a bunch of convolutional layers so that the output is a newly generated square.
As an error function I would like to feed the generated image into A and learn filters of B from the softmax tensor of A. To my understanding this is sort of a generative adversarial network, except for that A does not learn. While trying to implement this I have encountered two problems.
I have imported the Layers of A that I want to use in B as followed:
with gfile.FastGFile("shape-classifier.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
image_input_layer, extern_softmax_tensor = tf.import_graph_def(
graph_def, name="", return_elements=["image_input", "Softmax"])
I would like to avoid using two sess.run() three times. (Generating the random image, getting the softmax values from A, adjusting weights of B).
Is there a way to directly connect the tensors so that I only have one graph?
Calling:
logits = extern_softmax_tensor(my_generated_image_tensor)
throws:
TypeError: 'Operation' object is not callable
The "Graph-Connected" and the "Feed-Connected" approach confuse me a bit.
logits = extern_softmax_tensor(my_generated_image_tensor) # however you would call it
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=label_input,
logits=logits)
cross_entropy_mean = tf.reduce_mean(cross_entropy_tensor)
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
learning_step = optimizer.minimize(cross_entropy_mean)
With that Logic the error will be first passed back through A.
Is there a way to use the softmax calculated by A to directly adjust Layers of B?
Leaving aside if my idea actually works, is it actually possible to build it in tensorflow? I hope I could make my problems clear.
Thank you very much
Yes, it is possible to connect two models like that.
# Generator networ
my_generated_image_tensor = ...
# Read classifier
with gfile.FastGFile("shape-classifier.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Map generator output to classifier input
extern_softmax_tensor, = tf.import_graph_def(
graph_def, name="", input_map={"image_input": my_generated_image_tensor},
return_elements=["Softmax:0"])
# Define loss and training step
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
labels=label_input, logits=extern_softmax_tensor)
cross_entropy_mean = tf.reduce_mean(cross_entropy_tensor)
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
learning_step = optimizer.minimize(cross_entropy_mean)
As side notes: 1) tf.nn.softmax_cross_entropy_with_logits expects logits as input, that is, the values before being passed through the softmax function, so if the Softmax tensor in the loaded model is the output of a softmax operation you should probably change it to use the input logits instead. 2) Note that tf.nn.softmax_cross_entropy_with_logits is now deprecated anyway, see tf.nn.softmax_cross_entropy_with_logits_v2.

Delayed echo of sin - cannot reproduce Tensorflow result in Keras

I am experimenting with LSTMs in Keras with little to no luck. At some moment I decided to scale back to the most basic problems in order finally achieve some positive result.
However, even with simplest problems I find that Keras is unable to converge while the implementation of the same problem in Tensorflow gives stable result.
I am unwilling to just switch to Tensorflow without understanding why Keras keeps diverging on any problem I attempt.
My problem is a many-to-many sequence prediction of delayed sin echo, example below:
Blue line is a network input sequence, red dotted line is an expected output.
The experiment was inspired by this repo and workable Tensorflow solution was also created from it too.
The relevant excerpts from the my code are below, and full version of my minimal reproducible example is available here.
Keras model:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=True))
model.add(TimeDistributed(Dense(n_input, activation='linear')))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
Tensorflow model:
x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_steps])
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_steps], seed = SEED))
}
biases = {
'out': tf.Variable(tf.random_normal([n_steps], seed = SEED))
}
lstm = rnn.LSTMCell(n_hidden, forget_bias=1.0)
outputs, states = tf.nn.dynamic_rnn(lstm, inputs=x,
dtype=tf.float32,
time_major=False)
h = tf.transpose(outputs, [1, 0, 2])
pred = tf.nn.bias_add(tf.matmul(h[-1], weights['out']), biases['out'])
individual_losses = tf.reduce_sum(tf.squared_difference(pred, y),
reduction_indices=1)
loss = tf.reduce_mean(individual_losses)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) \
.minimize(loss)
I claim that other parts of code (data_generation, training) are completely identical. But learning progress with Keras stalls early and yields unsatisfactory predictions. Graphs of logloss for both libraries and example predictions are attached below:
Logloss for Tensorflow-trained model:
Logloss for Keras-trained model:
It's not easy to read from graph, but Tensorflow reaches target_loss=0.15 and stops early after about 10k batches. But Keras uses up all 13k batches reaching loss about only 1.5. In a separate experiment where Keras was running for 100k batches it went no further stalling around 1.0.
Figures below contain: black line - model input signal, green dotted line - ground truth output, red line - acquired model output.
Predictions of Tensorflow-trained model:
Predictions of Keras-trained model:
Thank you for suggestions and insights, dear colleagues!
Ok, I have managed to solve this. Keras implementation now converges steadily to a sensible solution too:
The models were in fact not identical. You may inspect with extra caution the Tensorflow model version from the question and verify for yourself that actual Keras equivalent is listed below, and isn't what stated in the question:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=False))
model.add(Dense(n_steps, input_shape=(n_hidden,), activation='linear'))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
I will elaborate. Workable solution here uses that last column of size n_hidden spat out by LSTM as an intermediate activation then fed to the Dense layer.
So, in a way, the actual prediction here is made by the regular perceptron.
One extra take away note - source of mistake in the original Keras solution is already evident from the inference examples attached to question. We see there that earlier timestamps fail utterly, while later timestamps are near perfect. These earlier timestamps correspond to the states of LSTM when it were just initialized on new window and clueless of context.

Tensorflow: optimize over input with gradient descent

I have a TensorFlow model (a convolutional neural network) which I successfully trained using gradient descent (GD) on some input data.
Now, in a second step, I would like to provide an input image as initialization then and optimize over this input image with fixed network parameters using GD. The loss function will be a different one, but this a detail.
So, my main question is how to tell the gradient descent algorithm to
stop optimizing the network parameters
to optimize over the input image
The first can probably done with this
Holding variables constant during optimizer
Do you guys have ideas about the second point?
I guess I can recode the gradient descent algorithm myself using the TF gradient function, but my gut feeling tells me that there should be an easier way, which also allows me to benefit from more complex GD variants (Adam etc.).
No need for your SDG own implementation. TensorFlow provides all functions:
import tensorflow as tf
import numpy as np
# some input
data_pldhr = tf.placeholder(tf.float32)
img_op = tf.get_variable('input_image', [1, 4, 4, 1], dtype=tf.float32, trainable=True)
img_assign = img_op.assign(data_pldhr)
# your starting image
start_value = (np.ones((4, 4), dtype=np.float32) + np.eye(4))[None, :, :, None]
# override variable_getter
def nontrainable_getter(getter, *args, **kwargs):
kwargs['trainable'] = False
return getter(*args, **kwargs)
# all variables in this scope are not trainable
with tf.variable_scope('myscope', custom_getter=nontrainable_getter):
x = tf.layers.dense(img_op, 10)
y = tf.layers.dense(x, 10)
# the usual stuff
cost_op = tf.losses.mean_squared_error(x, y)
train_op = tf.train.AdamOptimizer(0.1).minimize(cost_op)
# fire up the training process
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(img_assign, {data_pldhr: start_value})
print(sess.run(img_op))
for i in range(10):
_, c = sess.run([train_op, cost_op])
print(c)
print(sess.run(img_op))
represent an image as tf.Variable with trainable=True
initialise this variable with the starting image (initial guess)
recreate the NN graph using TF variables with trainable=False and copy the weights from the trained NN graph using tf.assign
calculate the loss function
plug the loss into any TF optimiser algorithm you want
Another alternative is to use ScipyOptimizerInterface, which allows to use scipy's minimizer. This supports constrained minimization.
I'm looking for a solution to the same problem, but my model is not an easy one as I have an LSTM network with cells created with MultiRNNCell, I don't think it is possible to get the weight and clone the network. Is there any workaround so that I can compute the gradient wrt the input?