optimizer gradient not applied in tensorflow - tensorflow

I am currently using tensorflow 1.7
For some reason the optimizer gradients are not being applied. Here is my simple code:
import tensorflow as tf
import numpy as np
print("tensorflow version={}".format(tf.__version__))
a= np.zeros((2,2), dtype="float32")
b= np.array([[6,7],[8,9]], dtype="float32")
t1= tf.placeholder(tf.float32, shape=(2,2))
label_t = tf.placeholder(tf.float32, shape=(2,2))
t2 = tf.layers.dense(t1,2, activation=tf.nn.relu)
loss = tf.reduce_sum(tf.square(t2 - label_t))
optimizer = tf.train.AdamOptimizer(0.1)
train_op = optimizer.minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
steps=2
for i in range(steps):
print("trainop", train_op)
pred, loss_val, _= sess.run([t2,loss, train_op], feed_dict={t1:a, label_t:b})
print("pred={}, loss={}".format(pred, loss_val))
The output is:
trainop name: "Adam"
op: "NoOp"
input: "^Adam/update_dense/kernel/ApplyAdam"
input: "^Adam/update_dense/bias/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"
pred=[[0. 0.]
[0. 0.]], loss=230.0
trainop name: "Adam"
op: "NoOp"
input: "^Adam/update_dense/kernel/ApplyAdam"
input: "^Adam/update_dense/bias/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"
pred=[[0. 0.]
[0. 0.]], loss=230.0
I have tried separating the minimize() into compute_gradient and apply_gradient. Even so, the gradients do not seem to be applied(loss value remains the same and the trainable variables stay the same).
Can anyone give me a hint on what I am doing wrong?

as #BugKiller helped me out in the comments, the cause was my bad choice of activation function. I can see the loss diminishing if I remove the explicit activation assignment. And as #BugKiller has kindly explained in the comments, it was my bad from the beginning to blindly use the tf.nn.relu function as the activation function.

Related

Why the gradient of categorical crossentropy loss with respect to logits is 0 with gradient tape in TF2.0?

I am learning Tensorflow 2.0 and I am trying to figure out how Gradient Tapes work. I have this simple example, in which, I evaluate the cross entropy loss between logits and labels. I am wondering why the gradients with respect to logits is being zero. (Please look at the code below).
The version of TF is tensorflow-gpu==2.0.0-rc0.
logits = tf.Variable([[1, 0, 0], [1, 0, 0], [1, 0, 0]], type=tf.float32)
labels = tf.constant([[1, 0, 0], [0, 1, 0], [0, 0, 1]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.losses.categorical_crossentropy(labels, logits))
grads = tape.gradient(loss, logits)
print(grads)
I am getting
tf.Tensor(
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]], shape=(3, 3), dtype=float32)
as a result, but should not it tell me how much should I change logits in order to minimize the loss?
When calculate the cross entropy loss, set from_logits=True in the tf.losses.categorical_crossentropy(). In default, it's false, which means you are directly calculate the cross entropy loss using -p*log(q). By setting the from_logits=True, you are using -p*log(softmax(q)) to calculate the loss.
Update:
Just find one interesting results.
logits = tf.Variable([[0.8, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))
grads = tape.gradient(loss, logits)
print(grads)
The grads will be tf.Tensor([[-0.25 1. 1. ]], shape=(1, 3), dtype=float32)
Previously, I thought tensorflow will use loss=-\Sigma_i(p_i)\log(q_i) to calculate the loss, and if we derive on q_i, we will have the derivative be -p_i/q_i. So, the expected grads should be [-1.25, 0, 0]. But the output grads looks like all increased by 1. But it won't affect the optimization process.
For now, I'm still trying to figure out why the grads will be increased by one. After reading the source code of tf.categorical_crossentropy, I found that even though we set from_logits=False, it still normalize the probabilities. That will change the final gradient expression. Specifically, the gradient will be -p_i/q_i+p_i/sum_j(q_j). If p_i=1 and sum_j(q_j)=1, the final gradient will plus one. That's why the gradient will be -0.25, however, I haven't figured out why the last two gradients would be 1..
To prove that all gradients are increased by 1/sum_j(q_j),
logits = tf.Variable([[0.5, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))
grads = tape.gradient(loss, logits)
print(grads)
The grads are tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]], which should be [-2,0,0].
It shows that all gradients are increased by 1/(0.5+0.1+0.1). For the p_i==1, the gradient increased by 1/(0.5+0.1+0.1) makes sense to me. But I don't understand why p_i==0, the gradient is still increased by 1/(0.5+0.1+0.1).
I finally figured that.
The keras categorical cross entropy calculates the gradient using the following way:
sum(target) / sum(input) - target / input
You just sum the values for both targets and inputs ONLY if the input(i) is different from ZERO.

How to assign an element of an tensor in tensorflow

I want to understand the error message of the following code
M = K.eye(2)
K.assign(M[0,1],1.0)
The message I got is "Tried to convert 'input' to a tensor and failed. Error: None values not supported."
You can assign an element of a variable in tensorflow. Here is an example. (I didn't find K.assign this operation in my installed tensorflow version, btw)
import tensorflow as tf
import keras.backend as K
M = tf.Variable(K.eye(2), tf.float32)
assign_op = tf.assign(M[0,1], 1.0)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(M))
sess.run(assign_op)
print(sess.run(M))
#[[1. 0.]
# [0. 1.]]
#[[1. 1.]
# [0. 1.]]

Custom Keras binary_crossentropy loss function not working

I’m trying to re-define keras’s binary_crossentropy loss function so that I can customize it but it’s not giving me the same results as the existing one.
I'm using TF 1.13.1 with Keras 2.2.4.
I went through Keras’s github code. My understanding is that the loss in model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['accuracy']), is defined in losses.py, using binary_crossentropy defined in tensorflow_backend.py.
I ran a dummy data and model to test it. Here are my findings:
The custom loss function outputs the same results as keras’s one
Using the custom loss in a keras model gives different accuracy results
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
import tensorflow as tf
from keras import losses
import keras.backend as K
import keras.backend.tensorflow_backend as tfb
from keras.layers import Dense
from keras import Sequential
#Dummy check of loss output
def binary_crossentropy_custom(y_true, y_pred):
return K.mean(binary_crossentropy_custom_tf(y_true, y_pred), axis=-1)
def binary_crossentropy_custom_tf(target, output, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
# Arguments
target: A tensor with the same shape as `output`.
output: A tensor.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
# Returns
A tensor.
"""
# Note: tf.nn.sigmoid_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
_epsilon = tfb._to_tensor(tfb.epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output / (1 - output))
return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
logits=output)
logits = tf.constant([[-3., -2.11, -1.22],
[-0.33, 0.55, 1.44],
[2.33, 3.22, 4.11]])
labels = tf.constant([[1., 1., 1.],
[1., 1., 0.],
[0., 0., 0.]])
custom_sigmoid_cross_entropy_with_logits = binary_crossentropy_custom(labels, logits)
keras_binary_crossentropy = losses.binary_crossentropy(y_true=labels, y_pred=logits)
with tf.Session() as sess:
print('CUSTOM sigmoid_cross_entropy_with_logits: ', sess.run(custom_sigmoid_cross_entropy_with_logits), '\n')
print('KERAS keras_binary_crossentropy: ', sess.run(keras_binary_crossentropy), '\n')
#CUSTOM sigmoid_cross_entropy_with_logits: [16.118095 10.886106 15.942386]
#KERAS keras_binary_crossentropy: [16.118095 10.886106 15.942386]
#Dummy check of model accuracy
X_train = tf.random.uniform((3, 5), minval=0, maxval=1, dtype=tf.dtypes.float32)
labels = tf.constant([[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.]])
model = Sequential()
#First Hidden Layer
model.add(Dense(5, activation='relu', kernel_initializer='random_normal', input_dim=5))
#Output Layer
model.add(Dense(3, activation='sigmoid', kernel_initializer='random_normal'))
#I ran model.fit for each model.compile below 10 times using the same X_train and provide the range of accuracy measurement
# model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['accuracy']) #0.748 < acc < 0.779
# model.compile(optimizer='adam', loss=losses.binary_crossentropy, metrics =['accuracy']) #0.761 < acc < 0.778
model.compile(optimizer='adam', loss=binary_crossentropy_custom, metrics =['accuracy']) #0.617 < acc < 0.663
history = model.fit(X_train, labels, steps_per_epoch=100, epochs=1)
I'd expect the custom loss function to give similar model accuracy output but it does not. Any idea? Thanks!
Keras automatically selects which accuracy implementation to use according to the loss, and this won't work if you use a custom loss. But in this case you can just explictly use the right accuracy, which is binary_accuracy:
model.compile(optimizer='adam', loss=binary_crossentropy_custom, metrics =['binary_accuracy'])

Keras TimeDistributed Not Masking CNN Model

For the sake of example, I have an input consisting of 2 images,of total shape (2,299,299,3). I'm trying to apply inceptionv3 on each image, and then subsequently process the output with an LSTM. I'm using a masking layer to exclude a blank image from being processed (specified below).
The code is:
import numpy as np
from keras import backend as K
from keras.models import Sequential,Model
from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D, BatchNormalization, \
Input, GlobalAveragePooling2D, Masking,TimeDistributed, LSTM,Dense,Flatten,Reshape,Lambda, Concatenate
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.applications import inception_v3
IMG_SIZE=(299,299,3)
def create_base():
base_model = inception_v3.InceptionV3(weights='imagenet', include_top=False)
x = GlobalAveragePooling2D()(base_model.output)
base_model=Model(base_model.input,x)
return base_model
base_model=create_base()
#Image mask to ignore images with pixel values of -1
IMAGE_MASK = -2*np.expand_dims(np.ones(IMG_SIZE),0)
final_input=Input((2,IMG_SIZE[0],IMG_SIZE[1],IMG_SIZE[2]))
final_model = Masking(mask_value = -2.)(final_input)
final_model = TimeDistributed(base_model)(final_model)
final_model = Lambda(lambda x: x, output_shape=lambda s:s)(final_model)
#final_model = Reshape(target_shape=(2, 2048))(final_model)
#final_model = Masking(mask_value = 0.)(final_model)
final_model = LSTM(5,return_sequences=False)(final_model)
final_model = Model(final_input,final_model)
#Create a sample test image
TEST_IMAGE = np.ones(IMG_SIZE)
#Create a test sample input, consisting of a normal image and a masked image
TEST_SAMPLE = np.concatenate((np.expand_dims(TEST_IMAGE,axis=0),IMAGE_MASK))
inp = final_model.input # input placeholder
outputs = [layer.output for layer in final_model.layers] # all layer outputs
functors = [K.function([inp]+ [K.learning_phase()], [out]) for out in outputs]
layer_outs = [func([np.expand_dims(TEST_SAMPLE,0), 1.]) for func in functors]
This does not work correctly. Specifically, the model should mask the IMAGE_MASK part of the input, but it instead processes it with inception (giving a nonzero output). here are the details:
layer_out[-1] , the LSTM output is fine:
[array([[-0.15324114, -0.09620268, -0.01668587, 0.07938149, -0.00757846]], dtype=float32)]
layer_out[-2] and layer_out[-3] , the LSTM input is wrong, it should have all zeros in the second array:
[array([[[ 0.37713543, 0.36381325, 0.36197218, ..., 0.23298527,
0.43247852, 0.34844452],
[ 0.24972123, 0.2378867 , 0.11810347, ..., 0.51930511,
0.33289322, 0.33403745]]], dtype=float32)]
layer_out[-4], the input to the CNN is correctly masked:
[[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.],
...,
[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]]],
[[[-0., -0., -0.],
[-0., -0., -0.],
[-0., -0., -0.],
...,
[-0., -0., -0.],
[-0., -0., -0.],
[-0., -0., -0.]],
Note that the code seems to work correctly with a simpler base_model such as:
def create_base():
input_layer=Input(IMG_SIZE)
base_model=Flatten()(input_layer)
base_model=Dense(2048)(base_model)
base_model=Model(input_layer,base_model)
return base_model
I have exhausted most online resources on this. Permutations of this question have been asked on Keras's github, such as here, here and here, but I can't seem to find any concrete resolution.
The links suggest that the issues seem to be stemming from a combination of TimeDistributed being applied to BatchNormalization, and the hacky fixes of either the Lambda identity layer, or Reshape layers remove errors but don't seem to output the correct model.
I've tried to force the base model to support masking via:
base_model.__setattr__('supports_masking',True)
and I've also tried applying an identity layer via:
TimeDistributed(Lambda(lambda x: base_model(x), output_shape=lambda s:s))(final_model)
but none of these seem to work. Note that I would like the final model to be trainable, in particular the CNN part of it should remain trainable.
Not entirely sure this will work, but based on the comment made here, with a newer version of tensorflow + keras it should work:
final_model = TimeDistributed(Flatten())(final_input)
final_model = Masking(mask_value = -2.)(final_model)
final_model = TimeDistributed(Reshape(IMG_SIZE))(final_model)
final_model = TimeDistributed(base_model)(final_model)
final_model = Model(final_input,final_model)
I took a look at the source code of masking, and I noticed Keras creates a mask tensor that only reduces the last axis. As long as you're dealing with 5D tensors, it will cause no problem, but when you reduce the dimensions for the LSTM, this masking tensor becomes incompatible.
Doing the first flatten step, before masking, will assure that the masking tensor works properly for 3D tensors. Then you expand the image again to its original size.
I'll probably try to install newer versions soon to test it myself, but these installing procedures have caused too much trouble and I'm in the middle of something important here.
On my machine, this code compiles, but that strange error appears in prediction time (see link at the first line of this answer).
Creating a model for predicting the intermediate layers
I'm not sure, by the code I've seen, that the masking function is kept internally in tensors. I don't know exactly how it works, but it seems to be managed separately from the building of the functions inside the layers.
So, try using a keras standard model to make the predictions:
inp = final_model.input # input placeholder
outputs = [layer.output for layer in final_model.layers] # all layer outputs
fullModel = Model(inp,outputs)
layerPredictions = fullModel.predict(np.expand_dims(TEST_SAMPLE,0))
print(layerPredictions[-2])
It seems to be working as intended. Masking in Keras doesn't produce zeros as you would expect, it instead skips the timesteps that are masked in upstream layers such as LSTM and loss calculation. In case of RNNs, Keras (at least tensorflow) is implemented such that the states from the previous step are carried over, tensorflow_backend.py. This is done in part to preserve the shapes of tensors when dynamic input is given.
If you really want zeros you will have to implement your own layer with a similar logic to Masking and return zeros explicitly. To solve your problem, you need a mask before the final LSTM layer using the final_input:
class MyMask(Masking):
"""Layer that adds a mask based on initial input."""
def compute_mask(self, inputs, mask=None):
# Might need to adjust shapes
return K.any(K.not_equal(inputs[0], self.mask_value), axis=-1)
def call(self, inputs):
# We just return input back
return inputs[1]
def compute_output_shape(self, input_shape):
return input_shape[1]
final_model = MyMask(mask_value=-2.)([final_input, final_model])
You probably can attach the mask in a simpler manner but this custom class essentially adds a mask based on your initial inputs and outputs a Keras tensor that now has a mask.
Your LSTM will ignore in your example the second image. To confirm you can return_sequences=Trueand check that the output for 2 images are identical.
I'm trying implement the same thing, I want my LSTM sequences to have variable sizes. However I can't even implement your original model. I obtain the following error: TypeError: Layer input_1 does not support masking, but was passed an input_mask: Tensor("time_distributed_1/Reshape_1:0", shape=(?, 100, 100), dtype=bool) I'm using tensorflow 1.10 and keras 2.2.2
I solved the problem by adding a second input, a mask to specify which timesteps to take into account for the LSTM. That way the image sequence always has the same number of timesteps, the CNN always generates an output, but some of them are ignored for the LSTM input. However, the missing images need to be chosen carefully so that the batch normalization is not affected.
def LSTM_CNN(params):
resnet = ResNet50(include_top=False, weights='imagenet', pooling = 'avg')
input_layer = Input(shape=(params.numFrames, params.height, params.width, 3))
input_mask = Input(shape=(params.numFrames,1))
curr_layer = TimeDistributed(resnet)(input_layer)
resnetOutput = Dropout(0.5)(curr_layer)
curr_layer = multiply([resnetOutput,input_mask])
cnn_output = curr_layer
curr_layer = Masking(mask_value=0.0)(curr_layer)
lstm_out = LSTM(256, dropout=0.5)(curr_layer)
output = Dense(output_dim=params.numClasses, activation='sigmoid')(lstm_out)
model = Model([input_layer, input_mask], output)
return model

Tensorflow tf.nn.embedding_lookup

is there a small neural network in tf.nn.embedding_lookup??
When I train some data, a value of the same index is changing.
So is it trained also? while I'm training my model
I checked the official embedding_lookup code but I can not see any tf.Variables for train embedding parameter.
But when I print all tf.Variables then I can found a Variable which is within embedding scope
Thank you.
Yes, the embedding is learned. You can look at the tf.nn.embedding_lookup operation as doing the following matrix multiplication more efficiently:
import tensorflow as tf
import numpy as np
NUM_CATEGORIES, EMBEDDING_SIZE = 5, 3
y = tf.placeholder(name='class_idx', shape=(1,), dtype=tf.int32)
RS = np.random.RandomState(42)
W_em_init = RS.randn(NUM_CATEGORIES, EMBEDDING_SIZE)
W_em = tf.get_variable(name='W_em',
initializer=tf.constant_initializer(W_em_init),
shape=(NUM_CATEGORIES, EMBEDDING_SIZE))
# Using tf.nn.embedding_lookup
y_em_1 = tf.nn.embedding_lookup(W_em, y)
# Using multiplication
y_one_hot = tf.one_hot(y, depth=NUM_CATEGORIES)
y_em_2 = tf.matmul(y_one_hot, W_em)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run([y_em_1, y_em_2], feed_dict={y: [1.0]})
# [array([[ 1.5230298 , -0.23415338, -0.23413695]], dtype=float32),
# array([[ 1.5230298 , -0.23415338, -0.23413695]], dtype=float32)]
The variable W_em will be trained in exactly the same way irrespective of whether you use y_em_1 or y_em_2 formulation; y_em_1 is likely to be more efficient, though.