Gradients are always zero - tensorflow

I have written an algorithm using tensorflow framework and faced with the problem, that tf.train.Optimizer.compute_gradients(loss) returns zero for all weights. Another problem is if I put batch size larger than about 5, tf.histogram_summary for weights throws an error that some of values are NaN.
I cannot provide here a reproducible example, because my code is quite bulky and I am not so good in TF for make it shorter. I will try to paste here some fragments.
Main loop:
images_ph = tf.placeholder(tf.float32, shape=some_shape)
labels_ph = tf.placeholder(tf.float32, shape=some_shape)
output = inference(BATCH_SIZE, images_ph)
loss = loss(labels_ph, output)
train_op = train(loss, global_step)
session = tf.Session()
session.run(tf.initialize_all_variables())
for i in xrange(MAX_STEPS):
images, labels = train_dataset.get_batch(BATCH_SIZE, yolo.INPUT_SIZE, yolo.OUTPUT_SIZE)
session.run([loss, train_op], feed_dict={images_ph : images, labels_ph : labels})
Train_op (here is the problem occures):
def train(total_loss)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(total_loss)
# Here gradients are zeros
for grad, var in grads:
if grad is not None:
tf.histogram_summary("gradients/" + var.op.name, grad)
return opt.apply_gradients(grads, global_step=global_step)
Loss (the loss is calculated correctly, since it changes from sample to sample):
def loss(labels, output)
return tf.reduce_mean(tf.squared_difference(labels, output))
Inference: a set of convolution layers with ReLU followed by 3 fully connected layers with sigmoid activation in the last layer. All weights initialized by truncated normal rv's. All labels are vectors of fixed length with real numbers in range [0,1].
Thanks in advance for any help! If you have some hypothesis for my problem, please share I will try them. Also I can share the whole code if you like.

Related

Tensorflow Neural Machine Translation Example - Loss Function

Im stepping through the code here: https://www.tensorflow.org/tutorials/text/nmt_with_attention
as a learning method and I am confused as to when the loss function is called and what is passed. I added two print statements in the loss_function and when the training loop runs, it only prints out
(64,)
(64, 4935)
at the very start multiple times and then nothing again. I am confused on two fronts:
Why doesnt the loss_function() get called repeatedly through the training loop and print the shapes? I expected that the loss function would get called at the end of each batch which is of size 64.
I expected the shapes of the actuals to be (batch size, time steps) and the predictions to be (batch size, time steps, vocabulary size). It looks like the loss gets called seperately for every time step (64 is the batch size and 4935 is the vocabulary size).
The relevant bits I believe are reproduced below.
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
print(real.shape)
print(pred.shape)
loss_ = loss_object(rea
l, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask #set padding entries to zero loss
return tf.reduce_mean(loss_)
#tf.function
def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
print(targ[:, t])
print(predictions)
loss += loss_function(targ[:, t], predictions)
# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
EPOCHS = 10
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
#print(batch)
batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
# saving (checkpoint) the model every 2 epochs
if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
The loss is treated similar to the rest of the graph. In tensorflow calls like tf.keras.layers.Dense and tf.nn.conv2d don't actually do the operation, but instead they define the graph for the operations. I have another post here How do backpropagation works in tensorflow that explains the backprop and some motivation of why this is.
The loss function you have above is
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
print(real.shape)
print(pred.shape)
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask #set padding entries to zero loss
result = tf.reduce_mean(loss_)
return result
Think of this function as a generate that returns result. Result defines the graph to compute the loss. Perhaps a better name for this function would be loss_function_graph_creator ... but that's another story.
Result, which is a graph that contains weights, bias, and information about how to both do the forward propagation and the back propagation is all model.fit needs. It no longer needs this function and it doesn't need to run the function every loop.
Truly, what is happening under the covers is that given your model (called my_model), the compile line
model.compile(loss=loss_function, optimizer='sgd')
is effectively the following lines
input = tf.keras.Input()
output = my_model(input)
loss = loss_function(input,output)
opt = tf.keras.optimizers.SGD()
gradient = opt.minimize(loss)
get_gradient_model = tf.keras.Model(input,gradient)
and there you have the gradient operation which can be use in a loop to get the gradients, which is conceptually what model.fit does.
Q and A
Is the fact that this function: #tf.function def train_step(inp, targ, enc_hidden): has the tf.function decorator (and the loss function is called in it) what makes this code run as you describe and not normal python?
No. It is not 'normal' python. It only defines the flow of tensors through the graph of matrix operations that will (hopefully) run on your GPU. All the tensorflow operations just set up the operations on the GPU (or a simulated GPU if you don't have one).
How can I tell the actual shapes being passed into loss_function (the second part of my question)?
No problem at all... simply run this code
loss_function(y, y).shape
This will compute the loss function of your expected output compared exactly to the same output. The loss will (hopefully) be zero, but actually calculating the value of the loss wasn't the point. You want the shape and this will give it to you.

Tensorflow Estimators : proper way to train image grids separately

I am trying to train an object detection model as described in this paper
There are 3 fully connected layers with 512, 512, 25 neurons. The 16x55x55 feature map from the last convolutional layer is fed into the fully connected layers to retrieve the appropriate class. At this stage, every grid described by (16x1x1) is fed into the fully connected layers to classify the grid as belonging to one of the 25 classes. The structure can be seen in the pciture below
fully connected layers
I am trying to adapt the code from TF MNIST classification tutorial, and I would like to know if it is okay to just sum the losses from each grid as in the code snippet below and use it to train the model weights.
flat_fmap = tf.reshape(last_conv_layer, [-1, 16*55*55])
total_loss = 0
for grid of flat_fmap:
dense1 = tf.layers.dense(inputs=grid, units=512, activation=tf.nn.relu)
dense2 = tf.layers.dense(inputs=dense1, units=512, activation=tf.nn.relu)
logits = tf.layers.dense(inputs=dense2, units=25)
total_loss += tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=total_loss,
global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=total_loss, train_op=train_op)
In the code above, I think at every iteration 3 new layers are being creating. However, I would like the weights to be preserved when classifying one grid and then another.
Adding to the total_loss should be ok.
tf.losses.sparse_softmax_cross_entropy is also adding losses together.
It calculates a sparse_softmax with logits and then reduces the resulting array though a sum using math_ops.reduce_sum.
So you are adding them together, one way or another.
As you can see in its source
The for loop on the network declaration seems unusual, it probably makes more sense to do it at run time and pass each grid through the feed_dict.
dense1 = tf.layers.dense(inputs=X, units=512, activation=tf.nn.relu)
dense2 = tf.layers.dense(inputs=dense1, units=512, activation=tf.nn.relu)
logits = tf.layers.dense(inputs=dense2, units=25)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
total_loss = 0
with tf.session as sess:
sess.run(init)
for grid in flat_fmap:
_, l = sess.run([optimizer,loss], feed_dict{X: grid, labels=labels})
total_loss += l

Simple softmax classifier in tensorflow

So I am trying to write a simple softmax classifier in TensorFlow.
Here is the code:
# Neural network parameters
n_hidden_units = 500
n_classes = 10
# training set placeholders
input_X = tf.placeholder(dtype='float32',shape=(None,X_train.shape[1], X_train.shape[2]),name="input_X")
input_y = tf.placeholder(dtype='int32', shape=(None,), name="input_y")
# hidden layer
dim = X_train.shape[1]*X_train.shape[2] # dimension of each traning data point
flatten_X = tf.reshape(input_X, shape=(-1, dim))
weights_hidden_layer = tf.Variable(initial_value=np.zeros((dim,n_hidden_units)), dtype ='float32')
bias_hidden_layer = tf.Variable(initial_value=np.zeros((1,n_hidden_units)), dtype ='float32')
hidden_layer_output = tf.nn.relu(tf.matmul(flatten_X, weights_hidden_layer) + bias_hidden_layer)
# output layer
weights_output_layer = tf.Variable(initial_value=np.zeros((n_hidden_units,n_classes)), dtype ='float32')
bias_output_layer = tf.Variable(initial_value=np.zeros((1,n_classes)), dtype ='float32')
output_logits = tf.matmul(hidden_layer_output, weights_output_layer) + bias_output_layer
predicted_y = tf.nn.softmax(output_logits)
# loss
one_hot_labels = tf.one_hot(input_y, depth=n_classes, axis = -1)
loss = tf.losses.softmax_cross_entropy(one_hot_labels, output_logits)
# optimizer
optimizer = tf.train.MomentumOptimizer(0.01, 0.5).minimize(
loss, var_list=[weights_hidden_layer, bias_hidden_layer, weights_output_layer, bias_output_layer])
This compiles, and I have checked the shape of all the tensor and it coincides with what I expect.
However, I tried to run the optimizer using the following code:
# running the optimizer
s = tf.InteractiveSession()
s.run(tf.global_variables_initializer())
for i in range(5):
s.run(optimizer, {input_X: X_train, input_y: y_train})
loss_i = s.run(loss, {input_X: X_train, input_y: y_train})
print("loss at iter %i:%.4f" % (i, loss_i))
And the loss kept being the same in all iterations!
I must have messed up something, but I fail to see what.
Any ideas? I also appreciate if somebody leaves comments regarding code style and/or tensorflow tips.
You have made a mistake. You are initializing your weights using np.zeros. Use np.random.normal. You can choose mean for this Gaussian Distribution by using number of inputs going to a particular neuron. You can read more about it here.
The reason that you want to initialize with Gaussian Distribution is because you want to break symmetry. If all the weights are initialized by zero, then you can use backpropogation to see that all the weights will evolved same.
One could visualize the weight histogram using TensorBoard to make it easier. I executed your code for this. A few more lines are needed to set up Tensorboard logging but the histogram summary of weights can be easily added.
Initialized to zeros
weights_hidden_layer = tf.Variable(initial_value=np.zeros((784,n_hidden_units)), dtype ='float32')
tf.summary.histogram("weights_hidden_layer",weights_hidden_layer)
Xavier initialization
initializer = tf.contrib.layers.xavier_initializer()
weights_hidden_layer = tf.Variable(initializer(shape=(784,n_hidden_units)), dtype ='float32')
tf.summary.histogram("weights_hidden_layer",weights_hidden_layer)

how to define a TensorFlow graph with more than one input of different dim and combined multi different dim's layer to one layer?

After set each layer's name, My codes in below run well.
=================== old ===============
how to define a TensorFlow graph with more than one input of different dim?
for example, I have the Input (X1, X2, X3) with different dim(d1, d2, d3).
how to define a multi-input layer combined with different size's hidden-1 layer, and then combine the three hidden-1 layer to hidden-2 layer, then with a output layer ?
Thanks for all!
I tryed some code like this:
model_fn(features, labels, mode, params):
input_layers = [tf.feature_column.input_layer(features=features, feature_columns=params["feature_columns"][i]) for i, fi in enumerate(FEA_DIM)]
hidden1 = [tf.layers.dense(input_layers[i], H1_DIM[i], tf.nn.selu) for i, _ in enumerate(FEA_DIM)]
hidden1_c = tf.concat(hidden1, -1, "concat")
hidden2 = tf.layers.dense( inputs=hidden1_c, units=32, activation=tf.nn.selu, )
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=tf.nn.softmax)
labels = tf.contrib.layers.one_hot_encoding(labels, NCLASS)
loss = tf.losses.sigmoid_cross_entropy(labels, predictions)
optimizer = tf.train.AdamOptimizer(learning_rate=1)
train_op = optimizer.minimize( loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec( mode=mode, loss=loss, train_op=train_op)
But it doesn't work. The accuracy is not changing at training time.
The tensorboard's model graph is (the dense_xx is the hidden1's tensors):
The biggest problem lies in these lines
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=tf.nn.softmax)
labels = tf.contrib.layers.one_hot_encoding(labels, NCLASS)
loss = tf.losses.sigmoid_cross_entropy(labels, predictions)
First, since you have multiple classes, you should use softmax_cross_entropy, or better, sparse_softmax_cross_entropy to dispense with the one-hot encoding.
Second, the input to softmax_cross_entropy or sigmoid_cross_entropy should be unnormalized scores, so activation=tf.nn.softmax is wrong. All deep learning frameworks combine the softmax/sigmoid with cross entropy in one step because the combined operation has better performance and numeric stability, so you should not calculate the softmax yourself first.
Third, your learning rate is too high. Even 0.0025 is, under most circumstances, still too high. You should start with 0.001 and then tune it up and down from there.
Finally, I don't understand why you first dense then concat. Why not just concatenate all the features and then transform together?
for how to concat the layers, give my complete running code for examples:
input_layers = [tf.feature_column.input_layer(features=features, feature_columns=params["feature_columns"][i]) for i, fi in enumerate(FEA_DIM)]
hidden1 = [tf.layers.dense(input_layers[i], H1_DIM[i], tf.nn.selu, name="h_1_%s" % i,
kernel_regularizer=tf.contrib.layers.l1_l2_regularizer(scale_l1=1e-3, scale_l2=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=1.0/math.sqrt(H1_DIM[i]+FEA_DIM[i]))
) for i, _ in enumerate(FEA_DIM)]
hidden1_c = tf.concat(hidden1, -1, "concat")
hidden2 = tf.layers.dense(inputs=hidden1_c, units=128, activation=tf.nn.selu, name="h_2",
kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=1.0/math.sqrt(128+H1_DIM[i])))
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=None, kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=0.1), name="output")

adding one hot encoding throws error in previously working code in Tensorflow

with tf.variable_scope("rnn_seq2seq"):
w = tf.get_variable("proj_w", [num_units, seq_width])
w_t = tf.transpose(w)
b = tf.get_variable("proj_b", [seq_width])
output_projection=(w,b)
output,state = rnn_seq2seq(enc_inputs,dec_inputs,cell,output_projection=output_projection,feed_previous=False)
weights=[tf.ones([batch_size * dec_steps])]
loss=[]
for i in xrange(dec_steps -1):
logits = tf.nn.xw_plus_b(output[i],output_projection[0],output_projection[1])
If I introduce one hot encoding on the logits here, the program gives error later although both returns the same dimensions. If I comment out this line, the program does not give any error.
prev = logits
logits = tf.to_float(tf.equal(prev,tf.reduce_max(prev,reduction_indices=[1],keep_dims=True)))
print prev
print logits
Tensor("rnn_seq2seq/xw_plus_b:0", shape=TensorShape([Dimension(800), Dimension(14)]), dtype=float32)
Tensor("rnn_seq2seq/ToFloat:0", shape=TensorShape([Dimension(800), Dimension(14)]), dtype=float32)
Rest of code:
crossent =tf.nn.softmax_cross_entropy_with_logits(logits,dec_inputs[i+1],name="SequenceLoss/CrossEntropy{0}".format(i))
loss.append(crossent)
cost = tf.reduce_sum(tf.add_n(loss))
final_state = state[-1]
tvars = tf.trainable_variables()
grads,norm = tf.clip_by_global_norm(tf.gradients(cost,tvars),5)
lr = tf.Variable(0.0,name="learningRate")
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads,tvars))
---> 23 grads,norm = tf.clip_by_global_norm(tf.gradients(cost,tvars),5)
ValueError: List argument 'values' to 'Pack' Op with length 0 shorter
than minimum length 1.
Neural networks can only be trained if all the operations they perform are differentiable. The "one-hot" step you apply is not differentiable, and hence such a neural network cannot be trained using any gradient descent-based optimizer (= any optimizer that tensor flow implements).
The general approach is to use softmax (which is differentiable) during training to approximate one-hot encoding (and your model already has softmax following computing logits, so commenting out the "one-hot" is actually all you need to do).