Loss function works with reduce_mean but not reduce_sum - tensorflow

I'm new to tensor flow, and have been looking at the examples here. I wanted to rewrite the multilayer perceptron classification model to be a regression model. However I encountered some strange behaviour when modifying the loss function. It works fine with tf.reduce_mean, but if I try using tf.reduce_sum it gives nan's in the output. This seems very strange, as the functions are very similar - the only difference is that the mean divides the sum result by the number of elements? So I can't see how nan's could be introduced by this change?
import tensorflow as tf
# Parameters
learning_rate = 0.001
# Network Parameters
n_hidden_1 = 32 # 1st layer number of features
n_hidden_2 = 32 # 2nd layer number of features
n_input = 2 # number of inputs
n_output = 1 # number of outputs
# Make artificial data
SAMPLES = 1000
X = np.random.rand(SAMPLES, n_input)
T = np.c_[X[:,0]**2 + np.sin(X[:,1])]
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_output])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with tanh activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.tanh(layer_1)
# Hidden layer with tanh activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.tanh(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_output]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_output]))
}
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
#se = tf.reduce_sum(tf.square(pred - y)) # Why does this give nans?
mse = tf.reduce_mean(tf.square(pred - y)) # When this doesn't?
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)
# Initializing the variables
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
training_epochs = 10
display_step = 1
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
# Loop over all batches
for i in range(100):
# Run optimization op (backprop) and cost op (to get loss value)
_, msev = sess.run([optimizer, mse], feed_dict={x: X, y: T})
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1), "mse=", \
"{:.9f}".format(msev))
The problematic variable se is commented out. It should be used in place of mse.
With mse the output looks like this:
Epoch: 0001 mse= 0.051669389
Epoch: 0002 mse= 0.031438075
Epoch: 0003 mse= 0.026629323
...
and with se it ends up like this:
Epoch: 0001 se= nan
Epoch: 0002 se= nan
Epoch: 0003 se= nan
...

The loss by summing across the batch is 1000 times larger (from skimming the code I think your training batch size is 1000) so your gradients and parameter updates are also 1000 times larger. The larger updates apparently lead to nans.
Generally learning rates are expressed per example so the loss to find the gradients for updates should be per example also. If the loss is per batch then the learning rate needs to be reduced by the batch size to get comparable training results.

if you use reduce_sum instead of reduce_mean, then the gradient is much larger. Therefore, you should correspondingly narrow down the learning rate to make sure the training process can properly carry on.

In most literature, the loss is expressed as the mean of the losses over the batch. If the loss is calculated using reduce_mean(), the learning rate should be regarded as per batch which should be larger.
It seems like in tensorflow.keras.losses, people are still choosing between mean or sum. For example, in the tf.keras.losses.Huber, the default is mean. But you are allowed to set it to sum.

Related

Why is tf.GradientTape.jacobian giving None?

I'm using the IRIS dataset, and am following this official tutorial: Custom training: walkthrough
In the Training loop, I am trying to gather the model outputs and weights in each epoch%50==0 in the lists m_outputs_mod50, gather_weights respectively:
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
m_outputs_mod50 = []
gather_weights = []
num_epochs = 201
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# gather_kernel(model)
# Training loop - using batches of 32
for x, y in train_dataset:
# Optimize the model
loss_value, grads = grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(y, model(x, training=True))
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
# pred_hist.append(model.predict(x))
if epoch % 50 == 0:
m_outputs_mod50.append(model(x))
gather_weights.append(model.weights)
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))
Running the above and trying to even get the jacobian at epoch 0 (using m_outputs_mod50[0] and gather_weights[0]) using
with tf.GradientTape() as tape:
print(tape.jacobian(target = m_outputs_mod50[0], sources = gather_weights[0]))`
I get a list of None as the output.
Why?
You need to understand how the GradientTape operates. For that, you can follow the guide: Introduction to gradients and automatic differentiation. Here is an excerpt:
TensorFlow provides the tf.GradientTape API for automatic
differentiation; that is, computing the gradient of a computation with
respect to some inputs, usually tf.Variables. TensorFlow "records"
relevant operations executed inside the context of a tf.GradientTape
onto a "tape". TensorFlow then uses that tape to compute the gradients
of a "recorded" computation using reverse mode differentiation.
To compute a gradient (or a jacobian), the tape needs to record the operations that are executed in its context. Then, outside its context, once the forward pass has been executed, its possible to use the tape to compute the gradient/jacobian.
You could use something like that:
if epoch % 50 == 0:
with tf.GradientTape() as tape:
out = model(x)
jacobian = tape.jacobian(out, model.weights)

Why doesn't custom training loop average loss over batch_size?

Below code snippet is the custom training loop from Tensorflow official tutorial.https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Another tutorial also does not average loss over batch_size, as shown here https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Why is the loss_value not averaged over batch_size at this line loss_value = loss_fn(y_batch_train, logits)? Is this a bug? From another question here Loss function works with reduce_mean but not reduce_sum, reduce_mean is indeed needed to average loss over batch_size
The loss_fn is defined in the tutorial as below. It obviously does not average over batch_size.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
From documentation, keras.losses.SparseCategoricalCrossentropy sums loss over the batch without averaging. Thus, this is essentially reduce_sum instead of reduce_mean!
Type of tf.keras.losses.Reduction to apply to loss. Default value is AUTO. AUTO indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
The code is shown below.
epochs = 2
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
I've figured it out, the loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) indeed averages loss over batch_size by default.

Why does the loss of vgg16 equal nan, but performs normally when adding an extra Softmax layer?

I am coding a vgg16 net with Tensorflow low-level api. The model is test in imagenet12 dataset. Due to computation cost, I split the validation set into 80% training data and 20% test data.
First, the last layer fc8 outputs without the activation of softmax, and I use the tf.nn.softmax_cross_entropy_with_logits_v2(labels, logits) to compute the loss. It finally outputs nan in the training process.
Then I try to add a softmax layer under fc8, but still use tf.nn.softmax_cross_entropy_with_logits_v2(labels, logits) to compute the loss. Surprisingly, the loss outputs normally rather than nan.
Here is the code before adding softmax layer:
def vgg16():
...
fc8_layer = FullConnectedLayer(y, self.weight_dict, regularizer_fc=self.regularizer_fc)
self.op_logits = fc8_layer.layer_output
def loss(self):
entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.Y, logits=self.op_logits)
l2_loss = tf.losses.get_regularization_loss()
self.op_loss = tf.reduce_mean(entropy, name='loss') + l2_loss
and I change the vgg16 output like:
def vgg16():
...
fc8_layer = FullConnectedLayer(y, self.weight_dict, regularizer_fc=self.regularizer_fc)
self.op_logits = tf.nn.softmax(fc8_layer.layer_output)
Besides, here is my optimizer:
def optimize(self):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
self.opt = tf.train.MomentumOptimizer(learning_rate=self.config.learning_rate, momentum=0.9,
use_nesterov=True)
self.op_opt = self.opt.minimize(self.op_loss)
I don't understand why adding a softmax layer works. In my idea, two softmax layers wont affect the final loss, since it doesn't change the proportion of each output unit.

Tensorflow lstm for sentiment analysis not learning. UPDATED

UPDATED:
i'm building a Neural Network for my final project and i need some help with it.
I'm trying to build a rnn to do sentiment analysis over Spanish text. I have about 200,000 labeled tweets and i vectorized them using a word2vec with a Spanish embedding
Dataset & Vectorization:
I erased duplicates and split the dataset into training and testing sets.
Padding, unknown and end of sentence tokens are applied when vectorizing.
I mapped the #mentions to known names in the word2vec model. Example: #iamthebest => "John"
My model:
My data tensor has shape = (batch_size, 20, 300).
I have 3 classes: neutral, positive and negative, so my target tensor has shape = (batch_size, 3)
I use BasicLstm cells and dynamic rnn to build the net.
I use Adam Optimizer, and softmax_cross entropy for the loss calculation
I use a dropout wrapper to decrease the overfitting.
Last run:
I have tried with different configurations and non of them seem to work.
Last setup: 2 Layers, 512 batch size, 15 epochs and 0.001 of lr.
Weak points for me:
im worried about the final layer and the handing of the final state in the dynamic_rnn
Code:
# set variables
num_epochs = 15
tweet_size = 20
hidden_size = 200
vec_size = 300
batch_size = 512
number_of_layers= 1
number_of_classes= 3
learning_rate = 0.001
TRAIN_DIR="/checkpoints"
tf.reset_default_graph()
# Create a session
session = tf.Session()
# Inputs placeholders
tweets = tf.placeholder(tf.float32, [None, tweet_size, vec_size], "tweets")
labels = tf.placeholder(tf.float32, [None, number_of_classes], "labels")
# Placeholder for dropout
keep_prob = tf.placeholder(tf.float32)
# make the lstm cells, and wrap them in MultiRNNCell for multiple layers
def lstm_cell():
cell = tf.contrib.rnn.BasicLSTMCell(hidden_size)
return tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=keep_prob)
multi_lstm_cells = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(number_of_layers)], state_is_tuple=True)
# Creates a recurrent neural network
outputs, final_state = tf.nn.dynamic_rnn(multi_lstm_cells, tweets, dtype=tf.float32)
with tf.name_scope("final_layer"):
# weight and bias to shape the final layer
W = tf.get_variable("weight_matrix", [hidden_size, number_of_classes], tf.float32, tf.random_normal_initializer(stddev=1.0 / math.sqrt(hidden_size)))
b = tf.get_variable("bias", [number_of_classes], initializer=tf.constant_initializer(1.0))
sentiments = tf.matmul(final_state[-1][-1], W) + b
prob = tf.nn.softmax(sentiments)
tf.summary.histogram('softmax', prob)
with tf.name_scope("loss"):
# define cross entropy loss function
losses = tf.nn.softmax_cross_entropy_with_logits(logits=sentiments, labels=labels)
loss = tf.reduce_mean(losses)
tf.summary.scalar("loss", loss)
with tf.name_scope("accuracy"):
# round our actual probabilities to compute error
accuracy = tf.to_float(tf.equal(tf.argmax(prob,1), tf.argmax(labels,1)))
accuracy = tf.reduce_mean(tf.cast(accuracy, dtype=tf.float32))
tf.summary.scalar("accuracy", accuracy)
# define our optimizer to minimize the loss
with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
#tensorboard summaries
merged_summary = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, session.graph)
# initialize any variables
tf.global_variables_initializer().run(session=session)
# Create a saver for writing training checkpoints.
saver = tf.train.Saver()
# load our data and separate it into tweets and labels
train_tweets = np.load('data_es/train_vec_tweets.npy')
train_labels = np.load('data_es/train_vec_labels.npy')
test_tweets = np.load('data_es/test_vec_tweets.npy')
test_labels = np.load('data_es/test_vec_labels.npy')
**HERE I HAVE THE LOOP FOR TRAINING AND TESTING, I KNOW ITS FINE**
I have already solved my problem. After reading some papers and more trial and error, I figured out what my mistakes were.
1) Dataset: I had a large dataset, but I didn't format it properly.
I checked the distribution of tweet labels (Neutral, Positive and Negative), realized there was a disparity in the distribution of said tweets and normalized it.
I cleaned it up even more by erasing url hashtags and unnecessary punctuation.
I shuffled prior to vectorization.
2) Initialization:
I initialized the MultiRNNCell with zeros and I changed my custom final layer to tf.contrib.fully_connected. I also added the initialization of the bias and weight matrix. (By fixing this, I started to see better loss and accuracy plots in Tensorboard)
3) Dropout:
I read this paper, Recurrent Dropout without Memory Loss, and I changed my dropouts accordingly; I started seeing improvements in the loss and accuracy.
4) Decaying the learning rate:
I added an exponential decaying rate after 10,000 steps to control over-fitting.
Final results:
After applying all of these changes, I achieved a test accuracy of 84%, which is acceptable because my data set still sucks.
My final network config was:
num_epochs = 20
tweet_size = 20
hidden_size = 400
vec_size = 300
batch_size = 512
number_of_layers= 2
number_of_classes= 3
start_learning_rate = 0.001

Linear Regression model On tensorflow can't learn bias

I am trying to train a linear regression model in Tensorflow using some generated data. The model seems to learn the slope of the line, but is unable to learn the bias.
I have tried changing the no. of epochs, the weight(slope) and the biases, but every time , the learnt bias by the model comes out to be zero. I don't know where I am going wrong and some help would be appreciated.
Here is the code.
import numpy as np
import tensorflow as tf
# assume the linear model to be Y = W*X + b
X = tf.placeholder(tf.float32, [None, 1])
Y = tf.placeholder(tf.float32, [None,1])
# the weight and biases
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
# the model
prediction = tf.matmul(X,W) + b
# the cost function
cost = tf.reduce_mean(tf.square(Y - prediction))
# Use gradient descent
learning_rate = 0.000001
train_step =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
steps = 1000
epochs = 10
Verbose = False
# In the end, the model should learn these values
test_w = 3
bias = 10
for _ in xrange(epochs):
for i in xrange(steps):
# make fake data for the model
# feed one example at a time
# stochastic gradient descent, because we only use one example at a time
x_temp = np.array([[i]])
y_temp = np.array([[test_w*i + bias]])
# train the model using the data
feed_dict = {X: x_temp, Y:y_temp}
sess.run(train_step,feed_dict=feed_dict)
if Verbose and i%100 == 0:
print("Iteration No: %d" %i)
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
print("Finally:")
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
# These values should be close to the values we used to generate data
https://github.com/HarshdeepGupta/tensorflow_notebooks/blob/master/Linear%20Regression.ipynb
Outputs are in the last line of code.
The model needs to learn test_w and bias (in the notebook link, it is in the 3rd cell, after the first comment), which are set to 3 and 10 respectively.
The model correctly learns the weight(slope), but is unable to learn the bias. Where is the error?
The main problem is that you are feeding just one sample at a time to the model. This makes your optimizer very inestable, that's why you have to use such a small learning rate. I will suggest you to feed more samples in each step.
If you insist in feeding one sample at a time, maybe you should consider using an optimizer with momentum, like tf.train.AdamOptimizer(learning_rate). This way you can increase the learning rate and reach convergence.