tensorflow tutorial of convolution, scale of logit - tensorflow

I am trying to edit my own model by adding some code to cifar10.py and here is the question.
In cifar10.py, the [tutorial][1] says:
EXERCISE: The output of inference are un-normalized logits. Try editing the network architecture to return normalized predictions using tf.nn.softmax().
So I directly input the output from "local4" to tf.nn.softmax(). This gives me the scaled logits which means the sum of all logits is 1.
But in the loss function, the cifar10.py code uses:
tf.nn.sparse_softmax_cross_entropy_with_logits()
and description of this function says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Also, according to the description, logits as input to above funtion must have the shape [batch_size, num_classes] and it means logits should be unscaled softmax, like sample code calculate unnormalized softmaxlogit as follow.
# softmax, i.e. softmax(WX + b)
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
_activation_summary(softmax_linear)
Does this mean I don't have to use tf.nn.softmax in the code?

You can use tf.nn.softmax in the code if you want, but then you will have to compute the loss yourself:
softmax_logits = tf.nn.softmax(logits)
loss = tf.reduce_mean(- labels * tf.log(softmax_logits) - (1. - labels) * tf.log(1. - softmax_logits))
In practice, you don't use tf.nn.softmax for computing the loss. However you need to use tf.nn.softmax if for instance you want to compute the predictions of your algorithm and compare them to the true labels (to compute accuracy).

Related

Custom loss function with regularization cost added in TensorFlow

I wrote a custom loss function that add the regularization loss to the total loss, I added L2 regularizer to kernels only, but when I called model.fit() a warning appeared which states that the gradients does not exist for those biases, and biases are not updated, also if I remove a regularizer from a kernel of one of the layers, the gradient for that kernel also does not exist.
I tried to add bias regularizer to each layer and everything worked correctly, but I don't want to regularize the biases, so what should I do?
Here is my loss function:
def _loss_function(y_true, y_pred):
# convert tensors to numpy arrays
y_true_n = y_true.numpy()
y_pred_n = y_pred.numpy()
# modify probablities for Knowledge Distillation loss
# we do this for old tasks only
old_y_true = np.float_power(y_true_n[:, :-1], 0.5)
old_y_true = old_y_true / np.sum(old_y_true)
old_y_pred = np.float_power(y_pred_n[:, :-1], 0.5)
old_y_pred = old_y_pred / np.sum(old_y_pred)
# Define the loss that we will used for new and old tasks
bce = tf.keras.losses.BinaryCrossentropy()
# compute the loss on old tasks
old_loss = bce(old_y_true, old_y_pred)
# compute the loss on new task
new_loss = bce(y_true_n[:, -1], y_pred_n[:, -1])
# compute the regularization loss
reg_loss = tf.compat.v1.losses.get_regularization_loss()
assert reg_loss is not None
# convert all tensors to float64
old_loss = tf.cast(old_loss, dtype=tf.float64)
new_loss = tf.cast(new_loss, dtype=tf.float64)
reg_loss = tf.cast(reg_loss, dtype=tf.float64)
return old_loss + new_loss + reg_loss
In keras, loss function should return the loss value without regularization losses. The regularization losses will be added automatically by setting kernel_regularizer or bias_regularizer in each of the keras layers.
In other words, when you write your custom loss function, you don't have to care about regularization losses.
Edit: the reason why you got the warning messages that gradients don't exist is because of the usage of numpy() in your loss function. numpy() will stop any gradient propagation.
The warning messages disappeared after you added regularizers to the layers do not imply that the gradients were then computed correctly. It would only include the gradients from the regularizers but not from the data. numpy() should be removed in the loss function in order to get the correct gradients.
One of the solutions is to keep everything in tensors and use tf.math library. e.g. use tf.pow to replace np.float_power and tf.reduce_sum to replace np.sum

Why does the loss of vgg16 equal nan, but performs normally when adding an extra Softmax layer?

I am coding a vgg16 net with Tensorflow low-level api. The model is test in imagenet12 dataset. Due to computation cost, I split the validation set into 80% training data and 20% test data.
First, the last layer fc8 outputs without the activation of softmax, and I use the tf.nn.softmax_cross_entropy_with_logits_v2(labels, logits) to compute the loss. It finally outputs nan in the training process.
Then I try to add a softmax layer under fc8, but still use tf.nn.softmax_cross_entropy_with_logits_v2(labels, logits) to compute the loss. Surprisingly, the loss outputs normally rather than nan.
Here is the code before adding softmax layer:
def vgg16():
...
fc8_layer = FullConnectedLayer(y, self.weight_dict, regularizer_fc=self.regularizer_fc)
self.op_logits = fc8_layer.layer_output
def loss(self):
entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.Y, logits=self.op_logits)
l2_loss = tf.losses.get_regularization_loss()
self.op_loss = tf.reduce_mean(entropy, name='loss') + l2_loss
and I change the vgg16 output like:
def vgg16():
...
fc8_layer = FullConnectedLayer(y, self.weight_dict, regularizer_fc=self.regularizer_fc)
self.op_logits = tf.nn.softmax(fc8_layer.layer_output)
Besides, here is my optimizer:
def optimize(self):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
self.opt = tf.train.MomentumOptimizer(learning_rate=self.config.learning_rate, momentum=0.9,
use_nesterov=True)
self.op_opt = self.opt.minimize(self.op_loss)
I don't understand why adding a softmax layer works. In my idea, two softmax layers wont affect the final loss, since it doesn't change the proportion of each output unit.

Keras dense layer outputs are 'nan'

I'm using Keras to build a RNN model with CTC loss.
I found that when passed a tensor to a Dense layer with activation=None, and the outputs of this layer were all nan.
But when set activation='softmax', the outputs were normal not nan.
problem code (elements of logits are all nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[logits, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
normal code(elements of logits are not nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
output = Activation(activation="softmax", name="softmax")(logits)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[output, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
def ctc_lambda_func(args):
y_pred, y_true, input_length, label_length = args
return ctc_batch_cost(y_true, y_pred,input_length,label_length)
Anyone helps? many thanks.
I may misunderstand you, but why would you want activation="none"?
Maybe what you want to use is linear activation?
Have a look at Keras Activation Functions
as per Klemen Grm
your neural network is completely linear. You might consider different activation functions (eg: tanh, sigmoid, linear) for your hidden and output layers. This both lets you constrain the output range, and will probably improve the learning properties of your network.
In addition to what Klemen says, for the last one you want a softmax,
that normalizes the outputs into probabilities.
Neural networks have to implement complex mapping functions hence they need activation functions that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. A neuron without an activation function is equivalent to a neuron with a linear activation function

What is the right way to Computing correct predictions in Tensorflow?

I'm feeding a simple ConvNet in Tensorflow using a tfrecords file containing grayscale images as inputs and integer class labels.
my loss is defined as loss = tf.nn.sparse_softmax_cross_entropy_with_logits(y_conv, label_batch)
where y_conv=tf.matmul(h_fc1_drop,W_fc2) + b_fc2
and label_batch is tensor of size [batch_size].
I'm trying to compute the accuracy by using
correct_prediction = tf.equal(tf.argmax(label_batch,1),tf.argmax(y_conv, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
This correct_prediction statement is giving an error:
InvalidArgumentError (see above for traceback): Minimum tensor rank: 2 but got: 1
I'm a bit confused as to how exactly one computes correct predictions in TF.
You probably want to use 0 for the dimension argument to tf.argmax since label_batch and y_conv are vectors. Using dimension=1 implies a tensor rank of at least 2. See the documentation for the dimension parameter of argmax here.
I hope that helps!
For your y_conv you do everything right -- it is a matrix of shape (batch_size, n_classes) where for each sample and for each class you have a probability that this is the class the image belongs to. So to get the actual predicted class you need to call argmax.
However your labels are integers and have a shape of just (batch_size,), because the class of an image is known and there's no reason to supply n_classes probabilities, a single integer can hold the actual class just as well. So you don't need to call argmax on it to convert probabilities to a class, it already has the class. To fix it, just do
correct_prediction = tf.equal(label_batch, tf.argmax(y_conv, 1))

Using Tensorflow's Connectionist Temporal Classification (CTC) implementation

I'm trying to use the Tensorflow's CTC implementation under contrib package (tf.contrib.ctc.ctc_loss) without success.
First of all, anyone know where can I read a good step-by-step tutorial? Tensorflow's documentation is very poor on this topic.
Do I have to provide to ctc_loss the labels with the blank label interleaved or not?
I could not be able to overfit my network even using a train dataset of length 1 over 200 epochs. :(
How can I calculate the label error rate using tf.edit_distance?
Here is my code:
with graph.as_default():
max_length = X_train.shape[1]
frame_size = X_train.shape[2]
max_target_length = y_train.shape[1]
# Batch size x time steps x data width
data = tf.placeholder(tf.float32, [None, max_length, frame_size])
data_length = tf.placeholder(tf.int32, [None])
# Batch size x max_target_length
target_dense = tf.placeholder(tf.int32, [None, max_target_length])
target_length = tf.placeholder(tf.int32, [None])
# Generating sparse tensor representation of target
target = ctc_label_dense_to_sparse(target_dense, target_length)
# Applying LSTM, returning output for each timestep (y_rnn1,
# [batch_size, max_time, cell.output_size]) and the final state of shape
# [batch_size, cell.state_size]
y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), # num_proj=num_classes
data,
dtype=tf.float32,
sequence_length=data_length,
)
# For sequence labelling, we want a prediction for each timestamp.
# However, we share the weights for the softmax layer across all timesteps.
# How do we do that? By flattening the first two dimensions of the output tensor.
# This way time steps look the same as examples in the batch to the weight matrix.
# Afterwards, we reshape back to the desired shape
# Reshaping
logits = tf.transpose(y_rnn1, perm=(1, 0, 2))
# Get the loss by calculating ctc_loss
# Also calculates
# the gradient. This class performs the softmax operation for you, so inputs
# should be e.g. linear projections of outputs by an LSTM.
loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))
# Define our optimizer with learning rate
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
# Decoding using beam search
decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)
Thanks!
Update (06/29/2016)
Thank you, #jihyeon-seo! So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function. Did I do everything correct?
This is the code:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
Update (07/11/2016)
Thank you #Xiv. Here is the code after the bug fix:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
self._logits = tf.transpose(self._logits, (1,0,2))
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
Update (07/25/16)
I published on GitHub part of my code, working with one utterance. Feel free to use! :)
I'm trying to do the same thing.
Here's what I found you may be interested in.
It was really hard to find the tutorial for CTC, but this example was helpful.
And for the blank label, CTC layer assumes that the blank index is num_classes - 1, so you need to provide an additional class for the blank label.
Also, CTC network performs softmax layer. In your code, RNN layer is connected to CTC loss layer. Output of RNN layer is internally activated, so you need to add one more hidden layer (it could be output layer) without activation function, then add CTC loss layer.
See here for an example with bidirectional LSTM, CTC, and edit distance implementations, training a phoneme recognition model on the TIMIT corpus. If you train on that corpus's training set, you should be able to get phoneme error rates down to 20-25% after 120 epochs or so.