Logits representation in TensorFlow’s sparse_softmax_cross_entropy - tensorflow

I’ve a question regarding to the sparse_softmax_cross_entropy cost function in TensorFlow.
I want to use it in a semantic segmentation context where I use an autoencoder architecture which uses typical convolution operations to downsample images to create a feature vector. This vector is than upsampled (using conv2d_transposeand one-by-one convolutions to create an output image.
Hence, my input consists of single channel images with shape (1,128,128,1), where the first index represents the batch size and the last one the number of channels. The pixel of the image are currently either 0 or 1. So each pixel is mapped to a class. The output image of the autoencoder follows the same rules. Hence, I can’t use any predefined cost function than either MSE or the previously mentioned one.
The network works fine with MSE. But I can’t get it working with sparse_softmax_cross_entropy. It seems like that this is the correct cost function in this context but I’m a bit confused about the representation of the logits. The official doc says that the logits should have the shape (d_i,...,d_n,num_classes). I tried to ignore the num_classes part but this causes an error which says that only the interval [0,1) is allowed. Of course, I need to specify the number of classes which would turn the allowed interval to [0,2) because the exclusive upper bound is obviously num_classes.
Could someone please explain how to turn my output image into the required logits?
The current code for the cost function is:
self._loss_op = tf.reduce_mean((tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.squeeze(self._target_placeholder, [3]), logits=self._model, name="Loss")))
The squeeze removes the last dimension of the label input to create a shape for the labels of [1 128 128]. This causes the following exception:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 1 which is outside the valid range of [0, 1).
Edit:
As requested, here's a minimal example to verfiy the behavior of the cost function in the context of fully-convolutional nets:
constructor snipped:
def __init__(self, img_channels=1, img_width=128, img_height=128):
...
self._loss_op = None
self._learning_rate_placeholder = tf.placeholder(tf.float32, [], 'lr')
self._input_placeholder = tf.placeholder(tf.float32, [None, img_width, img_height, img_channels], 'x')
self._target_placeholder = tf.placeholder(tf.float32, [None, img_width, img_height, img_channels], 'y')
self._model = self.build_model()
self.init_optimizer()
build_model() snipped:
def build_model(self):
with tf.variable_scope('conv1', reuse=tf.AUTO_REUSE):
#not necessary
x = tf.reshape(self._input_placeholder, [-1, self._img_width, self._img_height, self._img_channels])
conv1 = tf.layers.conv2d(x, 32, 5, activation=tf.nn.relu)
conv1 = tf.layers.max_pooling2d(conv1, 2, 2)
with tf.variable_scope('conv2', reuse=tf.AUTO_REUSE):
conv2 = tf.layers.conv2d(conv1, 64, 3, activation=tf.nn.relu)
conv2 = tf.layers.max_pooling2d(conv2, 2, 2)
with tf.variable_scope('conv3_red', reuse=tf.AUTO_REUSE):
conv3 = tf.layers.conv2d(conv2, 1024, 30, strides=1, activation=tf.nn.relu)
with tf.variable_scope('conv4_red', reuse=tf.AUTO_REUSE):
conv4 = tf.layers.conv2d(conv3, 64, 1, strides=1, activation=tf.nn.relu)
with tf.variable_scope('conv5_up', reuse=tf.AUTO_REUSE):
conv5 = tf.layers.conv2d_transpose(conv4, 32, (128, 128), strides=1, activation=tf.nn.relu)
with tf.variable_scope('conv6_1x1', reuse=tf.AUTO_REUSE):
conv6 = tf.layers.conv2d(conv5, 1, 1, strides=1, activation=tf.nn.relu)
return conv6
init_optimizer() snipped:
def init_optimizer(self):
self._loss_op = tf.reduce_mean((tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.squeeze(self._target_placeholder, [3]), logits=self._model, name="Loss")))
optimizer = tf.train.AdamOptimizer(learning_rate=self._learning_rate_placeholder)
self._train_op = optimizer.minimize(self._loss_op)

By definition the logit is an unscaled probability (strictly speaking odds) or simply put any number. The sequence of logits of length num_classes can be interpreted as unscaled probability distribution. For example, in your case, num_classes=2, then logits=[125.0, -10.0] is an unscaled probability distribution for one pixel (which clearly favors 0 over 1). This array can be squashed to a valid distribution by a softmax, and this is what tf.sparse_softmax_cross_entropy does internally. For [125.0, -10.0] the squashed distribution will be very close to [1.0, 0.0].
Once again, the array [2] is for a single pixel.
If you want to compute the cross-entropy over entire image, the network has to output the binary distribution for all pixels and all images in a batch, i.e. output [batch_size, 128, 128, 2] tensor. The term sparse in the name of the loss refers to the fact that the labels are not one-hot encoded (more details here). It's most useful when the number of classes is large, i.e. one-hot encoding becomes too inefficient in terms of memory, but in your case it's insignificant. If you decide to use tf.sparse_softmax_cross_entropy loss, the labels must be [batch_size, 128, 128], it must be tf.int32 or tf.int64 and must contain correct class indices, zero or one. That's it: tensorflow can compute the cross-entropy between these two arrays.

Related

Extracting gradient of Keras Embedding layer

I want to extract the gradient of a RNN model starting with an embedding layer using Tensorflow's GradientTape (using tensorflow 1.14 with eager execution). The model is a simple LSTM binary classifier, which is trained with a binary crossentropy loss:
inputs = Input(name='inputs', shape=[150])
layer = Embedding(2000, 50, input_length=150)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256, name='FC1')(layer)
layer = Activation('relu')(layer)
layer = Dropout(0.5)(layer)
layer = Dense(1, name='out_layer')(layer)
layer = Activation('sigmoid')(layer)
model = Model(inputs=inputs, outputs=layer)
GradientTape should return "... a list or nested structure of Tensors (or IndexedSlices, or None, or CompositeTensor), one for each element in sources". What is the correct way to use it to recover (and apply) the gradient?
I tried the following code:
with tf.GradientTape() as tape:
y_ = model(inputs)
loss_value = BinaryCrossEntropy()(y_true=targets, y_pred=y_)
grads = tape.gradient(loss_value, model.trainable_variables)
# some custom processing
optimizer = RMSprop(learning_rate=0.001, name="context")
optimizer.apply_gradients(list(zip(grads, model.trainable_variables)), name="context")
I would expect the returned gradient to be of size (2000,50), i.e., the shape of weights for the embedding layer. Instead, it takes a size that depends on the batch size, and cannot be used (at least with the code above) with apply_gradient. Changing the number of inputs consistently changes the first dimension of the gradient to batch_size * 150, while the shape of the trainable variables stays correct. If using 8 inputs, for example, I get the following result:
input shape: (8, 150), output shape: (8, 1)
model.trainable_variables shapes: (2000, 50),(50, 256),(64, 256),(256,),(64, 256),(256,),(256, 1),(1,)
tape.gradient shapes: (1200, 50),(50, 256),(64, 256),(256,),(64, 256),(256,),(256, 1),(1,)
With a batch size of 32, the first compunent would be (4800, 50), and so on. This doesn't match my understanding of GradientTape.gradient, since the returned gradient doesn't have the same size as the sources parameter. What did I miss?

Image Segmentation Tensorflow tutorials

In this tf tutorial, the U-net model has been divided into 2 parts, first contraction where they have used Mobilenet and it is not trainable. In second part, I'm not able to understand what all layers are being trained. As far as I could see, only the last layer conv2dTranspose seems trainable. Am I right?
And if I am how could only one layer is able to do such a complex task as segmentation?
Tutorial link: https://www.tensorflow.org/tutorials/images/segmentation
The code for the Image Segmentation Model, from the Tutorial is shown below:
def unet_model(output_channels):
inputs = tf.keras.layers.Input(shape=[128, 128, 3])
x = inputs
# Downsampling through the model
skips = down_stack(x)
x = skips[-1]
skips = reversed(skips[:-1])
# Upsampling and establishing the skip connections
for up, skip in zip(up_stack, skips):
x = up(x)
concat = tf.keras.layers.Concatenate()
x = concat([x, skip])
# This is the last layer of the model
last = tf.keras.layers.Conv2DTranspose(
output_channels, 3, strides=2,
padding='same') #64x64 -> 128x128
x = last(x)
return tf.keras.Model(inputs=inputs, outputs=x)
First part of the Model is Downsampling uses not the entire Mobilenet Architecture but only the Layers,
'block_1_expand_relu', # 64x64
'block_3_expand_relu', # 32x32
'block_6_expand_relu', # 16x16
'block_13_expand_relu', # 8x8
'block_16_project'
of the Pre-Trained Model, Mobilenet, which are non-trainable.
Second part of the Model (which is of your interest), before the layer, Conv2DTranspose is Upsampling part, which is present in the list,
up_stack = [
pix2pix.upsample(512, 3), # 4x4 -> 8x8
pix2pix.upsample(256, 3), # 8x8 -> 16x16
pix2pix.upsample(128, 3), # 16x16 -> 32x32
pix2pix.upsample(64, 3), # 32x32 -> 64x64
]
It means that it is accessing a Function named upsample from the Module, pix2pix. The code for the Module, pix2pix is present in this Github Link.
Code for the function, upsample is shown below:
def upsample(filters, size, norm_type='batchnorm', apply_dropout=False):
"""Upsamples an input.
Conv2DTranspose => Batchnorm => Dropout => Relu
Args:
filters: number of filters
size: filter size
norm_type: Normalization type; either 'batchnorm' or 'instancenorm'.
apply_dropout: If True, adds the dropout layer
Returns:
Upsample Sequential Model
"""
initializer = tf.random_normal_initializer(0., 0.02)
result = tf.keras.Sequential()
result.add(
tf.keras.layers.Conv2DTranspose(filters, size, strides=2,
padding='same',
kernel_initializer=initializer,
use_bias=False))
if norm_type.lower() == 'batchnorm':
result.add(tf.keras.layers.BatchNormalization())
elif norm_type.lower() == 'instancenorm':
result.add(InstanceNormalization())
if apply_dropout:
result.add(tf.keras.layers.Dropout(0.5))
result.add(tf.keras.layers.ReLU())
return result
This means that the second part of the Model comprises of the Upsampling Layers, whose functionality is defined above, with the Number of Filters being 512, 256, 128 and 64.

Tensorflow avoid shape information with crop

again I have some issue with Tensorflow. I am using a FCN model and need to apply a random crop due to memory usage.
tf.random_crop(combined, size=[512, 512, 4])
unfortunately now the new size "sticks" to the tensor and I can not get rid of it.
The issue caused by this is, that the resulting model only accepts input of size 512x512, which cannot be worked around in a nice way, as far as I know.
Is there any solution to either remove the shape information caused by random_crop or to easily adapt the size afterwards after obtaining a trained model?
Thank you in advance.
I don't know if it will completely suit your use-case, but the size parameter of tf.random_crop() can be a tensor, so you can for instance use a placeholder as shown in the example below.
import tensorflow as tf
import numpy as np
image = tf.placeholder(tf.float64, [None, None, 4])
cropped_size = tf.placeholder(tf.int32, [2])
cropped_image = tf.random_crop(image, size=[cropped_size[0], cropped_size[1], 4])
print(cropped_image.get_shape().as_list())
# [None, None, 4]
with tf.Session() as sess:
res = sess.run(cropped_image,
feed_dict={image: np.random.rand(900, 600, 4), cropped_size: [512, 512]})
print(res.shape)
# (512, 512, 4)
EDIT:
There may be different solutions to have the value of cropped_size assigned without using a feed_dict, depending how the crop dimensions are stored ; e.g. using TF file readers (the values would stay unknown till read).
Another simple hack otherwise: take advantage of tf.placeholder_with_default(default_val, shape) (doc), providing default_val with the crop dimensions acquired anyhow. As tf.placeholder_with_default() value isn't actually assigned until runtime (in case you you want to feed this placeholder with a different value), your dimensions would stay None in the graph:
import tensorflow as tf
image = tf.random_uniform((900, 600, 4)) # image tensor, acquired anyhow e.g. from tf.data
cropped_size_for_this_run = [512, 512] # crop dimensions, acquired anyhow
cropped_size = tf.placeholder_with_default(cropped_size_for_this_run, shape=[2])
cropped_image = tf.random_crop(image, size=[cropped_size[0], cropped_size[1], 4])
print(cropped_image.get_shape().as_list())
# [None, None, 4]
with tf.Session() as sess:
# You can leave cropped_size with its default value assigned at runtime:
res = sess.run(cropped_image)
print(res.shape)
# (512, 512, 4)
# ... or you can specify a new one if you wish so:
res = sess.run(cropped_image, feed_dict={cropped_size: [256, 256]})
print(res.shape)
# (256, 256, 4)
# ... It would switch back to the default value if you don't feed one:
res = sess.run(cropped_image)
print(res.shape)
# (512, 512, 4)

Parameters in tf.contrib.seq2seq.sequence_loss

I'm trying to use the tf.contrib.seq2seq.sequence_loss function in a RNN model to calculate the loss.
According to the API document, this function requires at least three parameters: logits, targets and weights
sequence_loss(
logits,
targets,
weights,
average_across_timesteps=True,
average_across_batch=True,
softmax_loss_function=None,
name=None
)
logits: A Tensor of shape [batch_size, sequence_length, num_decoder_symbols] and dtype float. The logits correspond to the prediction across all classes at each timestep.
targets: A Tensor of shape [batch_size, sequence_length] and dtype int. The target represents the true class at each timestep.
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
average_across_timesteps: If set, sum the cost across the sequence dimension and divide the cost by the total label weight across timesteps.
average_across_batch: If set, sum the cost across the batch dimension and divide the returned cost by the batch size.
softmax_loss_function: Function (labels, logits) -> loss-batch to be used instead of the standard softmax (the default if this is None). Note that to avoid confusion, it is required for the function to accept named arguments.
name: Optional name for this operation, defaults to "sequence_loss".
My understand is logits is my prediction after using Xw+b, so the shape of it should be [batch_size, sequence_length, output size]. Then target should be my label, but the shape required in is [batch_size, sequence_length]. I suppose my label should have the same shape as the logits.
So how to convert the 3d labels to 2d? Thanks in advance
Your targets(labels) don't need to be the same shape with logits.
If we ignore batch_size(which is not relevant to your question) for a moment, this API simply calculates loss between two sequences through weighed sum loss of each word.Suppose vocab_size is 5, and we get a target word 3, logits provide a prediction for this target with a vector [0.2, 0.1, 0.15, 0.4, 0.15].
To calculate the loss between target and prediction, target need not to be the same shape with prediction as [0, 0, 0, 1, 0]. tensorflow will do this internally.
You may refer to the distinction between two api: softmax_cross_entropy_with_logits and sparse_softmax_cross_entropy_with_logits
Your labels should be a 2d matrix of shape [batch_size, sequence_length], and your logits should be a 3d tensor of shape [batch_size, sequence_length, output_size]. Therefore you don't need to extend your label's dimension if your label variable is already in shape [batch_size, sequence_length].
In case you do want to extend the dimension, you can do it like this expended_variable = tf.expand_dims(the_variable_you_wanna_expand, axis = -1)
Deprecated, use instead
import tensorflow as tf
import tensorflow_addons as tfa
tfa.seq2seq.sequence_loss(
logits: tfa.types.TensorLike,
targets: tfa.types.TensorLike,
weights: tfa.types.TensorLike,
average_across_timesteps: bool = True,
average_across_batch: bool = True,
sum_over_timesteps: bool = False,
sum_over_batch: bool = False,
softmax_loss_function: Optional[Callable] = None,
name: Optional[str] = None
) -> tf.Tensor
https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/sequence_loss

Initialize a variable with placeholder as shape

I want to initialize the Weights variable by including the BatchSize dimension, which will be different between the Training and Prediction stages. Tried using the placeholder for that, but doesn't seem to work:
batchsize = tf.placeholder(tf.int32, name='batchsize', shape=[])
...
output, state = tf.nn.dynamic_rnn(multicell, X, dtype=tf.float32, initial_state=inState)
weights = tf.Variable(tf.truncated_normal([batchsize, CELL_SIZE, 1], 0.0, 1.0), name='weights')
bias = tf.Variable(tf.zeros(1), name='bias')
preds = tf.add(tf.matmul(output, weights), bias, name='preds')
loss = tf.reduce_mean(tf.squared_difference(preds, Y_))
train_step = tf.train.AdamOptimizer(LR).minimize(loss)
I can get it to work by specifying batchsize as a constant for the weights variable dimension, as opposed to a placeholder, but this way I get an error when I try to recover the session for the Prediction stage, because there the batchsize is 1. If I specify the placeholder, I get the error:
ValueError: initial_value must have a shape specified: Tensor("truncated_normal:0", shape=(?, 32, 1), dtype=float32)
Even though I do pass the value for the batchsize placeholder into the feed_dict when running this part of the graph.
If I specify the option validate_shape=False while creating the weights variable, that stage of the graph works, but later I get this error in AdamOptimizer:
ValueError: as_list() is not defined on an unknown TensorShape.
How can I get this to work? My ultimate goal is to reduce the Cell-Size dimension of the dynamic_rnn output down to 1 to predict the output at each time-step of the RNN.
Make the whole size of variable
get the specific shape of variable corresponding to the batch size (using tf.gather)
self.model_X = tf.placeholder(dtype=tf.float32, shape=[None, 100], name='X')
real_batch_size = tf.cast(tf.shape(self.model_X)[0],tf.int32)
self.y_dk = tf.get_variable(name="y_dk",initializer=tf.truncated_normal(shape=[self.num_doc, self.num_topic], mean=0, stddev=tf.truediv(1.0,self.lambda_y)), dtype=tf.float32)
batch_y_dk = tf.reshape(tf.gather(self.y_dk, self.model_batch_data_idx), [real_batch_size, self.num_topic])