I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss
Related
I am trying to learn a latent representation for text sequence (multiple features (3)) by doing reconstruction USING AUTOENCODER. As some of the sequences are shorter than the maximum pad length or a number of time steps I am considering (seq_length=15), I am not sure if reconstruction will learn to ignore the timesteps or not for calculating loss or accuracies.
I followed suggestions from this answer to crop the outputs but my losses are nan and several of accuracies as well.
input1 = keras.Input(shape=(seq_length,),name='input_1')
input2 = keras.Input(shape=(seq_length,),name='input_2')
input3 = keras.Input(shape=(seq_length,),name='input_3')
input1_emb = layers.Embedding(70,32,input_length=seq_length,mask_zero=True)(input1)
input2_emb = layers.Embedding(462,192,input_length=seq_length,mask_zero=True)(input2)
input3_emb = layers.Embedding(84,36,input_length=seq_length,mask_zero=True)(input3)
merged = layers.Concatenate()([input1_emb, input2_emb,input3_emb])
activ_func = 'tanh'
encoded = layers.LSTM(120,activation=activ_func,input_shape=(seq_length,),return_sequences=True)(merged) #
encoded = layers.LSTM(60,activation=activ_func,return_sequences=True)(encoded)
encoded = layers.LSTM(15,activation=activ_func)(encoded)
# Decoder reconstruct inputs
decoded1 = layers.RepeatVector(seq_length)(encoded)
decoded1 = layers.LSTM(60, activation= activ_func , return_sequences=True)(decoded1)
decoded1 = layers.LSTM(120, activation= activ_func , return_sequences=True,name='decoder1_last')(decoded1)
Decoder one has an output shape of (None, 15, 120).
input_copy_1 = layers.TimeDistributed(layers.Dense(70, activation='softmax'))(decoded1)
input_copy_2 = layers.TimeDistributed(layers.Dense(462, activation='softmax'))(decoded1)
input_copy_3 = layers.TimeDistributed(layers.Dense(84, activation='softmax'))(decoded1)
For each output, I am trying to crop the O padded timesteps as suggested by this answer. padding has 0 where actual input was missing (had zero due to padding) and 1 otherwise
#tf.function
def cropOutputs(x):
#x[0] is softmax of respective feature (time distributed) on top of decoder
#x[1] is the actual input feature
padding = tf.cast( tf.not_equal(x[1][1],0), dtype=tf.keras.backend.floatx())
print(padding)
return x[0]*tf.tile(tf.expand_dims(padding, axis=-1),tf.constant([1,x[0].shape[2]], tf.int32))
Applying crop function to all three outputs.
input_copy_1 = layers.Lambda(cropOutputs, name='input_copy_1', output_shape=(None, 15, 70))([input_copy_1,input1])
input_copy_2 = layers.Lambda(cropOutputs, name='input_copy_2', output_shape=(None, 15, 462))([input_copy_2,input2])
input_copy_3 = layers.Lambda(cropOutputs, name='input_copy_3', output_shape=(None, 15, 84))([input_copy_3,input3])
My logic is to crop timesteps of each feature (all 3 features for sequence have the same length, meaning they miss timesteps together). But for timestep, they have been applied softmax as per their feature size (70,462,84) so I have to zero out timestep by making a multi-dimensional mask array of zeros or ones equal to this feature size with help of mask padding, and multiply by respective softmax representation using this using multi-dimensional mask array.
I am not sure I am doing this right or not as I have Nan losses for these inputs as well as other accuracies have that I am learning jointly with this task (it happens only with this cropping thing).
If it helps someone, I end up cropping the padded entries from the loss directly (taking some keras code pointer from these answers).
#tf.function
def masked_cc_loss(y_true, y_pred):
mask = tf.keras.backend.all(tf.equal(y_true, masked_val_hotencoded), axis=-1)
mask = 1 - tf.cast(mask, tf.keras.backend.floatx())
loss = tf.keras.losses.CategoricalCrossentropy()(y_true, y_pred) * mask
return tf.keras.backend.sum(loss) / tf.keras.backend.sum(mask) # averaging by the number of unmasked entries
Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
and
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.
This question already has answers here:
Weighted cost function in tensorflow
(2 answers)
Closed 4 years ago.
I have a neural network with MSE loss function being implemented something like this:
# input x_ph is of size Nx1 and output should also be of size Nx1
def train_neural_network_batch(x_ph, predict=False):
prediction = neural_network_model(x_ph)
# MSE loss function
cost = tf.reduce_mean(tf.square(prediction - y_ph))
optimizer = tf.train.AdamOptimizer(learn_rate).minimize(cost)
# mini-batch optimization here
I'm fairly new to neural networks and Python, but I understand that each iteration, a sample of training points will be fed into the neural network and the loss function evaluated at the points in this sample. However, I would like to be able to modify the loss function so that it weights certain data more heavily. Some pseudocode of what I mean
# manually compute the MSE of the data without the first sampled element
cost = 0.0
for ii in range(1,len(y_ph)):
cost += tf.square(prediction[ii] - y_ph[ii])
cost = cost/(len(y_ph)-1.0)
# weight the first sampled data point more heavily according to some parameter W
cost += W*(prediction[0] - y_ph[0])
I might have more points I wish to weight differently as well, but for now, I'm just wondering how I can implement something like this in tensorflow. I know len(y_ph) is invalid as y_ph is just a placeholder, and I can't just do something like y_ph[i] or prediction[i].
You can do this in multiple ways:
1) If some of your data instances weighting are simply 2 times or 3 times more than normal instance, you may just copy those instance multiple times in your data set. Thus they would occupy more weight in loss, hence satisfy your intention. This is the simplest way.
2) If your weighting is more complex, say a float weighting. You can define a placeholder for weighting, multiply it to loss, and use feed_dict to feed the weighting in session together with x batch and y batch. Just make sure instance_weight is the same size with batch_size
E.g.
import tensorflow as tf
import numpy as np
with tf.variable_scope("test", reuse=tf.AUTO_REUSE):
x = tf.placeholder(tf.float32, [None,1])
y = tf.placeholder(tf.float32, [None,1])
instance_weight = tf.placeholder(tf.float32, [None,1])
w1 = tf.get_variable("w1", shape=[1, 1])
prediction = tf.matmul(x, w1)
cost = tf.square(prediction - y)
loss = tf.reduce_mean(instance_weight * cost)
opt = tf.train.AdamOptimizer(0.5).minimize(loss)
with tf.Session() as sess:
x1 = [[1.],[2.],[3.]]
y1 = [[2.],[4.],[3.]]
instance_weight1 = [[10.0], [10.0], [0.1]]
sess.run(tf.global_variables_initializer())
print (x1)
print (y1)
print (instance_weight1)
for i in range(1000):
_, loss1, prediction1 = sess.run([opt, loss, prediction], feed_dict={instance_weight : instance_weight1, x : x1, y : y1 })
if (i % 100) == 0:
print(loss1)
print(prediction1)
NOTE instance_weight1, you may change instance_weight1 to see the difference (here batch_size is set to 3)
Where x1,y1 and x2,y2 follow the rule y=2*x
Whereas x3,y3 follow the rule y=x
But with different weight as [10,10,0.1], the prediction1 coverage to y1,y2 rule and almost ignored y3, the output are as:
[[1.9823183]
[3.9646366]
[5.9469547]]
PS: in tensorflow graph, it's highly recommended not to use for loops, but use matrix operator instead to parallel the calculation.
I have a convolutional neural network with three images as inputs:
x_anchor = tf.placeholder('float', [None, 4900], name='x_anchor')
x_positive = tf.placeholder('float', [None, 4900], name='x_positive')
x_negative = tf.placeholder('float', [None, 4900], name='x_negative')
Within a train function, I feed the placeholders with the actual images:
input1, input2, input3 = training.next_batch(start,end)
....some other operations...
loss_value = sess.run([cost], feed_dict={x_anchor:input1, x_positive:input2, x_negative:input3})
I'm using a triplet loss function on these three inputs (that's actually the cost variable above):
def triplet_loss(d_pos, d_neg):
margin = 0.2
loss = tf.reduce_mean(tf.maximum(0., margin + d_pos - d_neg))
return loss
How can I filter the losses, so only the images with loss_value > 0 will be used to train the network?
How can I implement something like:
if(loss_value for input1, input2, input3 > 0)
use inputs to train network
else
do nothing/try another input
What I have tried so far:
I took the images one by one (input1[0], input2[0], input3[0]), calculated the loss, and if the loss was positive I would calculate (and apply) the gradients. But the problem is I use dropout in my model and I have to apply the model twice on my inputs:
First to calculate the loss and verify whether it's greater than 0
Second to run the optimizer: this is when things go wrong. As I mentioned before, I use dropout, so the results of the model on my inputs are different, so the new loss will sometimes be 0 even if the loss determined at step 1 is greater than 0.
I also tried to use tf.py_func but got stuck.
There's a new TensorFlow feature called “AutoGraph”. AutoGraph converts Python code, including control flow, print() and other Python-native features, into pure TensorFlow graph code. For example:
#autograph.convert()
def huber_loss(a):
if tf.abs(a) <= delta:
loss = a * a / 2
else:
loss = delta * (tf.abs(a) - delta / 2)
return loss
becomes this code at execution time due to the decorator:
def tf__huber_loss(a):
with tf.name_scope('huber_loss'):
def if_true():
with tf.name_scope('if_true'):
loss = a * a / 2
return loss,
def if_false():
with tf.name_scope('if_false'):
loss = delta * (tf.abs(a) - delta / 2)
return loss,
loss = ag__.utils.run_cond(tf.less_equal(tf.abs(a), delta), if_true,
if_false)
return loss
What you wanted to do could have been implemented before using tf.cond().
I found out about this through this medium post.
I was wondering how to penalize less represented classes more then other classes when dealing with a really imbalanced dataset (10 classes over about 20000 samples but here is th number of occurence for each class : [10868 26 4797 26 8320 26 5278 9412 4485 16172 ]).
I read about the Tensorflow function : weighted_cross_entropy_with_logits (https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits) but I am not sure I can use it for a multi label problem.
I found a post that sum up perfectly the problem I have (Neural Network for Imbalanced Multi-Class Multi-Label Classification) and that propose an idea but it had no answers and I thought the idea might be good :)
Thank you for your ideas and answers !
First of all, there is my suggestion you can modify your cost function to use in a multi-label way. There is code which show how to use Softmax Cross Entropy in Tensorflow for multilabel image task.
With that code, you can multiple weights in each row of loss calculation. Here is the example code in case you have multi-label task: (i.e, each image can have two labels)
logits_split = tf.split( axis=1, num_or_size_splits=2, value= logits )
labels_split = tf.split( axis=1, num_or_size_splits=2, value= labels )
weights_split = tf.split( axis=1, num_or_size_splits=2, value= weights )
total = 0.0
for i in range ( len(logits_split) ):
temp = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( logits=logits_split[i] , labels=labels_split[i] ))
total += temp * tf.reshape(weights_split[i],[-1])
I think you can just use tf.nn.weighted_cross_entropy_with_logits for multiclass classification.
For example, for 4 classes, where the ratios to the class with the largest number of members are [0.8, 0.5, 0.6, 1], You would just give it a weight vector in the following way:
cross_entropy = tf.nn.weighted_cross_entropy_with_logits(
targets=ground_truth_input, logits=logits,
pos_weight = tf.constant([0.8,0.5,0.6,1]))
So I am not entirely sure that I understand your problem given what you have written. The post you link to writes about multi-label AND multi-class, but that doesn't really make sense given what is written there either. So I will approach this as a multi-class problem where for each sample, you have a single label.
In order to penalize the classes, I implemented a weight Tensor based on the labels in the current batch. For a 3-class problem, you could eg. define the weights as the inverse frequency of the classes, such that if the proportions are [0.1, 0.7, 0.2] for class 1, 2 and 3, respectively, the weights will be [10, 1.43, 5]. Defining a weight tensor based on the current batch is then
weight_per_class = tf.constant([10, 1.43, 5]) # shape (, num_classes)
onehot_labels = tf.one_hot(labels, depth=3) # shape (batch_size, num_classes)
weights = tf.reduce_sum(
tf.multiply(onehot_labels, weight_per_class), axis=1) # shape (batch_size, num_classes)
reduction = tf.losses.Reduction.MEAN # this ensures that we get a weighted mean
loss = tf.losses.softmax_cross_entropy(
onehot_labels=onehot_labels, logits=logits, weights=weights, reduction=reduction)
Using softmax ensures that the classification problem is not 3 independent classifications.