I am working on this tutorial and I found in this following code: when evaluating predictions he runs accuracy which runs the correct variable which in turn runs prediction which will reintialize again the weights with randoms and reconstruct the NN model. How is this right? What am I missing ?
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1])),
'biases':tf.Variable(tf.random_normal([n_nodes_hl1]))}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
'biases':tf.Variable(tf.random_normal([n_nodes_hl2]))}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])),
'biases':tf.Variable(tf.random_normal([n_nodes_hl3]))}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes])),}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 10
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
epoch_loss = 0
for _ in range(int(mnist.train.num_examples/batch_size)):
epoch_x, epoch_y = mnist.train.next_batch(batch_size)
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
epoch_loss += c
print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:',accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
train_neural_network(x)
You almost got it right. The accuracy tensor indirectly depends on the prediction tensor, which is depending on a Tensor x. In your code snippet you did not include what x actually is; however from the linked tutorial:
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
So x is a placeholder, i.e. a Tensor that obtains its value directly from the user. It is not entirely clear from the last line of
train_neural_network(x)
that he is not actually calling a transformation function train_neural_network(x) that takes an x and processes it on the fly, like you would expect from a regular function; rather, the function uses a reference to the previously defined placeholder variables - dummies, really - in order to define a computation graph it then directly executes using a session.
The graph, however, is only constructed once using neural_network_model(x) and then queried for a given number of epochs.
What you missed is this:
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
This queries the result of the optimizer operation and the cost tensor given that the input values are epoch_x for x and epoch_y for y, pulling data through all defined computation nodes, all the way back "down" to x. In order to obtain the cost, y is needed as well. Both are provided by the caller. The AdamOptimizer will update all trainable variables as part of its execution, changing the network's weights.
After that,
accuracy.eval({x: mnist.test.images, y: mnist.test.labels})
or, equivalently
sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})
then issues another evaluation of the same graph - without changing it - but this time using the inputs mnist.test.images for x and mnist.test.labels for y.
It works because prediction itself depends on x, which is overridden to the user-provided values on each call to sess.run(...).
Here is what the graph looks like in TensorBoard. It's hard to tell, but the two placeholder nodes are on the bottom left, next to the orange node called "Variable" and in the center right, below the green "Slice_1".
Here's how the relevant part of the network's graph looks like; I exported this using TensorBoard. It's a bit hard to get since the nodes are not manually labeled (except for a couple I labeled myself), but there are six relevant points here. Placeholders are yellow: On the bottom right you'll find x and y is on the center left.
Green are the intermediate values that make sense to us: On the left is the prediction tensor, on the right there's the tensor called correct. The blue parts are endpoints of the graph: On the top left there's the cost tensor and on the top right you'll find accuracy. In essence, data flows from the bottom to the top.
So, whenever you say "evaluate prediction given x", "evaluate accuracy given x and y" or "optimize my network given x and y", you really just provide values on the yellow ends and observe the outcome on the green or blue ones.
Related
I have followed the tutorial available at: https://www.tensorflow.org/quantum/tutorials/mnist. I have modified this tutorial to the simplest example I could think of: an input set in which x increases linearly from 0 to 1 and y = x < 0.3. I then use a PQC with a single Rx gate with a symbol, and a readout using a Z gate.
When retrieving the optimized symbol and adjusting it manually, I can easily find a value that provides 100% accuracy, but when I let the Adam optimizer run, it converges to either always predict 1 or always predict -1. Does anybody spot what I do wrong? (and I apologize for not being able to break down the code to a smaller example)
import tensorflow as tf
import tensorflow_quantum as tfq
import cirq
import sympy
import numpy as np
# used to embed classical data in quantum circuits
def convert_to_circuit_cont(image):
"""Encode truncated classical image into quantum datapoint."""
values = np.ndarray.flatten(image)
qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
for i, value in enumerate(values):
if value:
circuit.append(cirq.rx(value).on(qubits[i]))
return circuit
# define classical dataset
length = 1000
np.random.seed(42)
# create a linearly increasing set for x from 0 to 1 in 1/length steps
x_train_sorted = np.asarray([[x/length] for x in range(0,length)], dtype=np.float32)
# p is used to shuffle x and y similarly
p = np.random.permutation(len(x_train_sorted))
x_train = x_train_sorted[p]
# y = x < 0.3 in {-1, 1} for Hinge loss
y_train_sorted = np.asarray([1 if (x/length)<0.30 else -1 for x in range(0,length)])
y_train = y_train_sorted[p]
# test == train for this example
x_test = x_train_sorted[:]
y_test = y_train_sorted[:]
# convert classical data into quantum circuits
x_train_circ = [convert_to_circuit_cont(x) for x in x_train]
x_test_circ = [convert_to_circuit_cont(x) for x in x_test]
x_train_tfcirc = tfq.convert_to_tensor(x_train_circ)
x_test_tfcirc = tfq.convert_to_tensor(x_test_circ)
# define the PQC circuit, consisting out of 1 qubit with 1 gate (Rx) and 1 parameter
def create_quantum_model():
data_qubits = cirq.GridQubit.rect(1, 1)
circuit = cirq.Circuit()
a = sympy.Symbol("a")
circuit.append(cirq.rx(a).on(data_qubits[0])),
return circuit, cirq.Z(data_qubits[0])
model_circuit, model_readout = create_quantum_model()
# Build the Keras model.
model = tf.keras.Sequential([
# The input is the data-circuit, encoded as a tf.string
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# used for logging progress during optimization
def hinge_accuracy(y_true, y_pred):
y_true = tf.squeeze(y_true) > 0.0
y_pred = tf.squeeze(y_pred) > 0.0
result = tf.cast(y_true == y_pred, tf.float32)
return tf.reduce_mean(result)
# compile the model with Hinge loss and Adam, as done in the example. Have tried with various learning_rates
model.compile(
loss = tf.keras.losses.Hinge(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
metrics=[hinge_accuracy])
EPOCHS = 20
BATCH_SIZE = 32
NUM_EXAMPLES = 1000
# fit the model
qnn_history = model.fit(
x_train_tfcirc, y_train,
batch_size=32,
epochs=EPOCHS,
verbose=1,
validation_data=(x_test_tfcirc, y_test),
use_multiprocessing=False)
results = model.predict(x_test_tfcirc)
results_mapped = [-1 if x<=0 else 1 for x in results[:,0]]
print(np.sum(np.equal(results_mapped, y_test)))
After 20 epochs of optimization, I get the following:
1000/1000 [==============================] - 0s 410us/sample - loss: 0.5589 - hinge_accuracy: 0.6982 - val_loss: 0.5530 - val_hinge_accuracy: 0.7070
This results in 700 samples out of 1000 predicted correctly. When looking at the mapped results, this is because all results are predicted as -1. When looking at the raw results, they linearly increase from -0.5484014 to -0.99996257.
When retrieving the weight with w = model.layers[0].get_weights(), subtracting 0.8, and setting it again with model.layers[0].set_weights(w), I get 920/1000 correct. Fine-tuning this process allows me to achieve 1000/1000.
Update 1:
I have also printed the update of the weight over the various epochs:
4.916246, 4.242602, 3.3765688, 2.6855211, 2.3405066, 2.206207, 2.1734586, 2.1656137, 2.1510274, 2.1634471, 2.1683235, 2.188944, 2.1510284, 2.1591303, 2.1632445, 2.1542525, 2.1677444, 2.1702878, 2.163104, 2.1635907
I set the weight to 1.36, a value which gives 908/1000 (as opposed to 700/100). The optimizer moves away from it:
1.7992111, 2.0727847, 2.1370323, 2.15711, 2.1686404, 2.1603785, 2.183334, 2.1563332, 2.156857, 2.169908, 2.1658351, 2.170673, 2.1575692, 2.1505954, 2.1561477, 2.1754034, 2.1545155, 2.1635509, 2.1464484, 2.1707492
One thing that I noticed is that the value for the hinge accuracy was 0.75 with the weight 1.36, which is higher than the 0.7 for 2.17. If this is the case, I am either in an unlucky part of the optimization landscape where the global minimum does not correspond to the minimum of the loss landscape, or the loss value is determined incorrectly. This is what I will be investigating next.
The minima of the Hinge loss function for this examples does not correspond with the maxima of number of correctly classified examples. Please see plot of these w.r.t. the value of the parameter. Given that the optimizer works towards the minima of the loss, not the maxima of the number of classified examples, the code (and framework/optimizer) do what they are supposed to do. Alternatively, one could use a different loss function to try to find a better fit. For example binarized l1 loss. This function would have the same global optimum, but would likely have a very flat landscape.
Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
and
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.
This question already has answers here:
Weighted cost function in tensorflow
(2 answers)
Closed 4 years ago.
I have a neural network with MSE loss function being implemented something like this:
# input x_ph is of size Nx1 and output should also be of size Nx1
def train_neural_network_batch(x_ph, predict=False):
prediction = neural_network_model(x_ph)
# MSE loss function
cost = tf.reduce_mean(tf.square(prediction - y_ph))
optimizer = tf.train.AdamOptimizer(learn_rate).minimize(cost)
# mini-batch optimization here
I'm fairly new to neural networks and Python, but I understand that each iteration, a sample of training points will be fed into the neural network and the loss function evaluated at the points in this sample. However, I would like to be able to modify the loss function so that it weights certain data more heavily. Some pseudocode of what I mean
# manually compute the MSE of the data without the first sampled element
cost = 0.0
for ii in range(1,len(y_ph)):
cost += tf.square(prediction[ii] - y_ph[ii])
cost = cost/(len(y_ph)-1.0)
# weight the first sampled data point more heavily according to some parameter W
cost += W*(prediction[0] - y_ph[0])
I might have more points I wish to weight differently as well, but for now, I'm just wondering how I can implement something like this in tensorflow. I know len(y_ph) is invalid as y_ph is just a placeholder, and I can't just do something like y_ph[i] or prediction[i].
You can do this in multiple ways:
1) If some of your data instances weighting are simply 2 times or 3 times more than normal instance, you may just copy those instance multiple times in your data set. Thus they would occupy more weight in loss, hence satisfy your intention. This is the simplest way.
2) If your weighting is more complex, say a float weighting. You can define a placeholder for weighting, multiply it to loss, and use feed_dict to feed the weighting in session together with x batch and y batch. Just make sure instance_weight is the same size with batch_size
E.g.
import tensorflow as tf
import numpy as np
with tf.variable_scope("test", reuse=tf.AUTO_REUSE):
x = tf.placeholder(tf.float32, [None,1])
y = tf.placeholder(tf.float32, [None,1])
instance_weight = tf.placeholder(tf.float32, [None,1])
w1 = tf.get_variable("w1", shape=[1, 1])
prediction = tf.matmul(x, w1)
cost = tf.square(prediction - y)
loss = tf.reduce_mean(instance_weight * cost)
opt = tf.train.AdamOptimizer(0.5).minimize(loss)
with tf.Session() as sess:
x1 = [[1.],[2.],[3.]]
y1 = [[2.],[4.],[3.]]
instance_weight1 = [[10.0], [10.0], [0.1]]
sess.run(tf.global_variables_initializer())
print (x1)
print (y1)
print (instance_weight1)
for i in range(1000):
_, loss1, prediction1 = sess.run([opt, loss, prediction], feed_dict={instance_weight : instance_weight1, x : x1, y : y1 })
if (i % 100) == 0:
print(loss1)
print(prediction1)
NOTE instance_weight1, you may change instance_weight1 to see the difference (here batch_size is set to 3)
Where x1,y1 and x2,y2 follow the rule y=2*x
Whereas x3,y3 follow the rule y=x
But with different weight as [10,10,0.1], the prediction1 coverage to y1,y2 rule and almost ignored y3, the output are as:
[[1.9823183]
[3.9646366]
[5.9469547]]
PS: in tensorflow graph, it's highly recommended not to use for loops, but use matrix operator instead to parallel the calculation.
In tensorflow it seems that the entire backpropagation algorithm is performed by a single running of an optimizer on a certain cost function, which is the output of some MLP or a CNN.
I do not fully understand how tensorflow knows from the cost that it is indeed an output of a certain NN? A cost function can be defined for any model. How should I "tell" it that a certain cost function derives from a NN?
Question
How should I "tell" tf that a certain cost function derives from a NN?
(short) Answer
This is done by simply configuring your optimizer to minimize (or maximize) a tensor. For example, if I have a loss function like so
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
where y0 is the ground truth (or desired output) and y_out is the calculated output, then I could minimize the loss by defining my training function like so
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
This tells Tensorflow that when train is calculated, it is to apply gradient descent on loss to minimize it, and loss is calculated using y0 and y_out, and so gradient descent will also affect those (if they are trainable variables), and so on.
The variable y0, y_out, loss, and train are not standard python variables but instead descriptions of a computation graph. Tensorflow uses information about that computation graph to unroll it while applying gradient descent.
Specifically how it does that is beyond the scope of this answer. Here and here are two good starting points for more information about more specifics.
Code Example
Let's walk through a code example. First the code.
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
sess.run(train)
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
Let's go through it, but in reverse order starting with
sess.run(train)
This tells tensorflow to look up the graph node defined by train and calculate it. Train is defined as
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
To calculate this tensorflow must compute the automatic differentiation for loss, which means walking the graph. loss is defined as
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
Which is really tensorflow applying automatic differentiation to unroll first tf.reduce_sum, then tf.square, then y0 - y_out, which leads to then having to walk the graph for both y0 and y_out.
y0 = tf.constant( y_ , dtype=tf.float32 )
y0 is a constant and will not be updated.
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
y_out will be processed similar to loss, first tf.sigmoid will be processed, etc...
All in all, each operation ( such as tf.sigmoid, tf.square ) not only defines the forward operation ( apply sigmoid or square ) but also information necessary for automatic differentiation. This is different than standard python math such as
x = 7 + 9
The above equation encodes nothing except how to update x, where as
z = y0 - y_out
encodes the graph of subtracting y_out from y0 and stores both the forward operation and enough to do automatic differentiation in z
The backpropagation was created by Rumelhart and Hinton et al and published on Nature in 1986.
As stated in section 6.5: Back-Propagation and Other DifferentiationAlgorithms of the deeplearning book there are two types of approaches for back-propagation gradients through computational graphs: symbol-to-number differentiation and symbol to symbol derivatives. The more relevant one to Tensorflow as stated in this paper: A Tour of TensorFlow is the later which can be illustrated using this diagram:
Source: Section II Part D of A Tour of TensorFlow
In left side of the the Fig. 7 above, w represents the weights(or Variables) in Tensorflow and x and y are two intermediary operations(or nodes, w, x, y and z are all operations) to get the scalar loss z.
Tensorflow will add a node to each node(if we print the names of variables in a certain checkpoint we can see some additional variables for such nodes and they will be eliminated if we freeze the model to a protocol buffer file for deployment) in the graph for the gradient which can be seen in diagram (b) on the right side: dz/dy, dy/dx, dx/dw.
During the traversal of the back propagation at each node we multiply its gradient with that of the previous one and finally to get a symbolic handle to the overall target derivative dz/dw = dz/dy * dy/dx * dx/dw, which applies exactly the chain rule. Once the gradient is worked out w can update itself with a learning rate.
For more detailed information please read this paper: TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems
I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss