Weighted cost function in tensorflow - tensorflow

I'm trying to introduce weighting into the following cost function:
_cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=y))
But without having to do the softmax cross entropy myself. So I was thinking of breaking the cost calc up into cost1 and cost2 and feeding in a modified version of my logits and y values to each one.
I want to do something like this but not sure what is the correct code:
mask=(y==0)
y0 = tf.boolean_mask(y,mask)*y1Weight
(This gives the error that mask cannot be scalar)

The weight masks can be computed using tf.where. Here is the weighted cost example:
batch_size = 100
y1Weight = 0.25
y0Weight = 0.75
_logits = tf.Variable(tf.random_normal(shape=(batch_size, 2), stddev=1.))
y = tf.random_uniform(shape=(batch_size,), maxval=2, dtype=tf.int32)
_cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=y)
#Weight mask, the weights for label=0 is y0Weight and for 1 is y1Weight
y_w = tf.where(tf.cast(y, tf.bool), tf.ones((batch_size,))*y0Weight, tf.ones((batch_size,))*y1Weight)
# New weighted cost
cost_w = tf.reduce_mean(tf.multiply(_cost, y_w))
As suggested by #user1761806, the simpler solution would be to use tf.losses.sparse_softmax_cross_entropy() which has allows weighting of the classes.

you can calculate the weighted cost as follows; use a predefined weights_per_class tensor with shape (num_classes, 1). For label use one_hot encoding.
# here labels shape should be [batch_size, num_classes] ; obtained using one_hot
_cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=y)
# Here you can define a deterministic weights tensor.
# weights_per_class = tf.constant(np.array([y0weights, y1weights, ...]))
weights_per_class =tf.random_normal(shape=(num_classes, 1), dtype=tf.float32)
# Use the weights tensor to compute weighted loss
_weighted_cost = tf.reduce_mean(tf.matmul(_cost, weights_per_class))

Related

Is this Neural Net example I'm looking at a mistake or am I not understanding backprop?

Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
and
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.

Accessing elements of a placeholder in tensorflow [duplicate]

This question already has answers here:
Weighted cost function in tensorflow
(2 answers)
Closed 4 years ago.
I have a neural network with MSE loss function being implemented something like this:
# input x_ph is of size Nx1 and output should also be of size Nx1
def train_neural_network_batch(x_ph, predict=False):
prediction = neural_network_model(x_ph)
# MSE loss function
cost = tf.reduce_mean(tf.square(prediction - y_ph))
optimizer = tf.train.AdamOptimizer(learn_rate).minimize(cost)
# mini-batch optimization here
I'm fairly new to neural networks and Python, but I understand that each iteration, a sample of training points will be fed into the neural network and the loss function evaluated at the points in this sample. However, I would like to be able to modify the loss function so that it weights certain data more heavily. Some pseudocode of what I mean
# manually compute the MSE of the data without the first sampled element
cost = 0.0
for ii in range(1,len(y_ph)):
cost += tf.square(prediction[ii] - y_ph[ii])
cost = cost/(len(y_ph)-1.0)
# weight the first sampled data point more heavily according to some parameter W
cost += W*(prediction[0] - y_ph[0])
I might have more points I wish to weight differently as well, but for now, I'm just wondering how I can implement something like this in tensorflow. I know len(y_ph) is invalid as y_ph is just a placeholder, and I can't just do something like y_ph[i] or prediction[i].
You can do this in multiple ways:
1) If some of your data instances weighting are simply 2 times or 3 times more than normal instance, you may just copy those instance multiple times in your data set. Thus they would occupy more weight in loss, hence satisfy your intention. This is the simplest way.
2) If your weighting is more complex, say a float weighting. You can define a placeholder for weighting, multiply it to loss, and use feed_dict to feed the weighting in session together with x batch and y batch. Just make sure instance_weight is the same size with batch_size
E.g.
import tensorflow as tf
import numpy as np
with tf.variable_scope("test", reuse=tf.AUTO_REUSE):
x = tf.placeholder(tf.float32, [None,1])
y = tf.placeholder(tf.float32, [None,1])
instance_weight = tf.placeholder(tf.float32, [None,1])
w1 = tf.get_variable("w1", shape=[1, 1])
prediction = tf.matmul(x, w1)
cost = tf.square(prediction - y)
loss = tf.reduce_mean(instance_weight * cost)
opt = tf.train.AdamOptimizer(0.5).minimize(loss)
with tf.Session() as sess:
x1 = [[1.],[2.],[3.]]
y1 = [[2.],[4.],[3.]]
instance_weight1 = [[10.0], [10.0], [0.1]]
sess.run(tf.global_variables_initializer())
print (x1)
print (y1)
print (instance_weight1)
for i in range(1000):
_, loss1, prediction1 = sess.run([opt, loss, prediction], feed_dict={instance_weight : instance_weight1, x : x1, y : y1 })
if (i % 100) == 0:
print(loss1)
print(prediction1)
NOTE instance_weight1, you may change instance_weight1 to see the difference (here batch_size is set to 3)
Where x1,y1 and x2,y2 follow the rule y=2*x
Whereas x3,y3 follow the rule y=x
But with different weight as [10,10,0.1], the prediction1 coverage to y1,y2 rule and almost ignored y3, the output are as:
[[1.9823183]
[3.9646366]
[5.9469547]]
PS: in tensorflow graph, it's highly recommended not to use for loops, but use matrix operator instead to parallel the calculation.

How to implement a filter in tensorflow?

I have a convolutional neural network with three images as inputs:
x_anchor = tf.placeholder('float', [None, 4900], name='x_anchor')
x_positive = tf.placeholder('float', [None, 4900], name='x_positive')
x_negative = tf.placeholder('float', [None, 4900], name='x_negative')
Within a train function, I feed the placeholders with the actual images:
input1, input2, input3 = training.next_batch(start,end)
....some other operations...
loss_value = sess.run([cost], feed_dict={x_anchor:input1, x_positive:input2, x_negative:input3})
I'm using a triplet loss function on these three inputs (that's actually the cost variable above):
def triplet_loss(d_pos, d_neg):
margin = 0.2
loss = tf.reduce_mean(tf.maximum(0., margin + d_pos - d_neg))
return loss
How can I filter the losses, so only the images with loss_value > 0 will be used to train the network?
How can I implement something like:
if(loss_value for input1, input2, input3 > 0)
use inputs to train network
else
do nothing/try another input
What I have tried so far:
I took the images one by one (input1[0], input2[0], input3[0]), calculated the loss, and if the loss was positive I would calculate (and apply) the gradients. But the problem is I use dropout in my model and I have to apply the model twice on my inputs:
First to calculate the loss and verify whether it's greater than 0
Second to run the optimizer: this is when things go wrong. As I mentioned before, I use dropout, so the results of the model on my inputs are different, so the new loss will sometimes be 0 even if the loss determined at step 1 is greater than 0.
I also tried to use tf.py_func but got stuck.
There's a new TensorFlow feature called “AutoGraph”. AutoGraph converts Python code, including control flow, print() and other Python-native features, into pure TensorFlow graph code. For example:
#autograph.convert()
def huber_loss(a):
if tf.abs(a) <= delta:
loss = a * a / 2
else:
loss = delta * (tf.abs(a) - delta / 2)
return loss
becomes this code at execution time due to the decorator:
def tf__huber_loss(a):
with tf.name_scope('huber_loss'):
def if_true():
with tf.name_scope('if_true'):
loss = a * a / 2
return loss,
def if_false():
with tf.name_scope('if_false'):
loss = delta * (tf.abs(a) - delta / 2)
return loss,
loss = ag__.utils.run_cond(tf.less_equal(tf.abs(a), delta), if_true,
if_false)
return loss
What you wanted to do could have been implemented before using tf.cond().
I found out about this through this medium post.

How do backpropagation works in tensorflow

In tensorflow it seems that the entire backpropagation algorithm is performed by a single running of an optimizer on a certain cost function, which is the output of some MLP or a CNN.
I do not fully understand how tensorflow knows from the cost that it is indeed an output of a certain NN? A cost function can be defined for any model. How should I "tell" it that a certain cost function derives from a NN?
Question
How should I "tell" tf that a certain cost function derives from a NN?
(short) Answer
This is done by simply configuring your optimizer to minimize (or maximize) a tensor. For example, if I have a loss function like so
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
where y0 is the ground truth (or desired output) and y_out is the calculated output, then I could minimize the loss by defining my training function like so
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
This tells Tensorflow that when train is calculated, it is to apply gradient descent on loss to minimize it, and loss is calculated using y0 and y_out, and so gradient descent will also affect those (if they are trainable variables), and so on.
The variable y0, y_out, loss, and train are not standard python variables but instead descriptions of a computation graph. Tensorflow uses information about that computation graph to unroll it while applying gradient descent.
Specifically how it does that is beyond the scope of this answer. Here and here are two good starting points for more information about more specifics.
Code Example
Let's walk through a code example. First the code.
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
sess.run(train)
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
Let's go through it, but in reverse order starting with
sess.run(train)
This tells tensorflow to look up the graph node defined by train and calculate it. Train is defined as
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
To calculate this tensorflow must compute the automatic differentiation for loss, which means walking the graph. loss is defined as
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
Which is really tensorflow applying automatic differentiation to unroll first tf.reduce_sum, then tf.square, then y0 - y_out, which leads to then having to walk the graph for both y0 and y_out.
y0 = tf.constant( y_ , dtype=tf.float32 )
y0 is a constant and will not be updated.
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
y_out will be processed similar to loss, first tf.sigmoid will be processed, etc...
All in all, each operation ( such as tf.sigmoid, tf.square ) not only defines the forward operation ( apply sigmoid or square ) but also information necessary for automatic differentiation. This is different than standard python math such as
x = 7 + 9
The above equation encodes nothing except how to update x, where as
z = y0 - y_out
encodes the graph of subtracting y_out from y0 and stores both the forward operation and enough to do automatic differentiation in z
The backpropagation was created by Rumelhart and Hinton et al and published on Nature in 1986.
As stated in section 6.5: Back-Propagation and Other DifferentiationAlgorithms of the deeplearning book there are two types of approaches for back-propagation gradients through computational graphs: symbol-to-number differentiation and symbol to symbol derivatives. The more relevant one to Tensorflow as stated in this paper: A Tour of TensorFlow is the later which can be illustrated using this diagram:
Source: Section II Part D of A Tour of TensorFlow
In left side of the the Fig. 7 above, w represents the weights(or Variables) in Tensorflow and x and y are two intermediary operations(or nodes, w, x, y and z are all operations) to get the scalar loss z.
Tensorflow will add a node to each node(if we print the names of variables in a certain checkpoint we can see some additional variables for such nodes and they will be eliminated if we freeze the model to a protocol buffer file for deployment) in the graph for the gradient which can be seen in diagram (b) on the right side: dz/dy, dy/dx, dx/dw.
During the traversal of the back propagation at each node we multiply its gradient with that of the previous one and finally to get a symbolic handle to the overall target derivative dz/dw = dz/dy * dy/dx * dx/dw, which applies exactly the chain rule. Once the gradient is worked out w can update itself with a learning rate.
For more detailed information please read this paper: TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems

Implementing contrastive loss and triplet loss in Tensorflow

I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss