im following MNIST Softmax tutorials https://www.tensorflow.org/tutorials/mnist/beginners/
Followed by the document, the model should be
y = tf.nn.softmax(tf.matmul(x, W) + b)
but in the sample source code, as u can see
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.matmul(x, W) + b
softmax is not used. I think it needs to be changed
y = tf.nn.softmax(tf.matmul(x, W) + b)
I assume that, in the testing function it uses argmax so the it doesn't needs to be normalized to 0~1.0 value. But it can bring some confusion to developers.
as idea on this?
Softmax is used, row 57:
# So here we use tf.nn.softmax_cross_entropy_with_logits on the raw
# outputs of 'y', and then average across the batch.
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))
See softmax_cross_entropy_with_logits for more details.
Related
The XOR problem is known to be solved by the multi-layer perceptron given all 4 boolean inputs and outputs, it trains and memorizes the weights needed to reproduce the I/O. E.g.
import numpy as np
np.random.seed(0)
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx):
# See https://math.stackexchange.com/a/1225116
return sx * (1 - sx)
# Cost functions.
def cost(predicted, truth):
return truth - predicted
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
X = xor_input
Y = xor_output
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
num_epochs = 10000
learning_rate = 1.0
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
layer2_error = cost(layer2, Y)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += learning_rate * np.dot(layer1.T, layer2_delta)
W1 += learning_rate * np.dot(layer0.T, layer1_delta)
We see that we've fully trained the network to memorize the outputs for XOR:
# On the training data
[int(prediction > 0.5) for prediction in layer2]
[out]:
[0, 1, 1, 0]
If we re-feed the same inputs, we get the same output:
for x, y in zip(X, Y):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
0 [0]
But if we retrain the parameters (W1 and W2) without one of the data points, i.e.
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
Let's drop the last row of data and use that as unseen test.
X = xor_input[:-1]
Y = xor_output[:-1]
And with the rest of the same code, regardless of how I change the hyperparameters, it's un-able to learn the XOR function and reproduce the I/O.
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
1 [0]
Even if we shuffle the in-/output:
# Shuffle the order of the inputs
_temp = list(zip(X, Y))
random.shuffle(_temp)
xor_input_shuff, xor_output_shuff = map(np.array, zip(*_temp))
We can't train the XOR function fully:'
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(x, int(prediction > 0.5), y)
[out]:
[0 0] 1 [0]
[0 1] 1 [1]
[1 0] 1 [1]
[1 1] 0 [0]
So when the literature states that the multi-layered perceptron (Aka the basic deep learning) solves XOR, does it mean that it can fully learn and memorize the weights given the fully set of in-/outputs but cannot generalize the XOR problem given that one of data point is missing?
Here's the link of the Kaggle dataset that answerers can test the network for themselves: https://www.kaggle.com/alvations/xor-with-mlp/
I think learning (generalizing) XOR and memorizing XOR are different things.
A two-layer perceptron can memorize XOR as you have seen, that is there exists a combination of weights where the loss is minimum and equal to 0 (absolute minimum).
If the weights are randomly initialized, you might end up with the situation where you have actually learned XOR and not only memorized.
Note that multi-layer perceptrons are non-convex functions so, there could be multiple minima (multiple global minima even). When data is missing one input, there are multiple minima (and all are equal in value) and there exists minima where the missing point would be correctly classified. Hence, MLP can learn an XOR. (though finding that weight combination might be hard with a missing point).
It is quite often argued that Neural Networks are universal function approximator and can approximate non-sense labels even. In that light, you might want to look at this work https://arxiv.org/abs/1611.03530
In tensorflow it seems that the entire backpropagation algorithm is performed by a single running of an optimizer on a certain cost function, which is the output of some MLP or a CNN.
I do not fully understand how tensorflow knows from the cost that it is indeed an output of a certain NN? A cost function can be defined for any model. How should I "tell" it that a certain cost function derives from a NN?
Question
How should I "tell" tf that a certain cost function derives from a NN?
(short) Answer
This is done by simply configuring your optimizer to minimize (or maximize) a tensor. For example, if I have a loss function like so
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
where y0 is the ground truth (or desired output) and y_out is the calculated output, then I could minimize the loss by defining my training function like so
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
This tells Tensorflow that when train is calculated, it is to apply gradient descent on loss to minimize it, and loss is calculated using y0 and y_out, and so gradient descent will also affect those (if they are trainable variables), and so on.
The variable y0, y_out, loss, and train are not standard python variables but instead descriptions of a computation graph. Tensorflow uses information about that computation graph to unroll it while applying gradient descent.
Specifically how it does that is beyond the scope of this answer. Here and here are two good starting points for more information about more specifics.
Code Example
Let's walk through a code example. First the code.
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
sess.run(train)
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
Let's go through it, but in reverse order starting with
sess.run(train)
This tells tensorflow to look up the graph node defined by train and calculate it. Train is defined as
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
To calculate this tensorflow must compute the automatic differentiation for loss, which means walking the graph. loss is defined as
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
Which is really tensorflow applying automatic differentiation to unroll first tf.reduce_sum, then tf.square, then y0 - y_out, which leads to then having to walk the graph for both y0 and y_out.
y0 = tf.constant( y_ , dtype=tf.float32 )
y0 is a constant and will not be updated.
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
y_out will be processed similar to loss, first tf.sigmoid will be processed, etc...
All in all, each operation ( such as tf.sigmoid, tf.square ) not only defines the forward operation ( apply sigmoid or square ) but also information necessary for automatic differentiation. This is different than standard python math such as
x = 7 + 9
The above equation encodes nothing except how to update x, where as
z = y0 - y_out
encodes the graph of subtracting y_out from y0 and stores both the forward operation and enough to do automatic differentiation in z
The backpropagation was created by Rumelhart and Hinton et al and published on Nature in 1986.
As stated in section 6.5: Back-Propagation and Other DifferentiationAlgorithms of the deeplearning book there are two types of approaches for back-propagation gradients through computational graphs: symbol-to-number differentiation and symbol to symbol derivatives. The more relevant one to Tensorflow as stated in this paper: A Tour of TensorFlow is the later which can be illustrated using this diagram:
Source: Section II Part D of A Tour of TensorFlow
In left side of the the Fig. 7 above, w represents the weights(or Variables) in Tensorflow and x and y are two intermediary operations(or nodes, w, x, y and z are all operations) to get the scalar loss z.
Tensorflow will add a node to each node(if we print the names of variables in a certain checkpoint we can see some additional variables for such nodes and they will be eliminated if we freeze the model to a protocol buffer file for deployment) in the graph for the gradient which can be seen in diagram (b) on the right side: dz/dy, dy/dx, dx/dw.
During the traversal of the back propagation at each node we multiply its gradient with that of the previous one and finally to get a symbolic handle to the overall target derivative dz/dw = dz/dy * dy/dx * dx/dw, which applies exactly the chain rule. Once the gradient is worked out w can update itself with a learning rate.
For more detailed information please read this paper: TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems
I need some clarification on how Tensorflow treats the shape of its tensors. This is taken from the MNIST example:
I define a placeholder that will at some later point be fed with some of my training data:
x = tf.placeholder(tf.float32, shape=[None, 784])
During runtime I feed it in batches of 100, so its shape during runtime is (100, 784). I also define weights and biases:
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
Wis of shape (784, 10) and bis of shape (10). Now I compute
y = tf.matmul(x,W) + b
And this is where I am stuck. The matrix product of x and W is of shape (None, 10) or (100, 10) during runtime. However I can without error add vector b to it. This confuses me. How can this work? And is there some better documentation for this?
The + operator in tf.matmul(x, W) + b is actually shorthand for tf.add(tf.matmul(x, W), b) (operator overloading).
The documentation for tf.add mentions that it supports broadcasting, which means that when you add a tensor with shape (10) to a tensor with shape (100, 10), it's the equivalent of adding the (10) tensor to each row of the (100, 10) tensor.
Hope that helps
I just study about the softmax regression, and I have a question really need your help. Here, I begin with MNIST softmax regression, and in this kind problem, it only calculate the accuracy without mention how to predict data.
But my problem is different:
My training data form
and I would like to predict the output with given input
Therefore, for my data, I define the following variables
x = tf.placeholder(tf.float32, [None, 2])
W = tf.Variable(tf.zeros([2, 3]))
b = tf.Variable(tf.zeros([3]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 3])
After training, I got W and b, but I don't know how to define the function to predict output if my input now is
x= [[11, 7],[3, 4],[1, 0]]
Could you help me to figure out?
Thanks very much
The function to predict should be the same you used at training time. I do not understand the question.
I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss