Tensorflow prediction binary strings - tensorflow

i'm trying to create a convolutional neural network, which predicts whether or not to sell for a hydropower dam, the issue i am having is the output. I input two inputs, price(a normalized float) and waterinflow (either 1 or 0 at this point).
My issue is that running this and trying to get the answer as a set of actions 0/1, gives me floats which do not make any sense other than if the output is set as the corresponding number instead of the set of actions. This is fine when the amount of actions are small, but will be horrible later on when the number of actions are extended.
Does anyone know how i can make it so that it outputs the actions as either 0 or 1, instead of the floats which seem to be certainty of the prediction.
Meaning if there are 4 actions, and the correct answer is 0, 1, 0, 1, then the predictions should be in the same form(4 actions either 0 or 1)
Any help would be much appreciated

Binary output from Normalized Probability
What you are looking for is a method of converting your normalized probability output to a binary one.
This is very straight forward in Tensorflow and involves added a tf.round function. The trick is to make sure you do not use the output tf.round in training. This is best demonstrated using a working code example.
Working code example
This code calculates the XOR function using a neural net. The outputs are y_out (the probability output) and y_binary (the casting of the probability output to binary)
### imports
import tensorflow as tf
import numpy as np
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[1.,0.],[1.,0.],[0.,1.],[0.,1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.placeholder( dtype=tf.float32 , shape=[None,2] )
y0 = tf.placeholder( dtype=tf.float32 , shape=[None,2] )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x2 softmax output
m2 = tf.Variable( tf.random_uniform( [3,2] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [2] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_logit = tf.matmul( h1,m2 ) + b2
y_out = tf.nn.softmax( y_logit )
y_binary = tf.round( y_out )
### loss
# loss : a loss function that uses y_logit or y_out , but NOT y_binary
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
print "\nloss"
for step in range(500) :
sess.run(train, feed_dict={x0:x,y0:y_})
if (step + 1) % 100 == 0 :
print sess.run(loss, feed_dict={x0:x,y0:y_})
y_out_value , y_binary_value = sess.run([y_out,y_binary], feed_dict={x0:x,y0:y_})
print "\nThe expected output is :"
print np.array(y_)
print "\nThe softmax output is :"
print np.array(y_out_value)
print "\nThe binary output is :"
print np.array(y_binary_value)
print ""
Output
The expected output is :
[[ 1. 0.]
[ 1. 0.]
[ 0. 1.]
[ 0. 1.]]
The softmax output is :
[[ 0.96538627 0.03461381]
[ 0.81609273 0.18390732]
[ 0.11534476 0.88465524]
[ 0.0978259 0.90217412]]
The binary output is :
[[ 1. 0.]
[ 1. 0.]
[ 0. 1.]
[ 0. 1.]]
As you can see, you can retrieve the probability outputs OR the probabilities cast as binary and still have all the benefits of classic logits.
Cheers.

I guess it is important to note that the output of neural nets are actually posterior probability computed on each element of the classes present---for a typical classification problem.
The figures returned tells you how likely is the ouput to be of class A, B, C given the input x. So that you can not expect to get 0 or 1 always.
#An example would be if I get
Output = [0.5,0.2,0.3] given input x.
#I predict the class should be A because it has posterior of 0.5
(the highest value of the 3 values returned).
Class = A (0.5)
# Or I might as well round it up. Tensor flow can do this for you
So I guess you should get the output and apply probabilistic assumptions thats fit your model like say the highest value in the returned predictions gives the class it belongs.
It might not be easy to wait for absolute one or zero prediction.
Be careful of this fact I wrote above. Its a common mistake. And please do read the paper below. Once you have posteriors, you can add and build models on them. There is no limitation to what you can achieve!
For example you can apply Gaussian Mixture models/ Markov models/Build decision Tress/Combine expert systems on the output, those are the elegant and scientific approach.
Read this paper for more info.
http://www.ee.iisc.ac.in/people/faculty/prasantg/downloads/NeuralNetworksPosteriors_Lippmann1991.pdf
Hope it helps!

Related

Tensorflow - Is there a simple way to zero out the losses of the samples with the highest losses in a mini-batch?

I am training a neural network for classification. In the context of my research, I would like to zero out the (k) highest losses in each minibatch. I couldn't figure out a simple way to perform this procedure, without relying on numpy at some level.
I have tried the following procedure :
1. Compute the argmax indices of the losses array -- It returns a tf Tensor
2. Slice the losses tensor with the indices array
The issue is that the slicing couldn't be performed using a tf Tensor.
# losses is tf.Tensor
ind_sorted = tf.argsort(losses)
losses_sorted = losses[ind_sorted] # Error mentioned above
# The issue is that ind_1_sorted depends on the output of the neural network. I couldn't find an equivalent of the detach method in pytorch
k_smallest_losses = losses_sorted[:k] # Keeping only the k smallest losses
loss = tf.sum(k_smallest_losses) # Performing the summation of the k smallest losses
Probably you want to use tf.nn.top_k, which returns you both the values and indices of top_k items. (Note to get smallest losses, I add a negative to your loss and convert them back when done).
batch = 2
max_len = 6
losses = tf.random.uniform(shape=[batch, max_len], minval=0, maxval=2, dtype = tf.float32)
bottom_losses_values, bottom_losses_indices = tf.nn.top_k(-losses, k=3)
total = tf.reduce_sum(-bottom_losses_values, axis=-1)
with tf.Session() as sess:
losses, bottom_losses_values, bottom_losses_indices, total = sess.run([losses, bottom_losses_values, bottom_losses_indices, total])
print 'original losses\n', losses
print 'bottom 3 loss values\n', -bottom_losses_values
print 'bottom 3 loss indices\n', bottom_losses_indices
print 'total\n', total
Results:
original losses
[[ 1.45301318 1.65069246 1.31003475 1.71488905 1.71400714 0.0543921 ]
[ 0.09954047 0.12081003 0.24793792 1.51561213 1.73758292 1.43859148]]
bottom 3 loss values
[[ 0.0543921 1.31003475 1.45301318]
[ 0.09954047 0.12081003 0.24793792]]
bottom 3 loss indices
[[5 2 0]
[0 1 2]]
total
[ 2.81744003 0.46828842]

Solving XOR with 3 data points using Multi-Layered Perceptron

The XOR problem is known to be solved by the multi-layer perceptron given all 4 boolean inputs and outputs, it trains and memorizes the weights needed to reproduce the I/O. E.g.
import numpy as np
np.random.seed(0)
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx):
# See https://math.stackexchange.com/a/1225116
return sx * (1 - sx)
# Cost functions.
def cost(predicted, truth):
return truth - predicted
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
X = xor_input
Y = xor_output
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
num_epochs = 10000
learning_rate = 1.0
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
layer2_error = cost(layer2, Y)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += learning_rate * np.dot(layer1.T, layer2_delta)
W1 += learning_rate * np.dot(layer0.T, layer1_delta)
We see that we've fully trained the network to memorize the outputs for XOR:
# On the training data
[int(prediction > 0.5) for prediction in layer2]
[out]:
[0, 1, 1, 0]
If we re-feed the same inputs, we get the same output:
for x, y in zip(X, Y):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
0 [0]
But if we retrain the parameters (W1 and W2) without one of the data points, i.e.
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
Let's drop the last row of data and use that as unseen test.
X = xor_input[:-1]
Y = xor_output[:-1]
And with the rest of the same code, regardless of how I change the hyperparameters, it's un-able to learn the XOR function and reproduce the I/O.
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
1 [0]
Even if we shuffle the in-/output:
# Shuffle the order of the inputs
_temp = list(zip(X, Y))
random.shuffle(_temp)
xor_input_shuff, xor_output_shuff = map(np.array, zip(*_temp))
We can't train the XOR function fully:'
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(x, int(prediction > 0.5), y)
[out]:
[0 0] 1 [0]
[0 1] 1 [1]
[1 0] 1 [1]
[1 1] 0 [0]
So when the literature states that the multi-layered perceptron (Aka the basic deep learning) solves XOR, does it mean that it can fully learn and memorize the weights given the fully set of in-/outputs but cannot generalize the XOR problem given that one of data point is missing?
Here's the link of the Kaggle dataset that answerers can test the network for themselves: https://www.kaggle.com/alvations/xor-with-mlp/
I think learning (generalizing) XOR and memorizing XOR are different things.
A two-layer perceptron can memorize XOR as you have seen, that is there exists a combination of weights where the loss is minimum and equal to 0 (absolute minimum).
If the weights are randomly initialized, you might end up with the situation where you have actually learned XOR and not only memorized.
Note that multi-layer perceptrons are non-convex functions so, there could be multiple minima (multiple global minima even). When data is missing one input, there are multiple minima (and all are equal in value) and there exists minima where the missing point would be correctly classified. Hence, MLP can learn an XOR. (though finding that weight combination might be hard with a missing point).
It is quite often argued that Neural Networks are universal function approximator and can approximate non-sense labels even. In that light, you might want to look at this work https://arxiv.org/abs/1611.03530

How BatchNormalization in keras works?

I want to know how BatchNormalization works in keras, so I write the code:
X_input = keras.Input((2,))
X = keras.layers.BatchNormalization(axis=1)(X_input)
model1 = keras.Model(inputs=X_input, outputs=X)
the input is a batch of two dimenstions vector, and normalizing it along axis=1, then print the output:
a = np.arange(4).reshape((2,2))
print('a=')
print(a)
print('output=')
print(model1.predict(a,batch_size=2))
and the output is:
a=
array([[0, 1],
[2, 3]])
output=
array([[ 0. , 0.99950039],
[ 1.99900079, 2.9985013 ]], dtype=float32)
I can not figure out the results. As far as I know, the mean of the batch should be ([0,1] + [2,3])/2 = [1,2], the var is 1/2*(([0,1] - [1,2])^2 + ([2,3]-[1,2])^2) = [1,1]. Finally, normalizing it with (x - mean)/sqrt(var), therefore the results are [-1, -1] and [1,1], where am I wrong?
BatchNormalization will substract the mean, divide by the variance, apply a factor gamma and an offset beta. If these parameters would actually be the mean and variance of your batch, the result would be centered around zero with variance 1.
But they are not. The keras BatchNormalization layer stores these as weights that can be trained, called moving_mean, moving_variance, beta and gamma. They are initialized as beta=0, gamma=1, moving_mean=0 and moving_variance=1. Since you don't have any train steps, BatchNorm does not change your values.
So, why don't you get exactly your input values? Because there is another parameter epsilon (a small number), which gets added to the variance. Therefore, all values are divided by 1+epsilon and end up a little bit below their input values.

How do backpropagation works in tensorflow

In tensorflow it seems that the entire backpropagation algorithm is performed by a single running of an optimizer on a certain cost function, which is the output of some MLP or a CNN.
I do not fully understand how tensorflow knows from the cost that it is indeed an output of a certain NN? A cost function can be defined for any model. How should I "tell" it that a certain cost function derives from a NN?
Question
How should I "tell" tf that a certain cost function derives from a NN?
(short) Answer
This is done by simply configuring your optimizer to minimize (or maximize) a tensor. For example, if I have a loss function like so
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
where y0 is the ground truth (or desired output) and y_out is the calculated output, then I could minimize the loss by defining my training function like so
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
This tells Tensorflow that when train is calculated, it is to apply gradient descent on loss to minimize it, and loss is calculated using y0 and y_out, and so gradient descent will also affect those (if they are trainable variables), and so on.
The variable y0, y_out, loss, and train are not standard python variables but instead descriptions of a computation graph. Tensorflow uses information about that computation graph to unroll it while applying gradient descent.
Specifically how it does that is beyond the scope of this answer. Here and here are two good starting points for more information about more specifics.
Code Example
Let's walk through a code example. First the code.
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
sess.run(train)
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
Let's go through it, but in reverse order starting with
sess.run(train)
This tells tensorflow to look up the graph node defined by train and calculate it. Train is defined as
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
To calculate this tensorflow must compute the automatic differentiation for loss, which means walking the graph. loss is defined as
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
Which is really tensorflow applying automatic differentiation to unroll first tf.reduce_sum, then tf.square, then y0 - y_out, which leads to then having to walk the graph for both y0 and y_out.
y0 = tf.constant( y_ , dtype=tf.float32 )
y0 is a constant and will not be updated.
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
y_out will be processed similar to loss, first tf.sigmoid will be processed, etc...
All in all, each operation ( such as tf.sigmoid, tf.square ) not only defines the forward operation ( apply sigmoid or square ) but also information necessary for automatic differentiation. This is different than standard python math such as
x = 7 + 9
The above equation encodes nothing except how to update x, where as
z = y0 - y_out
encodes the graph of subtracting y_out from y0 and stores both the forward operation and enough to do automatic differentiation in z
The backpropagation was created by Rumelhart and Hinton et al and published on Nature in 1986.
As stated in section 6.5: Back-Propagation and Other DifferentiationAlgorithms of the deeplearning book there are two types of approaches for back-propagation gradients through computational graphs: symbol-to-number differentiation and symbol to symbol derivatives. The more relevant one to Tensorflow as stated in this paper: A Tour of TensorFlow is the later which can be illustrated using this diagram:
Source: Section II Part D of A Tour of TensorFlow
In left side of the the Fig. 7 above, w represents the weights(or Variables) in Tensorflow and x and y are two intermediary operations(or nodes, w, x, y and z are all operations) to get the scalar loss z.
Tensorflow will add a node to each node(if we print the names of variables in a certain checkpoint we can see some additional variables for such nodes and they will be eliminated if we freeze the model to a protocol buffer file for deployment) in the graph for the gradient which can be seen in diagram (b) on the right side: dz/dy, dy/dx, dx/dw.
During the traversal of the back propagation at each node we multiply its gradient with that of the previous one and finally to get a symbolic handle to the overall target derivative dz/dw = dz/dy * dy/dx * dx/dw, which applies exactly the chain rule. Once the gradient is worked out w can update itself with a learning rate.
For more detailed information please read this paper: TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems

How to quantize the values of tf.Variables in Tensorflow

I have a training model like
Y = w * X + b
where Y and X are output and input placeholder, w and b are the vectors
I already know the value of w can only be 0 or 1, while b is still tf.float32.
How could I quantize the range of variable w when I define it?
or
Can I have two different learning rates? The rate for w is 1 or -1 and the rate for b is 0.0001 as usual.
There is no way to limit your variable during the activation. But what you can do is to limit it after each iteration. Here is one way to do this with tf.where():
import tensorflow as tf
a = tf.random_uniform(shape=(3, 3))
b = tf.where(
tf.less(a, tf.zeros_like(a) + 0.5),
tf.zeros_like(a),
tf.ones_like(a)
)
with tf.Session() as sess:
A, B = sess.run([a, b])
print A, '\n'
print B
Which will convert everything above 0.5 to 1 and everything else to 0:
[[ 0.2068541 0.12682056 0.73839438]
[ 0.00512838 0.43465161 0.98486936]
[ 0.32126224 0.29998791 0.31065524]]
[[ 0. 0. 1.]
[ 0. 0. 1.]
[ 0. 0. 0.]]
One method I have used to limit variables to a particular range is to add a constraint to my loss equation. If the variable goes outside of the desired range, then the loss will get bigger and the optimizer will push it back within the desired range.
For example:
#initialize variable to be between 0 and 1
variable = tf.Variable(tf.random_uniform([self.numOutputs], 0, 1))
#Clip the variable to force the result to be between 0 and 1 during training
variableClipped = tf.clip_by_value(variable, 0, 1)
#Set the loss to be the difference between the clipped variable and actual variable.
#Anytime it goes outside the variable range the loss will increase,
#and the optimizer will push it back within the desired range.
loss = originalLossEquation + tf.reduce_sum((variable - variableClipped)**2)