Loss functions in tensorflow (with an if - else) - tensorflow

I am trying a different loss functions in tensorflow.
The loss function I want is a kind of an epsilon insensitive function (this is componentwise):
if(|yData-yModel|<epsilon):
loss=0
else
loss=|yData-yModel|
I tried this solution:
yData=tf.placeholder("float",[None,numberOutputs])
yModel=model(...
epsilon=0.2
epsilonTensor=epsilon*tf.ones_like(yData)
loss=tf.maximum(tf.abs(yData-yModel)-epsilonTensor,tf.zeros_like(yData))
optimizer = tf.train.GradientDescentOptimizer(0.25)
train = optimizer.minimize(loss)
I also used
optimizer = tf.train.MomentumOptimizer(0.001,0.9)
I do not find any error in the implementation. However, it does not converge, while the loss = tf.square(yData-yModel) converges and loss=tf.maximum(tf.square(yData-yModel)-epsilonTensor,tf.zeros_like(yData)) also converges.
So, I also tried something simpler loss=tf.abs(yData-yModel) and it also does not converge. Am I making some mistake, or having problems with the non-differentiability of the abs at zero or something else? What is happenning with the abs function?

When your loss is something like Loss(x)=abs(x-y), then solution is an unstable fixed point of SGD -- start your minimization with a point arbitrarily close to the solution, and the next step will increase the loss.
Having a stable fixed point is a requirement for convergence of an iterative procedure like SGD. In practice this means your optimization will move towards a local minimum, but after getting close enough, will jump around the solution with steps proportional to the learning rate. Here's a toy TensorFlow program that illustrates the problem
x = tf.Variable(0.)
loss_op = tf.abs(x-1.05)
opt = tf.train.GradientDescentOptimizer(0.1)
train_op = opt.minimize(loss_op)
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
xvals = []
for i in range(20):
unused, loss, xval = sess.run([train_op, loss_op, x])
xvals.append(xval)
pyplot.plot(xvals)
Some solutions to the problem:
Use a more robust solver such as the Proximal Gradient Method
Use more SGD friendly loss function such as Huber Loss
Use learning rate schedule to gradually decrease learning rate
Here's a way to implement (3) on the toy problem above
x = tf.Variable(0.)
loss_op = tf.abs(x-1.05)
step = tf.Variable(0)
learning_rate = tf.train.exponential_decay(
0.2, # Base learning rate.
step, # Current index into the dataset.
1, # Decay step.
0.9 # Decay rate
)
opt = tf.train.GradientDescentOptimizer(learning_rate)
train_op = opt.minimize(loss_op, global_step=step)
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
xvals = []
for i in range(40):
unused, loss, xval = sess.run([train_op, loss_op, x])
xvals.append(xval)
pyplot.plot(xvals)

Related

Why does my gradient descent optimizer blow up after getting close to a solution?

I'm trying to run through a simple linear regression example in Tensorflow, and it appears that the training algorithm is converging to a solution, but once it gets close to the solution, it starts bouncing around and eventually blows up.
I'm passing data for a y = 2x line, so the gradient descent optimizer should be able to easily converge to a solution.
import tensorflow as tf
M = tf.Variable([0.4], dtype=tf.float32)
b = tf.Variable([-0.4], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = M * x + b
error = linear_model - y
loss = tf.square(error)
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for i in range(100):
sess.run(optimizer, {x: i, y: 2 * i})
print(sess.run([M, b]))
Here's the result. I circled the portion where it gets close to a solution. Why does the gradient descent break once it gets close to the solution, or is there's something that I'm doing wrong?
Your code feeds the training data one at a time for only one epoch. This corresponds to stochastic gradient descent, where the loss value tends to fluctuate more frequently than batch and mini-batch gradient descent during training. Moreover, since the data is fed in an increasing order of x, the gradient value also increases along with x. That is why you see larger fluctuations in the later part of an epoch.
This can happen if the learning rate is too high; try lowering it.
My guess would be that you have chosen a high learning rate. You can use grid search and find the optimal learning rate, then fit data using the optimal learning rate.

What does batch normalization do if the batch size is one?

I'am currently reading the paper from Ioffe and Szegedy about Batch Normalization and im wondering what happens if the Batch size is set to one. The computation of the mini-Batch mean(which is basically the value of theactivation itself) and variance(should be Zero plus constant epsilon) would lead to a normalized Dimension of Zero.
Yet this small example in tensorflow Shows that something different is Happening:
test_img = np.array([[[[50],[100]],
[[150],[200]]]], np.float32)
gt_img = np.array([[[[60],[130]],
[[180],[225]]]], np.float32)
test_img_op = tf.convert_to_tensor(test_img, tf.float32)
norm_op = tf.layers.batch_normalization(test_img_op)
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = gt_img,
logits = norm_op))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer_obj = tf.train.AdamOptimizer(0.01).minimize(loss_op)
with tf.Session() as sess:
sess.run(tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer()))
print(test_img)
while True:
new_img, op, lossy, trainable = sess.run([norm_op, optimizer_obj, loss_op, tf.trainable_variables()])
print(trainable)
print(new_img)
So what is TensorFlow doing differently(moving average?!)?
Thank you!
Because of beta, the learnable parameter for translation which is enabled by default, the normalized output will not necessarily be zero.
Moving averages for input mean and variance will be computed during training and can be used at testing (if you set is_training accordingly).

TensfoFlow: Linear Regression loss increasing (instead decreasing) with successive epochs

I'm learning TensorFlow and trying to apply it on a simple linear regression problem. data is numpy.ndarray of shape [42x2].
I'm a bit puzzled why after each succesive epoch the loss is increasing. Isn't the loss expected to to go down with each successive epoch!
Here is my code (let me know, if you'd like me to share the output as well!): (Thanks a lot for taking your time to answer to it.)
1) created the placeholders for dependent / independent variables
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32,name='Y')
2) created vars for weight, bias, total_loss (after each epoch)
w = tf.Variable(0.0,name='weights')
b = tf.Variable(0.0,name='bias')
3) defined loss function & optimizer
Y_pred = X * w + b
loss = tf.reduce_sum(tf.square(Y - Y_pred), name = 'loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
4) created summary events & event file writer
tf.summary.scalar(name = 'weight', tensor = w)
tf.summary.scalar(name = 'bias', tensor = b)
tf.summary.scalar(name = 'loss', tensor = loss)
merged = tf.summary.merge_all()
evt_file = tf.summary.FileWriter('def_g')
evt_file.add_graph(tf.get_default_graph())
5) and execute all in a session
with tf.Session() as sess1:
sess1.run(tf.variables_initializer(tf.global_variables()))
for epoch in range(10):
summary, _,l = sess1.run([merged,optimizer,loss],feed_dict={X:data[:,0],Y:data[:,1]})
evt_file.add_summary(summary,epoch+1)
evt_file.flush()
print(" new_loss: {}".format(sess1.run(loss,feed_dict={X:data[:,0],Y:data[:,1]})))
Cheers!
The short answer is that your learning rate is too big. I was able to get reasonable results by changing it from 0.001 to 0.0001, but I only used the 23 points from your second-last comment (I initially didn't notice your last comment), so using all the data might require an even lower number.
0.001 seems like a really low learning rate. However, the real problem is that your loss function is using reduce_sum instead of reduce_mean. This causes your loss to be a large number, which sends a very strong signal to the GradientDescentOptimizer, so it's overshooting despite the low learning rate. The problem would only get worse if you added more points to your training data. So use reduce_mean to get the average squared error and your algorithms will be much better behaved.

Linear Regression model On tensorflow can't learn bias

I am trying to train a linear regression model in Tensorflow using some generated data. The model seems to learn the slope of the line, but is unable to learn the bias.
I have tried changing the no. of epochs, the weight(slope) and the biases, but every time , the learnt bias by the model comes out to be zero. I don't know where I am going wrong and some help would be appreciated.
Here is the code.
import numpy as np
import tensorflow as tf
# assume the linear model to be Y = W*X + b
X = tf.placeholder(tf.float32, [None, 1])
Y = tf.placeholder(tf.float32, [None,1])
# the weight and biases
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
# the model
prediction = tf.matmul(X,W) + b
# the cost function
cost = tf.reduce_mean(tf.square(Y - prediction))
# Use gradient descent
learning_rate = 0.000001
train_step =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
steps = 1000
epochs = 10
Verbose = False
# In the end, the model should learn these values
test_w = 3
bias = 10
for _ in xrange(epochs):
for i in xrange(steps):
# make fake data for the model
# feed one example at a time
# stochastic gradient descent, because we only use one example at a time
x_temp = np.array([[i]])
y_temp = np.array([[test_w*i + bias]])
# train the model using the data
feed_dict = {X: x_temp, Y:y_temp}
sess.run(train_step,feed_dict=feed_dict)
if Verbose and i%100 == 0:
print("Iteration No: %d" %i)
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
print("Finally:")
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
# These values should be close to the values we used to generate data
https://github.com/HarshdeepGupta/tensorflow_notebooks/blob/master/Linear%20Regression.ipynb
Outputs are in the last line of code.
The model needs to learn test_w and bias (in the notebook link, it is in the 3rd cell, after the first comment), which are set to 3 and 10 respectively.
The model correctly learns the weight(slope), but is unable to learn the bias. Where is the error?
The main problem is that you are feeding just one sample at a time to the model. This makes your optimizer very inestable, that's why you have to use such a small learning rate. I will suggest you to feed more samples in each step.
If you insist in feeding one sample at a time, maybe you should consider using an optimizer with momentum, like tf.train.AdamOptimizer(learning_rate). This way you can increase the learning rate and reach convergence.

Tensorflow Gradient Optimizer returning Nan, while cross entropy cost returns a number

import tensorflow as tf
tf_X = tf.placeholder("float",[None,3073])
tf_Y = tf.placeholder("float",[None,10])
tf_W = tf.Variable(0.001*tf.random_normal([3073,10]))
#tf.random_uniform([3073,10],-0.1,0.1)#
tf_learning_rate = 0.0001
hypothesis = tf.nn.softmax(tf.matmul(tf_X,tf_W)) #out put is softmax value for each class
cost = tf.reduce_mean(tf.reduce_sum(tf_Y*-tf.log(hypothesis), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(tf_learning_rate).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
print sess.run(cost, feed_dict = {tf_X:X_dev,tf_Y:onehot_y_dev})
print sess.run
for step in xrange(400):
sess.run(optimizer, feed_dict = {tf_X:X_dev,tf_Y:onehot_y_dev}) # we have to make one hot coding for y
if step % 200 ==0:
print step,sess.run(cost, feed_dict={tf_X:X_dev,tf_Y:onehot_y_dev})
I'm trying to implement softmax-cross entropy in tensorflow.
When I sess.run(cost) it returns a number (2.322)
but when I run the GradientDescentOptimizer, the cost returned is Nan...
What's happening here? Did I implement the optimizer function wrong?
O.K I found the problem by myself.
I always heard of exploding gradients but the problem here is about exploding costs. I did not know there is a exploding cost problem
After my first gradident descent, my high learning rate forced my cost to explode by becoming -log(number very close to zero).
If you have a same problem as I had, lower your learning rate, or use tf.clip to mitigate this type of problem