How to set bounds and constraints on Tensorflow Variables (tf.Variable) - tensorflow

I am using Tensorflow to minimize a function. The function takes about 10 parameters. Every single parameter has bounds, e.g. a minimum and a maximum value the parameter is allowed to take. For example, the parameter x1 needs to be between 1 and 10.
I also have a pair of parameters that need to have the following constraint x2 > x3. In other words, x2 must always be bigger than x3. (In addition to this, x2 and x3 also have bounds, similarly to the example of x1 above.)
I know that tf.Variable has a "constraint" argument, however I can't really find any examples or documentation on how to use this to achieve the bounds and constraints as mentioned above.
Thank you!

It seems to me (I can be mistaken) that constrained optimization (you can google for it in tensorflow) is not exactly the case for which tensroflow was designed. You may want to take a look at this repo, it may satisfy your needs, but as far as I understand, it's still not solving arbitrary constrained optimization, just some classification problems with labels and features, compatible with precision/recall scores.
If you want to use constraints on the tensorflow variable (i.e. some function applied after gradient step - which you can do manually also - by taking variable values, doing manipulations, and reassigning then), it means that you will be cutting variables after each step done using gradient in general space. It's a question whether you will successfully reach the right optimization goal this way, or your variables will stuck at boundaries, because general gradient will point somewhere outside.
My approach 1
If your problem is simple enough. you can try to parametrize your x2 and x3 as x2 = x3 + t, and then try to do cutting in the graph:
x3 = tf.get_variable('x3',
dtype=tf.float32,
shape=(1,),
initializer=tf.random_uniform_initializer(minval=1., maxval=10.),
constraint=lambda z: tf.clip_by_value(z, 1, 10))
t = tf.get_variable('t',
dtype=tf.float32,
shape=(1,),
initializer=tf.random_uniform_initializer(minval=1., maxval=10.),
constraint=lambda z: tf.clip_by_value(z, 1, 10))
x2 = x3 + t
Then, on a separate call additionally do
sess.run(tf.assign(x2, tf.clip_by_value(x2, 1.0, 10.0)))
But my opinion is that it won't work well.
My approach 2
I would also try to invent some loss terms to keep variables within constraints, which is more likely to work. For example, constraint for x2 to be in the interval [1,10] will be:
loss += alpha*tf.abs(tf.math.tan(((x-5.5)/4.5)*pi/2))
Here the expression under tan is brought to -pi/2,pi/2 and then tan function is used to make it grow very rapidly when it reaches boundaries. In this case I think you're more likely to find your optimum, but again the loss weight alpha might be too big and training will stuck somewhere nearby, if required value of x2 lies near the boundary. In this case you can try to use smaller alpha.

In addition to the answer by Slowpoke, reparameterization is another option. E.g. let's say you have a param p which should be bounded in [lower_bound,upper_bound], you could write:
p_inner = tf.Variable(...) # unbounded
p = tf.sigmoid(p_inner) * (upper_bound - lower_bound) + lower_bound
However, this will change the behavior of gradient descent.

Related

Verify that points lie on a grid of specified pitch

While I am trying to solve this problem in a context where numpy is used heavily (and therefore an elegant numpy-based solution would be particularly welcome) the fundamental problem has nothing to do with numpy (or even Python) as such.
The task is to create an automated test for an algorithm which is supposed to produce points distributed on a grid whose pitch is specified as an input to the algorithm. The absolute positions of the points do not matter, but their relative positions do. For example, following
collection_of_points = algorithm(data, pitch=[1.3, 1.5, 2])
collection_of_points should contain only points whose x-coordinates differ by multiples of 1.3, whose y-coordinates differ by multiples of 1.5 and whose z-coordinates differ by multiples of 2.
The test should verify that this condition is satisfied.
One thing that I have tried, which doesn't seem too ugly, but doesn't work is
points = algo(data, pitch=requested_pitch)
for p1, p2 in itertools.combinations(points, 2):
distance_between_points = np.array(p2) - np.array(p1)
assert np.allclose(distance_between_points % requested_pitch, 0)
[ Aside for those unfamiliar with python or numpy:
itertools.combinations(points, 2) is a simple way of iterating through all pairs of points
Arithmetic operations on np.arrays are performed elementwise, so np.array([5,6,7]) % np.array([2,3,4]) evaluates to np.array([1, 0, 3]) via np.array([5%2, 6%3, 7%4])
np.allclose checks whether all corresponding elements in the two inputs arrays are approximately equal, and numpy automatically pretends that the 0 which is passed in as the second argument, was really an all-zero array of the correct size
]
To see why the idea shown above fails, consider a desired pitch of 3 and two points which are separated by 8.9999999 in the relevant dimension. 8.999999 % 3 is around 2.999999 which is nowhere near the required 0.
In all of this, I can't help feeling that I'm missing something obvious or that I'm re-inventing some wheel.
Can you suggest an elegant way of writing such a check?
Change your assertion to:
np.all(np.logical_or(np.isclose(x % y, 0), np.isclose((x % y) - y, 0)))
If you want to make it more readable, you should functionalize the statement. Something like:
def is_multiple(x, y, rtol=1e-05, atol=1e-08):
"""
Test if x is a multiple of y.
"""
remainder = x % y
is_zero = np.isclose(remainder, 0., rtol, atol)
is_y = np.isclose(remainder, y, rtol, atol)
return np.logical_or(is_zero, is_y)
And then:
assert np.all(is_multiple(distance_between_points, requested_pitch))

Non-Convex Loss Function

I am trying to understand gradient descent algorithm by plotting the error vs value of parameters in the function. What would be an example of a simple function of the form y = f(x) with just just one input variable x and two parameters w1 and w2 such that it has a non-convex loss function ? Is y = w1.tanh(w2.x) an example ? What i am trying to achieve is this :
How does one know if the function has a non-convex loss function without plotting the graph ?
In iterative optimization algorithms such as gradient descent or Gauss-Newton, what matters is whether the function is locally convex. This is correct (on a convex set) if and only if the Hessian matrix (Jacobian of gradient) is positive semi-definite. As for a non-convex function of one variable (see my Edit below), a perfect example is the function you provide. This is because its second derivative, i.e Hessian (which is of size 1*1 here) can be computed as follows:
first_deriv=d(w1*tanh(w2*x))/dx= w1*w2 * sech^2(w2*x)
second_deriv=d(first_deriv)/dx=some_const*sech^2(w2*x)*tanh(w2*x)
The sech^2 part is always positive, so the sign of second_deriv depends on the sign of tanh, which can vary depending on the values you supply as x and w2. Therefore, we can say that it is not convex everywhere.
Edit: It wasn't clear to me what you meant by one input variable and two parameters, so I assumed that w1 and w2 were fixed beforehand, and computed the derivative w.r.t x. But I think that if you want to optimize w1 and w2 (as I suppose it makes more sense if your function is from a toy neural net), then you can compute the 2*2 Hessian in a similar way.
The same way as in high-school algebra: the second derivative tells you the direction of flex. If that's negative in all orientations, then the function is convex.

Update equation for gradient descent

If we have a approximation function y = f(w,x), where x is input, y is output, and w is the weight. According to gradient descent rule, we should update the weight according to w = w - df/dw. But is that possible that we update the weight according to w = w - w * df/dw instead? Has anyone seen this before? The reason I want to do this is because it is easier for me to do it this way in my algorithm.
Recall, gradient descent is based on the Taylor expansion of f(w, x) in the close vicinity of w, and has its purpose---in your context---in repeatedly modifying the weight in small steps. The reverse gradient direction is just a search direction, based upon very local knowledge of the function f(w, x).
Usually the iterative of the weight includes a step length, yielding the expression
w_(i+1) = w_(i) - nu_j df/dw,
where the value of the step length nu_j is found by using line search, see e.g. https://en.wikipedia.org/wiki/Line_search.
Hence, based on the discussion above, to answer your question: no, it is not a good idea to update according to
w_(i+1) = w_(i) - w_(i) df/dw.
Why? If w_(i) is large (in context), we'll take a huge step based on very local information, and we would be using something very different than the fine-stepped gradient descent method.
Also, as lejlot points out in the comments below, a negative value of w(i) would mean you traverse in the (positive) direction of the gradient, i.e., in the direction in which the function grows most rapidly, which is, locally, the worst possible search direction (for minimization problems).

What is the best way to implement weight constraints in TensorFlow?

Suppose we have weights
x = tf.Variable(np.random.random((5,10)))
cost = ...
And we use the GD optimizer:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
How can we implement for example non-negativity on weights?
I tried clipping them:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
session.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
But this slows down my training by a factor of 50.
Does anybody know a good way to implement such constraints on the weights in TensorFlow?
P.S.: in the equivalent Theano algorithm, I had
T.clip(x, 0, np.infty)
and it ran smoothly.
You can take the Lagrangian approach and simply add a penalty for features of the variable you don't want.
e.g. To encourage theta to be non-negative, you could add the following to the optimizer's objective function.
added_loss = -tf.minimum( tf.reduce_min(theta),0)
If any theta are negative, then add2loss will be positive, otherwise zero. Scaling that to a meaningful value is left as an exercise to the reader. Scaling too little will not exert enough pressure. Too much may make things unstable.
As of TensorFlow 1.4, there is a new argument to tf.get_variable that allows to pass a constraint function that is applied after the update of the optimizer. Here is an example that enforces a non-negativity constraint:
with tf.variable_scope("MyScope"):
v1 = tf.get_variable("v1", …, constraint=lambda x: tf.clip_by_value(x, 0, np.infty))
constraint: An optional projection function to be applied to the
variable
after being updated by an Optimizer (e.g. used to implement norm
constraints or value constraints for layer weights). The function must
take as input the unprojected Tensor representing the value of the
variable and return the Tensor for the projected value
(which must have the same shape). Constraints are not safe to
use when doing asynchronous distributed training.
By running
sess.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
you are consistently adding nodes to the graph and making it slower and slower.
Actually you may just define a clip_op when building the graph and run it each time after updating the weights:
# build the graph
x = tf.Variable(np.random.random((5,10)))
loss = ...
train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss)
clip_op = tf.assign(x, tf.clip(x, 0, np.infty))
# train
sess.run(train_op)
sess.run(clip_op)
I recently had this problem as well. I discovered that you can import keras which has nice weight constraint functions as use them directly in the kernen constraint in tensorflow. Here is an example of my code. You can do similar things with kernel regularizer
from keras.constraints import non_neg
conv1 = tf.layers.conv2d(
inputs=features['x'],
filters=32,
kernel_size=[5,5],
strides = 2,
padding='valid',
activation=tf.nn.relu,
kernel_regularizer=None,
kernel_constraint=non_neg(),
use_bias=False)
There is a practical solution: Your cost function can be written by you, to put high cost onto negative weights. I did this in a matrix factorization model in TensorFlow with python, and it worked well enough. Right? I mean it's obvious. But nobody else mentioned it so here you go. EDIT: I just saw that Mark Borderding also gave another loss and cost-based solution implementation before I did.
And if "the best way" is wanted, as the OP asked, what then? Well "best" might actually be application-specific, in which case you'd need to try a few different ways with your dataset and consider your application requirements.
Here is working code for increasing the cost for unwanted negative solution variables:
cost = tf.reduce_sum(keep_loss) + Lambda * reg # Cost = sum of losses for training set, except missing data.
if prefer_nonneg: # Optionally increase cost for negative values in rhat, if you want that.
negs_indices = tf.where(rhat < tf.constant(0.0))
neg_vals = tf.gather_nd(rhat, negs_indices)
cost += 2. * tf.reduce_sum(tf.abs(neg_vals)) # 2 is a magic number (empirical parameter)
You are free to use my code but please give me some credit if you choose to use it. Give a link to this answer on stackoverflow.com please.
This design would be considered a soft constraint, because you can still get negative weights, if you let it, depending on your cost definition.
It seems that constraint= is also available in TF v1.4+ as a parameter to tf.get_variable(), where you can pass a function like tf.clip_by_value. This seems like another soft constraint, not hard constraint, in my opinion, because it depends on your function to work well or not. It also might be slow, as the other answerer tried the same function and reported it was slow to converge, although they didn't use the constraint= parameter to do this. I don't see any reason why one would be any faster than the other since they both use the same clipping approach. So if you use the constraint= parameter then you should expect slow convergence in the context of the original poster's application.
It would be nicer if also TF provided true hard constraints to the API, and let TF figure out how to both implement that as well as make it efficient on the back end. I mean, I have seen this done in linear programming solvers already for a long time. The application declares a constraint, and the back end makes it happen.

Normal Distribution function

edit
So based on the answers so far (thanks for taking your time) I'm getting the sense that I'm probably NOT looking for a Normal Distribution function. Perhaps I'll try to re-describe what I'm looking to do.
Lets say I have an object that returns a number of 0 to 10. And that number controls "speed". However instead of 10 being the top speed, I need 5 to be the top speed, and anything lower or higher would slow down accordingly. (with easing, thus the bell curve)
I hope that's clearer ;/
-original question
These are the times I wish I remembered something from math class.
I'm trying to figure out how to write a function in obj-C where I define the boundries, ex (0 - 10) and then if x = foo y = ? .... where x runs something like 0,1,2,3,4,5,6,7,8,9,10 and y runs 0,1,2,3,4,5,4,3,2,1,0 but only on a curve
Something like the attached image.
I tried googling for Normal Distribution but its way over my head. I was hoping to find some site that lists some useful algorithms like these but wasn't very successful.
So can anyone help me out here ? And if there is some good sites which shows useful mathematical functions, I'd love to check them out.
TIA!!!
-added
I'm not looking for a random number, I'm looking for.. ex: if x=0 y should be 0, if x=5 y should be 5, if x=10 y should be 0.... and all those other not so obvious in between numbers
alt text http://dizy.cc/slider.gif
Okay, your edit really clarifies things. You're not looking for anything to do with the normal distribution, just a nice smooth little ramp function. The one Paul provides will do nicely, but is tricky to modify for other values. It can be made a little more flexible (my code examples are in Python, which should be very easy to translate to any other language):
def quarticRamp(x, b=10, peak=5):
if not 0 <= x <= b:
raise ValueError #or return 0
return peak*x*x*(x-b)*(x-b)*16/(b*b*b*b)
Parameter b is the upper bound for the region you want to have a slope on (10, in your example), and peak is how high you want it to go (5, in the example).
Personally I like a quadratic spline approach, which is marginally cheaper computationally and has a different curve to it (this curve is really nice to use in a couple of special applications that don't happen to matter at all for you):
def quadraticSplineRamp(x, a=0, b=10, peak=5):
if not a <= x <= b:
raise ValueError #or return 0
if x > (b+a)/2:
x = a + b - x
z = 2*(x-a)/b
if z > 0.5:
return peak * (1 - 2*(z-1)*(z-1))
else:
return peak * (2*z*z)
This is similar to the other function, but takes a lower bound a (0 in your example). The logic is a little more complex because it's a somewhat-optimized implementation of a piecewise function.
The two curves have slightly different shapes; you probably don't care what the exact shape is, and so could pick either. There are an infinite number of ramp functions meeting your criteria; these are two simple ones, but they can get as baroque as you want.
The thing you want to plot is the probability density function (pdf) of the normal distribution. You can find it on the mighty Wikipedia.
Luckily, the pdf for a normal distribution is not difficult to implement - some of the other related functions are considerably worse because they require the error function.
To get a plot like you showed, you want a mean of 5 and a standard deviation of about 1.5. The median is obviously the centre, and figuring out an appropriate standard deviation given the left & right boundaries isn't particularly difficult.
A function to calculate the y value of the pdf given the x coordinate, standard deviation and mean might look something like:
double normal_pdf(double x, double mean, double std_dev) {
return( 1.0/(sqrt(2*PI)*std_dev) *
exp(-(x-mean)*(x-mean)/(2*std_dev*std_dev)) );
}
A normal distribution is never equal to 0.
Please make sure that what you want to plot is indeed a
normal distribution.
If you're only looking for this bell shape (with the tangent and everything)
you can use the following formula:
x^2*(x-10)^2 for x between 0 and 10
0 elsewhere
(Divide by 125 if you need to have your peek on 5.)
double bell(double x) {
if ((x < 10) && (x>0))
return x*x*(x-10.)*(x-10.)/125.;
else
return 0.;
}
Well, there's good old Wikipedia, of course. And Mathworld.
What you want is a random number generator for "generating normally distributed random deviates". Since Objective C can call regular C libraries, you either need a C-callable library like the GNU Scientific Library, or for this, you can write it yourself following the description here.
Try simulating rolls of dice by generating random numbers between 1 and 6. If you add up the rolls from 5 independent dice rolls, you'll get a surprisingly good approximation to the normal distribution. You can roll more dice if you'd like and you'll get a better approximation.
Here's an article that explains why this works. It's probably more mathematical detail than you want, but you could show it to someone to justify your approach.
If what you want is the value of the probability density function, p(x), of a normal (Gaussian) distribution of mean mu and standard deviation sigma at x, the formula is
p(x) = exp( ((x-mu)^2)/(2*sigma^2) ) / (sigma * 2 * sqrt(pi))
where pi is the area of a circle divided by the square of its radius (approximately 3.14159...). Using the C standard library math.h, this is:
#include <math>
double normal_pdf(double x, double mu, double sigma) {
double n = sigma * 2 * sqrt(M_PI); //normalization factor
p = exp( -pow(x-mu, 2) / (2 * pow(sigma, 2)) ); // unnormalized pdf
return p / n;
}
Of course, you can do the same in Objective-C.
For reference, see the Wikipedia or MathWorld articles.
It sounds like you want to write a function that yields a curve of a specific shape. Something like y = f(x), for x in [0:10]. You have a constraint on the max value of y, and a general idea of what you want the curve to look like (somewhat bell-shaped, y=0 at the edges of the x range, y=5 when x=5). So roughly, you would call your function iteratively with the x range, with a step that gives you enough points to make your curve look nice.
So you really don't need random numbers, and this has nothing to do with probability unless you want it to (as in, you want your curve to look like a the outline of a normal distribution or something along those lines).
If you have a clear idea of what function will yield your desired curve, the code is trivial - a function to compute f(x) and a for loop to call it the desired number of times for the desired values of x. Plot the x,y pairs and you're done. So that's your algorithm - call a function in a for loop.
The contents of the routine implementing the function will depend on the specifics of what you want the curve to look like. If you need help on functions that might return a curve resembling your sample, I would direct you to the reading material in the other answers. :) However, I suspect that this is actually an assignment of some sort, and that you have been given a function already. If you are actually doing this on your own to learn, then I again echo the other reading suggestions.
y=-1*abs(x-5)+5