How to understand bias parameter in LIBLINEAR? - libsvm

I don't understand the meaning of bias parameter in the API of LIBLINEAR. Why is it specified by user during the training? Shouldn't it be just a distance from the separating hyperplane to origin which is a parameter of the learned model?
This is from the README:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
If bias >= 0, we assume that one additional feature is added to the end of each data instance.
What is this additional feature?

Let's look at the equation for the separating hyperplane:
w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + ... + w_bias * x_bias = 0
Where x are the feature values and w are the trained "weights". The additional feature x_bias is a constant, whose value is equal to the bias. If bias = 0, you will get a separating hyperplane going through the origin (0,0,0,...). You can imagine many cases, where such a hyperplane is not the optimal separator.
The value of the bias affects the margin through scaling of w_bias. Therefore the bias is a tuning parameter, which is usually determined through cross-validation similar to other parameters.

Related

Why gradient of tanh in tensorflow is `grad = dy * (1 - y*y)`

tf.raw_ops.TanhGrad says that grad = dy * (1 - y*y), where y = tanh(x).
But I think since dy / dx = 1 - y*y, where y = tanh(x), grad should be dy / (1 - y*y). Where am I wrong?
An expression like dy / dx is a mathematical notation for the derivative, it is not an actual fraction. It is meaningless to move dy or dx around individually as you would with a numerator and denominator.
Mathematically, it is known that d(tanh(x))/dx = 1 - (tanh(x))^2. TensorFlow computes gradients "backwards" (what is called backpropagation, or more generally reverse automatic differentiation). That means that, in general, we will reach the computation of the gradient of tanh(x) after reaching the step where we compute the gradient of an "outer" function g(tanh(x)). g represents all the operations that are applied to the output of tanh to reach the value for which the gradient is computed. The derivative of this function g, according to the chain rule, is d(g(tanh(x)))/dx = d(g(tanh(x))/d(tanh(x)) * d(tanh(x))/dx. The first factor, d(g(tanh(x))/d(tanh(x)), is the reverse accumulated gradient up until tanh, that is, the derivate of all those later operations, and is the value of dy in the documentation of the function. Therefore, you only need to compute d(tanh(x))/dx (which is (1 - y * y), because y = tanh(x)) and multiply it by the given dy. The resulting value will then be propagated further back to the operation that produced the input x to tanh in the first place, and it will become the dy value in the computation of that gradient, and so on until the gradient sources are reached.

How to apply bounds on a variable when performing optimisation in Pytorch?

I am trying to use Pytorch for non-convex optimisation, trying to maximise my objective (so minimise in SGD). I would like to bound my dependent variable x > 0, and also have the sum of my x values be less than 1000.
I think I have the penalty implemented correctly in the form of a ramp penalty, but am struggling with the bounding of the x variable. In Pytorch you can set the bounds using clamp but it doesn't seem appropriate in this case. I think this is because optim needs the gradients free under the hood. Full working example:
import torch
from torch.autograd import Variable
import numpy as np
def objective(x, a, b, c): # Want to maximise this quantity (so minimise in SGD)
d = 1 / (1 + torch.exp(-a * (x)))
# Checking constraint
exceeded_limit = constraint(x).item()
#print(exceeded_limit)
obj = torch.sum(d * (b * c - x))
# If overlimit add ramp penalty
if exceeded_limit < 0:
obj = obj - (exceeded_limit * 10)
print("Exceeded limit")
return - obj
def constraint(x, limit = 1000): # Must be > 0
return limit - x.sum()
N = 1000
# x is variable to optimise for
x = Variable(torch.Tensor([1 for ii in range(N)]), requires_grad=True)
a = Variable(torch.Tensor(np.random.uniform(0,100,N)), requires_grad=True)
b = Variable(torch.Tensor(np.random.rand(N)), requires_grad=True)
c = Variable(torch.Tensor(np.random.rand(N)), requires_grad=True)
# Would like to include the clamp
# x = torch.clamp(x, min=0)
# Non-convex methodf
opt = torch.optim.SGD([x], lr=.01)
for i in range(10000):
# Zeroing gradients
opt.zero_grad()
# Evaluating the objective
obj = objective(x, a, b, c)
# Calculate gradients
obj.backward()
opt.step()
if i%1000==0: print("Objective: %.1f" % -obj.item())
print("\nObjective: {}".format(-obj))
print("Limit: {}".format(constraint(x).item()))
if torch.sum(x<0) > 0: print("Bounds not met")
if constraint(x).item() < 0: print("Constraint not met")
Any suggestions as to how to impose the bounds would be appreciated, either using clamp or otherwise. Or generally advice on non-convex optimisation using Pytorch. This is a much simpler and scaled down version of the problem I'm working so am trying to find a lightweight solution if possible. I am considering using a workaround such as transforming the x variable using an exponential function but then you'd have to scale the function to avoid the positive values becoming infinite, and I want some flexibility with being able to set the constraint.
I meet the same problem with you.
I want to apply bounds on a variable in PyTorch, too.
And I solved this problem by the below Way3.
Your example is a little compliex but I am still learning English.
So I give a simpler example below.
For example, there is a trainable variable v, its bounds is (-1, 1)
v = torch.tensor((0.5, ), require_grad=True)
v_loss = xxxx
optimizer.zero_grad()
v_loss.backward()
optimizer.step()
Way1. RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
v.clamp_(-1, 1)
Way2. RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed.
v = torch.clamp(v, -1, +1) # equal to v = v.clamp(-1, +1)
Way3. NotError. I solved this problem in Way3.
with torch.no_grad():
v[:] = v.clamp(-1, +1) # You must use v[:]=xxx instead of v=xxx

Custom Weighted Cross Entropy loss in Keras

Ok, so I have a neural network that classifies fire size into 3 groups, 0-1, 1 - 100, and over 100 acres. I need a loss function that weights the loss as double when the classifier guesses a class that is off by 2 (Actual = 0, predicted = 3)
I need a loss function that weights the loss as double when the classifier guesses a class that is off by 2 (Actual = 0, predicted = 3)
double of what?.
A)Is it the double the loss value when the classifier guesses correctly,
B)or double the loss value when the classifier is off by 1.
C)Can we relax this 'double' constraint, and can we assume that any suitable higher power would suffice?
Let us assume A).
Let f(x) denotes the probability that your input variable belong to a particular class. Note that, in f(x), x is the absolute value of the difference in categorical value.
Then we see that f(0)=0.5 is a solution for assumption A. This means that f(1)=0.25 and f(2)=0.25. Btw, the fact that f(1)==f(2) doesn't look natural.
Assume that your classifier calculates a function f(x), and uses it as follows.
def classifier_output(firesize):
if (firesize >=0 and firesize < 1.0):
return [f(0), f(1), f(2)]
elif (firesize >= 1.0 and firesize < 100.0):
return [f(1), f(0), f(1)]
else :
assert(firesize > 100.0)
return (f(2), f(1), f(0)]
The constraints are
C1)
f(x) >=0
C2)
the components of your output vector should always sum to 1.0
ie. sum of all three components of the return value should always be 1.
C3)
When the true class and predicted class differ by 2, the 1-hot encoding loss
will be -log(f(2)), According to assumption A, this should equal -2log(f(0)).
ie:
log(f(2))=2*log(f(0))
This translates to
f(2) = f(0)*f(0)
Let us put z=f(0). Now f(2)=z*z. We don't know f(1). Let us assume, f(1)=y.
From the constraint C2,
We have the following equations,
z+ z*z + y=1
z + 2*y=1
A solution to the above is z=0.5, y=0.25
If you assume B), you wont be able to find such a function.

How to handle log(0) when using cross entropy

In order to make the case simple and intuitive, I will using binary (0 and 1) classification for illustration.
Loss function
loss = np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY)) #cross entropy
cost = -np.sum(loss)/m #num of examples in batch is m
Probability of Y
predY is computed using sigmoid and logits can be thought as the outcome of from a neural network before reaching the classification step
predY = sigmoid(logits) #binary case
def sigmoid(X):
return 1/(1 + np.exp(-X))
Problem
Suppose we are running a feed-forward net.
Inputs: [3, 5]: 3 is number of examples and 5 is feature size (fabricated data)
Num of hidden units: 100 (only 1 hidden layer)
Iterations: 10000
Such arrangement is set to overfit. When it's overfitting, we can perfectly predict the probability for the training examples; in other words, sigmoid outputs either 1 or 0, exact number because the exponential gets exploded. If this is the case, we would have np.log(0) undefined. How do you usually handle this issue?
If you don't mind the dependency on scipy, you can use scipy.special.xlogy. You would replace the expression
np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY))
with
xlogy(Y, predY) + xlogy(1 - Y, 1 - predY)
If you expect predY to contain very small values, you might get better numerical results using scipy.special.xlog1py in the second term:
xlogy(Y, predY) + xlog1py(1 - Y, -predY)
Alternatively, knowing that the values in Y are either 0 or 1, you can compute the cost in an entirely different way:
Yis1 = Y == 1
cost = -(np.log(predY[Yis1]).sum() + np.log(1 - predY[~Yis1]).sum())/m
How do you usually handle this issue?
Add small number (something like 1e-15) to predY - this number doesn't make predictions much off, and it solves log(0) issue.
BTW if your algorithm outputs zeros and ones it might be useful to check the histogram of returned probabilities - when algorithm is so sure that something's happening it can be a sign of overfitting.
One common way to deal with log(x) and y / x where x is always non-negative but can become 0 is to add a small constant (as written by Jakub).
You can also clip the value (e.g. tf.clip_by_value or np.clip).

what is the behavior of SAME padding when stride is greater than 1?

My understanding of SAME padding in Tensorflow is that padding is added such that the output dimensions (for width and height) will be the same as the input dimensions. However, this understanding only really makes sense when stride=1, because if stride is >1 then output dimensions will almost certainly be lower.
So I'm wondering what the algorithm is for calculating padding in this case. Is it simply that padding is added so that the filter is applied to every input value, rather than leaving some off on the right?
There is a formula for that:
n' = floor((n+2*p-f)/s + 1)
where n' is the output size, n is the input size, p is the padding and f is the filter size, s will be the stride.
If you are using SAME padding with stride > 1, p will be the minimum number to make (n+2*p-f) divisible by s. Note: p could be decimal as it will be averaged over two sides of the image.
Peter's answer is true but might lack a few details. Let me add on top of it.
Autopadding = SAME means that: o = ceil(i/s), where o = output size, i = input size, s = stride.
In addition, the generic output size formula is:
o = floor( (i + p - k) / s) + 1
Where the new terms are p (pading) and k, i.e., the effective kernel size
(including dilation, or just kernel size if dilation is disabled).
If you develop that formula to solve for p, you get:
p_min = (o-1) s - i + k # i.e., when the floor is removed from the previous equation
p_max = o s - i + k - 1 # i.e., when the numerator of the floor % s is s-1
Any padding value p in the range [p_min, p_max] will satisfy the condition o = ceil(i/s), meaning that for a stride s there are s total solution satisfying the formula.
It is the norm to use p_min as padding, so you can ignore all other s-1 solutions.
PS: This would be for 1D, but for nD, simply repeat these formulas independently for each dimension, i.e.,
p_min[dimension_index] = (o[dimension_index]-1)s[dimension_index] - i[dimension_index] + k[dimension_index]
For references, these 2 links are really useful:
https://arxiv.org/abs/1603.07285
https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215
https://mmuratarat.github.io/2019-01-17/implementing-padding-schemes-of-tensorflow-in-python