calculate gradient for Contrastive loss - numpy

I have come across this loss function
L = y*d**2 + (1-y)*max(margin-d,0)**2
is it possible to calculate the gradient of this type of loss with max function in the function, if yes how should it be done?
I tried finding the gradient for simpler function in numpy and was successful in it , but with this i dont seem to understand how to go about it.

On high level yes, there is nothing too special about pointwise maximum, since
d/dx max(x, a) = 1 if x>a
0 otherwise
in your case a=0 and it is nothing but relu that is widely used in deep learning these days.
CS approach
Lets just use jax
import jax
from jax import numpy as jnp
def loss(d, y, m):
return y*d**2 + (1-y)*jnp.maximum(m-d, 0)**2
loss_grad = jax.grad(loss)
Math approach
We can just compute it analytically

Related

Creating a matrix with certain conditions

I am trying to create a matrix using PyTorch of size 32x10x1.
The conditions that I need to fulfill are that
torch.mean(a, dim=0) # size is 10x1 and should be almost 0
torch.mean(a, dim=1) # size is 32x1 and should be almost 0
This is a noise matrix for GANs and I am trying to sample it from Normal Distribution. I tried using torch.MultiVariateNormal() but it didnt give me matrix of that shape
Is there any other function or something in numpy or scikit to get this kind of matrix
Use numpy.random.normal
import numpy.random as npr
mean = 0
std_dev = 0.1
size = (32, 10, 1)
mat = npr.normal(loc=mean, scale=std_dev, size=size)
and set the mean and standard deviation as desired to keep the values close to 0.
Here you can see the effect of changing the mean and standard deviation on the graph
By Inductiveload - self-made, Mathematica, Inkscape, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3817954

Tensorflow Probability VI: Discrete + Continuous RVs inference: gradient estimation?

See this tensorflow-probability issue
tensorflow==2.7.0
tensorflow-probability==0.14.1
TLDR
To perform VI on discrete RVs, should I use:
A- the REINFORCE gradient estimator
B- the Gumbel-Softmax reparametrization
C- another solution
and how to implement it ?
Problem statement
Sorry in advance for the long issue, but I believe the problem requires some explaining.
I want to implement a Hierarchical Bayesian Model involving both continuous and discrete Random Variables. A minimal example is a Gaussian Mixture model:
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
G = 2
p = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
z=tfd.Categorical(
probs=tf.ones((G,)) / G
),
x=lambda mu, z: tfd.Normal(
loc=mu[z],
scale=1.
)
)
)
In this example I don't use the tfd.Mixture API on purpose to expose the Categorical label. I want to perform Variational Inference in this context, and for instance given an observed x fit over the posterior of z a Categorical distribution with parametric probabilities:
q_probs = tfp.util.TransformedVariable(
tf.ones((G,)) / G,
tfb.SoftmaxCentered(),
name="q_probs"
)
q_loc = tf.Variable(0., name="q_loc")
q_scale = tfp.util.TransformedVariable(
1.,
tfb.Exp(),
name="q_scale"
)
q = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Normal(q_loc, q_scale),
z=tfd.Categorical(probs=q_probs)
)
)
The issue is: when computing the ELBO and trying to optimize for the optimal q_probs I cannot use the reparameterization gradient estimators: this is AFAIK because z is a discrete RV:
def log_prob_fn(**kwargs):
return p.log_prob(
**kwargs,
x=tf.constant([2.])
)
optimizer = tf.optimizers.SGD()
#tf.function
def fit_vi():
return tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=log_prob_fn,
surrogate_posterior=q,
optimizer=optimizer,
num_steps=10,
sample_size=8
)
_ = fit_vi()
# This last line raises:
# ValueError: Distribution `surrogate_posterior` must be reparameterized, i.e.,a diffeomorphic transformation
# of a parameterless distribution. (Otherwise this function has a biased gradient.)
I'm looking into a way to make this work. I've identified at least 2 ways to circumvent the issue: using REINFORCE gradient estimator or the Gumbel-Softmax reparameterization.
A- REINFORCE gradient
cf this TFP API link a classical result in VI is that the REINFORCE gradient can deal with a non-differentiable objective function, for instance due to discrete RVs.
I can use a tfp.vi.GradientEstimators.SCORE_FUNCTION estimator instead of the tfp.vi.GradientEstimators.REPARAMETERIZATION one using the lower-level tfp.vi.monte_carlo_variational_loss function?
Using the REINFORCE gradient, In only need the log_prob method of q to be differentiable, but the sample method needn't be differentiated.
As far as I understood it, the sample method for a Categorical distribution implies a gradient break, but the log_prob method does not. Am I correct to assume that this could help with my issue? Am I missing something here?
Also I wonder: why is this possibility not exposed in the tfp.vi.fit_surrogate_posterior API ? Is the performance bad, meaning is the variance of the estimator too large for practical purposes ?
B- Gumbel-Softmax reparameterization
cf this TFP API link I could also reparameterize z as a variable y = tfd.RelaxedOneHotCategorical(...) . The issue is: I need to have a proper categorical label to use for the definition of x, so AFAIK I need to do the following:
p_GS = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=tf.ones((G,)) / G
),
x=lambda mu, y: tfd.Normal(
loc=mu[tf.argmax(y)],
scale=1.
)
)
)
...but his would just move the gradient breaking problem to tf.argmax. This is where I maybe miss something. Following the Gumbel-Softmax (Jang et al., 2016) paper, I could then use the "STRAIGHT-THROUGH" (ST) strategy and "plug" the gradients of the variable tf.one_hot(tf.argmax(y)) -the "discrete y"- onto y -the "continuous y".
But again I wonder: how to do this properly ? I don't want to mix and match the gradients by hand, and I guess an autodiff backend is precisely meant to avoid me this issue. How could I create a distribution that differentiates the forward direction (sampling a "discrete y") from the backward direction (gradient computed using the "continuous y") ? I guess this is the meant usage of the tfd.RelaxedOneHotCategorical distribution, but I don't see this implemented anywhere in the API.
Should I implement this myself ? How ? Could I use something in the lines of tf.custom_gradient?
Actual question
Which solution -A or B or another- is meant to be used in the TFP API, if any? How should I implement said solution efficiently?
So the ides was not to make a Q&A but I looked into this issue for a couple days and here are my conclusions:
solution A -REINFORCE- is a possibility, it doesn't introduce any bias, but as far as I understood it it has high variance in its vanilla form -making it prohibitively slow for most real-world tasks. As detailed a bit below, control variates can help tackle the variance issue;
solution B, Gumbell-Softmax, exists as well in the API, but I did not find any native way to make it work for hierarchical tasks. Below is my implementation.
First off, we need to reparameterize the joint distribution p as the KL between a discrete and a continuous distribution is ill-defined (as explained in the Maddison et al. (2017) paper). To not break the gradients, I implemented a simple one_hot_straight_through operation that converts the continuous RV y into a discrete RV z:
G = 2
#tf.custom_gradient
def one_hot_straight_through(y):
depth = y.shape[-1]
z = tf.one_hot(
tf.argmax(
y,
axis=-1
),
depth=depth
)
def grad(upstream):
return upstream
return z, grad
p = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=tf.ones((G,)) / G
),
x=lambda mu, y: tfd.Normal(
loc=tf.reduce_sum(
one_hot_straight_through(y)
* mu
),
scale=1.
)
)
)
The variational distribution q follows the same reparameterization and the following code bit does work:
q_probs = tfp.util.TransformedVariable(
tf.ones((G,)) / G,
tfb.SoftmaxCentered(),
name="q_probs"
)
q_loc = tf.Variable(tf.zeros((2,)), name="q_loc")
q_scale = tfp.util.TransformedVariable(
1.,
tfb.Exp(),
name="q_scale"
)
q = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Independent(
tfd.Normal(q_loc, q_scale),
reinterpreted_batch_ndims=1
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=q_probs
)
)
)
def log_prob_fn(**kwargs):
return p.log_prob(
**kwargs,
x=tf.constant([2.])
)
optimizer = tf.optimizers.SGD()
#tf.function
def fit_vi():
return tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=log_prob_fn,
surrogate_posterior=q,
optimizer=optimizer,
num_steps=10,
sample_size=8
)
_ = fit_vi()
Now there are several issues with that design:
first off we needed to reparameterize not only q but also p so we "modify our target model". This results in our models p and q not outputing discrete RVs like originally intended but continuous RVs. I think that the introduction of a hard option like in the torch implem could be a nice addition to overcome this issue;
second we introduce the burden of setting up the temperature parameter. The latter make the continuous RV y smoothly converge to its discrete counterpart z. An annealing strategy, reducing the temperature to reduce the bias introduced by the relaxation at the cost of a higher variance can be implemented. Or the temperature can be learned online, akin to an entropy regularization (see Maddison et al. (2017) and Jang et al. (2017));
the gradient obtained with this estimator are biased, which probably can be acceptable for most applications but is an issue in theory.
Recent methods like REBAR (Tucker et al. (2017)) or RELAX (Grathwohl et al. (2018)) can instead obtain unbiased estimators with a lower variance than the original REINFORCE. But they do so at the cost of introducing -learnable- control variates with separate losses. Modifications of the one_hot_straight_through functions could probably implement this.
In conclusion my opinion is that the tensorflow probability support for discrete RVs optimization is too scarce at the moment and that the API lacks native functions and tutorials to make it easier for the user.

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

Keras custom loss function with binary (round) with tensorflow backend

I'm currently trying to implement a custom loss function (precision) with a binary outcome but Tensorflow backend refuses to use round function which is necessary to be used in order to generate a '0' or '1'.
As far as I have investigated, this is because Tensorflow defines the gradient of the round as None and the loss function can't return None.
I have currently implemented this custom loss to create as close as is possible '0' or '1' in R Keras interface.
precision_loss<-function(y_true,y_pred){
y_pred_pos = K$clip(y_pred, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pred_pos = K$maximum(0,K$minimum(1,(y_pred_pos+0.0625)/0.125))
y_pred_neg = 1 - y_pred_pos
y_pos = K$clip(y_true, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pos = K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(y_pos*y_pred_pos)
tn = K$sum(y_neg*y_pred_neg)
fp = K$sum(y_neg*y_pred_pos)
fn = K$sum(y_pos*y_pred_neg)
return(1-(tp/(tp+fp+K$epsilon())))
}
Notice the "sigmoid" : K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
What I wanted to implement is a workaround for this one:
precision_loss<-function(y_true, y_pred){
y_pred_pos = K$round(K$clip(y_pred, 0, 1))
y_pred_neg = 1 - y_pred_pos
y_pos = K$round(K$clip(y_true, 0, 1))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(K$clip(y_pos * y_pred_pos,0,1))
tn = K$sum(K$clip(y_neg * y_pred_neg,0,1))
fp = K$sum(K$clip(y_neg * y_pred_pos,0,1))
fn = K$sum(K$clip(y_pos * y_pred_neg,0,1))
return(1-(tp/(tp+fp+K$epsilon())))
}
Some of you have an alternative implementation without using round to generate binary outcomes in the loss function?
PD: In custom metrics function the round is allowed
In order to build a binary loss function, it wouldn't be enough to just build the custom loss function itself. You would also have to pre-define the gradients.
Your high-dimensional loss function would be zero for some points and one for all others. For all non-continuous points in this space, it would be impossible to analytically compute a gradient (i.e. the concept of a gradient doesn't even exist for such points), so you would have to just define one. And for all the continuous points in this space (e.g. an open set in which all loss values are 1), the gradient would exist, but it would be zero, so you would also have to pre-define the gradient values, otherwise your weights wouldn't move at all.
That means either way you would have to define your own custom "gradient" computation function that replaces Keras' (i.e. TensorFlow's) automatic differentiation engine for that particular node in the graph (the loss function node).
You could certainly achieve this by modifying your local copy of Keras or TensorFlow, but nothing good can come from it.
Also, even if you managed to do this, consider this: If your loss function returns only 0 or 1, that means it can only distinguish between two states: The model's prediction is either 100% correct (0 loss) or it is not 100% correct (1 loss). The magnitude of the gradient would have to be the same for all non-100% cases. Is that a desirable property?
Your quasi-binary sigmoid solution has the same problem: The gradient will be almost zero almost everywhere, and in the few points where it won't be almost zero, it will be almost infinity. If you try to train a model with that loss function, it won't learn anything.
As you noticed a custom loss function need to be based on functions which have their gradients defined (in order to minimise the loss function), which is not necessary for a simple metric. Some functions like “round” and “sign” are difficult to use in loss function since their gradients are either null all the time or infinite which is not helpful for minimisation. That’s probably why their gradients are not defined, by default.
Then, you have two options:
Option 1: you use the round function but you need to add your custom gradient for round, to substitute it in backend.
Option 2: you define another loss function without using round
You chose option 2, which is the best option I think. But your “sigmoid” is very linear, so probably, not a good approximation of your “round” function. You could use an actual sigmoid which is slower due to the use of exponential but you could obtain a similar result with a modified softsign:
max_gradient=100
K$maximum(0,K$minimum(1,0.5*(1+(max_gradient*y_pos)/(1+ max_gradient*abs(y_pos)))))
The max_gradient coefficient can be used to make your edge more sharp, around 0.5. It defines the maximum gradient at 0.5.

Constrained np.polyfit

I am trying to fit a quadratic to some experimental data and using polyfit in numpy. I am looking to get a concave curve, and hence want to make sure that the coefficient of the quadratic term is negative, also the fit itself is weighted, as in there are some weights on the points. Is there an easy way to do that? Thanks.
The use of weights is described here (numpy.polyfit).
Basically, you need a weight vector with the same length as x and y.
To avoid the wrong sign in the coefficient, you could use a fit function definition like
def fitfunc(x,a,b,c):
return -1 * abs(a) * x**2 + b * x + c
This will give you a negative coefficient for x**2 at all times.
You can use curve_fit
.
Or you can run polyfit with rank 2 and if the last coefficient is bigger than 0. run again linear polyfit (polyfit with rank 1)