Tensorflow: What exact formula is applied in `tf.nn.sparse_softmax_cross_entropy_with_logits`? - tensorflow

I tried to manually recompute the outputs of this function so I created a minimal example:
logits = tf.pack(np.array([[[[0,1,2]]]],dtype=np.float32)) # img of shape (1, 1, 1, 3)
labels = tf.pack(np.array([[[1]]],dtype=np.int32)) # gt of shape (1, 1, 1)
softmaxCrossEntropie = tf.nn.sparse_softmax_cross_entropy_with_logits(logits,labels)
softmaxCrossEntropie.eval() # --> output is [1.41]
Now according to my own calculation I only get [1.23]
When manually calculating, I'm simply applying softmax
and cross-entropy:
where q(x) = sigma(x_j) or (1-sigma(x_j)) depending whether j is the correct ground truth class or not and p(x) = labels which are then one-hot-encoded
I'm not sure where the difference might originate from. I cannot really imagine that some epsilon causes such a big difference. Does someone know where I can lookup, which exact formula is used by tensorflow?
Is the source code of that exact part available?
I could only find nn_ops.py, but it only uses another function called gen_nn_ops._sparse_softmax_cross_entropy_with_logits which I couldn't find on github...

Well, usually p(x) in cross-entropy equation is true distribution, while q(x) is the distribution obtained from softmax. So, if p(x) is one-hot (and this is so, otherwise sparse cross-entropy could not be applied), cross entropy is just negative log for probability of true category.
In your example, softmax(logits) is a vector with values [0.09003057, 0.24472847, 0.66524096], so the loss is -log(0.24472847) = 1.4076059 which is exactly what you got as output.

Related

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

Why tf.contrib.layers.instance_norm layer contain StopGradient operation?

Why tf.contrib.layers.instance_norm layer contain StopGradient operation? i.e. why it's needed?
Seems there is StopGradient even in simpler layer tf.nn.moments (that can be building block of tf.contrib.layers.instance_norm).
x_m, x_v = tf.nn.moments(x, [1, 2], keep_dims=True)
Also I find a note on StopGradient in tf.nn.moments source code:
# The dynamic range of fp16 is too limited to support the collection of
# sufficient statistics. As a workaround we simply perform the operations
# on 32-bit floats before converting the mean and variance back to fp16
y = math_ops.cast(x, dtypes.float32) if x.dtype == dtypes.float16 else x
# Compute true mean while keeping the dims for proper broadcasting.
mean = math_ops.reduce_mean(y, axes, keepdims=True, name="mean")
# sample variance, not unbiased variance
# Note: stop_gradient does not change the gradient that gets
# backpropagated to the mean from the variance calculation,
# because that gradient is zero
variance = math_ops.reduce_mean(
math_ops.squared_difference(y, array_ops.stop_gradient(mean)),
axes,
keepdims=True,
name="variance")
So it's sort of optimisation because gradient is always zero?
Attempt of an answer.
This design tells us that minimizing second moment we would not want to propagate gradients through the first moment. Does it make sense? If we try to minimize E[x^2]-E[x]^2 we would minimize E[x^2] while simultaneously maximizing E[x]^2. First term would decrease absolute values of each element (drag them to the center). Second term would increase all values by gradient which would do nothing to minimize variance but might negatively affect other gradient paths.
So, we don't propagate gradient of second moment through the first moment because this gradient would not effect second moment whatsoever, at least when using plain SGD.

tf.fake_quant_with_min_max_vars is a differentiable function?

Quantization schemes are generally non-differentiable because they pass through the threshold, such as round or sign function. It means that we can not get the gradient of trainable variables due to the nature of chain rule.
Instead, we can use a trick called 'straight-through-estimator', which enable us to back-propagating the gradient of individual trainable variables.
One such method is tf.fake_quant_with_min_max_vars, The advantages of this format are that it can represent arbitrary magnitudes of ranges, they don’t have to be symmetrical, it can represent signed and unsigned values, and the linear spread makes doing multiplications straightforward.Blog, Paper.
So, my question is, can we differentiate the fake_quant function? And if so, does this function apply 'straight-through-estimator'?
I did a little bit of this with some snippet code
x = tf.cast(np.random.normal(0,1,(10,10), tf.float32)
x_q = tf.fake_quant_with_min_max_vars(x, min=tf.reduce_min(x), max=tf.reduce_max(x), num_bits=3)
grad = tf.gradients(x_q, x)
In that case, almost every grad have value 1(i.e, gradient 1), which means it pass through the gradient itself.
However, sometimes a few samples have gradient 0, or other constant, such as 2, 3, 4...
Am I missing what's going on?

Keras custom loss function with binary (round) with tensorflow backend

I'm currently trying to implement a custom loss function (precision) with a binary outcome but Tensorflow backend refuses to use round function which is necessary to be used in order to generate a '0' or '1'.
As far as I have investigated, this is because Tensorflow defines the gradient of the round as None and the loss function can't return None.
I have currently implemented this custom loss to create as close as is possible '0' or '1' in R Keras interface.
precision_loss<-function(y_true,y_pred){
y_pred_pos = K$clip(y_pred, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pred_pos = K$maximum(0,K$minimum(1,(y_pred_pos+0.0625)/0.125))
y_pred_neg = 1 - y_pred_pos
y_pos = K$clip(y_true, 0, 1)
#Custom sigmoid to generate '0' '1'
y_pos = K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(y_pos*y_pred_pos)
tn = K$sum(y_neg*y_pred_neg)
fp = K$sum(y_neg*y_pred_pos)
fn = K$sum(y_pos*y_pred_neg)
return(1-(tp/(tp+fp+K$epsilon())))
}
Notice the "sigmoid" : K$maximum(0,K$minimum(1,(y_pos+0.0625)/0.125))
What I wanted to implement is a workaround for this one:
precision_loss<-function(y_true, y_pred){
y_pred_pos = K$round(K$clip(y_pred, 0, 1))
y_pred_neg = 1 - y_pred_pos
y_pos = K$round(K$clip(y_true, 0, 1))
y_neg = 1 - y_pos
#Generate confusion matrix counts
tp = K$sum(K$clip(y_pos * y_pred_pos,0,1))
tn = K$sum(K$clip(y_neg * y_pred_neg,0,1))
fp = K$sum(K$clip(y_neg * y_pred_pos,0,1))
fn = K$sum(K$clip(y_pos * y_pred_neg,0,1))
return(1-(tp/(tp+fp+K$epsilon())))
}
Some of you have an alternative implementation without using round to generate binary outcomes in the loss function?
PD: In custom metrics function the round is allowed
In order to build a binary loss function, it wouldn't be enough to just build the custom loss function itself. You would also have to pre-define the gradients.
Your high-dimensional loss function would be zero for some points and one for all others. For all non-continuous points in this space, it would be impossible to analytically compute a gradient (i.e. the concept of a gradient doesn't even exist for such points), so you would have to just define one. And for all the continuous points in this space (e.g. an open set in which all loss values are 1), the gradient would exist, but it would be zero, so you would also have to pre-define the gradient values, otherwise your weights wouldn't move at all.
That means either way you would have to define your own custom "gradient" computation function that replaces Keras' (i.e. TensorFlow's) automatic differentiation engine for that particular node in the graph (the loss function node).
You could certainly achieve this by modifying your local copy of Keras or TensorFlow, but nothing good can come from it.
Also, even if you managed to do this, consider this: If your loss function returns only 0 or 1, that means it can only distinguish between two states: The model's prediction is either 100% correct (0 loss) or it is not 100% correct (1 loss). The magnitude of the gradient would have to be the same for all non-100% cases. Is that a desirable property?
Your quasi-binary sigmoid solution has the same problem: The gradient will be almost zero almost everywhere, and in the few points where it won't be almost zero, it will be almost infinity. If you try to train a model with that loss function, it won't learn anything.
As you noticed a custom loss function need to be based on functions which have their gradients defined (in order to minimise the loss function), which is not necessary for a simple metric. Some functions like “round” and “sign” are difficult to use in loss function since their gradients are either null all the time or infinite which is not helpful for minimisation. That’s probably why their gradients are not defined, by default.
Then, you have two options:
Option 1: you use the round function but you need to add your custom gradient for round, to substitute it in backend.
Option 2: you define another loss function without using round
You chose option 2, which is the best option I think. But your “sigmoid” is very linear, so probably, not a good approximation of your “round” function. You could use an actual sigmoid which is slower due to the use of exponential but you could obtain a similar result with a modified softsign:
max_gradient=100
K$maximum(0,K$minimum(1,0.5*(1+(max_gradient*y_pos)/(1+ max_gradient*abs(y_pos)))))
The max_gradient coefficient can be used to make your edge more sharp, around 0.5. It defines the maximum gradient at 0.5.

pymc python change point detection for small probabilities. ZeroProbability Error

I am trying to use pymc to find a change point in a time-series. The value I am looking at over time is probability to "convert" which is very small, 0.009 on average with a range of 0.001-0.016.
I give the two probabilities a uniform distribution as a prior between zero and the max observation.
alpha = df.cnvrs.max() # Set upper uniform
center_1_c = pm.Uniform("center_1_c", 0, alpha)
center_2_c = pm.Uniform("center_2_c", 0, alpha)
day_c = pm.DiscreteUniform("day_c", lower=1, upper=n_days)
#pm.deterministic
def lambda_(day_c=day_c, center_1_c=center_1_c, center_2_c=center_2_c):
out = np.zeros(n_days)
out[:day_c] = center_1_c
out[day_c:] = center_2_c
return out
observation = pm.Uniform("obs", lambda_, value=df.cnvrs.values, observed=True)
When I run this code I get:
ZeroProbability: Stochastic obs's value is outside its support,
or it forbids its parents' current values.
I'm pretty new to pymc so not sure if I'm missing something obvious. My guess is I might not have appropriate distributions for modelling small probabilities.
It's impossible to tell where you've introduced this bug—and programming is off-topic here, in any case—without more of your output. But there is a statistical issue here: You've somehow constructed a model that cannot produce either the observed variables or the current sample of latent ones.
To give a simple example, say you have a dataset with negative values, and you've assumed it to be gamma distributed; this will produce an error, because the data has zero probability under a gamma. Similarly, an error will be thrown if an impossible value is sampled during an MCMC chain.