I'd like to make an algorithm, f, that takes a, x and y, and returns b in base y as opposed to a in base x. I can't seem to make sense of how to do this, and I've made multiple attempts. How does one go about that?
def f(a, x, y):
pass # should convert (a in base x) to (b in base y)
Related
This is a question about floating point analysis and numerical stability. Say I have two [d x 1] vectors a and x and a scalar b such that a.T # x < b (where # denotes a dot product).
I additionally have a unit [d x 1] vector d. I want to derive the maximum scalar s so that a.T # (x + s * d) < b. Without floating point errors this is trivial:
s = (b - a.T # x) / (a.T # d).
But with floating point errors though this s is not guaranteed to satisfy a.T # (x + s * d) < b.
Currently my solution is to use a stabilized division, which helps:
s = sign(a.T # x) * sign(a.T # d) * exp(log(abs(a.T # x) + eps) - log(abs(a.T # d) + eps)).
But this s still does not always satisfy the inequality. I can check how much this fails by:
diff = a.T # (x + s * d) - b
And then "push" that diff back through: (x + s * d - a.T # (diff + eps2)). Even with both the stable division and pushing the diff back sometimes the solution fails to satisfy the inequality. So these attempts at a solution are both hacky and they do not actually work. I think there is probably some way to do this that would work and be guaranteed to minimally satisfy the inequality under floating point imprecision, but I'm not sure what it is. The solution needs to be very efficient because this operation will be run trillions of times.
Edit: Here is an example in numpy of this issue coming into play, because a commenter had some trouble replicating this problem.
np.random.seed(1)
p, n = 10, 1
k = 3
x = np.random.normal(size=(p, n))
d = np.random.normal(size=(p, n))
d /= np.sum(d, axis=0)
a, b = np.hstack([np.zeros(p - k), np.ones(k)]), 1
s = (b - a.T # x) / (a.T # d)
Running this code gives a case where a.T # (s * d + x) > b failing to satisfy the constraint. Instead we have:
>>> diff = a.T # (x + s * d) - b
>>> diff
array([8.8817842e-16])
The question is about how to avoid this overflow.
The problem you are dealing with appear to be mainly rounding issues and not really numerical stability issues. Indeed, when a floating-point operation is performed, the result has to be rounded so to fit in the standard floating point representation. The IEEE-754 standard specify multiple rounding mode. The default one is typically the rounding to nearest.
This mean (b - a.T # x) / (a.T # d) and a.T # (x + s * d) can be rounded to the previous or nest floating-point value. As a result, there is slight imprecision introduced in the computation. This imprecision is typically 1 unit of least precision (ULP). 1 ULP basically mean a relative error of 1.1eā16 for double-precision numbers.
In practice, every operation can result in rounding and not the whole expression so the error is typically of few ULP. For operation like additions, the rounding tends to mitigate the error while for some others like a subtraction, the error can dramatically increase. In your case, the error seems only to be due to the accumulation of small errors in each operations.
The floating point computing units of processors can be controlled in low-level languages. Numpy also provides a way to find the next/previous floating point value. Based on this, you can round the value up or down for some parts of the expression so for s to be smaller than the target theoretical value. That being said, this is not so easy since some the computed values can be certainly be negative resulting in opposite results. One can round positive and negative values differently but the resulting code will certainly not be efficient in the end.
An alternative solution is to compute the theoretical error bound so to subtract s by this value. That being said, this error is dependent of the computed values and the actual algorithm used for the summation (eg. naive sum, pair-wise, Kahan, etc.). For example the naive algorithm and the pair-wise ones (used by Numpy) are sensitive to the standard deviation of the input values: the higher the std-dev, the bigger the resulting error. This solution only works if you exactly know the distribution of the input values or/and the bounds. Another issues is that it tends to over-estimate the error bounds and gives a just an estimation of the average error.
Another alternative method is to rewrite the expression by replacing s by s+h or s*h and try to find the value of h based on the already computed s and other parameters. This methods is a bit like a predictor-corrector. Note that h may not be precise also due to floating point errors.
With the absolute correction method we get:
h_abs = (b - a # (x + s * d)) / (a # d)
s += h_abs
With the relative correction method we get:
h_rel = (b - a # x) / (a # (s * d))
s *= h_rel
Here are the absolute difference with the two methods:
Initial method: 8.8817842e-16 (8 ULP)
Absolute method: -8.8817842e-16 (8 ULP)
Relative method: -8.8817842e-16 (8 ULP)
I am not sure any of the two methods are guaranteed to fulfil the requirements but a robust method could be to select the smallest s value of the two. At least, results are quite encouraging since the requirement are fulfilled for the two methods with the provided inputs.
A good method to generate more precise results is to use the Decimal package which provide an arbitrary precision at the expense of a much slower execution. This is particularly useful to compare practical results with more precise ones.
Finally, a last solution is to increase/decrease s one by one ULP so to find the best result. Regarding the actual algorithm used for the summation and inputs, results can change. The exact expression used to compute the difference also matter. Moreover, the result is certainly not monotonic because of the way floating-point arithmetic behave. This means one need to increase/decrease s by many ULP so to be able to perform the optimization. This solution is not very efficient (at least, unless big steps are used).
I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff ā it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.
Background: I have a simulation model which has unobserved parameters. I created a metamodel using artificial neural networks (ANN) because the runtime was very long for the simulation model. I am trying to estimate the unobserved parameters using Bayesian calibration, where priors are based on current knowledge, and the likelihood of observing data is being estimated from the metamodel.
Query: I have two random variables X and Y for which I am trying to get the posterior distribution using STAN. The prior distribution of X is uniform, U(0,2). The prior for Y is also uniform, but it will always exceed X i.e., Y ~ U(X,2). Since Y is linked to X, how can I define the prior distribution for Y in STAN such that the constraint Y>X holds? I am new to STAN, so I would appreciate any suggestions or guidance on how to proceed. Thank you so much!
Stan's ordered vectors are what you need. Create an ordered vector of length 2 (I'll call it beta) in the parameters block, like this:
parameters {
ordered<lower=0,upper=2>[2] beta;
}
Ordered vectors are constrained such that each element is greater than the previous element. So beta[1] will be your estimate of X and beta[2] will be your estimate of Y.
(To make sure I understand your model correctly: you have two parameters, X and Y, and your only prior knowledge about them is that they both lie in [0, 2] and Y > X. X and Y describe some aspect of the distribution of your data - for example, maybe X is the mean of some other random variable Z, for which you have observations. Do I have that right?)
I believe Stan's priors are uniform by default, but you can make sure of this by specifying a prior for beta in the model block:
model {
beta ~ uniform(0, 2);
...
}
I have n (around 5 million) sets of specific (k,m,v,z)* parameters that describe some linear relationships. I want to find the optimal positive a,b and c coefficients that minimize the addition of their absolute values as shown below:
I know beforehand the range for each a, b and c and so, I could use it to make things a bit faster. However, I do not know how to properly implement this problem to best take advantage of Numpy (or Scipy/etc).
I was thinking of iteratively making checks using different a, b and c coefficients (based on a step) and in the end keeping the combination that would provide the minimum sum. But properly implementing this in Numpy is another thing.
*
(k,m,v are either 0 or positive and are in fact k,m,v,i,j,p)
(z can be negative too)
Any tips are welcome!
Either I am missing something, or a == b == c == 0 is optimal. So, a positive solution for (a,b,c) does not exist in general. You can verify this explicitly by posing the minimization problem as a quantile regression of 0 on (k, m, v) with the quantile set to 0.5.
import numpy as np
from statsmodels.regression.quantile_regression import QuantReg
x = np.random.rand(1000, 3)
a, b, c = QuantReg(np.zeros(x.shape[0]), x).fit(0.5).params
assert np.allclose([a, b, c], 0)
If I call x,y = sess.run([X,f(X)]), is X computed once or twice? I'm asking because in my case the value of X is not deterministic, and it's necessary that f be evaluated on the same 'instance' of X.
To make sure that f uses the current X you can set up dependencies.
with tf.control_dependencies([X]):
y = f(X)
x, y_ = sess.run([X, y])
It will only compute it once. It would not make sense if it recomputed the dependent variables. Just about all variables in a tensorflow model are dependent on one another.