Quantile of a vector in stan - bayesian

I would like to use the quantile of a vector in stan but the function quantile doesn't seem to work. See the ** ** in the following example.
data{
vector[10] y;
vector[10] x;
}
parameters{
real a;
real b;
}
model{
vector[10] mu;
real Q;
mu = a*x+b;
**Q = quantile(y-mu, 0.66);**
}

The quantile function does not exist in the Stan language.
Stan's language focuses on expressing differentiable statistical models. The quantile function is not differentiable.

Related

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

STAN - Defining priors for dependent random variables

Background: I have a simulation model which has unobserved parameters. I created a metamodel using artificial neural networks (ANN) because the runtime was very long for the simulation model. I am trying to estimate the unobserved parameters using Bayesian calibration, where priors are based on current knowledge, and the likelihood of observing data is being estimated from the metamodel.
Query: I have two random variables X and Y for which I am trying to get the posterior distribution using STAN. The prior distribution of X is uniform, U(0,2). The prior for Y is also uniform, but it will always exceed X i.e., Y ~ U(X,2). Since Y is linked to X, how can I define the prior distribution for Y in STAN such that the constraint Y>X holds? I am new to STAN, so I would appreciate any suggestions or guidance on how to proceed. Thank you so much!
Stan's ordered vectors are what you need. Create an ordered vector of length 2 (I'll call it beta) in the parameters block, like this:
parameters {
ordered<lower=0,upper=2>[2] beta;
}
Ordered vectors are constrained such that each element is greater than the previous element. So beta[1] will be your estimate of X and beta[2] will be your estimate of Y.
(To make sure I understand your model correctly: you have two parameters, X and Y, and your only prior knowledge about them is that they both lie in [0, 2] and Y > X. X and Y describe some aspect of the distribution of your data - for example, maybe X is the mean of some other random variable Z, for which you have observations. Do I have that right?)
I believe Stan's priors are uniform by default, but you can make sure of this by specifying a prior for beta in the model block:
model {
beta ~ uniform(0, 2);
...
}

GAMS - Unit step function

I need to use the step function in order to count the number of non-zero elements in a parameter. The step function that I am considering is the following:
After searching on the internet for solution, I realized we can create stepwise functions in GAMS, but I need a continuous function for x > 1.
I tried the following code to reproduce a step-like function:
round(1 / (1 + exp(-x)) - 0.01)
which is:
Unfortunately, this formula does not work with GAMS. When I try to run the code, I got this error:
Endogenous function argument(s) not allowed in linear models
I am working with a MIP (Mixed Integer Linear Program) model. Is there a way to use a step function in GAMS?
I assume, that x is a variable in your code? Then you can try something like this (if x would be a parameter, it would be easier):
Equation a, b;
Variable x;
Binary Variable y;
Scalar BigM / 1e3/
SmallM /1e-3/;
a.. y*BigM =g= x;
b.. y*SmallM =l= x;
So, if x=0, y will be 0 as well because of equation b. And if x>0, y will become 1 because of equation a. The BigM you should choose as small as possible and as big as necessary (so it should be the maximum value x can take) and SmallM the other way around. This assumes of course, that there is something like a lower and upper bound for x, if it is not 0...
Hope that helps!
Lutz

Stan. Using target += syntax

I'm starting to learn Stan.
Could anyone explain when and how to use syntax such as... ?
target +=
instead of just:
y ~ normal(mu, sigma)
For example in Stan manual you can find the following example.
model {
real ps[K]; // temp for log component densities
sigma ~ cauchy(0, 2.5);
mu ~ normal(0, 10);
for (n in 1:N) {
for (k in 1:K) {
ps[k] = log(theta[k])
+ normal_lpdf(y[n] | mu[k], sigma[k]);
}
target += log_sum_exp(ps);
}
}
I think the target line increases the target value, that I think it's the logarithm of the posterior density.
But the posterior density for what parameter?
When is it updated and initialized?
After Stan finishes (and converges), how do you access its value and how I use it?
Other examples:
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}
transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}
model {
target += normal_lpdf(eta | 0, 1);
target += normal_lpdf(y | theta, sigma);
}
the example above uses target twice instead of just once.
another example.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma_sq;
vector<lower=-0.5, upper=0.5>[N] y_err;
}
transformed parameters {
real<lower=0> sigma;
vector[N] z;
sigma = sqrt(sigma_sq);
z = y + y_err;
}
model {
target += -2 * log(sigma);
z ~ normal(mu, sigma);
}
This last example even mixes both methods.
To do it even more difficult I've read that
y ~ normal(0,1);
has the same effect than
increment_log_prob(normal_log(y,0,1));
Could anyone explain why, please?
Could anyone provide a simple example written in two different ways, with "target +=" and in the regular simpler "y ~" way, please?
Regards
The syntax
target += u;
adds u to the target log density.
The target density is the density from which the sampler samples and it needs to be equal to the joint density of all the parameters given the data up to a constant (which is usually achieved via Bayes's rule by coding as the joint density of parameters and modeled data up to a constant). You access it as lp__ in the posterior, but be careful, as it also contains the Jacobians arising from the constraints and drops constants in sampling statements---you do not want to use it for model comparison.
From a sampling perspective, writing
target += normal_lpdf(y | mu, sigma);
has the same effect as
y ~ normal(mu, sigma);
The _lpdf signals it's the log probability density function for the normal, which is implicit in the sampling notation. The sampling notation is just shorthand for the target += syntax, and in addition, drops constant terms in the log density.
It's explained in the statements section of the language reference (the second part of the manual) and used in multiple examples through the programmer's guide (the first part of the manual).
I am just starting to learn Stan and Bayesian statistics, and mainly rely on John Kruschke's book "Doing Bayesian Data Analysis". Here, in chapter 14.3.3, he explains:
Thus, the essence of computation in Stan is dealing with the logarithm
of the posterior probability density and its gradient; there is no
direct random sampling of parameters from distributions.
As a result (still rephrasing Kruschke), a
model [...] like y ∼ normal(mu,sigma) [actuallly] means to multiply the current posterior probability by the density of the normal distribution at the datum value y.
Following logarithm calculation rules, this multiplication is equal to add the log probability density of a given data y to the current log-probability. (log(a*b) = log(a) + log(b), hence the equality of multiplication and sum).
I concede that I don't grasp the full implications of that, but I think it points into the right direction into what, mathematically speaking, the targer += does.

How to estimate confidence of nonlinear regression?

I use Levenberg -- Marquardt algorithm to fit my nonlinear function f(x,b) (x:Nx1, b:Mx1) to data X:NxK.
Now I want to estimate goodness (confidence) of solution b.
This post says that I should not try to find R-squared in nonlinear case. What should I do then? Are there any reliable universal metrics at all? I could not google any answer for this.
Standard errors are usually calculated as:
s.e. = sigma^2 inv(J'J)
or as
s.e. = sigma^2 inv(H)
where
J : Jacobian matrix
H : Hessian matrix
sigma^2 = SSE/df = sum of squared errors / (n-p)
A confidence interval is then
b +- s.e. * t(n-p,alpha/2)
where t is the critical value for the Student’s t distribution