Tensorflow Bijectors with non-invertible transformations? - tensorflow

I'm trying to understand the variational inference module in tensorflow; I have a particular use case I'm hoping to use it for.
I want to make a custom distribution, the RV of which is a transformation of a vector of independent gamma RV's. This transformation removes one degree of freedom.
For simplicity's sake, let's consider the Dirichlet distribution. If x is an independent gamma vector with shape parameter vector a, then y = x / sum(x) is a dirichlet vector with the same shape vector, and sums to 1. Thus it loses 1 degree of freedom in the transformation.
Let's say I want to implement this distribution as a tfp.distributions.TransformedDistribution. Would that be possible? The Bijector class assumes implementation of both forward and inverse transformations, which, after the sum is integrated out, is no longer possible.
How would I go about implementing the Dirichlet in TransformedDistribution?

Related

Does variational autoencoder make distribution based on only latent representation?

If my latent representation of variational autoencoder(VAE) is r, and my dataset is x, does vae's latent representation follows normalization based on r or x?
If r= 10, that means it has 10 means and variance (multi-gussain) and distribution comes from data whole data x?
Or r = 10 constructs one distribution based on r, and every sample try to follow this distribution
I'm confused about which one is correct
VAE constructs a mapping e(x) -> Z (encoder), and d(z) -> X (decoder). This means that every elements of your input space x will be mapped through an encoder e(x) into a single, r-dimensional Gaussian. It is not a "mixture", it is just a single gaussian with diagonal covariance matrix.
I'll add my 2 cents to #lejlot answer.
Your encoder in VAE will map your sample to a distribution, that in your case has 10 dimensions... that distribution is used to say "ok my best estimate of this property of this sample is mu, but I'm not too sure, so consider that it might have variance sigma"
Therefore, you have a distribution for each sample.
However, in order to make sampling easier in VAE, we ask the VAE to keep the distributions as close to a known one, that is the standard normal distribution (we know "where the distributions are located", if you check the latent space in a normal AE you will see that you will have groups far from eachother).

STAN - Defining priors for dependent random variables

Background: I have a simulation model which has unobserved parameters. I created a metamodel using artificial neural networks (ANN) because the runtime was very long for the simulation model. I am trying to estimate the unobserved parameters using Bayesian calibration, where priors are based on current knowledge, and the likelihood of observing data is being estimated from the metamodel.
Query: I have two random variables X and Y for which I am trying to get the posterior distribution using STAN. The prior distribution of X is uniform, U(0,2). The prior for Y is also uniform, but it will always exceed X i.e., Y ~ U(X,2). Since Y is linked to X, how can I define the prior distribution for Y in STAN such that the constraint Y>X holds? I am new to STAN, so I would appreciate any suggestions or guidance on how to proceed. Thank you so much!
Stan's ordered vectors are what you need. Create an ordered vector of length 2 (I'll call it beta) in the parameters block, like this:
parameters {
ordered<lower=0,upper=2>[2] beta;
}
Ordered vectors are constrained such that each element is greater than the previous element. So beta[1] will be your estimate of X and beta[2] will be your estimate of Y.
(To make sure I understand your model correctly: you have two parameters, X and Y, and your only prior knowledge about them is that they both lie in [0, 2] and Y > X. X and Y describe some aspect of the distribution of your data - for example, maybe X is the mean of some other random variable Z, for which you have observations. Do I have that right?)
I believe Stan's priors are uniform by default, but you can make sure of this by specifying a prior for beta in the model block:
model {
beta ~ uniform(0, 2);
...
}

Implementation of Isotropic squared exponential kernel with numpy

I've come across a from scratch implementation for gaussian processes:
http://krasserm.github.io/2018/03/19/gaussian-processes/
There, the isotropic squared exponential kernel is implemented in numpy. It looks like:
The implementation is:
def kernel(X1, X2, l=1.0, sigma_f=1.0):
sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)
consistent with the implementation of Nando de Freitas: https://www.cs.ubc.ca/~nando/540-2013/lectures/gp.py
However, I am not quite sure how this implementation matches the provided formula, especially in the sqdist part. In my opinion, it is wrong but it works (and delivers the same results as scipy's cdist with squared euclidean distance). Why do I think it is wrong? If you multiply out the multiplication of the two matrices, you get
which equals to either a scalar or a nxn matrix for a vector x_i, depending on whether you define x_i to be a column vector or not. The implementation however gives back a nx1 vector with the squared values.
I hope that anyone can shed light on this.
I found out: The implementation is correct. I just was not aware of the fuzzy notation (in my opinion) which is sometimes used in ML contexts. What is to be achieved is a distance matrix and each row vectors of matrix A are to be compared with the row vectors of matrix B to infer the covariance matrix, not (as I somehow guessed) the direct distance between two matrices/vectors.

how tensorflow handles complex gradient?

Let z is a complex variable, C(z) is its conjugation.
In complex analysis theory, the derivative of C(z) w.r.t z don't exist. But in tesnsorflow, we can calculate dC(z)/dz and the result is just 1.
Here is an example:
x = tf.placeholder('complex64',(2,2))
y = tf.reduce_sum(tf.conj(x))
z = tf.gradients(y,x)
sess = tf.Session()
X = np.random.rand(2,2)+1.j*np.random.rand(2,2)
X = X.astype('complex64')
Z = sess.run(z,{x:X})[0]
The input X is
[[0.17014372+0.71475762j 0.57455420+0.00144318j]
[0.57871044+0.61303568j 0.48074263+0.7623235j ]]
and the result Z is
[[1.-0.j 1.-0.j]
[1.-0.j 1.-0.j]]
I don't understand why the gradient is set to be 1?
And I want to know how tensorflow handles the complex gradients in general.
How?
The equation used by Tensorflow for the gradient is:
Where the '*' means conjugate.
When using the definition of the partial derivatives wrt z and z* it uses Wirtinger Calculus. Wirtinger calculus enables to calculate the derivative wrt a complex variable for non-holomorphic functions. The Wirtinger definition is:
Why this definition?
When using for example Complex-Valued Neural Networks (CVNN) the gradients will be used over non-holomorphic, real-valued scalar function of one or several complex variables, tensorflow definition of a gradient can then be written as:
This definition corresponds with the literature of CVNN like for example chapter 4 section 4.3 of this book or Amin et al. (between countless examples).
Bit late, but I came across this issue recently too.
The key point is that TensorFlow defines the "gradient" of a complex-valued function f(z) of a complex variable as "the gradient of the real map F: (x,y) -> Re(f(x+iy)), expressed as a complex number" (the gradient of that real map is a vector in R^2, so we can express it as a complex number in the obvious way).
Presumably the reason for that definition is that in TF one is usually concerned with gradients for the purpose of running gradient descent on a loss function, and in particular for identifying the direction of maximum increase/decrease of that loss function. Using the above definition of gradient means that a complex-valued function of complex variables can be used as a loss function in a standard gradient descent algorithm, and the result will be that the real part of the function gets minimised (which seems to me a somewhat reasonable interpretation of "optimise this complex-valued function").
Now, to your question, an equivalent way to write that definition of gradient is
gradient(f) := dF/dx + idF/dy = conj(df/dz + dconj(f)/dz)
(you can easily verify that using the definition of d/dz). That's how TensorFlow handles complex gradients. As for the case of f(z):=conj(z), we have df/dz=0 (as you mention) and dconj(f)/dz=1, giving gradient(f)=1.
I wrote up a longer explanation here, if you're interested: https://github.com/tensorflow/tensorflow/issues/3348#issuecomment-512101921

PyMC: How can I describe a state space model?

I used to code my MCMC using C. But I'd like to give PyMC a try.
Suppose X_n is the underlying state whose dynamics following a Markov chain and Y_n is the observed data. In particular,
Y_n has Poisson distribution with mean depending on X_n and a multidimensional unknown parameter theta
X_n | X_{n-1} has distribution depending on theta
How should I describe this model using PyMC?
Another question: I can find conjugate priors for theta but not for X_n. Is it possible to specify which posteriors are updated using conjugate priors and which using MCMC?
Here is an example of a state-space model in PyMC on the PyMC wiki. It basically involves populating a list and allowing PyMC to treat it as a container of PyMC nodes.
As for the second part of the question, you could certainly calculate some of your conjugate posteriors ahead of time and put them into the model. For example, if you observed binomial data x=4, n=10 you could insert a Beta node p = Beta('p', 5, 7) to represent that posterior (its really just a prior, as far as the model is concerned, but it is the posterior given data x). Then PyMC would draw a sample for this posterior at every iteration to be used wherever it is needed in the model.