Derivative of a sum in gradient descent optimizer - sum

Im trying to replicate a fairly complex algorithm on Matlab based on a research paper. (https://doi.org/10.1117/1.JEI.23.1.013007)
The algorithm uses gradient descent to find optimal connections.
p = [p1, p2, p3, ... , pn], are probabilities that possible connections are true
q = [q1, q2, q3, ... , qn], is the gradient vector
rij is a symmetric matrix that expresses the compatibility relations between the possible connections
The probabilities are found by minimizing E(p)
qi = - d/dpi E(p),
I'm having trouble with the partial derivative of Equation 5 of the paper. In Eq. 5 I have substituted ci with Eq. 12 and then partially derivated the equation. In the uploaded image you can see the derivatives solved.
Equations
I'm trying write a matlab script that would output the q vector [q1, q2, q3, .. , qn] but I don't understand for example how to get q5.

Related

Does variational autoencoder make distribution based on only latent representation?

If my latent representation of variational autoencoder(VAE) is r, and my dataset is x, does vae's latent representation follows normalization based on r or x?
If r= 10, that means it has 10 means and variance (multi-gussain) and distribution comes from data whole data x?
Or r = 10 constructs one distribution based on r, and every sample try to follow this distribution
I'm confused about which one is correct
VAE constructs a mapping e(x) -> Z (encoder), and d(z) -> X (decoder). This means that every elements of your input space x will be mapped through an encoder e(x) into a single, r-dimensional Gaussian. It is not a "mixture", it is just a single gaussian with diagonal covariance matrix.
I'll add my 2 cents to #lejlot answer.
Your encoder in VAE will map your sample to a distribution, that in your case has 10 dimensions... that distribution is used to say "ok my best estimate of this property of this sample is mu, but I'm not too sure, so consider that it might have variance sigma"
Therefore, you have a distribution for each sample.
However, in order to make sampling easier in VAE, we ask the VAE to keep the distributions as close to a known one, that is the standard normal distribution (we know "where the distributions are located", if you check the latent space in a normal AE you will see that you will have groups far from eachother).

Tensorflow Probability VI: Discrete + Continuous RVs inference: gradient estimation?

See this tensorflow-probability issue
tensorflow==2.7.0
tensorflow-probability==0.14.1
TLDR
To perform VI on discrete RVs, should I use:
A- the REINFORCE gradient estimator
B- the Gumbel-Softmax reparametrization
C- another solution
and how to implement it ?
Problem statement
Sorry in advance for the long issue, but I believe the problem requires some explaining.
I want to implement a Hierarchical Bayesian Model involving both continuous and discrete Random Variables. A minimal example is a Gaussian Mixture model:
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
G = 2
p = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
z=tfd.Categorical(
probs=tf.ones((G,)) / G
),
x=lambda mu, z: tfd.Normal(
loc=mu[z],
scale=1.
)
)
)
In this example I don't use the tfd.Mixture API on purpose to expose the Categorical label. I want to perform Variational Inference in this context, and for instance given an observed x fit over the posterior of z a Categorical distribution with parametric probabilities:
q_probs = tfp.util.TransformedVariable(
tf.ones((G,)) / G,
tfb.SoftmaxCentered(),
name="q_probs"
)
q_loc = tf.Variable(0., name="q_loc")
q_scale = tfp.util.TransformedVariable(
1.,
tfb.Exp(),
name="q_scale"
)
q = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Normal(q_loc, q_scale),
z=tfd.Categorical(probs=q_probs)
)
)
The issue is: when computing the ELBO and trying to optimize for the optimal q_probs I cannot use the reparameterization gradient estimators: this is AFAIK because z is a discrete RV:
def log_prob_fn(**kwargs):
return p.log_prob(
**kwargs,
x=tf.constant([2.])
)
optimizer = tf.optimizers.SGD()
#tf.function
def fit_vi():
return tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=log_prob_fn,
surrogate_posterior=q,
optimizer=optimizer,
num_steps=10,
sample_size=8
)
_ = fit_vi()
# This last line raises:
# ValueError: Distribution `surrogate_posterior` must be reparameterized, i.e.,a diffeomorphic transformation
# of a parameterless distribution. (Otherwise this function has a biased gradient.)
I'm looking into a way to make this work. I've identified at least 2 ways to circumvent the issue: using REINFORCE gradient estimator or the Gumbel-Softmax reparameterization.
A- REINFORCE gradient
cf this TFP API link a classical result in VI is that the REINFORCE gradient can deal with a non-differentiable objective function, for instance due to discrete RVs.
I can use a tfp.vi.GradientEstimators.SCORE_FUNCTION estimator instead of the tfp.vi.GradientEstimators.REPARAMETERIZATION one using the lower-level tfp.vi.monte_carlo_variational_loss function?
Using the REINFORCE gradient, In only need the log_prob method of q to be differentiable, but the sample method needn't be differentiated.
As far as I understood it, the sample method for a Categorical distribution implies a gradient break, but the log_prob method does not. Am I correct to assume that this could help with my issue? Am I missing something here?
Also I wonder: why is this possibility not exposed in the tfp.vi.fit_surrogate_posterior API ? Is the performance bad, meaning is the variance of the estimator too large for practical purposes ?
B- Gumbel-Softmax reparameterization
cf this TFP API link I could also reparameterize z as a variable y = tfd.RelaxedOneHotCategorical(...) . The issue is: I need to have a proper categorical label to use for the definition of x, so AFAIK I need to do the following:
p_GS = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=tf.ones((G,)) / G
),
x=lambda mu, y: tfd.Normal(
loc=mu[tf.argmax(y)],
scale=1.
)
)
)
...but his would just move the gradient breaking problem to tf.argmax. This is where I maybe miss something. Following the Gumbel-Softmax (Jang et al., 2016) paper, I could then use the "STRAIGHT-THROUGH" (ST) strategy and "plug" the gradients of the variable tf.one_hot(tf.argmax(y)) -the "discrete y"- onto y -the "continuous y".
But again I wonder: how to do this properly ? I don't want to mix and match the gradients by hand, and I guess an autodiff backend is precisely meant to avoid me this issue. How could I create a distribution that differentiates the forward direction (sampling a "discrete y") from the backward direction (gradient computed using the "continuous y") ? I guess this is the meant usage of the tfd.RelaxedOneHotCategorical distribution, but I don't see this implemented anywhere in the API.
Should I implement this myself ? How ? Could I use something in the lines of tf.custom_gradient?
Actual question
Which solution -A or B or another- is meant to be used in the TFP API, if any? How should I implement said solution efficiently?
So the ides was not to make a Q&A but I looked into this issue for a couple days and here are my conclusions:
solution A -REINFORCE- is a possibility, it doesn't introduce any bias, but as far as I understood it it has high variance in its vanilla form -making it prohibitively slow for most real-world tasks. As detailed a bit below, control variates can help tackle the variance issue;
solution B, Gumbell-Softmax, exists as well in the API, but I did not find any native way to make it work for hierarchical tasks. Below is my implementation.
First off, we need to reparameterize the joint distribution p as the KL between a discrete and a continuous distribution is ill-defined (as explained in the Maddison et al. (2017) paper). To not break the gradients, I implemented a simple one_hot_straight_through operation that converts the continuous RV y into a discrete RV z:
G = 2
#tf.custom_gradient
def one_hot_straight_through(y):
depth = y.shape[-1]
z = tf.one_hot(
tf.argmax(
y,
axis=-1
),
depth=depth
)
def grad(upstream):
return upstream
return z, grad
p = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=tf.ones((G,)) / G
),
x=lambda mu, y: tfd.Normal(
loc=tf.reduce_sum(
one_hot_straight_through(y)
* mu
),
scale=1.
)
)
)
The variational distribution q follows the same reparameterization and the following code bit does work:
q_probs = tfp.util.TransformedVariable(
tf.ones((G,)) / G,
tfb.SoftmaxCentered(),
name="q_probs"
)
q_loc = tf.Variable(tf.zeros((2,)), name="q_loc")
q_scale = tfp.util.TransformedVariable(
1.,
tfb.Exp(),
name="q_scale"
)
q = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Independent(
tfd.Normal(q_loc, q_scale),
reinterpreted_batch_ndims=1
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=q_probs
)
)
)
def log_prob_fn(**kwargs):
return p.log_prob(
**kwargs,
x=tf.constant([2.])
)
optimizer = tf.optimizers.SGD()
#tf.function
def fit_vi():
return tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=log_prob_fn,
surrogate_posterior=q,
optimizer=optimizer,
num_steps=10,
sample_size=8
)
_ = fit_vi()
Now there are several issues with that design:
first off we needed to reparameterize not only q but also p so we "modify our target model". This results in our models p and q not outputing discrete RVs like originally intended but continuous RVs. I think that the introduction of a hard option like in the torch implem could be a nice addition to overcome this issue;
second we introduce the burden of setting up the temperature parameter. The latter make the continuous RV y smoothly converge to its discrete counterpart z. An annealing strategy, reducing the temperature to reduce the bias introduced by the relaxation at the cost of a higher variance can be implemented. Or the temperature can be learned online, akin to an entropy regularization (see Maddison et al. (2017) and Jang et al. (2017));
the gradient obtained with this estimator are biased, which probably can be acceptable for most applications but is an issue in theory.
Recent methods like REBAR (Tucker et al. (2017)) or RELAX (Grathwohl et al. (2018)) can instead obtain unbiased estimators with a lower variance than the original REINFORCE. But they do so at the cost of introducing -learnable- control variates with separate losses. Modifications of the one_hot_straight_through functions could probably implement this.
In conclusion my opinion is that the tensorflow probability support for discrete RVs optimization is too scarce at the moment and that the API lacks native functions and tutorials to make it easier for the user.

Question about inconsistency between tensorflow lite quantization code, paper and documentation

In this paper (Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference) published by google, quantization scheme is described as follows:
Where,
M = S1 * S2 / S3
S1, S2 and S3 are scales of inputs and output respectively.
Both S1 (and zero point Z1) and S2 (and zero point Z2) can be determined easily, whether "offline" or "online". But what about S3 (and zero point Z3)? These parameters are dependent on "actual" output scale (i.e., the float value without quantization). But output scale is unknown before it is computed.
According to the tensorflow documentation:
At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.
But the code below says something different:
tensor_utils::BatchQuantizeFloats(
input_ptr, batch_size, input_size, quant_data, scaling_factors_ptr,
input_offset_ptr, params->asymmetric_quantize_inputs);
for (int b = 0; b < batch_size; ++b) {
// Incorporate scaling of the filter.
scaling_factors_ptr[b] *= filter->params.scale;
}
// Compute output += weight * quantized_input
int32_t* scratch = GetTensorData<int32_t>(accum_scratch);
tensor_utils::MatrixBatchVectorMultiplyAccumulate(
filter_data, num_units, input_size, quant_data, scaling_factors_ptr,
batch_size, GetTensorData<float>(output), /*per_channel_scale=*/nullptr,
input_offset_ptr, scratch, row_sums_ptr, &data->compute_row_sums,
CpuBackendContext::GetFromContext(context));
Here we can see:
scaling_factors_ptr[b] *= filter->params.scale;
I think this means:
S1 * S2 is computed.
The weights are still integers. Just the final results are floats.
It seems S3 and Z3 don't have to be computed. But if so, how can the final float results be close to the unquantized results?
This inconsistency between paper, documentation and code makes me very confused. I can't tell what I miss. Can anyone help me?
Let me answer my own question. All of a sudden I saw what I missed when I was
riding bicycle. The code in the question above is from the function
tflite::ops::builtin::fully_connected::EvalHybrid(). Here the
name has explained everything! Value in the output of matrix multiplication is
denoted as r3 in section 2.2 of the paper. In terms of equation
(2) in section 2.2, we have:
If we want to get the float result of matrix multiplication, we can use equation (4) in section 2.2, then convert the result back to floats, OR we can use equation (3) with the left side replaced with r3, as in:
If we choose all the zero points to be 0, then the formula above becomes:
And this is just what EvalHybrid() does (ignoring the bias for the moment). Turns out the paper gives an outline of the quantization algorithm, while the implementation uses different variants.

how tensorflow handles complex gradient?

Let z is a complex variable, C(z) is its conjugation.
In complex analysis theory, the derivative of C(z) w.r.t z don't exist. But in tesnsorflow, we can calculate dC(z)/dz and the result is just 1.
Here is an example:
x = tf.placeholder('complex64',(2,2))
y = tf.reduce_sum(tf.conj(x))
z = tf.gradients(y,x)
sess = tf.Session()
X = np.random.rand(2,2)+1.j*np.random.rand(2,2)
X = X.astype('complex64')
Z = sess.run(z,{x:X})[0]
The input X is
[[0.17014372+0.71475762j 0.57455420+0.00144318j]
[0.57871044+0.61303568j 0.48074263+0.7623235j ]]
and the result Z is
[[1.-0.j 1.-0.j]
[1.-0.j 1.-0.j]]
I don't understand why the gradient is set to be 1?
And I want to know how tensorflow handles the complex gradients in general.
How?
The equation used by Tensorflow for the gradient is:
Where the '*' means conjugate.
When using the definition of the partial derivatives wrt z and z* it uses Wirtinger Calculus. Wirtinger calculus enables to calculate the derivative wrt a complex variable for non-holomorphic functions. The Wirtinger definition is:
Why this definition?
When using for example Complex-Valued Neural Networks (CVNN) the gradients will be used over non-holomorphic, real-valued scalar function of one or several complex variables, tensorflow definition of a gradient can then be written as:
This definition corresponds with the literature of CVNN like for example chapter 4 section 4.3 of this book or Amin et al. (between countless examples).
Bit late, but I came across this issue recently too.
The key point is that TensorFlow defines the "gradient" of a complex-valued function f(z) of a complex variable as "the gradient of the real map F: (x,y) -> Re(f(x+iy)), expressed as a complex number" (the gradient of that real map is a vector in R^2, so we can express it as a complex number in the obvious way).
Presumably the reason for that definition is that in TF one is usually concerned with gradients for the purpose of running gradient descent on a loss function, and in particular for identifying the direction of maximum increase/decrease of that loss function. Using the above definition of gradient means that a complex-valued function of complex variables can be used as a loss function in a standard gradient descent algorithm, and the result will be that the real part of the function gets minimised (which seems to me a somewhat reasonable interpretation of "optimise this complex-valued function").
Now, to your question, an equivalent way to write that definition of gradient is
gradient(f) := dF/dx + idF/dy = conj(df/dz + dconj(f)/dz)
(you can easily verify that using the definition of d/dz). That's how TensorFlow handles complex gradients. As for the case of f(z):=conj(z), we have df/dz=0 (as you mention) and dconj(f)/dz=1, giving gradient(f)=1.
I wrote up a longer explanation here, if you're interested: https://github.com/tensorflow/tensorflow/issues/3348#issuecomment-512101921

Which objective is optimized Intra-Cluster sum of distances or MSE?

In the cluster analysis papers using meta-heuristic algorithms, many has optimized Mean-Squared Quantization Error (MSE). For example in
[1] and [2] .
I have a confusion with the results. They have told that they have used the MSE as the objective function. But they have reported the result values in intra-cluster sum of Euclidean distances.
K-Means minimizes Within-Cluster Sum of Squares (WCSS) (from wiki) [3]. I could not find what is the difference between WCSS and MSE, when Euclidean distance is used in the case of the difference metric when calculating MSE.
In the case of K-Means the WCSS is minimized, and if we use the same MSE function with the meta-heuristics algorithms they will also minimize it. In this case how the sum of Euclidean distances for the K-Means and the other vary?
I can reproduce the results shown in the papers if I optimize the intra-cluster sum of Euclidean distances.
I think I am doing something wrong here. Can anyone help me with this.
Main question: What objectives did the referenced papers [1] and [2] optimize, and which function's values are shown in the table?
K-means optimizes the (sum of within-cluster-) sum of squares aka variance aka sum of squared Euclidean distances.
This is easy to see if you study the convergence proof.
I can't study the two papers you referenced. They're with crappy Elsevier and paywalled, and I'm not going to pay $36+$32 to answer your question.
Update: I managed to get a free copy of one of them. They call it "MSE, mean-square quantization error", but their equation is the usual within-cluster-sum-of-squares, no mean involved; with a shady self-citation attached to this statement, and half of the references being self-citations... it seems like it's more this author that likes to call it different than everybody else. Looks bit like "reinventing the wheel with a different name" to me. I'd carefully double-check their results. I'm not saying they are false, I havn't checked in more detail. But the "mean-square error" doesn't involve a mean for sure; it's the sum of squared errors.
Update: if "intra-cluster sum" means sum of pairwise distances of any two objects, consider the following:
Without loss of generality, move the data such that the mean is 0. (Translation doesn't change Euclidean or squared Euclidean distances).
sum_x sum_y sum_i (x_i-y_i)^2
= sum_x sum_y [ sum_i (x_i)^2 + sum_i (y_i)^2 - 2 sum_i (x_i*y_i) ]
= n * sum_x sum_i (x_i)^2 + n * sum_y sum_i (y_i)
- 2 * sum_i [sum_x x_i * sum_y y_i]
The first two summands are the same. So we have 2n times the WCSS.
But since mu_i = 0, sum_x x_i = sum_y y_i = 0, and the third term disappears.
If I didn't screw up this computation, then the mean, asymmetric pairwise squared Euclidean distance within a cluster is the same as WCSS.