Explaining Variational Autoencoder gaussian parameterization - tensorflow

In the original Auto-Encoding Variational Bayes paper, the authors describes the "reparameterization trick" in section 2.4. The trick is to breakup your latent state z into learnable mean and sigma (learned by the encoder) and adding Gaussian noise. You then sample a datapoint from z (basically you generate an encoded image) and let the decoder map the encoded datapoint back to the original image.
I have a hard getting over how strange this is. Could someone explain a bit more on the latent variable model, specifically:
Why are we assuming the latent state is Gaussian?
How is it possible that a Gaussian can generate an image?
And how does backprop corrupt the encoder to learn a Gaussian function as opposed to an unknown non-linear function?
Here is an example implementation of the latent model from here in TensorFlow.
...neural net code maps input to hidden layers z_mean and z_log_sigma
self.z_mean, self.z_log_sigma_sq = \
self._recognition_network(network_weights["weights_recog"],
network_weights["biases_recog"])
# Draw one sample z from Gaussian distribution
n_z = self.network_architecture["n_z"]
eps = tf.random_normal((self.batch_size, n_z), 0, 1,
dtype=tf.float32)
# z = mu + sigma*epsilon
self.z = tf.add(self.z_mean,
tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq)), eps))
...neural net code maps z to output

They are not assuming that the activations of the encoder follow a gaussian distribution, they are enforcing that of the possible solutions choose a gaussian resembling one.
The image is generated from decoding a activation/feature, the activations are distributed resembling a gaussian.
They minimize the KL divergence between the activations distribution and a gaussian one.

Related

GPflow multi-class: How to squeeze many gp.predict_f_samples to obtain their probabilities?

I classify MNIST digits and I want to sample the probabilities (not the latent function) for each class on multiple many times. However, gp.predict_y gives the probabilities just for one case.
Thus I take f_samples = gp.predict_f_samples which returns numerous examples from the underlying latent function.
Now, how to 'squeeze' the f_samples through the robust_max likelihood?
Code for my gp:
kernel = gpflow.kernels.Matern52(input_dim=128, ARD=ARD, active_dims=np.arange(128))\
+ gpflow.kernels.White(input_dim=128, active_dims=np.arange(128))
# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(10) # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(10, invlink=invlink) # Multiclass likelihood
Z = x_train[::5].copy() # inducing inputs
gp = gpflow.models.SVGP(x_train, y_train, num_latent=10,
kern=kernel, Z=Z, likelihood=likelihood,
whiten=True, q_diag=True)
GPflow version: 1.5.1
Once you've sampled you're no longer working with a probability distribution - you have actual values for each of your 10 latent functions. To convert a sample to probabilities over the classes you can just apply the RobustMax function (probability 1-epsilon for the largest latent function, epsilon/9 for all the others) to the 10 values you get. E.g.
eps = 0.001
f_samples = gp.predict_f_samples(x_test, num_samples)
largests = np.argmax(f_samples , axis = 2)
prob_samples = (np.eye(10)[largests]*(1-eps-eps/9)+eps/9)
Note that the probabilities you get will all be 0.999 on one class and 0.0001 on all the others - that's what RobustMax is. If you're intending to average over your samples, you probably just want to call gp.predict_y(), which actually integrates the RobustMax over the probability distribution and can give you some smoother class probabilities if the latent means are close.

Considering Gaussian decoder for Variational autoencoders

I am trying to implement variation auto-encoder for real data where both encoder and decoder are modeled via multivariate Gaussian. I have found several implementations online for the case where the encoder is Gaussian and decoder is Bernoulli, but nothing for Gaussian decoder case. For the case of Bernoulli decoder the reconstruction loss can be defined as follows
reconstr_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=x,logits=x_out_logit)
where x_out_logit is modeled by a DNN. I am not sure how to write reconstruction loss for the Gaussian case. I assumed the decoder should output mean (gz_mean) and variance (gz_log_sigma_sq) as well (similar to the Gaussian encoder) and since the reconstruction loss is the Gaussian probability, I defined it to be
mvn = tf.contrib.distributions.MultivariateNormalDiag(loc=self.gz_mean,scale_diag=tf.sqrt(tf.exp(self.gz_log_sigma_sq)))
reconstr_loss = tf.log(1e-20+mvn.prob(self.x))
However this loss does not seem to work, mvn.prob(self.x) is always zero no matter what training step. Please let me know of any ideas or any git-hub source which considers this case.

How to constrain a layer to be a probability matrix?

I recently read this paper which deals with noisy labels in convolutional neural networks.
They model label noise by a probability transition matrix which forms a simple
constrained linear layer after the softmax output.
So as an example we may have a 3-by-3 probability transition matrix (3 classes):
Example probability transition matrix. The sum of each column has to be 1.
This matrix Q is basically trained in the same way as the rest of the network via backpropagation. But it needs to be constrained to be a probability matrix. Quote from the paper:
After taking a gradient step with the Q and the model
weights, we project Q back to the subspace of probability matrices because it represents conditional probabilities.
Now I am wondering what is the best way to implement such a layer in tensorflow.
I have some ideas but i'm not sure what could work or is best procedure.
1) Hard code the constraint in the model before any training is done, something like:
# ... build conv model without Q
[...]
# shape of y_conv (output CNN) assumed to be a [3,1] vector
y_conv = tf.nn.softmax(y_conv, 0)
# add linear layer representing Q, no bias
W_Q = weight_variable([3, 3])
# add constraint: columns are valid probability distribution
W_Q = tf.nn.softmax(W_Q, 0)
# output of model:
Q_out = tf.matmul(W_Q, y_conv)
# now compute loss, gradients and start training
2) Compute and apply gradients to the whole model (Q included), then apply constraint
train_op = ...
constraint_op = tf.assign(W_Q, tf.nn.softmax(W_Q,0))
sess = tf.session()
# compute and apply gradients in form of a train_op
sess.run(train_op)
sess.run(constraint_op)
I think the second approach is more related to the paper quote, but I am not sure to what extend external assignments confuse training.
Or maybe my ideas are bananas. I hope you can give me some advice!

Does Stochastic Gradient Descent even work with TensorFlow?

I designed a MLP, fully connected, with 2 hidden and one output layer.
I get a nice learning curve if I use batch or mini-batch gradient descent.
But a straight line while performing Stochastic Gradient Descent (violet)
What did I get wrong?
In my understanding, I do stochastic gradient descent with Tensorflow, if I provide just one train/learn example each train step, like:
X = tf.placeholder("float", [None, amountInput],name="Input")
Y = tf.placeholder("float", [None, amountOutput],name="TeachingInput")
...
m, i = sess.run([merged, train_op], feed_dict={X:[input],Y:[label]})
Whereby input is a 10-component vector and label is a 20-component vector.
For testings I run 1000 iterations, each iterations contains one of 50 prepared train/learn example.
I expected an overfittet nn. But as you see, it doesn't learn :(
Because the nn will perform in an online-learning environment, a mini-batch oder batch gradient descent isn't an option.
thanks for any hints.
The batch size influences the effective learning rate.
If you think to the update formula of a single parameter, you'll see that it's updated averaging the various values computed for this parameter, for every element in the input batch.
This means that if you're working with a batch size with size n, your "real" learning rate per single parameter is about learning_rate/n.
Thus, if the model you've trained with batches of size n have trained without issues, this is because the learning rate was ok for that batch size.
If you use pure stochastic gradient descent, you have to lower the learning rate (usually by a factor of some power of 10).
So, for example, if your learning rate was 1e-4 with a batch size of 128, try with a learning rate of 1e-4 / 128.0 as see if the network learn (it should).

How does word2vec give one hot word vector from the embedding vector?

I understand how word2vec works.
I want to use word2vec(skip-gram) as input for RNN. Input is embedding word vector. Output is also embedding word vector generated by RNN.
Here’s question! How can I convert the output vector to one hot word vector? I need inverse matrix of embeddings but I don’t have!
The output of an RNN is not an embedding. We convert the output from the last layer in an RNN cell into a vector of vocabulary_size by multiplying with an appropriate matrix.
Take a look at the PTB Language Model example to get a better idea. Specifically look at lines 133-136:
softmax_w = tf.get_variable("softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
The above operation will give you logits. This logits is a probability distribution over your vocabulary. numpy.random.choice might help you to use these logits to make a prediction.