Variational autoencoder cannot train with smal input values - tensorflow

I am using a variational autoencoder to reconstruct images in tensorflow 2.0 with the Keras API. My model's architecture looks like that:
The lambda layer uses a function to sample from a normal distribution which looks like that:
def sampling(args):
z_mean, z_log_var = args
epsilon = K.random_normal(shape =(1,1,16))
return z_mean + K.exp(0.5 * z_log_var) * epsilon
My hyperparameters are as follows:
epochs = 50
batch size =16
num_training = 1800
num_val = 100
num_test = 100
learning rate = 0.001
exponential decay = 0.9 * initial learning rate (calculated every 5 epochs)
optimizer = Adam
shuffle = True
I am using the following loss:
def vae_loss(y_pred, y_gt):
mse_loss = mse(y_pred, y_gt)
z_mean = model.get_layer('z_mean_layer').output
z_log_var = model.get_layer('z_log_var_layer').output
kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
return K.mean(mse_loss + kl_loss)
My weights are initialized the default way: kernel_initializer='glorot_uniform', bias_initializer='zeros'.
My datasets images consist of a randomly placed circle, which looks like that:
The background has the value 0 and the circle's value is sampled from a uniform distribution between -1 and 1, e.g. 0.987 for all circle pixels.
When I train with this configuration, I get the following loss.
The KL divergence is of magnitude 1e-8, whereas the MSE loss is stays at 0.101.
And I always get the same reconstruction, regardless of the input, which is an image with a constant pixel intensity
Now, if I multiply all input images with 500 (eg. background stays zero, circle pixel values are uniformly distributed in the range (-500, 500)), the network miraculously starts to learn.
with a KL loss of magnitude 50 and MSE loss of magnitude 250 (last epochs)
And the image reconstruction works well. Basically, the MSE metric is high, but the circle contour is positioned in the right place.
My quiestion is: How come the network cannot reconstruct images in the range (-1,1) , but does so in the range (-500, 500)?
Machine precision is set to float32.
I have used numerous learning rates, e.g. 0.00001, but this does not solve the problem. I have also trained for many epochs, e.g. 200, still no result.

As mentioned in the comments there is probably a problem with the scaling of the loss. Your current implementation of the MSE loss uses the mean of the squared differences (which is fairly small). Instead of using the mean, try using the sum of the squared differences over your image. The Keras VAE ( does this by scaling the computed MSE loss with the original image size (in pytorch this can be specified directly


Reconstruction loss function of VAE

I am learning from an example given by TensorFlow document,
VAEs train by maximizing the evidence lower bound (ELBO) on the
marginal log-likelihood.
In practice, optimize the single sample Monte Carlo estimate of this
expectation: logp(x|z) + logp(z) - logq(z|x).
The loss function was implemented as:
def log_normal_pdf(sample, mean, logvar, raxis=1):
log2pi = tf.math.log(2. * np.pi)
return tf.reduce_sum(
-.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
def compute_loss(model, x):
mean, logvar = model.encode(x)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z)
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
logpx_z = -tf.reduce_sum(cross_ent, axis=[1, 2, 3])
logpz = log_normal_pdf(z, 0., 0.)
logqz_x = log_normal_pdf(z, mean, logvar)
return -tf.reduce_mean(logpx_z + logpz - logqz_x)
Since this example used MINIST dataset, x can be normalized to [0, 1] and sigmoid_cross_entropy_with_logits was used here.
My questions are:
What if x > 1, what kind of loss could be used?
Can we use other loss functions as a reconstruction loss in VAE, such as Huber loss (
Another example used MSE loss (as follow), is MSE loss a valid ELBO loss to measure p(x|z)?
# Iterate over the batches of the dataset.
for step, x_batch_train in enumerate(train_dataset):
with tf.GradientTape() as tape:
reconstructed = vae(x_batch_train)
# Compute reconstruction loss
loss = mse_loss_fn(x_batch_train, reconstructed)
loss += sum(vae.losses) # Add KLD regularization loss
In the loss function of a variational autoencoder, you jointly optimize two terms:
The reconstruction loss between prediction and label, like in a normal autoencoder
The distance between the parametrized probability distribution and the assumed true probability distribution. In practice, the true distribution is usually assumed to be Gaussian and distance is measured in terms of Kullback-Leibler divergence
For the reconstruction loss part, you can pick any loss function that fits your data, including MSE and Huber. It is generally still a good idea to normalize your input features though.

Keras dense model gradient explosion

I have a very simple dense layer model takes 10 input values, 20 units in hidden layer, 1 unit in output layer, and "relu" as activation function, adam optimizer with learning rate 0.01
layer_dense(densemodel, input_shape=ncol(trainingX), units=20, activation="relu")
layer_dropout(densemodel, rate=0.1)
layer_dense(densemodel, units=1, activation="relu")
compile(densemodel, optimizer=optimizer, loss="logcosh", metrics = list("mean_squared_error"))
I trained the model with n = 2e4 training data and ran into serious gradient explosion, which was finally confirmed caused by some outliers (n < 10) in the training records.
Without removing the the outlier records, any one or combination of the following strategies failed to address the gradient explosion problem.
kernel_regularizer, bias_regularizer, activity_regularizer, clipnorm=1, clipvalue=0.5 or 0.1, set learning rate to 1e-5, add drop out layer, increase batch size.
basically none of them work.
I expect at least clipnorm or clipvalue should work since according to definition
clipnorm: Gradients will be clipped when their L2 norm exceeds this
clipvalue: Gradients will be clipped when their absolute value exceeds
this value.
but why they failed?

Neural Network Input scaling

I trained a simple fully connected network on CIFAR-10 dataset:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3*32*32, 300, bias=False)
self.fc2 = nn.Linear(300, 10, bias=False)
def forward(self, x):
x = x.reshape(250, -1)
self.x2 = F.relu(self.fc1(x))
x = self.fc2(self.x2)
return x
def train():
# The output of torchvision datasets are PILImage images of range [0, 1].
transform = transforms.Compose([transforms.ToTensor()])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader =, batch_size=250, shuffle=True, num_workers=4)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader =,, shuffle=False, num_workers=4)
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.02, momentum=0.9, weight_decay=0.0001)
for epoch in range(20):
correct = 0
total = 0
for data in trainloader:
inputs, labels = data
outputs = net(inputs)
loss = criterion(outputs, labels)
_, predicted = torch.max(, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
acc = 100. * correct / total
This network gets to ~50% test accuracy with the parameters specified, after 20 epochs.
Note that I didn't do any whitening of the inputs (no per channel mean subtraction)
Next I scaled up the model inputs by 255, by replacing outputs = net(inputs) with outputs = net(inputs*255). After this change, the network no longer converges. I looked at the gradients and they seem to grow explosively after just a few iterations, leading to all model outputs being zero. I'd like to understand why this is happening.
Also, I tried scaling down the learning rate by 255. This helps, but the network only gets to ~43% accuracy. Again, I don't understand why this helps, and more importantly why the accuracy is still degraded compared to the original settings.
EDIT: forgot to mention that I don't use biases in this network.
EDIT2: I can recover the original accuracy if I scale down the initial weights in both layers by 255 (in addition to scaling down the learning rate). I also tried to scale down the initial weights only in the first layer, but the network had trouble learning (even when I did scale down the learning rate in both layers). Then I tried scaling down the learning rate only in the first layer - this also didn't help. Finally I tried reducing learning rate in both layer even more (by 255*255) and this suddenly worked. This does not make sense to me - scaling down the initial weights by the same factor the inputs have been scaled up should have completely eliminated any difference from the original network, the input to the second layer is identical. At that point the learning rate should be scaled down in the first layer only, but in practice both layers need significantly lower learning rate...
Scaling up the inputs will lead to exploding gradients because of a few observations:
The learning rate is common to all the weights in a given update step.
Hence, the same scaling factor (ie: the learning rate) is applied to a given weight's cost derivative regardless of it's magnitude, so large and small weights get updated by the same scale.
When the loss landscape is highly erratic, this leads to exploding gradients.(like a snowball effect, one overshot update - in say, the axis of one particular weight - causes another in the opposite direction in the next update which overshoots again and so on..)
The range of values of the pixels are 0 to 255, hence scaling the data by 255 will ensure all inputs are between 0 and 1 and hence more smooth convergence as all the gradients will be uniform with respect to the learning rate. But here you scaled the learning rate which adjusts some of the problems mentioned above but is not as effective as scaling the data itself. This reduces the learning rate hence making convergence time longer, that might be the reason why it reaches 43% at 20 epochs, maybe it needs more epochs..
CIFAR-10 is a significant step up from something like the MNIST dataset, hence, fully connected neural networks do not have the representation power needed to accurately predict these images. CNNs are the way to go for any image classification task beyond MNIST. ~50% accuracy is the max you can get with a fully connected neural network unfortunately.
Maybe decrease the learning rate by 1/255 ... just a guess

Cosine similarity loss cause weight values to explode

Suppose my data consists of images of bubbles, and the labels are histograms describing the distribution of sizes, for example:
0-10mm 10%
10-20mm 30%
20-30mm 40%
30-40mm 20%
It is important to note that -
All size percentages sum to 100% (or 1.0 to be more precise).
I don't have annotated data, so i can't train an object detector and then just calculate the distribution by counting objects detected. However, i do have a feature extractor train on my data.
I implemented a simple CNN that consists of -
Resnet50 backbone.
Global max pooling.
1x1 convolution of 6 filters (6 distribution bins in labels).
After some experiments i came to the conclusion that softmax and cross entropy as loss function does not suit my problem and needs.
I thought that maybe a cosine similarity loss, with a light modification, may be a good alternative (normalization will be part of post process). This is the implementation:
def cosine_similarity_loss(logits, probs, weights=1.0, label_smoothing=0):
x1_val = tf.sqrt(tf.reduce_sum(tf.matmul(logits, tf.transpose(logits)), axis=1))
x2_val = tf.sqrt(tf.reduce_sum(tf.matmul(probs, tf.transpose(probs)), axis=1))
denom = tf.multiply(x1_val, x2_val)
num = tf.reduce_sum(tf.multiply(logits, probs), axis=1)
cosine_sim = tf.math.divide(num, denom)
cosine_dist = tf.math.reduce_mean(1 - tf.square(cosine_sim)) # Cosine Distance. Reduce mean for shape compatibility.
return cosine_dist
Loss is a summation of cosine distance and l2 regularization on weights. After first feed forward i got loss: 3.1267 and after second feed forward i got loss: 96003645440.0000 - meaning weights exploded (logits: [[-785595.812 -553858.625 -545579.625 -148547.875 -12845.8633 19871.1055]] while probs: [[0.466 0.297 0.19 0.047 0 0]]).
What could be the reason for such rapid and extreme increase?
My guess is cosine distance does an internal normalisation of the logits, removing the magnitude, and thus there is no gradient to propogate that opposes the values increasing. BTW weights is not used in your implementation.
What about just plain Euclidian distance using sigmoid instead of softmax in the last layer. Also, I would try adding another one or two dense layers (say size 512) between resnet50 and output dense layer.

How can I improve my LSTM accuracy in Tensorflow

I'm trying to figure out how to decrease the error in my LSTM. It's an odd use-case because rather than classifying, we are taking in short lists (up to 32 elements long) and outputting a series of real numbers, ranging from -1 to 1 - representing angles. Essentially, we want to reconstruct short protein loops from amino acid inputs.
In the past we had redundant data in our datasets, so the accuracy reported was incorrect. Since removing the redundant data our validation accuracy has gotten much worse, which suggests our network had learned to memorise the most frequent examples.
Our dataset is 10,000 items, split 70/20/10 between train, validation and test. We use a bi-directional, LSTM as follows:
x = tf.cast(tf_train_dataset, dtype=tf.float32)
output_size = FLAGS.max_cdr_length * 4
dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
keep_prob = tf.placeholder(tf.float32, name="keepprob")
sizes = [FLAGS.lstm_size,int(math.floor(FLAGS.lstm_size/2)),int(math.floor(FLAGS.lstm_size/ 4))]
single_rnn_cell_fw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_fw" + str(i)) for i in range(len(sizes))])
single_rnn_cell_bw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_bw" + str(i)) for i in range(len(sizes))])
length = create_length(x)
initial_state = single_rnn_cell_fw.zero_state(FLAGS.batch_size, dtype=tf.float32)
initial_state = single_rnn_cell_bw.zero_state(FLAGS.batch_size, dtype=tf.float32)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=single_rnn_cell_fw, cell_bw=single_rnn_cell_bw, inputs=x, dtype=tf.float32, sequence_length = length)
output_fw, output_bw = outputs
states_fw, states_bw = states
output_fw = last_relevant(FLAGS, output_fw, length, "last_fw")
output_bw = last_relevant(FLAGS, output_bw, length, "last_bw")
output = tf.concat((output_fw, output_bw), axis=1, name='bidirectional_concat_outputs')
test = tf.placeholder(tf.float32, [None, output_size], name="train_test")
W_o = weight_variable([sizes[-1]*2, output_size], "weight_output")
b_o = bias_variable([output_size],"bias_output")
y_conv = tf.tanh( ( tf.matmul(output, W_o)) * dmask, name="output")
Essentially, we use 3 layers of LSTM, with 256, 128 and 64 units each. We take the last step of both the Forward and Backward passes and concatenate them together. These feed into a final, fully connected layer that presents the data in the way we need it. We use a mask to set these steps we don't need to zero.
Our cost function uses a mask again, and takes the mean of the squared difference. We build the mask from the test data. Values to ignore are set to -3.0.
def cost(goutput, gtest, gweights, FLAGS):
mask = tf.sign(tf.add(gtest,3.0))
basic_error = tf.square(gtest-goutput) * mask
basic_error = tf.reduce_sum(basic_error)
basic_error /= tf.reduce_sum(mask)
return basic_error
To train the net I've used a variety of optimizers. The lowest scores have been obtained with the AdamOptimizer. The others, such as Adagrad, Adadelta, RMSProp tend to flatline around 0.3/0.4 error which is not particularly great.
Our learning rate is 0.004, batch size of 200. We use a 0.5 probability dropout layer.
I've tried adding more layers, changing learning rates, batch sizes, even the representation of the data. I've attempted batch regularisation, L1 and L2 weight regularisation (though perhaps incorrectly) and I've even considered switching to a convnet approach instead.
Nothing seems to make any difference. What has seemed to work is changing the optimizer. Adam seems noisier as it improves, but it does get closer than the other optimizers.
We need to get down to a value much closer to 0.05 or 0.01. Sometimes the training error touches 0.09 but the validation doesn't follow. I've run this network for about 500 epochs so far (about 8 hours) and it tends to settle around 0.2 validation error.
I'm not quite sure what to attempt next. Decayed learning rate might help but I suspect there is something more fundamental I need to do. It could be something as simple as a bug in the code - I need to double check the masking,