Tensorflow converge to mean - tensorflow

I am trying to predict a binary output using tensorflow. The training data has roughly 69% zeros for the output. The input features are real valued, and I normalized them by subtracting the mean and dividing by the standard deviation. Every time I run the network, no matter what techniques I've tried, I cannot get a model >69% accurate, and it looks like my Yhat is converging to all zeros.
I've tried a lot of things like different optimizers, loss functions, batch sizes, etc.. but no matter what I do it converges to 69% and never goes over. I'm guessing there's a more fundemental problem with what I'm doing but I can't seem to find it.
Here is the latest version of my code
X = tf.placeholder(tf.float32,shape=[None,14],name='X')
Y = tf.placeholder(tf.float32,shape=[None,1],name='Y')
W1 = tf.Variable(tf.truncated_normal(shape=[14,20],stddev=0.5))
b1 = tf.Variable(tf.zeros([20]))
l1 = tf.nn.relu(tf.matmul(X,W1) + b1)
l1 = tf.nn.dropout(l1,0.5)
W2 = tf.Variable(tf.truncated_normal(shape=[20,20],stddev=0.5))
b2 = tf.Variable(tf.zeros([20]))
l2 = tf.nn.relu(tf.matmul(l1,W2) + b2)
l2 = tf.nn.dropout(l2,0.5)
W3 = tf.Variable(tf.truncated_normal(shape=[20,15],stddev=0.5))
b3 = tf.Variable(tf.zeros([15]))
l3 = tf.nn.relu(tf.matmul(l2,W3) + b3)
l3 = tf.nn.dropout(l3,0.5)
W5 = tf.Variable(tf.truncated_normal(shape=[15,1],stddev=0.5))
b5 = tf.Variable(tf.zeros([1]))
Yhat = tf.matmul(l3,W5) + b5
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Yhat, labels=Y))
learning_rate = 0.005
l2_weight = 0.001
learner = tf.train.AdamOptimizer(learning_rate).minimize(loss)
correct_prediction = tf.equal(tf.greater(Y,0.5), tf.greater(Yhat,0.5))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

When you calculate your correct_prediction
correct_prediction = tf.equal(tf.greater(Y,0.5), tf.greater(Yhat,0.5))
It seems that Yhat is still logits, you're supposed to calculate a Y_pred using sigmoid, and use the Y_pred to calculate your correct_prediction
Y_pred = tf.nn.sigmoid(Yhat)
correct_prediction = tf.equal(tf.greater(Y,0.5), tf.greater(Y_pred,0.5))

You are using a constant dropout.
l3 = tf.nn.dropout(l3,0.5)
Dropout should be used only while training and not while checking accuracy or during prediction.
keep_prob = tf.placeholder(tf.float32)
l3 = tf.nn.dropout(l3,keep_prob)
The placeholder should be given appropriate value during training and 1 while testing/prediction.
You have dropouts at every layer, I am not sure if you need that many dropouts for a small network. Hope this helps

Related

VAE reconstruction loss (MSE) not decreasing, but KL Divergence is

I've been trying to create an LSTM VAE to reconstruct multivariate time-series data on Tensorflow. To start off I attempted to adapt (changed to Functional API, changed layers) the approach taken here and came up with the following code:
input_shape = 13
latent_dim = 2
prior = tfd.Independent(tfd.Normal(loc=tf.zeros(latent_dim), scale=1), reinterpreted_batch_ndims=1)
input_enc = Input(shape=[512, input_shape])
lstm1 = LSTM(latent_dim * 16, return_sequences=True)(input_enc)
lstm2 = LSTM(latent_dim * 8, return_sequences=True)(lstm1)
lstm3 = LSTM(latent_dim * 4, return_sequences=True)(lstm2)
lstm4 = LSTM(latent_dim * 2, return_sequences=True)(lstm3)
lstm5 = LSTM(latent_dim, return_sequences=True)(lstm4)
lat = Dense(tfpl.MultivariateNormalTriL.params_size(latent_dim))(lstm5)
reg = tfpl.MultivariateNormalTriL(latent_dim, activity_regularizer= tfpl.KLDivergenceRegularizer(prior, weight=1.0))(lat)
lstm6 = LSTM(latent_dim, return_sequences=True)(reg)
lstm7 = LSTM(latent_dim * 2, return_sequences=True)(lstm6)
lstm8 = LSTM(latent_dim * 4, return_sequences=True)(lstm7)
lstm9 = LSTM(latent_dim * 8, return_sequences=True)(lstm8)
lstm10 = LSTM(latent_dim * 16, return_sequences=True)(lstm9)
output_dec = TimeDistributed(Dense(input_shape))(lstm10)
enc = Model(input_enc, reg)
vae = Model(input_enc, output_dec)
vae.compile(optimizer='adam',
loss='mse',
metrics='mse'
)
es = callbacks.EarlyStopping(monitor='val_loss',
mode='min',
verbose=1,
patience=5,
restore_best_weights=True,
)
vae.fit(tf_train,
epochs=1000,
callbacks=[es],
validation_data=tf_val,
shuffle=True
)
By observing the MSE as a metric I've noticed that it does not change during training, only the divergence does down. Then I set the activity_regularizer argument to None and, indeed, the MSE did go down. So it seems that the KL Divergence is preventing the reconstruction error from being optimised for.
Why is that? Am I doing anything obviously wrong?
Any help greatly appreciated!
(I'm aware the latent dimension is rather small, I set it to two to easily visualise it, though this behaviour still occurs with larger latent dimensions, hence I don't think the problem lies there.)
Could it be that you are using an Autoencoder and in the loss there is a KL Divergence term? In a (Beta-) VAE the loss is Loss = MSE + beta * KL .
Since beta = 1 would be a normal VAE you could try to make beta smaller then one. This should give more wheight to the MSE and less to the KL divergence. This should help the reconstruction but is bad if you would like to have a disentangled latent space.

Is this Neural Net example I'm looking at a mistake or am I not understanding backprop?

Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
and
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.

Weighted cost function in tensorflow

I'm trying to introduce weighting into the following cost function:
_cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=y))
But without having to do the softmax cross entropy myself. So I was thinking of breaking the cost calc up into cost1 and cost2 and feeding in a modified version of my logits and y values to each one.
I want to do something like this but not sure what is the correct code:
mask=(y==0)
y0 = tf.boolean_mask(y,mask)*y1Weight
(This gives the error that mask cannot be scalar)
The weight masks can be computed using tf.where. Here is the weighted cost example:
batch_size = 100
y1Weight = 0.25
y0Weight = 0.75
_logits = tf.Variable(tf.random_normal(shape=(batch_size, 2), stddev=1.))
y = tf.random_uniform(shape=(batch_size,), maxval=2, dtype=tf.int32)
_cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=y)
#Weight mask, the weights for label=0 is y0Weight and for 1 is y1Weight
y_w = tf.where(tf.cast(y, tf.bool), tf.ones((batch_size,))*y0Weight, tf.ones((batch_size,))*y1Weight)
# New weighted cost
cost_w = tf.reduce_mean(tf.multiply(_cost, y_w))
As suggested by #user1761806, the simpler solution would be to use tf.losses.sparse_softmax_cross_entropy() which has allows weighting of the classes.
you can calculate the weighted cost as follows; use a predefined weights_per_class tensor with shape (num_classes, 1). For label use one_hot encoding.
# here labels shape should be [batch_size, num_classes] ; obtained using one_hot
_cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=y)
# Here you can define a deterministic weights tensor.
# weights_per_class = tf.constant(np.array([y0weights, y1weights, ...]))
weights_per_class =tf.random_normal(shape=(num_classes, 1), dtype=tf.float32)
# Use the weights tensor to compute weighted loss
_weighted_cost = tf.reduce_mean(tf.matmul(_cost, weights_per_class))

Tensor Flow MNIST Evaluating predictions

I am working on this tutorial and I found in this following code: when evaluating predictions he runs accuracy which runs the correct variable which in turn runs prediction which will reintialize again the weights with randoms and reconstruct the NN model. How is this right? What am I missing ?
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1])),
'biases':tf.Variable(tf.random_normal([n_nodes_hl1]))}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
'biases':tf.Variable(tf.random_normal([n_nodes_hl2]))}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])),
'biases':tf.Variable(tf.random_normal([n_nodes_hl3]))}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes])),}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 10
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
epoch_loss = 0
for _ in range(int(mnist.train.num_examples/batch_size)):
epoch_x, epoch_y = mnist.train.next_batch(batch_size)
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
epoch_loss += c
print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print('Accuracy:',accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
train_neural_network(x)
You almost got it right. The accuracy tensor indirectly depends on the prediction tensor, which is depending on a Tensor x. In your code snippet you did not include what x actually is; however from the linked tutorial:
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
So x is a placeholder, i.e. a Tensor that obtains its value directly from the user. It is not entirely clear from the last line of
train_neural_network(x)
that he is not actually calling a transformation function train_neural_network(x) that takes an x and processes it on the fly, like you would expect from a regular function; rather, the function uses a reference to the previously defined placeholder variables - dummies, really - in order to define a computation graph it then directly executes using a session.
The graph, however, is only constructed once using neural_network_model(x) and then queried for a given number of epochs.
What you missed is this:
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
This queries the result of the optimizer operation and the cost tensor given that the input values are epoch_x for x and epoch_y for y, pulling data through all defined computation nodes, all the way back "down" to x. In order to obtain the cost, y is needed as well. Both are provided by the caller. The AdamOptimizer will update all trainable variables as part of its execution, changing the network's weights.
After that,
accuracy.eval({x: mnist.test.images, y: mnist.test.labels})
or, equivalently
sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})
then issues another evaluation of the same graph - without changing it - but this time using the inputs mnist.test.images for x and mnist.test.labels for y.
It works because prediction itself depends on x, which is overridden to the user-provided values on each call to sess.run(...).
Here is what the graph looks like in TensorBoard. It's hard to tell, but the two placeholder nodes are on the bottom left, next to the orange node called "Variable" and in the center right, below the green "Slice_1".
Here's how the relevant part of the network's graph looks like; I exported this using TensorBoard. It's a bit hard to get since the nodes are not manually labeled (except for a couple I labeled myself), but there are six relevant points here. Placeholders are yellow: On the bottom right you'll find x and y is on the center left.
Green are the intermediate values that make sense to us: On the left is the prediction tensor, on the right there's the tensor called correct. The blue parts are endpoints of the graph: On the top left there's the cost tensor and on the top right you'll find accuracy. In essence, data flows from the bottom to the top.
So, whenever you say "evaluate prediction given x", "evaluate accuracy given x and y" or "optimize my network given x and y", you really just provide values on the yellow ends and observe the outcome on the green or blue ones.

Tensorflow - loss starts high and does not decrease

i started writing Neuronal Networks with tensorflow and there is one Problem i seem to face in each of my example Projects.
My loss allways starts at something like 50 or higher and does not decrease or if it does, it does so slowly that after all my epochs i do not even get near an acceptable loss-rate.
Things it already tried (and did not affect the result very much)
tested on overfitting, but in the following example
you can see that i have 15000 training and 15000 testing-datasets and
something like 900 neurons
tested different optimizers and optimizer-values
tried increasing the traingdata by using the testdata as
trainingdata aswell
tried increasing and decreasing the batchsize
I created the network on knowledge of https://youtu.be/vq2nnJ4g6N0
But let us have a look on one of my testprojects:
I have a list of names and wanted to assume the gender so my raw data looks like this:
names=["Maria","Paul","Emilia",...]
genders=["f","m","f",...]
For feeding it into the network i transform the names into an array of charCodes (expecting a maxlength of 30) and the gender into a bit array
names=[[77.,97. ,114.,105.,97. ,0. ,0.,...]
[80.,97. ,117.,108.,0. ,0. ,0.,...]
[69.,109.,105.,108.,105.,97.,0.,...]]
genders=[[1.,0.]
[0.,1.]
[1.,0.]]
I built the network with 3 hidden layers [30,20],[20,10],[10,10] and [10,2] for the output layer. All hidden layers have a ReLU as activation function. The output layer has a softmax.
# Input Layer
x = tf.placeholder(tf.float32, shape=[None, 30])
y_ = tf.placeholder(tf.float32, shape=[None, 2])
# Hidden Layers
# H1
W1 = tf.Variable(tf.truncated_normal([30, 20], stddev=0.1))
b1 = tf.Variable(tf.zeros([20]))
y1 = tf.nn.relu(tf.matmul(x, W1) + b1)
# H2
W2 = tf.Variable(tf.truncated_normal([20, 10], stddev=0.1))
b2 = tf.Variable(tf.zeros([10]))
y2 = tf.nn.relu(tf.matmul(y1, W2) + b2)
# H3
W3 = tf.Variable(tf.truncated_normal([10, 10], stddev=0.1))
b3 = tf.Variable(tf.zeros([10]))
y3 = tf.nn.relu(tf.matmul(y2, W3) + b3)
# Output Layer
W = tf.Variable(tf.truncated_normal([10, 2], stddev=0.1))
b = tf.Variable(tf.zeros([2]))
y = tf.nn.softmax(tf.matmul(y3, W) + b)
Now the calculation for the loss, accuracy and the training operation:
# Loss
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
# Accuracy
is_correct = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
# Training
train_operation = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
I train the network in batches of 100
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(150):
bs = 100
index = i*bs
inputBatch = inputData[index:index+bs]
outputBatch = outputData[index:index+bs]
sess.run(train_operation, feed_dict={x: inputBatch, y_: outputBatch})
accuracyTrain, lossTrain = sess.run([accuracy, cross_entropy], feed_dict={x: inputBatch, y_: outputBatch})
if i%(bs/10) == 0:
print("step %d loss %.2f accuracy %.2f" % (i, lossTrain, accuracyTrain))
And i get the following result:
step 0 loss 68.96 accuracy 0.55
step 10 loss 69.32 accuracy 0.50
step 20 loss 69.31 accuracy 0.50
step 30 loss 69.31 accuracy 0.50
step 40 loss 69.29 accuracy 0.51
step 50 loss 69.90 accuracy 0.53
step 60 loss 68.92 accuracy 0.55
step 70 loss 68.99 accuracy 0.55
step 80 loss 69.49 accuracy 0.49
step 90 loss 69.25 accuracy 0.52
step 100 loss 69.39 accuracy 0.49
step 110 loss 69.32 accuracy 0.47
step 120 loss 67.17 accuracy 0.61
step 130 loss 69.34 accuracy 0.50
step 140 loss 69.33 accuracy 0.47
What am i doing wrong?
Why does it start at ~69 in my Project and not lower?
Thank you very much guys!
There's nothing wrong with 0.69 nats of entropy per samples, as a starting point for a binary classification.
If you convert to base 2, 0.69/log(2), you'll see that it's almost exactly 1 bit per sample which is exactly what you would expect if you're unsure about a binary classification.
I usually use the mean loss instead of the sum so things are less sensitive to batch size.
You should also not calculate the entropy directly yourself, because that method breaks easily. you probably want tf.nn.sigmoid_cross_entropy_with_logits.
I also like starting with the Adam Optimizer instead of pure gradient descent.
Here are two reasons you might be having some trouble with this problem:
1) Character codes are ordered, but the order doesn't mean anything. Your inputs would be easier for the network to take as input if they were input as one-hot vectors. So your input would be a 26x30 = 780 element vector. Without that the network has to waste a bunch of capacity learning the boundaries between letters.
2) You've only got fully connected layers. This makes it impossible for it to learn a fact independent of it's absolute position in the name. 6 of the top 10 girls names in 2015 ended in 'a', while 0 of the top 10 boys names did. As currently written your network needs to re-learn "Usually it's a girl's name if it ends in 'a'" independently for each name length. Using some convolution layers would allow it to learn facts once across all name lengths.