Tensorflow: my rnn always output same value, weights of rnn are not trained - tensorflow

I used tensorflow to implement a simple RNN model to learn possible trends of time series data and predict future values. However, the model always produces same values after training. Actually, the best model it got is:
y = b.
The RNN structure is:
InputLayer -> BasicRNNCell -> Dense -> OutputLayer
RNN code:
def RNN(n_timesteps, n_input, n_output, n_units):
tf.reset_default_graph()
X = tf.placeholder(dtype=tf.float32, shape=[None, n_timesteps, n_input])
cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_units)]
stacked_rnn = tf.contrib.rnn.MultiRNNCell(cells)
stacked_output, states = tf.nn.dynamic_rnn(stacked_rnn, X, dtype=tf.float32)
stacked_output = tf.layers.dense(stacked_output, n_output)
return X, stacked_output
while in training, n_timesteps=1, n_input=1, n_output=1, n_units=2, learning_rate=0.0000001. And loss is calculated by mean squared error.
Input is a sequence of data in continuous days. Output is the data after the days of input.
(Maybe these are not good settings. But no matter how I change them, the results are almost same. So I just set these to help show them later.)
And I found out this is because weights and bias of BasicRNNCell are not trained. They keep same from beginning. And only the weights and bias of Dense keep changing. So in training, I got a prediction like these:
In the beginning:
loss: 1433683500.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
After a while:
loss: 175372340.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
The orange line indicates the true data, the blue line indicates results of my code. Through training, the blue line will keep going up until model gets a stable loss.
So I doubt whether I did a wrong implementation, so I generate a group of data with y = 10x + 5 for testing. This time, My model learns the correct results.
In the beginning:
In the end:
I have tried:
add more layers of both BasicRNNCell and Dense
increase rnn cell hidden num(n_units) to 128
decrease learning_rate to 1e-10
increase timesteps to 60
They all dont work.
So, my questions are:
Is it because my model is too simple? But I think the trend of my data is not so complicated to learn. At least something like y = ax + b will produce a smaller loss than y = b.
What may lead to these results?
Or how should I go on debugging?
And now, I double maybe BasicRNNCell is not fully realized, users should implement some functions of it? I have no experience with tensorflow before.

It seems your net is just not fit for that kind of data, or from another point of view, your data is badly scaled. When adding the 4 lines below after the split_data, I get some sort of learning behavior, similar to the one with the a*x+b case
data = read_data(work_dir, input_file)
plot_data(data)
input_data, output_data, n_batches = split_data(data, n_timesteps, n_input, n_output)
# scale input and output data
input_data = input_data-input_data[0]
input_data = input_data/np.max(input_data)*1000
output_data = output_data-output_data[0]
output_data = output_data/np.max(output_data)*1000

Related

tf.keras.layers.BatchNormalization with trainable=False appears to not update its internal moving mean and variance

I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.

How to adjust Model for rare binary outcome with Tensorflow or GBM

I'm currently working on data with rare binary outcome, i.e. the response vector contains mostly 0 and only a few 1 (approximately 1.5% ones). I've got about 20 continuous explanatory variables. I tried to train models using GBM, Random Forests, TensorFlow with Keras backend.
I observed a special behavior of the models, regardless which method I used:
The accuracy is high (~98%) but the model predicts probabilities for class "0" for all outcomes as ~98.5% and for class "1" ~1,5%.
How can I prevent this behavior?
I'm using RStudio. For Example a TF model with Keras would be:
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = "relu", input_shape = c(20)) %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dense(units = 2, activation = "sigmoid")
parallel_model <- multi_gpu_model(model, gpus=2)
parallel_model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = "binary_accuracy")
histroy <- parallel_model %>% fit(
x_train, y_train,
batch_size = 64,
epochs = 100,
class_weight = list("0"=1,"1"=70),
verbose = 1,
validation_split = 0.2
)
But my observation is not limited to TF. This makes my question more general. I'm not asking for specific adjustments for the model above, rather I'd like to discuss at what point all outcomes are assigned the same probability.
I can guess, the issue is connected to the loss-function.
I know there is no way to use AUC as loss functions, since it's not differentiable. If I test models with AUC with unknown data, the result is not better than random guessing.
I don't mind answers with code in Python, since this isn't a problem about coding rather than general behavior and algorithms.
When your problem has unbalanced classes, I suggest using SMOTE (on the training data only!!! never use smote on your testing data!!!) before training the model.
For example:
from imblearn.over_sampling import SMOTE
X_trn_balanced, Y_trn_balanced = SMOTE(random_state=1, ratio=1).fit_sample(X_trn, Y_trn)
#next fit the model with the balanced data
model.fit(X_trn_balanced, Y_trn_balanced )
In my (not so big) experience with AUC problems and rare positives, I see models with one class (not two). It's either "result is positive (1)" or "result is negative (0)".
Metrics like accuracy are useless for these problems, you should use AUC based metrics with big batch sizes.
For these problems, it doesn't matter whether the outcome probabilities are too little, as long as there is a difference between them. (Forests, GBM, etc. will indeed output these little values, but this is not a problem)
For neural networks, you can try to use class weights to increase the output probabilities. But notice that if you split the result in two separate classes (considering only one class should be positive), it doesn't matter if you use weights, because:
For the first class, low weights: predict all ones is good
For the second class, high weights: predict all zeros is good (weighted to very good)
So, as an initial solution, you can:
Use a 'softmax' activation (to guarantee your model will have only one correct output) and a 'categorical_crossentropy' loss.
(Or, preferrably) Use a model with only one class and keep 'sigmoid' with 'binary_crossentropy'.
I always work with the preferrable option above. In this case, if you use batch sizes that are big enough to contain one or two positive examples (batch size around 100 for you), weights may even be discarded. If the batch sizes are too little and many batches don't contain positive results, you may have too many weight updates towards plain zeros, which is bad.
You may also resample your data and, for instance, multiply by 10 the number of positive examples, so your batches contain more positives and training becomes easier.
Example of AUC metric to determine when training should end:
#in python - considering outputs with only one class
def aucMetric(true, pred):
true= K.flatten(true)
pred = K.flatten(pred)
totalCount = K.shape(true)[0]
values, indices = tf.nn.top_k(pred, k = totalCount)
sortedTrue = K.gather(true, indices)
tpCurve = K.cumsum(sortedTrue)
negatives = 1 - sortedTrue
auc = K.sum(tpCurve * negatives)
totalCount = K.cast(totalCount, K.floatx())
positiveCount = K.sum(true)
negativeCount = totalCount - positiveCount
totalArea = positiveCount * negativeCount
return auc / totalArea

How can I improve my LSTM accuracy in Tensorflow

I'm trying to figure out how to decrease the error in my LSTM. It's an odd use-case because rather than classifying, we are taking in short lists (up to 32 elements long) and outputting a series of real numbers, ranging from -1 to 1 - representing angles. Essentially, we want to reconstruct short protein loops from amino acid inputs.
In the past we had redundant data in our datasets, so the accuracy reported was incorrect. Since removing the redundant data our validation accuracy has gotten much worse, which suggests our network had learned to memorise the most frequent examples.
Our dataset is 10,000 items, split 70/20/10 between train, validation and test. We use a bi-directional, LSTM as follows:
x = tf.cast(tf_train_dataset, dtype=tf.float32)
output_size = FLAGS.max_cdr_length * 4
dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
keep_prob = tf.placeholder(tf.float32, name="keepprob")
sizes = [FLAGS.lstm_size,int(math.floor(FLAGS.lstm_size/2)),int(math.floor(FLAGS.lstm_size/ 4))]
single_rnn_cell_fw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_fw" + str(i)) for i in range(len(sizes))])
single_rnn_cell_bw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_bw" + str(i)) for i in range(len(sizes))])
length = create_length(x)
initial_state = single_rnn_cell_fw.zero_state(FLAGS.batch_size, dtype=tf.float32)
initial_state = single_rnn_cell_bw.zero_state(FLAGS.batch_size, dtype=tf.float32)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=single_rnn_cell_fw, cell_bw=single_rnn_cell_bw, inputs=x, dtype=tf.float32, sequence_length = length)
output_fw, output_bw = outputs
states_fw, states_bw = states
output_fw = last_relevant(FLAGS, output_fw, length, "last_fw")
output_bw = last_relevant(FLAGS, output_bw, length, "last_bw")
output = tf.concat((output_fw, output_bw), axis=1, name='bidirectional_concat_outputs')
test = tf.placeholder(tf.float32, [None, output_size], name="train_test")
W_o = weight_variable([sizes[-1]*2, output_size], "weight_output")
b_o = bias_variable([output_size],"bias_output")
y_conv = tf.tanh( ( tf.matmul(output, W_o)) * dmask, name="output")
Essentially, we use 3 layers of LSTM, with 256, 128 and 64 units each. We take the last step of both the Forward and Backward passes and concatenate them together. These feed into a final, fully connected layer that presents the data in the way we need it. We use a mask to set these steps we don't need to zero.
Our cost function uses a mask again, and takes the mean of the squared difference. We build the mask from the test data. Values to ignore are set to -3.0.
def cost(goutput, gtest, gweights, FLAGS):
mask = tf.sign(tf.add(gtest,3.0))
basic_error = tf.square(gtest-goutput) * mask
basic_error = tf.reduce_sum(basic_error)
basic_error /= tf.reduce_sum(mask)
return basic_error
To train the net I've used a variety of optimizers. The lowest scores have been obtained with the AdamOptimizer. The others, such as Adagrad, Adadelta, RMSProp tend to flatline around 0.3/0.4 error which is not particularly great.
Our learning rate is 0.004, batch size of 200. We use a 0.5 probability dropout layer.
I've tried adding more layers, changing learning rates, batch sizes, even the representation of the data. I've attempted batch regularisation, L1 and L2 weight regularisation (though perhaps incorrectly) and I've even considered switching to a convnet approach instead.
Nothing seems to make any difference. What has seemed to work is changing the optimizer. Adam seems noisier as it improves, but it does get closer than the other optimizers.
We need to get down to a value much closer to 0.05 or 0.01. Sometimes the training error touches 0.09 but the validation doesn't follow. I've run this network for about 500 epochs so far (about 8 hours) and it tends to settle around 0.2 validation error.
I'm not quite sure what to attempt next. Decayed learning rate might help but I suspect there is something more fundamental I need to do. It could be something as simple as a bug in the code - I need to double check the masking,

Strange sequence classification performance after shuffling sequence elements

I have one million sequences I'm trying to classify as either 0 or 1. The outcome is fairly well balanced (class 0:70%, class 1:30%). Maximum sequence length is 50, and I've post-padded by sequences with zeroes. There are 100 unique sequence symbols. Embedding length is 30. It's an LSTM NN trained on two outputs (one is the main output node, and the other is right after the LSTM). The code is below.
As a sanity check, I ran three versions of this: One in which I randomize the outcome labels (I expect terrible performance), another one where the labels are correct but I randomize the sequence of events in each sequence but the outcome labels are correct (I also expected bad performance), and finally one where everything is left unshuffled (I expected good performance).
Instead I found the following:
Shuffled labels: Accuracy = 69.5% (Model predicts every sequence is class 0)
Shuffled sequence symbols: Accuracy = 88%!
Nothing is shuffled: Accuracy = 90%
What do you make of this? All I can think of is that there is little signal to be gained from analyzing the sequences, and maybe most of the signal is from the presence or lack of presence of symbols in the sequence. Maybe RNNs and LSTMs are overkill here?
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
lstm_out = LSTM(32)(x)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1], [train_Y, train_Y], epochs=1, batch_size=200)
Assuming you've played around with the size of the LSTM, your conclusion seems reasonable. Beyond that, it's hard to say as it depends what the dataset is. For example, it could be that shorter sequences are more unpredictable, and if most of your sequences are short, then this would support the conclusion as well.
It's worth it to also try truncating your sequences in length, to say the first 25 entries.

TensfoFlow: Linear Regression loss increasing (instead decreasing) with successive epochs

I'm learning TensorFlow and trying to apply it on a simple linear regression problem. data is numpy.ndarray of shape [42x2].
I'm a bit puzzled why after each succesive epoch the loss is increasing. Isn't the loss expected to to go down with each successive epoch!
Here is my code (let me know, if you'd like me to share the output as well!): (Thanks a lot for taking your time to answer to it.)
1) created the placeholders for dependent / independent variables
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32,name='Y')
2) created vars for weight, bias, total_loss (after each epoch)
w = tf.Variable(0.0,name='weights')
b = tf.Variable(0.0,name='bias')
3) defined loss function & optimizer
Y_pred = X * w + b
loss = tf.reduce_sum(tf.square(Y - Y_pred), name = 'loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
4) created summary events & event file writer
tf.summary.scalar(name = 'weight', tensor = w)
tf.summary.scalar(name = 'bias', tensor = b)
tf.summary.scalar(name = 'loss', tensor = loss)
merged = tf.summary.merge_all()
evt_file = tf.summary.FileWriter('def_g')
evt_file.add_graph(tf.get_default_graph())
5) and execute all in a session
with tf.Session() as sess1:
sess1.run(tf.variables_initializer(tf.global_variables()))
for epoch in range(10):
summary, _,l = sess1.run([merged,optimizer,loss],feed_dict={X:data[:,0],Y:data[:,1]})
evt_file.add_summary(summary,epoch+1)
evt_file.flush()
print(" new_loss: {}".format(sess1.run(loss,feed_dict={X:data[:,0],Y:data[:,1]})))
Cheers!
The short answer is that your learning rate is too big. I was able to get reasonable results by changing it from 0.001 to 0.0001, but I only used the 23 points from your second-last comment (I initially didn't notice your last comment), so using all the data might require an even lower number.
0.001 seems like a really low learning rate. However, the real problem is that your loss function is using reduce_sum instead of reduce_mean. This causes your loss to be a large number, which sends a very strong signal to the GradientDescentOptimizer, so it's overshooting despite the low learning rate. The problem would only get worse if you added more points to your training data. So use reduce_mean to get the average squared error and your algorithms will be much better behaved.