NO convergence in bidirectional-lstm of Tensorflow - tensorflow

I am training a bi-directional LSTM network, but when I train it, I got this as below:
"
Iter 3456, Minibatch Loss= 10.305597, Training Accuracy= 0.25000
Iter 3840, Minibatch Loss= 22.018646, Training Accuracy= 0.00000
Iter 4224, Minibatch Loss= 34.245750, Training Accuracy= 0.00000
Iter 4608, Minibatch Loss= 13.833059, Training Accuracy= 0.12500
Iter 4992, Minibatch Loss= 19.687658, Training Accuracy= 0.00000
"
Even iteration is 50 0000, loss and accuracy is nearly the same. My setting is below:
# Parameters
learning_rate = 0.0001
training_iters = 20000#120000
batch_size = tf.placeholder(dtype=tf.int32)#24,128
display_step = 16#10
test_num = 275#245
keep_prob = tf.placeholder("float") #probability for dropout
kp = 1.0
# Network Parameters
n_input = 240*160 #28 # MNIST data input (img shape: 28*28)
n_steps = 16 #28 # timesteps
n_hidden = 500 # hidden layer num of features
n_classes = 20
Is this the problem of techniques or schemes?

The first thing I would try is varying the learning rate to see if you can get the loss to decrease.
It may also be helpful to compare the accuracy with some baselines (e.g. are you better than predicting the most frequent class in a classification problem).
If your loss is not decreasing at all for a wide range of learning rates I would start looking for bugs in the code (e.g. is the training op that updates weights actually run, do features and labels match, is your data randomized properly, ...).
If there is a problem with the technique (bidirection LSTM) depends on what task you are trying to accomplish. If you are actually applying this to MNIST (based on the comment in your code) then I would rather recommend some convolutional and maxpooling layers than an RNN.

Related

Different loss values computed by train_on_batch and evaluate

Evaluation before batch training, training on batch and post training returns different loss values.
pre_train_loss = model.evaluate(batch_x, batch_y, verbose=0)
train_loss = model.train_on_batch(batch_x, batch_y)
post_train_loss = model.evaluate(batch_x, batch_y, verbose=0)
Pre batch train loss : 2.3195652961730957
train_on_batch loss : 2.3300909996032715
Post batch train loss : 2.2722578048706055
I assumed train_on_batch returns loss computed before parameters update (before backpropagation). But pre_train_loss and train_loss are not exacly the same. Moreover all loss values are different.
Is my assumption of train_on_batch right? If so, why all loss values are different?
Colab example
Let me give you a detailed explanation of what is going on.
Calling model.evaluate (or model.test_on_batch) will invoke the model.make_test_function which will invoke the model.test_step and this function does following:
y_pred = self(x, training=False)
# Updates stateful loss metrics.
self.compiled_loss(
y, y_pred, sample_weight, regularization_losses=self.losses)
Calling model.train_on_batch will invoke the model.make_train_function which will invoke the model.train_step and this function does following:
with backprop.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(
y, y_pred, sample_weight, regularization_losses=self.losses)
As you can see from above source code, the only difference between model.test_step and model.train_step when compute the loss is whether training=True when forward pass data to model.
Because some neural network layers behave differently during training and inference (e.g Dropout and BatchNormalization layers), so we have training argument for let those layer know which of the two "paths" it should take, e.g:
During training, dropout will randomly drop out units and correspondingly scale up activations of the remaining units.
During inference, it does nothing (since you usually don't want the randomness of dropping out units here).
Since you have dropout layer in your model, so the loss increase in training mode is expected.
If you remove the line layers.Dropout(0.5), when define the model you will see the loss is nearly identical (i.e with little floating point precision mismatch), e.g outputs of three epoch:
Epoch: 1
Pre batch train loss : 1.6852061748504639
train_on_batch loss : 1.6852061748504639
Post batch train loss : 1.6012675762176514
Pre batch train loss : 1.7325702905654907
train_on_batch loss : 1.7325704097747803
Post batch train loss : 1.6512296199798584
Epoch: 2
Pre batch train loss : 1.5149778127670288
train_on_batch loss : 1.5149779319763184
Post batch train loss : 1.4209072589874268
Pre batch train loss : 1.567994475364685
train_on_batch loss : 1.5679945945739746
Post batch train loss : 1.4767804145812988
Epoch: 3
Pre batch train loss : 1.3269715309143066
train_on_batch loss : 1.3269715309143066
Post batch train loss : 1.2274967432022095
Pre batch train loss : 1.3868262767791748
train_on_batch loss : 1.3868262767791748
Post batch train loss : 1.2916004657745361
Reference:
Documents and source code link of tf.keras.Model
What does training=True mean when calling a TensorFlow Keras model?

Decreased training speed on multi-GPU machine with dynamic RNN of Tensorflow

I have two machines available on which I can train models built with Tensorflow: a local desktop machine with one GPU (called "local" in the following) and a remote cluster with 4 GPUs (called "cluster" in the following). Even though the cluster has 4 GPUs I'm only using one GPU at a time (e.g. through CUDA_VISIBLE_DEVICES=2 python script.py). My problem is that training the exact same model on the cluster is considerably slower than on my local machine, even though the cluster has more powerful GPUs. I realize that this question might be very localized and it is difficult to find out why this is the case, however I am at loss as to what causes this behavior. In the following I try to give as many details about the configuration of both machines and the model I'm building.
Model
The model is a simple toy RNN taken from this github project. The model definition is as follows:
# Parameters
learning_rate = 0.01
training_steps = 600
batch_size = 128
display_step = 200
# Network Parameters
seq_max_len = 20 # Sequence max length
n_hidden = 64 # hidden layer num of features
n_classes = 2 # linear sequence or not
# tf Graph input
x = tf.placeholder("float", [None, seq_max_len, 1])
y = tf.placeholder("float", [None, n_classes])
# A placeholder for indicating each sequence length
seqlen = tf.placeholder(tf.int32, [None])
# Define weights
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_classes]))
}
biases = {
'out': tf.Variable(tf.random_normal([n_classes]))
}
def dynamicRNN(x, seqlen, weights, biases):
# Prepare data shape to match `rnn` function requirements
# Current data input shape: (batch_size, n_steps, n_input)
# Required shape: 'n_steps' tensors list of shape (batch_size, n_input)
with tf.device('gpu:0'):
# Unstack to get a list of 'n_steps' tensors of shape (batch_size, n_input)
x = tf.unstack(x, seq_max_len, 1)
# Define a lstm cell with tensorflow
lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
# Get lstm cell output, providing 'sequence_length' will perform dynamic
# calculation.
outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, x, dtype=tf.float32,
sequence_length=seqlen)
# When performing dynamic calculation, we must retrieve the last
# dynamically computed output, i.e., if a sequence length is 10, we need
# to retrieve the 10th output.
# However TensorFlow doesn't support advanced indexing yet, so we build
# a custom op that for each sample in batch size, get its length and
# get the corresponding relevant output.
# 'outputs' is a list of output at every timestep, we pack them in a Tensor
# and change back dimension to [batch_size, n_step, n_input]
outputs = tf.stack(outputs)
outputs = tf.transpose(outputs, [1, 0, 2])
# Hack to build the indexing and retrieve the right output.
batch_size = tf.shape(outputs)[0]
# Start indices for each sample
index = tf.range(0, batch_size)*seq_max_len+(seqlen-1)
# Indexing
outputs = tf.gather(tf.reshape(outputs, [-1, n_hidden]), index)
# Linear activation, using outputs computed above
return tf.matmul(outputs, weights['out'])+biases['out']
pred = dynamicRNN(x, seqlen, weights, biases)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
The complete (runnable) python script can be found here: https://pastebin.com/LnKmCiSy
Configuration on local
Tensorflow: v1.3 (pre-compiled version installed)
CUDA: v8.0.61
cuDNN: v6.0.21
GPU: GeForce GTX TITAN X
NVIDIA Driver: 375.82
OS: Ubuntu 16.04, 64-bit
Configuration on cluster
Exactly the same as for local, except:
GPU: GeForce GTX TITAN X Pascal
NVIDIA Driver: 375.66
Performance Measures
Executing the above provided toy script I get the following outputs on local:
Step 128, Minibatch Loss= 0.725320, Training Accuracy= 0.43750, Time: 0.3180224895477295
Step 25600, Minibatch Loss= 0.683126, Training Accuracy= 0.50962, Time: 0.013816356658935547
Step 51200, Minibatch Loss= 0.680907, Training Accuracy= 0.50000, Time: 0.013682842254638672
Step 76800, Minibatch Loss= 0.677346, Training Accuracy= 0.57692, Time: 0.014072895050048828
And the following on the cluster:
Step 128, Minibatch Loss= 1.536499, Training Accuracy= 0.47656, Time: 0.8308820724487305
Step 25600, Minibatch Loss= 0.693901, Training Accuracy= 0.49038, Time: 0.06193065643310547
Step 51200, Minibatch Loss= 0.689709, Training Accuracy= 0.53846, Time: 0.05762457847595215
Step 76800, Minibatch Loss= 0.685955, Training Accuracy= 0.54808, Time: 0.06454324722290039
As you can see, execution times on the cluster are about 4x higher. I tried to profile what is happening on the GPU through the use of the timeline feature. I find it difficult to interpret the output of this feature, but what I found most striking is that there are huge idle gaps on the cluster. For this, see the following images that show a trace of the timeline feature for one call to sess.run (note that the scale of the time axis is not exactly the same in both images, but the difference should still be visible).
Timeline on cluster:
Timeline on local:
Did any of you observe the same behavior? What are possible reasons that could cause this behavior and/or how can if further narrow down the issue?

Step, batch size in Gradient Descent Tensorflow

I'm studying Udacity Deep Learning class and its homework says "Demonstrate an extreme case of overfitting. Restrict your training data to just a few batches."
My question is:
1)
Why does reducing num_steps, num_batches have anything to do with over-fitting? We are not adding any variables nor increasing the size of W.
In below code, num_steps used to be 3001 and num_batches were 128 and the solution is just reducing them to 101 and 3, respectively.
num_steps = 101
num_bacthes = 3
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
#offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
offset = step % num_bacthes
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, beta_regul : 1e-3}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 2 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(
valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
This code is an excerpt from the solution: https://github.com/rndbrtrnd/udacity-deep-learning/blob/master/3_regularization.ipynb
2) Can someone explain the concept of "offset" in gradient descent? Why do we have to use it?
3) I've experimented with num_steps and found out that if I increase num_steps, the accuracy goes up. Why? How should I interpret num_step with learning rate?
1) It's quite typical to set early stopping conditions when you 're training neural networks in order to prevent overfitting. You're not adding new variables, but using early stopping conditions you're not able to use them intensively and badly, what is more o less equivalent.
2) In this case "offset" are the remaining observations not used in minibatch (rest of the division)
3) Think of "learning rate" as "speed" and "num_steps" as "time". If you run longer, you may drive further... but maybe if you drive faster maybe you could get crashed and not go much further...

Tensorflow dynamic_rnn training loss decreasing, validation loss increasing

I am adding my RNN text classification model. I am using last state to classify text. Dataset is small I am using glove vector for embedding.
def rnn_inputs(FLAGS, input_data):
with tf.variable_scope('rnn_inputs', reuse=True):
W_input = tf.get_variable("W_input", [FLAGS.en_vocab_size, FLAGS.num_hidden_units])
embeddings = tf.nn.embedding_lookup(W_input, input_data)
return embeddings
self.inputs_X = tf.placeholder(tf.int32, shape=[None, None, FLAGS.num_dim_input], name='inputs_X')
self.targets_y = tf.placeholder(tf.float32, shape=[None, None], name='targets_y')
self.dropout = tf.placeholder(tf.float32, name='dropout')
self.seq_leng = tf.placeholder(tf.int32, shape=[None, ], name='seq_leng')
with tf.name_scope("RNNcell"):
stacked_cell = rnn_cell(FLAGS, self.dropout)
with tf.name_scope("Inputs"):
with tf.variable_scope('rnn_inputs'):
W_input = tf.get_variable("W_input", [FLAGS.en_vocab_size, FLAGS.num_hidden_units], initializer=tf.truncated_normal_initializer(stddev=0.1))
inputs = rnn_inputs(FLAGS, self.inputs_X)
#initial_state = stacked_cell.zero_state(FLAGS.batch_size, tf.float32)
with tf.name_scope("DynamicRnn"):
# flat_inputs = tf.reshape(inputs, [FLAGS.batch_size, -1, FLAGS.num_hidden_units])
flat_inputs = tf.transpose(tf.reshape(inputs, [-1, FLAGS.batch_size, FLAGS.num_hidden_units]), perm=[1, 0, 2])
all_outputs, state = tf.nn.dynamic_rnn(cell=stacked_cell, inputs=flat_inputs, sequence_length=self.seq_leng, dtype=tf.float32)
outputs = state[0]
with tf.name_scope("Logits"):
with tf.variable_scope('rnn_softmax'):
W_softmax = tf.get_variable("W_softmax", [FLAGS.num_hidden_units, FLAGS.num_classes])
b_softmax = tf.get_variable("b_softmax", [FLAGS.num_classes])
logits = rnn_softmax(FLAGS, outputs)
probabilities = tf.nn.softmax(logits, name="probabilities")
self.accuracy = tf.equal(tf.argmax(self.targets_y,1), tf.argmax(logits,1))
with tf.name_scope("Loss"):
self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=self.targets_y))
with tf.name_scope("Grad"):
self.lr = tf.Variable(0.0, trainable=False)
trainable_vars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, trainable_vars), FLAGS.max_gradient_norm)
optimizer = tf.train.AdamOptimizer(self.lr)
self.train_optimizer = optimizer.apply_gradients(zip(grads, trainable_vars))
sampling_outputs = all_outputs[0]
sampling_logits = rnn_softmax(FLAGS, sampling_outputs)
self.sampling_probabilities = tf.nn.softmax(sampling_logits)
Output printed
EPOCH 7 SUMMARY 40 STEP
Training loss 0.439
Training accuracy 0.247
----------------------
Validation loss 0.452
Validation accuracy 0.234
----------------------
Saving the model.
EPOCH 8 SUMMARY 45 STEP
Training loss 0.429
Training accuracy 0.281
----------------------
Validation loss 0.462
Validation accuracy 0.203
----------------------
Saving the model.
EPOCH 9 SUMMARY 50 STEP
Training loss 0.428
Training accuracy 0.268
----------------------
Validation loss 0.465
Validation accuracy 0.188
----------------------
Saving the model.
EPOCH 10 SUMMARY 55 STEP
Training loss 0.424
Training accuracy 0.284
----------------------
Validation loss 0.455
Validation accuracy 0.172
----------------------
Saving the model.
EPOCH 11 SUMMARY 60 STEP
Training loss 0.421
Training accuracy 0.305
----------------------
Validation loss 0.461
Validation accuracy 0.156
----------------------
Saving the model.
EPOCH 12 SUMMARY 65 STEP
Training loss 0.418
Training accuracy 0.299
----------------------
Validation loss 0.462
Validation accuracy 0.141
----------------------
Saving the model.
EPOCH 13 SUMMARY 70 STEP
Training loss 0.416
Training accuracy 0.286
----------------------
Validation loss 0.462
Validation accuracy 0.156
----------------------
Saving the model.
EPOCH 14 SUMMARY 75 STEP
Training loss 0.413
Training accuracy 0.323
----------------------
Validation loss 0.468
Validation accuracy 0.141
----------------------
Saving the model.
After 165 EPOCH
EPOCH 165 SUMMARY 830 STEP
Training loss 0.306
Training accuracy 0.544
----------------------
Validation loss 0.547
Validation accuracy 0.109
----------------------
Saving the model.
If training loss goes down, but validation loss goes up, it is likely that you are running into the problem of overfitting. This means: Generally speaking it is not that hard for a machine learning algorithm to perform exceptionally well on the training set (i.e. training loss is very low). If the algorithm just memorizes the training data set, it will produce a perfect score.
The challenge in machine learning however is to devise a model that performs well on unseen data, i.e. data that was not presented to the algorithm during training. This is what your validation set represents. If a model performs well on unseen data, we say that it generalizes well. If a model performs only well on training data, we call this overfitting. A model that does not generalize well is essentially useless, as it did not learn anything about the underlying structure of the data but just memorized the training set. This is useless because a trained model will be used on new data and probably never data used during training.
So how can you prevent that:
Reduce your model's capacity, i.e. take a simpler model and see if this can still accomplish the task. A less powerful model is less susceptible to simply memorize the data. Cf. also Occam's razor.
Use regularization: use e.g. dropout regularization in your model or add e.g. L1 or L2 norm of your trainable parameters to your loss function.
To get more information about this, search online for regularization, overfitting, etc.

Loss function works with reduce_mean but not reduce_sum

I'm new to tensor flow, and have been looking at the examples here. I wanted to rewrite the multilayer perceptron classification model to be a regression model. However I encountered some strange behaviour when modifying the loss function. It works fine with tf.reduce_mean, but if I try using tf.reduce_sum it gives nan's in the output. This seems very strange, as the functions are very similar - the only difference is that the mean divides the sum result by the number of elements? So I can't see how nan's could be introduced by this change?
import tensorflow as tf
# Parameters
learning_rate = 0.001
# Network Parameters
n_hidden_1 = 32 # 1st layer number of features
n_hidden_2 = 32 # 2nd layer number of features
n_input = 2 # number of inputs
n_output = 1 # number of outputs
# Make artificial data
SAMPLES = 1000
X = np.random.rand(SAMPLES, n_input)
T = np.c_[X[:,0]**2 + np.sin(X[:,1])]
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_output])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with tanh activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.tanh(layer_1)
# Hidden layer with tanh activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.tanh(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_output]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_output]))
}
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
#se = tf.reduce_sum(tf.square(pred - y)) # Why does this give nans?
mse = tf.reduce_mean(tf.square(pred - y)) # When this doesn't?
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)
# Initializing the variables
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
training_epochs = 10
display_step = 1
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
# Loop over all batches
for i in range(100):
# Run optimization op (backprop) and cost op (to get loss value)
_, msev = sess.run([optimizer, mse], feed_dict={x: X, y: T})
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1), "mse=", \
"{:.9f}".format(msev))
The problematic variable se is commented out. It should be used in place of mse.
With mse the output looks like this:
Epoch: 0001 mse= 0.051669389
Epoch: 0002 mse= 0.031438075
Epoch: 0003 mse= 0.026629323
...
and with se it ends up like this:
Epoch: 0001 se= nan
Epoch: 0002 se= nan
Epoch: 0003 se= nan
...
The loss by summing across the batch is 1000 times larger (from skimming the code I think your training batch size is 1000) so your gradients and parameter updates are also 1000 times larger. The larger updates apparently lead to nans.
Generally learning rates are expressed per example so the loss to find the gradients for updates should be per example also. If the loss is per batch then the learning rate needs to be reduced by the batch size to get comparable training results.
if you use reduce_sum instead of reduce_mean, then the gradient is much larger. Therefore, you should correspondingly narrow down the learning rate to make sure the training process can properly carry on.
In most literature, the loss is expressed as the mean of the losses over the batch. If the loss is calculated using reduce_mean(), the learning rate should be regarded as per batch which should be larger.
It seems like in tensorflow.keras.losses, people are still choosing between mean or sum. For example, in the tf.keras.losses.Huber, the default is mean. But you are allowed to set it to sum.