How to use DropoutWrapper in LSTM training and decoding

How to use DropoutWrapper in LSTM training and decoding - tensorflow

When I used dropout mechanism for lstm, the rouge score and loss of no dropout model performs better than model with dropout. So I wonder is my dropout code correct? I use tensorflow 0.12
cellClass = tf.nn.rnn_cell.LSTMCell
for layer_i in xrange(hps.enc_layers):
with tf.variable_scope('encoder%d'%layer_i), tf.device(
self._next_device()):
#bidirectional rnn cell
cell_fw = cellClass(
hps.num_hidden
,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=123),
state_is_tuple=False
)
cell_bw = cellClass(
hps.num_hidden
,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
state_is_tuple=False
)
cell_fw = tf.nn.rnn_cell.DropoutWrapper(cell_fw, input_keep_prob=hps.input_dropout, output_keep_prob=hps.output_dropout)
cell_bw = tf.nn.rnn_cell.DropoutWrapper(cell_bw, input_keep_prob=hps.input_dropout, output_keep_prob=hps.output_dropout)
(emb_encoder_inputs, fw_state, _) = tf.nn.bidirectional_rnn(
cell_fw, cell_bw, emb_encoder_inputs, dtype=tf.float32,
sequence_length=article_lens)
#decoder
cell = cellClass(
hps.num_hidden
,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
state_is_tuple=False
)
cell=tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=hps.input_dropout, output_keep_prob=hps.output_dropout)
decoder_outputs, self._dec_out_state, self.cur_attns, self.cur_alpha = seq2seq.attention_decoder(
emb_decoder_inputs, self._dec_in_state, self._enc_top_states,
cell, num_heads=1, loop_function=loop_function,
initial_state_attention=initial_state_attention)
When training I set those keep prob to be the value I use like 0.5, when computing the loss of training set and validation set I keep them as 0.5, but in decoding step I use 1, which did not dropout anything. Am I correct?

Almost!
When you calculate the accuracy and validation, you need to manually set the keep_probability to 1.0 so that you don't actually drop any of your weight values when you are evaluating your network. If you don't do this, you'll essentially miscalculate the value you've trained your network to predict thus far. This could certainly negatively affect your acc/val scores. Especially with a 50% dropout rate.
Your dropout layer used in the decoding step is optional and should be experimented with. If you do use it, you'll want to set it to something other than 1.0.
Just to recap for the passerby, the idea behind dropout is to reset the values of random weights across your network weights to increase the probability that neurons don't become erroneously fixed (or whatever term you prefer) which results in over fitting of your network. Remember that, generally speaking, we are trying to approximate, or fit, our networks to a function. Since fitting networks is, in essence, an optimization problem, we have to to worry about optimizing to a local minima (or maxima... depending on which way the picture is oriented in your mind). Dropout is thus a form of regularization that helps us avoid over fitting.
If anyone has any further insights or corrections, please post!

Related

RNN Text Generation: How to balance training/test lost with validation loss?

I'm working on a short project that involves implementing a character RNN for text generation. My model uses a single LSTM layer with varying units (messing around with between 50 and 500), dropout at a rate of 0.2, and softmax activation. I'm using RMSprop with a learning rate of 0.01.
My issue is that I can't find a good way to characterize the validation loss. I'm using a validation split of 0.3 and I'm finding that the validation loss starts to become constant after only a few epochs (maybe 2-5 or so) while the training loss keeps decreasing. Does validation loss carry much weight in this sort of problem? The purpose of the model is to generate new strings, so quantifying the validation loss with other strings seems... pointless?
It's hard for me to really find the best model since qualitatively I get the sense that the best model is trained for more epochs than it takes for the validation loss to stop changing but also for fewer epochs than it takes for the training loss to start increasing. I would really appreciate any advice you have regarding this problem as well as any general advice about RNN's for text generation, especially regarding dropout and overfitting. Thanks!
This is the code for fitting the model for every epoch. The callback is a custom callback that just prints a few tests. I'm now realizing that history_callback.history['loss'] is probably the training loss isn't it...
for i in range(num_epochs):
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback],
validation_split=0.3)
loss_history.append(history_callback.history['loss'])
validation_loss_history.append(history_callback.history['val_loss'])
My intention for this model isn't to replicate sentences from the training data, rather, I'd like to generate sentence from the same distribution that I'm training on.

Yes history_callback.history['loss'] is Training Loss and history_callback.history['val_loss'] is the Validation Loss.
Yes, Validation Loss carries weight in this sort of problem because you just don't want to replicate the sentences which are given during Training but you want to learn the patterns from the Training Data and generate new sentences when it sees a new data.
From the information you mentioned in the question and from the insights identified from comments (thanks to Brian Bartoldson), it is understood that your model is overfitting. In addition to EarlyStopping and dropout, you can try the below mentioned techniques to mitigate overfitting problem.
3.a. Shuffle the Data, by using shuffle=True in model.fit. Code is shown below
3.b. Use recurrent_dropout. For example, If we set the value of Recurrent Dropout as 0.2 in a Recurrent Layer (LSTM), it means that it will consider only 80% of the Time Steps for that Recurrent Layer (LSTM).
3.c. Use Regularization. You can try l1 Regularization or l1_l2 Regularization as well for the arguments, kernel_regularizer, recurrent_regularizer, bias_regularizer, activity_regularizer of the LSTM Layer.
Sample code to use Shuffle, Early Stopping, Recurrent_Dropout, Regularization is shown below:
from tensorflow.keras.regularizers import l2
from tensorflow.keras.models import Sequential
model = Sequential()
Regularizer = l2(0.001)
model.add(tf.keras.layers.LSTM(units = 50, activation='relu',kernel_regularizer=Regularizer ,
recurrent_regularizer=Regularizer , bias_regularizer=Regularizer , activity_regularizer=Regularizer, dropout=0.2, recurrent_dropout=0.3))
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=15)
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback, callback],
validation_split=0.3, shuffle = True)
Hope this helps. Happy Learning!

Dropout with densely connected layer

Iam using a densenet model for one of my projects and have some difficulties using regularization.
Without any regularization, both validation and training loss (MSE) decrease. The training loss drops faster though, resulting in some overfitting of the final model.
So I decided to use dropout to avoid overfitting. When using Dropout, both validation and training loss decrease to about 0.13 during the first epoch and remain constant for about 10 epochs.
After that both loss functions decrease in the same way as without dropout, resulting in overfitting again. The final loss value is in about the same range as without dropout.
So for me it seems like dropout is not really working.
If I switch to L2 regularization though, Iam able to avoid overfitting, but I would rather use Dropout as a regularizer.
Now Iam wondering if anyone has experienced that kind of behaviour?
I use dropout in both the dense block (bottleneck layer) and in the transition block (dropout rate = 0.5):
def bottleneck_layer(self, x, scope):
with tf.name_scope(scope):
x = Batch_Normalization(x, training=self.training, scope=scope+'_batch1')
x = Relu(x)
x = conv_layer(x, filter=4 * self.filters, kernel=[1,1], layer_name=scope+'_conv1')
x = Drop_out(x, rate=dropout_rate, training=self.training)
x = Batch_Normalization(x, training=self.training, scope=scope+'_batch2')
x = Relu(x)
x = conv_layer(x, filter=self.filters, kernel=[3,3], layer_name=scope+'_conv2')
x = Drop_out(x, rate=dropout_rate, training=self.training)
return x
def transition_layer(self, x, scope):
with tf.name_scope(scope):
x = Batch_Normalization(x, training=self.training, scope=scope+'_batch1')
x = Relu(x)
x = conv_layer(x, filter=self.filters, kernel=[1,1], layer_name=scope+'_conv1')
x = Drop_out(x, rate=dropout_rate, training=self.training)
x = Average_pooling(x, pool_size=[2,2], stride=2)
return x

Without any regularization, both validation and training loss (MSE) decrease. The training loss drops faster though, resulting in some overfitting of the final model.
This is not overfitting.
Overfitting starts when your validation loss starts increasing, while your training loss continues decreasing; here is its telltale signature:
The image is adapted from the Wikipedia entry on overfitting - diferent things may lie in the horizontal axis, e.g. depth or number of boosted trees, number of neural net fitting iterations etc.
The (generally expected) difference between training and validation loss is something completely different, called the generalization gap:
An important concept for understanding generalization is the generalization gap, i.e., the difference between a model’s performance on training data and its performance on unseen data drawn from the same distribution.
where, practically speaking, validation data is unseen data indeed.
So for me it seems like dropout is not really working.
It can very well be the case - dropout is not expected to work always and for every problem.

Interesting problem,
I would recommend plotting the validation loss and the training loss to see if it is really overfitting. if you see that the validation loss didn't change while the training loss dropped ( you will also probably see a large gap between them ) then it is overfitting.
If it is overfitting then try to reduce the number of layers or the number of nodes ( also play a little with the Dropout rate after you do that). Reducing the number of epochs could also be helpful.
If you would like to use a different method instead of dropout I would recommend using the Gaussian Noise layer.
Keras - https://keras.io/layers/noise/
TensorFlow - https://www.tensorflow.org/api_docs/python/tf/keras/layers/GaussianNoise

MLP output of first layer is zero after one epoch

I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.

the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working

I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.

Is it relevant to use both batch_norm and dropout in estimator?

I read batch normalization and dropout are two different ways to avoid overfitting in neural networks. Is it relevant to use both in the same estimator as following ?
```
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
hidden_units=[512,512,512],
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
dropout=0.5,
activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models9/Action/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10),
batch_norm=True)

There is a small problem in your understanding. Batch Normalization original intent is not to reduce overfitting but to speed up the training. Just like how you normalize the inputs while you are passing it to the first layer of your network, batch normalization achieves this action in inner (or hidden) layers. Batch normalization removes the effect of covariate shift while it is training.
But since this is applied at every batches separately, it results in a side effect of regularizing your weight parameters. This regularizing effect is quite similar to that of how you would have done had you intended to solve over-fitting.
You can apply both batch_norm and dropout together but it is advisable to reduce the dropout. Currently, your dropout rate at 0.5 is very high. I believe dropout of 0.1 to 0.2 should be enough when you are applying it together with batch_norm. Also, the value of dropout is a hyper-parameter, so there is no fixed answer to it and you may have to tune it as per your data input and network.

Both batch normalization and dropout gives the regularization effect in some way or another.
As you apply the batch normalization for normalization steps it sees all the training example in mini-batch together to reduce the internal covariate shift which helps in speeding up the training and not setting the learning rate low and also gives the regularization effect.
If batch normalization is used along the network, then the dropout regularization can be reduced or dropped in strength

How to handle the BatchNorm layer when training fully convolutional networks by finetuning?

Training fully convolutional nerworks (FCNs) for pixelwise semantic segmentation is very memory intensive. So we often use batchsize=1 for traing FCNs. However, when we finetune the pretrained networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense for the BN layers. So, how to handle the BN layers?
Some options:
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
Freeze the parameters and statistics of the BN layers
....
which is better and any demo for implementation in pytorch/tf/caffe?

Having only one element will make the batch normalization zero if epsilon is non-zero (variance is zero, mean will be same as input).
Its better to delete the BN layers from the network and try the activation function SELU (scaled exponential linear units). This is from the paper 'Self normalizing neural networks' (SNNs).
Quote from the paper:
While batch normalization requires explicit normalization, neuron
activations of SNNs automatically converge towards zero mean and
unit variance. The activation function of SNNs are “scaled
exponential linear units” (SELUs), which induce self-normalizing
properties.
The SELU is defined as:
def selu(x, name="selu"):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * tf.where(x >= 0.0, x, alpha * tf.nn.elu(x))

Batch Normalization was introduced to reduce the internal covariate shift of the input feature maps. Due to change of parameters of each layer after every optimization steps, input distribution of a layer also changes, this slow down the model convergence. By using Batch Normalization we can normalize the input distribution irrespective of the batch_size (whether batch_size =1 or larger).
BN normalizes the input distribution
For convolutional network input for intermediate layer is 4D tensor. [batch_size, width, height, num_filters]. Normalization effect all the feature maps.
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
This may further slow down the training step and convergence mayn't be achieved.
Freeze the parameters and statistics of the BN layers
Sometime the input data distribution for retrain/finetune, may vary significantly from the original data used to train the pretrained model used for initialization, Due to which your model may end-up in non-optimal solution.

According to my experiments in PyTorch, if convolutional layer before the BN outputs more than one value (i.e. 1 x feat_nb x height x width, where height > 1 or width > 1), then the BN still works fine even when the batch size is equal to one. However, I suspect that in this case the variance estimate might be very biased since all samples that are used for variance calculation come from the same image. Therefore in my case I still decided to use small batch.

The effective batch size over convolutional layer
I think the CNN-relative section (Section 3.2) in the BN original paper could help. From the point of view of the authors, it should be OK to use batch size = 1 for convolutional layers. The "effective batch size" for convolutional layer actually is batch_size * image_height * image_width.

I do not have an exact answer, but here are my thoughts:
networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense
for the BN layers
The main motivation of BN is to fix the distribution (mean/variance) of the input in the batch. In my opinion, having one element this does not make sense. Judging from the paper
you will need to calculate the mean and the variance for 1 element, which does not make sense.
You can always just remove BN but are you sure you can't afford at least 16 elements in the batch?

My observation is in contrary with Stephan's: using PyTorch on a similar input batch x feat_nb x height x width, where height > 1 or width > 1, I found adding BatchNorm after the last conv and before the last non-linear (sigmoid) actually hurts the accuracy by a big margin. Still trying to make sense out of it..
(batch size = 8)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas