Keras: Why is LSTM much faster than SimpleRNN during training - tensorflow

When I tried using SimpleRNN vs LSTM, I found the SimpleRNN training had an ETA of 30 min, whereas the LSTM had ETA of 20 seconds. But SimpleRNN should have less operations than LSTM. What is causing this huge difference? Am I using SimpleRNN wrong?
import tensorflow as tf
SEQUENCE_LENGTH = 80
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words = 2000)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=SEQUENCE_LENGTH)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=SEQUENCE_LENGTH)
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(2000, 128),
tf.keras.layers.SimpleRNN(8),
# tf.keras.layers.LSTM(8),
tf.keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(
loss="binary_crossentropy",
optimizer="adam",
metrics=["accuracy"],
)
model.fit(x_train, y_train, batch_size=32, epochs=1)

Simple RNN :- There is simple multiplication of Input (xt) and Previous Output (ht-1),passed through Tanh activation function. No Gates present.
Recurrent neural networks(RNN) have a recurrent connection in which the output is transmitted back to the RNN neuron rather than only passing it to the next node.
Each node in the RNN model functions as a
memory cell, continuing calculation and operation implementation. An
RNN remembers every piece of information throughout time. RNNs have feedback loops in the recurrent layer. This lets them maintain information in ‘memory’ over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies.This is
because the gradient of the loss function decays exponentially with
time (called the vanishing gradient problem).
LSTM : - LSTMs deal with vanishing and exploding gradient problem by introducing new gates, such as input(i) and forget(f) gates, which allow for a better control over the gradient flow that update and regulate the cell states in an LSTM network and enable better preservation of “long-range dependencies”.
LSTM tackles gradient vanishing by ignoring useless data/information
in the network. If there is no valuable data from other inputs
(previous words of the sentence), LSTM will forget that data and
produce the result “Cut down the budget.
It contains four networks activated by either the sigmoid function (σ)
or the tanh function, all with their own different set of parameters.
Forget gate layer (f):-Decides which information to forget from the cell state
Input gate layer (i):-This could also be a remember gate. It decides which of the new candidates are relevant for this time step
New candidate gate layer (n):- Creates a new set of candidates to be stored in the cell state
Output gate layer (o):- Determines which parts of the cell state are output.
Please check this link for better understanding in this.

Related

Is validation curve slight greater or lower in CNN models good?

Can you tell me which one among the two is a good validation vs train plot?
Both of them are trained with same keras sequential layers, but the second one is trained using more number of samples, i.e. augmented the dataset.
I'm a little bit confused about the zigzags in the first plot, otherwise I think it is better than the second.
In the second plot, there are no zigzags but the validation accuracy tends to be a little high than train, is it overfitting or considerable?
It is an image detection model where the first model's dataset size is 5170 and the second had 9743 samples.
The convolutional layers defined for the model building:
tf.keras.layers.Conv2D(128,(3,3), activation = 'relu', input_shape = (150,150,3)),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(64,(3,3), activation = 'relu'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(32,(3,3), activation = 'relu'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512,activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(128,activation='relu'),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Dense(1,activation='sigmoid')
Can the model be improved?
From the graphs the second graph where you have more samples is better. The reason is with more samples the model is trained on a much wider probability distribution of images. So when validation is run you have a better chance of correctly classifying the image. You have a lot of dropout in your model. This is good to prevent over fitting, however it will lower the training accuracy relative to the validation accuracy. Your model seems to be doing well. It might improve if you add additional convolution- max pooling layers. Alternative of course is to use transfer learning. I would recommend efficientnetb3. I also recommend using an adjustable learning rate. The Keras callback ReduceLROnPlateau works well for that purpose. Documentation is here.. Code below shows my recommended settings.
rlronp=tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=2,
verbose=1,
mode='auto'
)
in model.fit include callbacks=[rlronp]

Should feature embeddings be taken before or after dropout layer in neural network?

I am training a binary text classification model using BERT as follows:
def create_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l1 = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l2 = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l1)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l2])
return model
This code is borrowed from the example on tfhub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.
I want to extract feature embeddings from the penultimate layer and use them for comparison, clustering, visualization, etc between examples. Should this be done before dropout (l1 in the model above) or after dropout (l2 in the model above)?
I am trying to figure out whether this choice makes a significant difference, or is it fine either way? For example, if I extract feature embeddings after dropout and compute feature similarities between two examples, this might be affected by which nodes are randomly set to 0 (but perhaps this is okay).
In order to answer your question let's recall how a Dropout layer works:
The Dropout layer is usually used as a means to mitigate overfitting. Suppose two layers, A and B, are connected through a Dropout layer. Then during the training phase, neurons in layer A are being randomly dropped. That prevents layer B from becoming too dependent upon specific neurons in layer A, as these neurons are not always available. Therefore, layer B has to take into consideration the overall signal coming from layer A, and (hopefully) cannot cling to some noise which is specific to the training set.
An important point to note is that the Dropout mechanism is activated only during the training phase. While predicting, Dropout does nothing.
If I understand you correctly, you want to know whether to take the features before or after the Dropout (note that in your network l1 denotes the features after Dropout has been applied). If so, I would take the features before Dropout, because technically it does not really matter (Dropout is inactive during prediction) and it is more reasonable to do so (Dropout is not meaningful without a following layer).

tensorflow transfer learning with pre-trained model that uses batch normalization

In Tensorflow guide about transfer learning, they said:
When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training=False when calling the base model.
What I understand from this is, even when I unfreeze layers, if the pre-trained model contains the BatchNormalization layer, I should set 'traininig=False' just like the code below:
resnet = ResNet50(weights='imagenet', include_top=False)
resnet.trainable = True # unfreeze
inputs = Input(shape=(150,150,3))
x = resnet(inputs, training=False) # because of BN
x = GlobalAveragePooling2D()(x)
x = Dropout(0.2)(x)
outputs = Dense(150,kernel_regularizer=regularizers.l2(0.005), activation='softmax')(x)
However, I got very low accuracy and learning rarely occurred whereas when I set training to True the accuracy rate was satisfied.
So, these are my questions:
Is it wrong to set training as True when it comes to model with BN?
what does 'training = False' mean? I thought it relates to back-propagation.
Thanks in advance!
There is 4 parameters in a BN layer, 2 of which are trainale scale factors, and anoter 2 are mean and std of the input feature (for this BN layer).
Therefore:
Generally, we set training=True in the training. procedure.
However, when it comes to transfer learning, it's optional, that is, "True" or "False" are acceptable, where the former unfroze the BN layer while the latter uses BN layers trianed on pervious data sets.
'Training=False' means don't update "mean", "std" and scale factors of the BN layer. When testing, it's necessary to set training=False, otherwise which would cause test data leakage of the test data thus making the test accuracy unreliable.

RNN Text Generation: How to balance training/test lost with validation loss?

I'm working on a short project that involves implementing a character RNN for text generation. My model uses a single LSTM layer with varying units (messing around with between 50 and 500), dropout at a rate of 0.2, and softmax activation. I'm using RMSprop with a learning rate of 0.01.
My issue is that I can't find a good way to characterize the validation loss. I'm using a validation split of 0.3 and I'm finding that the validation loss starts to become constant after only a few epochs (maybe 2-5 or so) while the training loss keeps decreasing. Does validation loss carry much weight in this sort of problem? The purpose of the model is to generate new strings, so quantifying the validation loss with other strings seems... pointless?
It's hard for me to really find the best model since qualitatively I get the sense that the best model is trained for more epochs than it takes for the validation loss to stop changing but also for fewer epochs than it takes for the training loss to start increasing. I would really appreciate any advice you have regarding this problem as well as any general advice about RNN's for text generation, especially regarding dropout and overfitting. Thanks!
This is the code for fitting the model for every epoch. The callback is a custom callback that just prints a few tests. I'm now realizing that history_callback.history['loss'] is probably the training loss isn't it...
for i in range(num_epochs):
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback],
validation_split=0.3)
loss_history.append(history_callback.history['loss'])
validation_loss_history.append(history_callback.history['val_loss'])
My intention for this model isn't to replicate sentences from the training data, rather, I'd like to generate sentence from the same distribution that I'm training on.
Yes history_callback.history['loss'] is Training Loss and history_callback.history['val_loss'] is the Validation Loss.
Yes, Validation Loss carries weight in this sort of problem because you just don't want to replicate the sentences which are given during Training but you want to learn the patterns from the Training Data and generate new sentences when it sees a new data.
From the information you mentioned in the question and from the insights identified from comments (thanks to Brian Bartoldson), it is understood that your model is overfitting. In addition to EarlyStopping and dropout, you can try the below mentioned techniques to mitigate overfitting problem.
3.a. Shuffle the Data, by using shuffle=True in model.fit. Code is shown below
3.b. Use recurrent_dropout. For example, If we set the value of Recurrent Dropout as 0.2 in a Recurrent Layer (LSTM), it means that it will consider only 80% of the Time Steps for that Recurrent Layer (LSTM).
3.c. Use Regularization. You can try l1 Regularization or l1_l2 Regularization as well for the arguments, kernel_regularizer, recurrent_regularizer, bias_regularizer, activity_regularizer of the LSTM Layer.
Sample code to use Shuffle, Early Stopping, Recurrent_Dropout, Regularization is shown below:
from tensorflow.keras.regularizers import l2
from tensorflow.keras.models import Sequential
model = Sequential()
Regularizer = l2(0.001)
model.add(tf.keras.layers.LSTM(units = 50, activation='relu',kernel_regularizer=Regularizer ,
recurrent_regularizer=Regularizer , bias_regularizer=Regularizer , activity_regularizer=Regularizer, dropout=0.2, recurrent_dropout=0.3))
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=15)
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback, callback],
validation_split=0.3, shuffle = True)
Hope this helps. Happy Learning!

Difference between Keras and tensorflow implementation of LSTM with dropout

I was reviewing the documentation for the LSTM cell in tensorflow and Keras. In particular, I want to apply dropout as well. Here is what I have in Keras and would like to apply the same LSTM cell in tensorflow:
cell = LSTM(num_units_2, return_sequences=True, dropout=dropout, recurrent_dropout=dropout)(net)
Therefore, I know that I need to use tf.nn.rnn_cell.LSTMCell in tensorflow with num_units = num_units_2. Second, I need a DropoutWrapper as:
cell = tf.nn.rnn_cell.DropoutWrapper(cell)
Now, I want to apply dropout and recurrent_dropout similar to the Keras code. Therefore, I found that tensorflow's implementation of dropout will apply a different dropout mask at every time step unless variational_recurrent is set to True (Yet I'm not sure how variational_recurrent works in details).
Additionally, I'm not sure if the LSTM in Keras apply different Mask at each time step as well.
Second, I was confused about the difference between the output_keep_prob and the state_keep_prob as both mention:
output_keep_prob: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added...
Any help is much appreciated!!
What variational dropout does
As far as I know, the main novelty of variational dropout is using the same dropout mask for all unrolled steps (as you said).
Difference between output_keep_prob and the state_keep_prob
output_keep_prob is the dropout rate applied to the output (h) of the LSTM cell where state_keep_prob is the dropout rate applied to the cell (c) of the LSTM state.
Dropout choice in Keras
Looking at the _generate_dropout_mask method in the LSTM source code and its use for the LSTMCell of Keras, I think Keras LSTM uses variational recurrent dropout only for the recurrent connections (i.e. self._recurrent_dropout_mask) . But I'm not 100% confident about this.