Difference between Keras and tensorflow implementation of LSTM with dropout - tensorflow

I was reviewing the documentation for the LSTM cell in tensorflow and Keras. In particular, I want to apply dropout as well. Here is what I have in Keras and would like to apply the same LSTM cell in tensorflow:
cell = LSTM(num_units_2, return_sequences=True, dropout=dropout, recurrent_dropout=dropout)(net)
Therefore, I know that I need to use tf.nn.rnn_cell.LSTMCell in tensorflow with num_units = num_units_2. Second, I need a DropoutWrapper as:
cell = tf.nn.rnn_cell.DropoutWrapper(cell)
Now, I want to apply dropout and recurrent_dropout similar to the Keras code. Therefore, I found that tensorflow's implementation of dropout will apply a different dropout mask at every time step unless variational_recurrent is set to True (Yet I'm not sure how variational_recurrent works in details).
Additionally, I'm not sure if the LSTM in Keras apply different Mask at each time step as well.
Second, I was confused about the difference between the output_keep_prob and the state_keep_prob as both mention:
output_keep_prob: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added...
Any help is much appreciated!!

What variational dropout does
As far as I know, the main novelty of variational dropout is using the same dropout mask for all unrolled steps (as you said).
Difference between output_keep_prob and the state_keep_prob
output_keep_prob is the dropout rate applied to the output (h) of the LSTM cell where state_keep_prob is the dropout rate applied to the cell (c) of the LSTM state.
Dropout choice in Keras
Looking at the _generate_dropout_mask method in the LSTM source code and its use for the LSTMCell of Keras, I think Keras LSTM uses variational recurrent dropout only for the recurrent connections (i.e. self._recurrent_dropout_mask) . But I'm not 100% confident about this.

Related

Keras: Why is LSTM much faster than SimpleRNN during training

When I tried using SimpleRNN vs LSTM, I found the SimpleRNN training had an ETA of 30 min, whereas the LSTM had ETA of 20 seconds. But SimpleRNN should have less operations than LSTM. What is causing this huge difference? Am I using SimpleRNN wrong?
import tensorflow as tf
SEQUENCE_LENGTH = 80
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words = 2000)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=SEQUENCE_LENGTH)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=SEQUENCE_LENGTH)
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(2000, 128),
tf.keras.layers.SimpleRNN(8),
# tf.keras.layers.LSTM(8),
tf.keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(
loss="binary_crossentropy",
optimizer="adam",
metrics=["accuracy"],
)
model.fit(x_train, y_train, batch_size=32, epochs=1)
Simple RNN :- There is simple multiplication of Input (xt) and Previous Output (ht-1),passed through Tanh activation function. No Gates present.
Recurrent neural networks(RNN) have a recurrent connection in which the output is transmitted back to the RNN neuron rather than only passing it to the next node.
Each node in the RNN model functions as a
memory cell, continuing calculation and operation implementation. An
RNN remembers every piece of information throughout time. RNNs have feedback loops in the recurrent layer. This lets them maintain information in ‘memory’ over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies.This is
because the gradient of the loss function decays exponentially with
time (called the vanishing gradient problem).
LSTM : - LSTMs deal with vanishing and exploding gradient problem by introducing new gates, such as input(i) and forget(f) gates, which allow for a better control over the gradient flow that update and regulate the cell states in an LSTM network and enable better preservation of “long-range dependencies”.
LSTM tackles gradient vanishing by ignoring useless data/information
in the network. If there is no valuable data from other inputs
(previous words of the sentence), LSTM will forget that data and
produce the result “Cut down the budget.
It contains four networks activated by either the sigmoid function (σ)
or the tanh function, all with their own different set of parameters.
Forget gate layer (f):-Decides which information to forget from the cell state
Input gate layer (i):-This could also be a remember gate. It decides which of the new candidates are relevant for this time step
New candidate gate layer (n):- Creates a new set of candidates to be stored in the cell state
Output gate layer (o):- Determines which parts of the cell state are output.
Please check this link for better understanding in this.

Should feature embeddings be taken before or after dropout layer in neural network?

I am training a binary text classification model using BERT as follows:
def create_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l1 = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l2 = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l1)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l2])
return model
This code is borrowed from the example on tfhub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.
I want to extract feature embeddings from the penultimate layer and use them for comparison, clustering, visualization, etc between examples. Should this be done before dropout (l1 in the model above) or after dropout (l2 in the model above)?
I am trying to figure out whether this choice makes a significant difference, or is it fine either way? For example, if I extract feature embeddings after dropout and compute feature similarities between two examples, this might be affected by which nodes are randomly set to 0 (but perhaps this is okay).
In order to answer your question let's recall how a Dropout layer works:
The Dropout layer is usually used as a means to mitigate overfitting. Suppose two layers, A and B, are connected through a Dropout layer. Then during the training phase, neurons in layer A are being randomly dropped. That prevents layer B from becoming too dependent upon specific neurons in layer A, as these neurons are not always available. Therefore, layer B has to take into consideration the overall signal coming from layer A, and (hopefully) cannot cling to some noise which is specific to the training set.
An important point to note is that the Dropout mechanism is activated only during the training phase. While predicting, Dropout does nothing.
If I understand you correctly, you want to know whether to take the features before or after the Dropout (note that in your network l1 denotes the features after Dropout has been applied). If so, I would take the features before Dropout, because technically it does not really matter (Dropout is inactive during prediction) and it is more reasonable to do so (Dropout is not meaningful without a following layer).

What's the mean of “ recurrent dropout” of LSTM?

Recently,I try to use the “tf.contrib.rnn.LayerNormBasicLSTMCell” , but I don't know what's the mean of the argument “dropout_keep_prob”.
Then I look at the Document given by Google. Their explanation is “unit Tensor or float between 0 and 1 representing the recurrent dropout probability value. If float and 1.0, no dropout will be applied.”
But I don't know the difference between “recurrent dropout” and“dropout”.
Recurrent Dropout is a regularization method for recurrent neural networks. Dropout is applied to the updates to LSTM memory cells, i.e. it drops out the input/update gate in LSTM. For more information you can refer here.

tensorflow - how to use variational recurrent dropout correctly

The tensorflow config dropout wrapper has three different dropout probabilities that can be set: input_keep_prob, output_keep_prob, state_keep_prob.
I want to use variational dropout for my LSTM units, by setting the variational_recurrent argument to true. However, I don't know which of the three dropout probabilities I have to use for variational dropout to function correctly.
Can someone provide help?
According to this paper https://arxiv.org/abs/1512.05287 that is used for implementation of the variational_recurrent dropouts, you can think about as follows,
input_keep_prob - probability that dropping out input connections.
output_keep_prob - probability that dropping out output connections.
state_keep_prob - Probability that droping out recurrent connections.
See the diagram below,
If you set the variational_recurrent to be true you will get an RNN that's similar to the model in right and otherwise in left.
The basic differences in above two models are,
Variational RNN repeats the same dropout mask at each time
step for both inputs, outputs, and recurrent layers (drop
the same network units at each time step).
Native RNN uses different dropout masks at each time step for the
inputs and outputs alone (no dropout is used with the recurrent
connections since the use of different masks with these connections
leads to deteriorated performance).
In the above diagram, coloured connections represent the dropped-out connections, with different colours corresponding to different dropout masks. Dashed lines correspond to standard connections with no dropout.
Therefore, if you use a variational RNN you can set all three probability parameters according to your requirement.
Hope this helps.

Using binary_crossentropy loss in Keras (Tensorflow backend)

In the training example in Keras documentation,
https://keras.io/getting-started/sequential-model-guide/#training
binary_crossentropy is used and sigmoid activation is added in the network's last layer, but is it necessary that add sigmoid in the last layer? As I found in the source code:
def binary_crossentropy(output, target, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
Arguments:
output: A tensor.
target: A tensor with the same shape as `output`.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
Returns:
A tensor.
"""
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon, 1 - epsilon)
output = math_ops.log(output / (1 - output))
return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Keras invokes sigmoid_cross_entropy_with_logits in Tensorflow, but in sigmoid_cross_entropy_with_logits function, sigmoid(logits) is calculated again.
https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits
So I don't think it makes sense that add a sigmoid at last, but seemingly all the binary/multi-label classification examples and tutorials in Keras I found online added sigmoid at last. Besides I don't understand what is the meaning of
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
Why Keras expects probabilities? Doesn't it use the nn.softmax_cross_entropy_with_logits function? Does it make sense?
Thanks.
You're right, that's exactly what's happening. I believe this is due to historical reasons.
Keras was created before tensorflow, as a wrapper around theano. And in theano, one has to compute sigmoid/softmax manually and then apply cross-entropy loss function. Tensorflow does everything in one fused op, but the API with sigmoid/softmax layer was already adopted by the community.
If you want to avoid unnecessary logit <-> probability conversions, call binary_crossentropy loss withfrom_logits=True and don't add the sigmoid layer.
In categorical cross entropy :
if it is prediction it will compute the cross entropy directly
if it is logit it will apply softmax_cross entropy with logit
In Binary cross entropy:
if it is prediction it will convert it back to logit then apply sigmoied cross entropy with logit
if it is logit it will apply sigmoied cross entropy with logitdirectly
In Keras by default we use activation sigmoid on the output layer and then use the keras binary_crossentropy loss function, independent of the backend implementation (Theano, Tensorflow or CNTK).
If you look more in depth for the pure Tensorflow case you find that the tensorflow backend binary_crossentropy function (which you pasted in your question) uses tf.nn.sigmoid_cross_entropy_with_logits. The later function also add the sigmoid activation. To avoid double sigmoid, the tensorflow backend binary_crossentropy, will by default (with from_logits=False) calculate the inverse sigmoid (logit(x)=log(x/1-x)) to get the output back into the raw state from the network with no activation.
The extra activation sigmoid, and inverse sigmoid calculation can be avoided by using no sigmoid activation function in your last layer, and then call the tensorflow backend binary_crossentropy with parameter from_logits=True (Or directly use tf.nn.sigmoid_cross_entropy_with_logits)