Understanding states of a bidirectional LSTM in a seq2seq model (tf keras) - tensorflow

I am creating a language model: A seq2seq model with 2 Bidirectional LSTM layers. I have got the model to train and the accuracy seems good, but whilst stuck on figuring out the inference model, I've found myself a bit confused by the states that are returned by each LSTM layer.
I am using this tutorial as a guide, though the example in this link is not using bidriectional layers: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
Note: I am using a pretrained word embedding.
lstm_units = 100
# Set up embedding layer using pretrained weights
embedding_layer = Embedding(total_words+1, emb_dimension, input_length=max_input_len, weights=[embedding_matrix], name="Embedding")
# Encoder
encoder_input_x = Input(shape=(None,), name="Enc_Input")
encoder_embedding_x = embedding_layer(encoder_input_x)
encoder_lstm_x, enc_state_h_fwd, enc_state_c_fwd, enc_state_h_bwd, enc_state_c_bwd = Bidirectional(LSTM(lstm_units, dropout=0.5, return_state=True, name="Enc_LSTM1"), name="Enc_Bi1")(encoder_embedding_x)
encoder_states = [enc_state_h_fwd, enc_state_c_fwd, enc_state_h_bwd, enc_state_c_bwd]
# Decoder
decoder_input_x = Input(shape=(None,), name="Dec_Input")
decoder_embedding_x = embedding_layer(decoder_input_x)
decoder_lstm_layer = Bidirectional(LSTM(lstm_units, return_state=True, return_sequences=True, dropout=0.5, name="Dec_LSTM1"))
decoder_lstm_x, _, _, _, _= decoder_lstm_layer(decoder_embedding_x, initial_state=encoder_states)
decoder_dense_layer = TimeDistributed(Dense(total_words+1, activation="softmax", name="Dec_Softmax"))
decoder_output_x = decoder_dense_layer(decoder_lstm_x)
model = Model(inputs=[encoder_input_x, decoder_input_x], outputs=decoder_output_x)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
I believe diagram of the model looks like this, with 60 time steps.:
I want the encoder to pass the enc_state_h_fwd and enc_state_c_fwd forward to the decoder. This connection is highlighted by the orange arrow.
But since the model is bidirectional, I have some questions:
Do I need to pass the decoder states backwards to the encoder? And how would one possibly do this, it seems like a chicken and egg scenario.
The encoder_states that come from the encoder lstm layer output 4 states. h and c states going forward and backward. I feel like the "backward" states are denoted in my diagram by the pink arrow going left out of the encoder. I am passing these to the decoder, but why does it need them? Am I incorrectly connecting the pink arrow on the left to the purple arrow going into the decoder from the right?

This model is not valid. It is set up as a translation model, which during inference would predict one word at a time, starting with the start of sequence token, to predict y1, then looping and feeding in the start of sequence token, y1 to get y2 etc.
A bidirectional LSTM cannot be used for real time predictions in a many to many prediction unless the entire decoder input is available. In this case, the decoder input is only available after predicting one step at a time, so the first prediction (of y1) is invalid without the rest of the sequence (y2-yt).
The decoder should therefore not be an LSTM.
As for the states, the encoder Bidirectional LSTM does indeed output h and c states going forward (orange arrow), and h and c states going backward (pink arrow).
By concatenating these states and feeding them to the decoder, we can give the decoder more information. This is possible as we do have the entire encoder input at time of inference.
Also to be noted is that the bidirectional encoder with lstm_units (eg. 100) effectively has 200 lstm units, half going forward, half going backward. To feed these into the decoder, the decoder must have 200 units too.

Related

Keras: Why is LSTM much faster than SimpleRNN during training

When I tried using SimpleRNN vs LSTM, I found the SimpleRNN training had an ETA of 30 min, whereas the LSTM had ETA of 20 seconds. But SimpleRNN should have less operations than LSTM. What is causing this huge difference? Am I using SimpleRNN wrong?
import tensorflow as tf
SEQUENCE_LENGTH = 80
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words = 2000)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=SEQUENCE_LENGTH)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=SEQUENCE_LENGTH)
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(2000, 128),
tf.keras.layers.SimpleRNN(8),
# tf.keras.layers.LSTM(8),
tf.keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(
loss="binary_crossentropy",
optimizer="adam",
metrics=["accuracy"],
)
model.fit(x_train, y_train, batch_size=32, epochs=1)
Simple RNN :- There is simple multiplication of Input (xt) and Previous Output (ht-1),passed through Tanh activation function. No Gates present.
Recurrent neural networks(RNN) have a recurrent connection in which the output is transmitted back to the RNN neuron rather than only passing it to the next node.
Each node in the RNN model functions as a
memory cell, continuing calculation and operation implementation. An
RNN remembers every piece of information throughout time. RNNs have feedback loops in the recurrent layer. This lets them maintain information in ‘memory’ over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies.This is
because the gradient of the loss function decays exponentially with
time (called the vanishing gradient problem).
LSTM : - LSTMs deal with vanishing and exploding gradient problem by introducing new gates, such as input(i) and forget(f) gates, which allow for a better control over the gradient flow that update and regulate the cell states in an LSTM network and enable better preservation of “long-range dependencies”.
LSTM tackles gradient vanishing by ignoring useless data/information
in the network. If there is no valuable data from other inputs
(previous words of the sentence), LSTM will forget that data and
produce the result “Cut down the budget.
It contains four networks activated by either the sigmoid function (σ)
or the tanh function, all with their own different set of parameters.
Forget gate layer (f):-Decides which information to forget from the cell state
Input gate layer (i):-This could also be a remember gate. It decides which of the new candidates are relevant for this time step
New candidate gate layer (n):- Creates a new set of candidates to be stored in the cell state
Output gate layer (o):- Determines which parts of the cell state are output.
Please check this link for better understanding in this.

Using Pooling Layers in an LSTM Autoencoder

I am attempting to create an LSTM denoising autoencoder for use on long time series (100,000+ points) in Python using Tensorflow. I have shied away from the typical LSTM Autoencoder structure, where the information is rolled up into a single vector at the final time step and then fed into the decoder. The reason I have avoided this is that to achieve a sensible compression ratio in the encoding many neurons would be required in the final encoding layer.
My architecture uses max pooling layers after each LSTM layer to spread the compression across series length and number of neurons. The encoded representation is then taken to be the series of vectors outputted by the final layer of the encoder, not just the final vector. I have been training the model using the Adam optimiser. This architecture is perhaps better explained by the following code.
series_input = tf.keras.layers.Input(shape=(2**14, 64), name='series_input')
x = tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(16, return_sequences=True, activation='relu'))(x)
x = tf.keras.layers.MaxPool1D(4)(x)
x = tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(8, return_sequences=True, activation='relu'))(x)
encoded = tf.keras.layers.MaxPool1D(4)(x)
x = tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(8, return_sequences=True, activation='relu'))(encoded)
x = tf.keras.layers.UpSampling1D(4)(x)
x = tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(16, return_sequences=True, activation='relu'))(x)
x = tf.keras.layers.UpSampling1D(4)(x)
x = tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(1, return_sequences=True, activation='relu'))(x)
decoded = tf.keras.layers.Dense(1, activation='relu')(x)
autoencoder = tf.keras.Model(inputs=[series_input], outputs=[decoded])
autoencoder.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3,
clipvalue=10,
decay=0.005),
loss='mean_squared_error',
metrics=['accuracy'])
My initial attempts at training the model seem to show that it is a viable architecture, and able to learn sensible denoising strategies. The problem, however, is that this configuration drastically slows the training process. To add some colour, it trains significantly more slowly than other stacked LSTM structures—a variation on this architecture took around a day to complete a single epoch when running on 4 V100s. What is the reason for this? What could be done to improve the computational efficiency of this model? I am open to any suggestions and happy to hear them. Thank you for taking the time to reach the end of this post.

How to pass latent vector to decoder in LSTM Variational Autoencoder

I'm trying to write my own LSTM Variational Autoencoder for text, and have gotten an OK understanding of how the encoding step works and how I perform sampling of the latent vector Z. The problem is now how I should pass on the Z to the decoder. For the input to the decoder I have a start token <s>, which leaves the hidden state h, and the cell state c for the LSTM cell in the decoder.
Should I make both the initial states h and c equal to Z, just one of them, or something else?
Using RepeatVector you can repeat the latent output n times. Then, feed it into the LSTM. Here is a minimal example:
# latent_dim: int, latent z-layer shape.
decoder_input = Input(shape=(latent_dim,))
_h_decoded = RepeatVector(timesteps)(decoder_input)
decoder_h = LSTM(intermediate_dim, return_sequences=True)
_h_decoded = decoder_h(_h_decoded)
decoder_mean = LSTM(input_dim, return_sequences=True)
_x_decoded_mean = decoder_mean(_h_decoded)
decoder = Model(decoder_input, _x_decoded_mean)
It is clearly explained in Keras documentation.

Stacked autoencoders for data denoising with keras not training the encoder?

I looked for several samples on the web to build a stacked autoencoder for data denoising but I don't seem to understand a fundamental part of the encoder part:
https://blog.keras.io/building-autoencoders-in-keras.html
Following the examples I built the autoencoder like that:
inputs = Input(shape=(timesteps, 50))
encoded1 = Dense(30, activation="relu")(inputs)
encoded2 = Dense(15, activation="relu")(encoded1)
encoded3 = Dense(5, activation="relu")(encoded2)
decoded1 = Dense(15, activation="relu")(encoded3)
decoded2 = Dense(30, activation="relu")(decoded1)
decoded = Dense(50, activation="sigmoid")(decoded2)
autoencoder = Model(inputs=inputs, outputs=decoded)
encoder = Model(inputs, encoded3)
autoencoder.compile(loss='mse', optimizer='rmsprop')
autoencoder.fit(trainX,
trainX,
epochs=epochs,
batch_size=512,
callbacks=callbacks,
validation_data=(trainX, trainX))
On the examples there is mostly a model with the encoder and a seperate model with the decoder. I always see that only the decoder model get's trained. The encoder is not trained. But for my usecase I only need the encoder model to denoise the data. Why does the encoder need no training?
Your interpretation about encoder-decoder is wrong. Encoder encodes your input data into some high dimensional representation which is abstract but it's very powerful if you want use that as features for further prediction. To make sure encoded output is as close to your actual input, you have decoder which decodes your encoded high-dimensional input back to original input. During training, both encoder and decoder are involved i.e. the weights of the encoder layers and decoder layers both are updated. If the encoder is not trained how it's going to learn the encoding mechanism. During inference, you use only the encoder module as you want to encode the input.

LSTM with Condition

I'm studying LSTM with CNN in tensorflow.
I want to put some scalar label into LSTM network as a condition.
Does anybody know which LSTM is what I meant?
If available, please let me know the usage of that
Thank you.
This thread might interest you: Adding Features To Time Series Model LSTM.
You have basically 3 possible ways:
Let's take an example with weather data from two different cities: Paris and San Francisco. You want to predict the next temperature based on historical data. But at the same time, you expect the weather to change based on the city. You can either:
Combine the auxiliary features with the time series data, at the beginning or at the end (ugly!).
Concatenate the auxiliary features with the output of the RNN layer. It's some kind of post-RNN adjustment since the RNN layer won't see this auxiliary info.
Or just initialize the RNN states with a learned representation of the condition (e.g. Paris or San Francisco).
I wrote a library to condition on auxiliary inputs. It abstracts all the complexity and has been designed to be as user-friendly as possible:
https://github.com/philipperemy/cond_rnn/
The implementation is in tensorflow (>=1.13.1) and Keras.
Hope it helps!
Heres an example of applying CNN and LSTM over the output probabilities of a sequence, like you asked:
def build_model(inputs):
BATCH_SIZE = 4
NUM_CLASSES = 2
NUM_UNITS = 128
H = 224
W = 224
C = 3
TIME_STEPS = 4
# inputs is assumed to be of shape (BATCH_SIZE, TIME_STEPS, H, W, C)
# reshape your input such that you can apply the CNN for all images
input_cnn_reshaped = tf.reshape(inputs, (-1, H, W, C))
# define CNN, for instance vgg 16
cnn_logits_output, _ = vgg_16(input_cnn_reshaped, num_classes=NUM_CLASSES)
cnn_probabilities_output = tf.nn.softmax(cnn_logits_output)
# reshape back to time series convention
cnn_probabilities_output = tf.reshape(cnn_probabilities_output, (BATCH_SIZE, TIME_STEPS, NUM_CLASSES))
# perform LSTM over the probabilities per image
cell = tf.contrib.rnn.LSTMCell(NUM_UNITS)
_, state = tf.nn.dynamic_rnn(cell, cnn_probabilities_output)
# employ FC layer over the last state
logits = tf.layers.dense(state, NUM_UNITS)
# logits is of shape (BATCH_SIZE, NUM_CLASSES)
return logits
By the way, a better approach would be to employ the LSTM over the last hidden layer, i.e to use the CNN as feature extractor and make the prediction over sequences of features.