In Keras, what exactly am I configuring when I create a stateful `LSTM` layer with N `units`? - tensorflow

The first arguments in a normal Dense layer is also units, and is the number of neurons/nodes in that layer. A standard LSTM unit however looks like the following:
(This is a reworked version of "Understanding LSTM Networks")
In Keras, when I create an LSTM object like this LSTM(units=N, ...), am I actually creating N of these LSTM units? Or is it the size of the "Neural Network" layers inside the LSTM unit, i.e., the W's in the formulas? Or is it something else?
For context, I'm working based on this example code.
The following is the documentation: https://keras.io/layers/recurrent/
It says:
units: Positive integer, dimensionality of the output space.
It makes me think it is the number of outputs from the Keras LSTM "layer" object. Meaning the next layer will have N inputs. Does that mean there actually exists N of these LSTM units in the LSTM layer, or maybe that that exactly one LSTM unit is run for N iterations outputting N of these h[t] values, from, say, h[t-N] up to h[t]?
If it only defines the number of outputs, does that mean the input still can be, say, just one, or do we have to manually create lagging input variables x[t-N] to x[t], one for each LSTM unit defined by the units=N argument?
As I'm writing this it occurs to me what the argument return_sequences does. If set to True all the N outputs are passed forward to the next layer, while if it is set to False it only passes the last h[t] output to the next layer. Am I right?

You can check this question for further information, although it is based on Keras-1.x API.
Basically, the unit means the dimension of the inner cells in LSTM. Because in LSTM, the dimension of inner cell (C_t and C_{t-1} in the graph), output mask (o_t in the graph) and hidden/output state (h_t in the graph) should have the SAME dimension, therefore you output's dimension should be unit-length as well.
And LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length. If you set return_sequence=True, it will return something with shape: (batch_size, timespan, unit). If false, then it just return the last output in shape (batch_size, unit).
As for the input, you should provide input for every timestamp. Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit. If you just want to provide input at the first step, you can simply pad your data with zeros at other time steps.

Does that mean there actually exists N of these LSTM units in the LSTM layer, or maybe that that exactly one LSTM unit is run for N iterations outputting N of these h[t] values, from, say, h[t-N] up to h[t]?
First is true. In that Keras LSTM layer there are N LSTM units or cells.
keras.layers.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False)
If you plan to create simple LSTM layer with 1 cell you will end with this:
And this would be your model.
N=1
model = Sequential()
model.add(LSTM(N))
For the other models you would need N>1

How many instances of "LSTM chains"
The proper intuitive explanation of the 'units' parameter for Keras recurrent neural networks is that with units=1 you get a RNN as described in textbooks, and with units=n you get a layer which consists of n independent copies of such RNN - they'll have identical structure, but as they'll be initialized with different weights, they'll compute something different.
Alternatively, you can consider that in an LSTM with units=1 the key values (f, i, C, h) are scalar; and with units=n they'll be vectors of length n.

"Intuitively" just like a dense layer with 100 dim (Dense(100)) will have 100 neurons. Same way LSTM(100) will be a layer of 100 'smart neurons' where each neuron is the figure you mentioned and the output will be a vector of 100 dimensions

Related

How to do operations on hidden vector of the decoder on every timestep and append it to the input of the next lstm unit

To implement attention in encoder-decoder, we have to take the hidden vector of an LSTM unit of the decoder, do several operation on it, to compute the attention weights. Now, my question is how can I individually take each hidden vector out from each LSTM unit in Keras?
This is how we initialize an lstm layer in Keras:
lstm_layer = LSTM(num_units)(inputs)
Now, there would be so many LSTM units being initialized in this layer. How can I take each LSTM unit's hidden vector, do some operations on it and concat it with the input to the next lstm unit?
Note - I know that we can extract the hidden vector of all the lstm units, by setting the return_sequences = true. But, I want to take the hidden vector of all the lSTM units out, do some operations on it, and concat it with the input to the next lstm unit.
Edit - By I want to take the hidden vector of all the LSTM units, what I mean is that: Suppose there are n timesteps in total, so there will be n number of lstm units. Now, I want to take the output (i.e. the hidden vector) of lstm from timestep 0, do some operations on it (not to worry on this part, as it is to be implemented by the viewer himself based on which operations you want to do), and then concat it with the input to LSTM at timestep 1. And, implement all these operations on every lstm unit. So, in general: Take hidden state of lstm from timestep t, do some operations on it and concat it with the input to lstm at timestep t+1

Dropout layer before or after LSTM. What is the difference?

Suppose that we have an LSTM model for time series forecasting. Also, this is a multivariate case, so we're using more than one feature for training the model.
ipt = Input(shape = (shape[0], shape[1])
x = Dropout(0.3)(ipt) ## Dropout before LSTM.
x = CuDNNLSTM(10, return_sequences = False)(x)
out = Dense(1, activation='relu')(x)
We can add Dropout layer before LSTM (like the above code) or after LSTM.
If we add it before LSTM, is it applying dropout on timesteps (different lags of time series), or different input features, or both of them?
If we add it after LSTM and because return_sequences is False, what is dropout doing here?
Is there any different between dropout option in LSTM and dropout layer before LSTM layer?
As default, Dropout creates a random tensor of zeros an ones. No pattern, no privileged axis. So, you can't say a specific thing is being dropped, just random coordinates in the tensor. (Well, it drops features, but different features for each step, and differently for each sample)
You can, if you want, use the noise_shape property, which will define the shape of the random tensor. Then you can select if you want to drop steps, features or samples, or maybe a combination.
Dropping time steps: noise_shape = (1,steps,1)
Dropping features: noise_shape = (1,1, features)
Dropping samples: noise_shape = (None, 1, 1)
There is also the SpatialDropout1D layer, which uses noise_shape = (input_shape[0], 1, input_shape[2]) automatically. This drops the same feature for all time steps, but treats each sample individually (each sample will drop a different group of features).
After the LSTM you have shape = (None, 10). So, you use Dropout the same way you would use in any fully connected network. It drops a different group of features for each sample.
A dropout as an argument to the LSTM has a lot of differences. It generates 4 different dropout masks, for creating different inputs for each of the different gates. (You can see the LSTMCell code to check this).
Also, there is the option of recurrent_dropout, which will generate 4 dropout masks, but to be applied to the states instead of the inputs, each step of the recurrent calculations.
You are confusing Dropout with it's variant SpatialDropoutND (either 1D, 2D or 3D). See documentation (apparently you can't link specific class).
Dropout applies random binary mask to input, no matter the shape, except first dimension (batch), so it applies to features and and timesteps in this case.
Here, if return_sequences=False, you only get output from last timestep, so it would be of size [batch, 10] in your case. Dropout will randomly drop value from the second dimension
Yes, there is a difference, as dropout is for time steps when LSTM produces sequences (e.g. sequences of 10 goes through the unrolled LSTM and some of the features are dropped before going into the next cell). Dropout would drop random elements (except batch dimension). SpatialDropout1D would drop entire channels, in this case some timesteps would be entirely dropped out (in the convolution case, you could use SpatialDropout2D to drop channels, either input or along the network).

Difference between Dense(2) and Dense(1) as the final layer of a binary classification CNN?

In a CNN for binary classification of images, should the shape of output be (number of images, 1) or (number of images, 2)? Specifically, here are 2 kinds of last layer in a CNN:
keras.layers.Dense(2, activation = 'softmax')(previousLayer)
or
keras.layers.Dense(1, activation = 'softmax')(previousLayer)
In the first case, for every image there are 2 output values (probability of belonging to group 1 and probability of belonging to group 2). In the second case, each image has only 1 output value, which is its label (0 or 1, label=1 means it belongs to group 1).
Which one is correct? Is there intrinsic difference? I don't want to recognize any object in those images, just divide them into 2 groups.
Thanks a lot!
This first one is the correct solution:
keras.layers.Dense(2, activation = 'softmax')(previousLayer)
Usually, we use the softmax activation function to do classification tasks, and the output width will be the number of the categories. This means that if you want to classify one object into three categories with the labels A,B, or C, you would need to make the Dense layer generate an output with a shape of (None, 3). Then you can use the cross_entropyloss function to calculate the LOSS, automatically calculate the gradient, and do the back-propagation process.
If you want to only generate one value with the Dense layer, that means you get a tensor with a shape of (None, 1) - so it produces a single numeric value, like a regression task. You are using the value of the output to represent the category. The answer is correct, but does not perform like the general solution of the classification task.
The difference is if the class probabilities are independent of each other (multi-label classification) or not.
When there are 2 classes and you generally have P(c=1) + P(c=0) = 1 then
keras.layers.Dense(2, activation = 'softmax')
keras.layers.Dense(1, activation = 'sigmoid')
both are correct in terms of class probabilities. The only difference being how you supply the labels during training. But
keras.layers.Dense(2, activation = 'sigmoid')
is incorrect in that context. However, it is correct implementation if you have P(c=1) + P(c=0) != 1. This is the case for multi-label classification where an instance may belong to more than one correct class.

Getting keras LSTM layer to accept two inputs?

I'm working with padded sequences of maximum length 50. I have two types of sequence data:
1) A sequence, seq1, of integers (1-100) that correspond to event types (e.g. [3,6,3,1,45,45....3]
2) A sequence, seq2, of integers representing time, in minutes, from the last event in seq1. So the last element is zero, by definition. So for example [100, 96, 96, 45, 44, 12,... 0]. seq1 and seq2 are the same length, 50.
I'm trying to run the LSTM primarily on the event/seq1 data, but have the time/seq2 strongly influence the forget gate within the LSTM. The reason for this is I want the LSTM to tend to really penalize older events and be more likely to forget them. I was thinking about multiplying the forget weight by the inverse of the current value of the time/seq2 sequence. Or maybe (1/seq2_element + 1), to handle cases where it's zero minutes.
I see in the keras code (LSTMCell class) where the change would have to be:
f = self.recurrent_activation(x_f + K.dot(h_tm1_f,self.recurrent_kernel_f))
So I need to modify keras' LSTM code to accept multiple inputs. As an initial test, within the LSTMCell class, I changed the call function to look like this:
def call(self, inputs, states, training=None):
time_input = inputs[1]
inputs = inputs[0]
So that it can handle two inputs given as a list.
When I try running the model with the Functional API:
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
## Input 2: time vectors
auxiliary_input = Input(shape=(max_seq_length,1), dtype='float32', name='aux_input')
m = Masking(mask_value = 99999999.0)(auxiliary_input)
lstm_out = LSTM(32)(x, time_vector = m)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input, auxiliary_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1, train_X2], [train_Y, train_Y], epochs=1, batch_size=200)
However, I get the following error:
An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 50, 1), ndim=3)]; however `cell.state_size` is (32, 32)
Any advice?
You can't pass a list of inputs to default recurrent layers in Keras. The input_spec is fixed and the recurrent code is implemented based on single tensor input also pointed out in the documentation, ie it doesn't magically iterate over 2 inputs of same timesteps and pass that to the cell. This is partly because of how the iterations are optimised and assumptions made if the network is unrolled etc.
If you like 2 inputs, you can pass constants (doc) to the cell which will pass the tensor as is. This is mainly to implement attention models in the future. So 1 input will iterate over timesteps while the other will not. If you really like 2 inputs to be iterated like a zip() in python, you will have to implement a custom layer.
I would like to throw in a different ideas here. They don't require you to modify the Keras code.
After the embedding layer of the event types, stack the embeddings with the elapsed time. The Keras function is keras.layers.Concatenate(axis=-1). Imagine this, a single even type is mapped to a n dimensional vector by the embedding layer. You just add the elapsed time as one more dimension after the embedding so that it becomes a n+1 vector.
Another idea, sort of related to your problem/question and may help here, is 1D convolution. The convolution can happen right after the concatenated embeddings. The intuition for applying convolution to event types and elapsed time is actually 1x1 convolution. In such a way that you linearly combine the two together and the parameters are trained. Note in terms of convolution, the dimensions of the vectors are called channels. Of course, you can also convolve more than 1 event at a step. Just try it. It may or may not help.

Tensorflow: What's the difference between tf.nn.dropout and tf.contrib.rnn.DropoutWrapper?

How's the following codes different?
with tf.contrib.rnn.DropoutWrapper
enc_cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.BasicLSTMCell(rnn_sizes, output_keep_prob=1-keep_prob) for _ in range(num_layers)])
_, encoding_state = tf.nn.dynamic_rnn(enc_cell, rnn_inputs, dtype=tf.float32)
with tf.nn.droupout
enc_cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.BasicLSTMCell(rnn_size) for _ in range(num_layers)])
_, encoding_state = tf.nn.dynamic_rnn(enc_cell, tf.nn.dropout(rnn_inputs, 1 - keep_prob), dtype=tf.float32)
It seems that there is a difference in the number of states we get from tf.nn.dynamic_rnn. len(encoding state) is greater with tf.nn.dropout.
An explanation will be highly appreciated.
Thank you.
The idea behind both is the same and it is dropout: the network "drops" (i.e does not use) some of its nodes in the prediction. This means reducing during training the capacity of the model to prevent overfitting. Thanks to dropout, the network learns not to rely exclusively on particular nodes for its prediction.
The difference between the two methods is that:
tf.nn.droputis a generic function to perform droput to a given input tensor. Looking at the documentation:
Computes dropout.
With probability keep_prob, outputs the input element scaled up by 1 /
keep_prob, otherwise outputs 0. The scaling is so that the expected
sum is unchanged.
tf.contrib.rnn.DropoutWrapper or tf.nn.rnn_cell.DropoutWrapper is a specific class to define Recurrent Neural Network cells with dropout applied both at the input and the output of the cell. Looking at the documentation:
Operator adding dropout to inputs and outputs of the given cell.
In particular, it uses tf.nn.droput to mask the input to the cell, the state and the output.
The difference between your two pieces of code is that when you are using tf.nn.dropout you are masking the inputs of the first layer only. In the wrapper case, layer per layer, you are masking the outputs of the cells (since you are providing only the output probabilities )
I think dropout can only mask one end, like what you did with rnn_inputs.
DropoutWrapper can mask multi end, like a lstm cell.