Why not use Flatten followed by a Dense layer instead of TimeDistributed? - tensorflow

I am trying to understand the Keras layers better. I am working on a sequence to sequence model where I embed a sentence and pass it to a LSTM that returns sequences. Hereafter, I want to apply a Dense layer to each timestep (word) in the sentence and it seems like TimeDistributed does the job for three-dimensional tensors like this case.
In my understanding, Dense layers only work for two-dimensional tensors and TimeDistributed just applies the same dense on every timestep in three dimensions. Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result or are these not equivalent in some way that I am missing?

Imagine you have a batch of 4 time steps, each containing a 3-element vector. Let's represent that with this:
Now you want to transform this batch using a dense layer, so you get 5 features per time step. The output of the layer can be represented as something like this:
You consider two options, a TimeDistributed dense layer, or reshaping as a flat input, apply a dense layer and reshaping back to time steps.
In the first option, you would apply a dense layer with 3 inputs and 5 outputs to every single time step. This could look like this:
Each blue circle here is a unit in the dense layer. By doing this with every input time step you get the total output. Importantly, these five units are the same for all the time steps, so you only have the parameters of a single dense layer with 3 inputs and 5 outputs.
The second option would involve flattening the input into a 12-element vector, applying a dense layer with 12 inputs and 20 outputs, and then reshaping that back. This is how it would look:
Here the input connections of only one unit are drawn for clarity, but every unit would be connected to every input. Here, obviously, you have many more parameters (those of a dense layer with 12 inputs and 20 outputs), and also note that each output value is influenced by every input value, so values in one time step would affect outputs in other time steps. Whether this is something good or bad depends on your problem and model, but it is an important difference with respect to the previous, where each time step input and output were independent. In addition to that, this configuration requires you to use a fixed number of time steps on each batch, whereas the previous works independently of the number of time steps.
You could also consider the option of having four dense layers, each applied independently to each time step (I didn't draw it but hopefully you get the idea). That would be similar to the previous one, only each unit would receive input connections only from its respective time step inputs. I don't think there is a straightforward way to do that in Keras, you would have to split the input into four, apply dense layers to each part and merge the outputs. Again, in this case the number of time steps would be fixed.

Dense layer can act on any tensor, not necessarily rank 2. And I think that TimeDistributed wrapper does not change anything in the way Dense layer acts. Just applying Dense layer to a tensor of rank 3 will do exactly the same as applying TimeDistributed wrapper of the Dense layer. Here is illustration:
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
model = Sequential()
model.add(Dense(5,input_shape=(50,10)))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 50, 5) 55
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________
model1 = Sequential()
model1.add(TimeDistributed(Dense(5),input_shape=(50,10)))
model1.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
time_distributed_3 (TimeDist (None, 50, 5) 55
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________

Adding to the above answers,
here are few pictures comparing the output shapes of the two layers. So when using one of these layers after LSTM(for example) would have different behaviors.

"Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result"
No, flattening timesteps into input dimensions (input_dim) is the wrong operation. As illustrated by yuva-rajulu if you flatten a 3D input (batch_size,timesteps,input_dim) = (1000,50,10), you end up with a flattened input (batch_size,input_dim)=(1000,500), resulting in a network architecture with timesteps interacting with each others (see jdehesa). This is not what is intended (i.e., we want to apply the same dense layer to each timestep independently).
What need to be done instead is to reshape the 3D input as (batch_size * timesteps, input_dim) = (50000,10), then apply the dense layer on this 2D input. That way the same dense layer will operate 50000 times on each input vector (10,1) independently. You will end up with a (50000,n_units) output that you should reshape back as a (1000,50,n_units) output. Fortunately, when you pass a 3D input to a dense layer keras does this automatically for you. See official reference:
"If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel (using tf.tensordot). For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units)."
Another way to see it is that the way Dense() computes the output is simply by applying the kernel , i.e., weigth matrix of size (input_dim, n_units) to the last dimension of your 3D input, considering all other dimensions as similar to batch sizes, then size the output accordingly.
I think that they may have been a time when the TimeDistributed layer was needed in keras with Dense() discussion here. Today, we do not need the TimeDistributed wrapper as Dense() and TimeDistributed(Dense()) do exactly the same thing, see Andrey Kite Gorin or mujjiga.

Related

calculating attention scores in Bahdanau attention in tensorflow using decoder hidden state and encoder output

This question relates to the neural machine translation shown here: Neural Machine Translation
self.W1 and self.W2 are initialized to dense neural layers of 10 units each, in lines 4 and 5 in the __init__ function of class BahdanauAttention
In the code image attached, I am not sure I understand the feed forward neural network set up in line 17 and line 18. So, I broke this formula down into it's parts. See line 23 and line 24.
query_with_time_axis is the input tensor to self.W1 and values is input to self.W2. And each compute the function Z = WX + b, and the Z's are added together. The dimensions of the tensors added together are (64, 1, 10) and (64, 16, 10). I am assuming random weight initialization for both self.W1 and self.W2 is handled by Keras behind the scenes.
Question:
After adding the Z's together, a non-linearity (tanh) is applied to come up with an activation and this resulting activation is input to the next layer self.V, which is a layer with just one output and gives us the score.
For this last step, we don't apply an activation function (tanh etc) to the result of self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))), to get a single output from this last neural network layer.
Is there a reason why an activation function was not used for this last step?
The ouput of the attention form so-called attention energies, i.e., one scalar for each encoder output. These numbers get stacked into a vector a this vector is normalized using softmax, yielding attention distribution.
So, in fact, there is non-linearity applied in the next step, which is the softmax. If you used an activation function before the softmax, you would only decrease the space of distributions that the softmax can do.

Dropout layer before or after LSTM. What is the difference?

Suppose that we have an LSTM model for time series forecasting. Also, this is a multivariate case, so we're using more than one feature for training the model.
ipt = Input(shape = (shape[0], shape[1])
x = Dropout(0.3)(ipt) ## Dropout before LSTM.
x = CuDNNLSTM(10, return_sequences = False)(x)
out = Dense(1, activation='relu')(x)
We can add Dropout layer before LSTM (like the above code) or after LSTM.
If we add it before LSTM, is it applying dropout on timesteps (different lags of time series), or different input features, or both of them?
If we add it after LSTM and because return_sequences is False, what is dropout doing here?
Is there any different between dropout option in LSTM and dropout layer before LSTM layer?
As default, Dropout creates a random tensor of zeros an ones. No pattern, no privileged axis. So, you can't say a specific thing is being dropped, just random coordinates in the tensor. (Well, it drops features, but different features for each step, and differently for each sample)
You can, if you want, use the noise_shape property, which will define the shape of the random tensor. Then you can select if you want to drop steps, features or samples, or maybe a combination.
Dropping time steps: noise_shape = (1,steps,1)
Dropping features: noise_shape = (1,1, features)
Dropping samples: noise_shape = (None, 1, 1)
There is also the SpatialDropout1D layer, which uses noise_shape = (input_shape[0], 1, input_shape[2]) automatically. This drops the same feature for all time steps, but treats each sample individually (each sample will drop a different group of features).
After the LSTM you have shape = (None, 10). So, you use Dropout the same way you would use in any fully connected network. It drops a different group of features for each sample.
A dropout as an argument to the LSTM has a lot of differences. It generates 4 different dropout masks, for creating different inputs for each of the different gates. (You can see the LSTMCell code to check this).
Also, there is the option of recurrent_dropout, which will generate 4 dropout masks, but to be applied to the states instead of the inputs, each step of the recurrent calculations.
You are confusing Dropout with it's variant SpatialDropoutND (either 1D, 2D or 3D). See documentation (apparently you can't link specific class).
Dropout applies random binary mask to input, no matter the shape, except first dimension (batch), so it applies to features and and timesteps in this case.
Here, if return_sequences=False, you only get output from last timestep, so it would be of size [batch, 10] in your case. Dropout will randomly drop value from the second dimension
Yes, there is a difference, as dropout is for time steps when LSTM produces sequences (e.g. sequences of 10 goes through the unrolled LSTM and some of the features are dropped before going into the next cell). Dropout would drop random elements (except batch dimension). SpatialDropout1D would drop entire channels, in this case some timesteps would be entirely dropped out (in the convolution case, you could use SpatialDropout2D to drop channels, either input or along the network).

Changing the number of interneurons in Dense

I am novice in tensorflow and keras. I have the code below but I do not know why when I change 1 in dense to 10 (Dense(10)) I get error. I think I should be able to arbitrarily change the number of neurons in each layer. How should I change the number of neurons in dense? and if I want to add more dense latyers is there any rule for the number in dense?
model=Sequential()
model.add(Dense(1029, input_dim=29))
model.add(Activation('tanh'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
#odel.add(Dropout (0.2))
sgd=SGD(lr=0.1)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(input, target, steps_per_epoch=4, epochs=1000)
error:
ValueError: Error when checking target: expected activation_65 to have shape (10,) but got array with shape (1,)
I figured out the problem and I will post here for who might face the same issue. Th reason is that I need to have the last layer number of neurons equal to 1 according to my output. My input dimension is 1029 rows and 29 columns and my target is 1029 rows. I can add another layers of dense with arbitrary number of neurons.

Does Tensorflows tf.layers.dense flatten input dimensions?

I'm searching for a data leak in my model. I'm using tf.layers.dense before a masking operation and am concerned that the model could just learn to switch positions in the middle dimension of my input tensor.
When I have an input tensor x = tf.ones((2,3,4)) would tf.layers.dense(x,8) flatten x to a fully connected layer with 2*3*4=24 input neurons and 2*3*8=48 output neurons then reshape it again to [2,3,8], or would it create 2*3=6 fully connected layers with 4 input and 8 output neurons then concatenate them?
As for the Keras Dense layer, it has been already mentioned in another answer that its input is not flattened and instead, it is applied on the last axis of its input.
As for the TensorFlow Dense layer, it is actually inherited from Keras Dense layer and as a result, same as Keras Dense layer, it is applied on the last axis of its input.

In Keras, what exactly am I configuring when I create a stateful `LSTM` layer with N `units`?

The first arguments in a normal Dense layer is also units, and is the number of neurons/nodes in that layer. A standard LSTM unit however looks like the following:
(This is a reworked version of "Understanding LSTM Networks")
In Keras, when I create an LSTM object like this LSTM(units=N, ...), am I actually creating N of these LSTM units? Or is it the size of the "Neural Network" layers inside the LSTM unit, i.e., the W's in the formulas? Or is it something else?
For context, I'm working based on this example code.
The following is the documentation: https://keras.io/layers/recurrent/
It says:
units: Positive integer, dimensionality of the output space.
It makes me think it is the number of outputs from the Keras LSTM "layer" object. Meaning the next layer will have N inputs. Does that mean there actually exists N of these LSTM units in the LSTM layer, or maybe that that exactly one LSTM unit is run for N iterations outputting N of these h[t] values, from, say, h[t-N] up to h[t]?
If it only defines the number of outputs, does that mean the input still can be, say, just one, or do we have to manually create lagging input variables x[t-N] to x[t], one for each LSTM unit defined by the units=N argument?
As I'm writing this it occurs to me what the argument return_sequences does. If set to True all the N outputs are passed forward to the next layer, while if it is set to False it only passes the last h[t] output to the next layer. Am I right?
You can check this question for further information, although it is based on Keras-1.x API.
Basically, the unit means the dimension of the inner cells in LSTM. Because in LSTM, the dimension of inner cell (C_t and C_{t-1} in the graph), output mask (o_t in the graph) and hidden/output state (h_t in the graph) should have the SAME dimension, therefore you output's dimension should be unit-length as well.
And LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length. If you set return_sequence=True, it will return something with shape: (batch_size, timespan, unit). If false, then it just return the last output in shape (batch_size, unit).
As for the input, you should provide input for every timestamp. Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit. If you just want to provide input at the first step, you can simply pad your data with zeros at other time steps.
Does that mean there actually exists N of these LSTM units in the LSTM layer, or maybe that that exactly one LSTM unit is run for N iterations outputting N of these h[t] values, from, say, h[t-N] up to h[t]?
First is true. In that Keras LSTM layer there are N LSTM units or cells.
keras.layers.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False)
If you plan to create simple LSTM layer with 1 cell you will end with this:
And this would be your model.
N=1
model = Sequential()
model.add(LSTM(N))
For the other models you would need N>1
How many instances of "LSTM chains"
The proper intuitive explanation of the 'units' parameter for Keras recurrent neural networks is that with units=1 you get a RNN as described in textbooks, and with units=n you get a layer which consists of n independent copies of such RNN - they'll have identical structure, but as they'll be initialized with different weights, they'll compute something different.
Alternatively, you can consider that in an LSTM with units=1 the key values (f, i, C, h) are scalar; and with units=n they'll be vectors of length n.
"Intuitively" just like a dense layer with 100 dim (Dense(100)) will have 100 neurons. Same way LSTM(100) will be a layer of 100 'smart neurons' where each neuron is the figure you mentioned and the output will be a vector of 100 dimensions