I've been trying to understand how attention mechanism works. Currently looking at tfjs-examples/date-conversion-attention example. I've found out that in the example the dot product alignment score (from Effective Approaches to Attention-based Neural Machine Translation) is being used.
So this expression is represented as
let attention = tf.layers.dot({axes: [2, 2]}).apply([decoder, encoder]);
in the code.
The decoder (h_t) has a shape of [10,64] and the encoder (h_s) is [12,64] so the result will have a shape of [10,12]. So far so good.
Now I'm trying to implement the concat alignment score, which looks like this
.
So the first thing to do is to concatenate the h_t and h_s. However, they have different shapes so I don't know how to proceed. Should I reshape somehow the tensors? If so, what would be the shape?
I've been googling around to find out how other people do this and found this.
#For concat scoring, decoder hidden state and encoder outputs are concatenated first
out = torch.tanh(self.fc(decoder_hidden+encoder_outputs))
But this doesn't seem right as they sum the values instead of concatenating.
Any guidance would be appreciated.
UPDATE Here is the model summary:
__________________________________________________________________________________________________
Layer (type) Output shape Param # Receives inputs
==================================================================================================
input1 (InputLayer) [null,12] 0
__________________________________________________________________________________________________
embedding_Embedding1 (Embedding [null,12,64] 2240 input1[0][0]
__________________________________________________________________________________________________
input2 (InputLayer) [null,10] 0
__________________________________________________________________________________________________
lstm_LSTM1 (LSTM) [null,12,64] 33024 embedding_Embedding1[0][0]
__________________________________________________________________________________________________
embedding_Embedding2 (Embedding [null,10,64] 832 input2[0][0]
__________________________________________________________________________________________________
encoderLast (GetLastTimestepLay [null,64] 0 lstm_LSTM1[0][0]
__________________________________________________________________________________________________
lstm_LSTM2 (LSTM) [null,10,64] 33024 embedding_Embedding2[0][0]
encoderLast[0][0]
encoderLast[0][0]
__________________________________________________________________________________________________
dot_Dot1 (Dot) [null,10,12] 0 lstm_LSTM2[0][0]
lstm_LSTM1[0][0]
__________________________________________________________________________________________________
attention (Activation) [null,10,12] 0 dot_Dot1[0][0]
__________________________________________________________________________________________________
context (Dot) [null,10,64] 0 attention[0][0]
lstm_LSTM1[0][0]
__________________________________________________________________________________________________
concatenate_Concatenate1 (Conca [null,10,128] 0 context[0][0]
lstm_LSTM2[0][0]
__________________________________________________________________________________________________
time_distributed_TimeDistribute [null,10,64] 8256 concatenate_Concatenate1[0][0]
__________________________________________________________________________________________________
time_distributed_TimeDistribute [null,10,13] 845 time_distributed_TimeDistributed1
==================================================================================================
Total params: 78221
Trainable params: 78221
Non-trainable params: 0
__________________________________________________________________________________________________
First thing, for the tf.layers.dot to work, both inputs should have the same shape.
To perform a concatenation, you can use tf.concat([h_t, h_s]). The new shape will depend on the axis over which the concatenation is performed.
Lets suppose that both h_t and h_s have the shape [a, b], if the concatenation is done over the axis 0, then the new shape would be [2a, b] and if it is done over the axis 1, the resulting shape would be [a, 2b]
Then you can apply the tf.tanh to the input or create a customize layer that does it for you.
Update:
Since the tf.layers.dot is performed over 3d data who happen not to match on the second axis (axis = 1), the concatenation can be done only on that axis and the resulting shape would be [ 1, 10 + 12, 64 ]
Related
I have problems to understand the parameter "timestep" in LSTM layer. I have found some meanings, but I am very confuse now. Some mention that it is the amount of data per batch-size that enters the model during training. Others, on the other hand, say that it is the number of occurrences of a cell within the LSTM layer, while the states are being passed from one cell to another.
The point is that I have the following form of the training data:
(sequences, number of frames per sequence, width, height, channel = 1)
(2000, 5, 80, 80, 1)
My model must predict the following sequence of frames, in this case 5 future frames. The model consists of a variational autoencoder, first I use 3D convolutional layers to compress the sequences of 5 frames, then I resize the size of the outputs so that I can enter the LSTM layer, who only accepts (batch, timestep, features).
Model: "sequential"
____________________________________________________________________________________________________
Layer (type) Output Shape Param #
====================================================================================================
conv3d (Conv3D) (None, 2, 27, 27, 32) 19392
____________________________________________________________________________________________________
batch_normalization (BatchNormalization) (None, 2, 27, 27, 32) 128
____________________________________________________________________________________________________
conv3d_1 (Conv3D) (None, 1, 14, 14, 32) 2654240
____________________________________________________________________________________________________
batch_normalization_1 (BatchNormalization) (None, 1, 14, 14, 32) 128
____________________________________________________________________________________________________
conv3d_2 (Conv3D) (None, 1, 7, 7, 64) 3211328
____________________________________________________________________________________________________
batch_normalization_2 (BatchNormalization) (None, 1, 7, 7, 64) 256
____________________________________________________________________________________________________
flatten (Flatten) (None, 3136) 0
____________________________________________________________________________________________________
reshape (Reshape) (None, 4, 784) 0
____________________________________________________________________________________________________
lstm (LSTM) (None, 64) 217344
____________________________________________________________________________________________________
repeat_vector (RepeatVector) (None, 4, 64) 0
____________________________________________________________________________________________________
lstm_1 (LSTM) (None, 4, 64) 33024
____________________________________________________________________________________________________
time_distributed (TimeDistributed) (None, 4, 784) 50960
____________________________________________________________________________________________________
reshape_1 (Reshape) (None, 1, 7, 7, 64) 0
____________________________________________________________________________________________________
conv3d_transpose (Conv3DTranspose) (None, 2, 14, 14, 64) 6422592
____________________________________________________________________________________________________
batch_normalization_3 (BatchNormalization) (None, 2, 14, 14, 64) 256
____________________________________________________________________________________________________
conv3d_transpose_1 (Conv3DTranspose) (None, 4, 28, 28, 32) 5308448
____________________________________________________________________________________________________
batch_normalization_4 (BatchNormalization) (None, 4, 28, 28, 32) 128
____________________________________________________________________________________________________
conv3d_transpose_2 (Conv3DTranspose) (None, 8, 84, 84, 1) 19361
____________________________________________________________________________________________________
batch_normalization_5 (BatchNormalization) (None, 8, 84, 84, 1) 4
____________________________________________________________________________________________________
cropping3d (Cropping3D) (None, 8, 80, 80, 1) 0
____________________________________________________________________________________________________
cropping3d_1 (Cropping3D) (None, 5, 80, 80, 1) 0
====================================================================================================
I have finally used the RESHAPE layer to get into the LSTM layer, with shape (batch, 4, 784). In other words, I have called timestpe = 4. I think it should be 5, or not necessarily should be equal to the number of frames I want to predict.
What is the true meaning of timestep in this case? Do I need to order the values of my layers? I really appreciate your support.
On the other hand, I am thinking of applying convolutional layers to each frame, and no longer to the entire 5-frame sequence, but frame by frame and then connect the outputs of the convolutional layers to LSTM layers, finally connect the output states of the LSTM layers of each frame, respecting the order of the frames, in this case I consider using timestpe = 1.
I have called timestpe = 4. I think it should be 5, or not necessarily should be equal to the number of frames I want to predict.
You are right. The timestep is not equal to the number of frames you want to predict.
Let us frame it in a natural-language friendly description.
The timestep in essence is the number of units (seconds/minutes/hours/days/frames in a video etc.) which is used to predict the future step(s).
For example, you want to predict the stock price taking into account the last 5 days. In this case, the timestep = 5, where T-5 = current_day - 5, T-4 = current_day - 4 etc. Notice that the current_day would be here something like the 'future day', like 'predicting in advance' for today.
You want to predict maybe the stock price in the current day. In this case, you would to one-step-forecast. However, you may also want to predict the stock price in the current day, tomorrow, and the day after tomorrow. That is, predict T,T+1,T+2 by taking into account T-5,T-4,T-3,T-2,T-1.The acknowledged nomenclature for the second case is called multi-step-forecast.
Notice how the timestep which is strictly related to "past" is not related as computation for the multi-step-forecast.
Evidently, according to your problem, it is almost always the case that for multi-step-forecast you may need to take into consideration a bigger "past" frame, i.e. increase the number of timesteps, in order to help your LSTM capture more data correlation.
If you were to relate it to the amount of data per batch, you can consider a batch-size of 2 equal to 2 chunks of data in which [T-5,T-4,T-3,T-2,T-1] are taken to predict T. Therefore, 2 chunks of the form ([T-5,T-4,T-3,T-2,T-1],[T]).
When you prepare the data and you want to predict the next frame, of course you need exact perfect order for your past values (T-5,T-4...) inside a chunk. What you do not need is to have the exact consecutive chunks from a video, say.
In other words, you can have a chunk like the one described above from video 1, a chunk from video 9, etc.
I have 2 layers with the following shapes.
Layer 1: LSTM
K.int_shape(x)
(None, None, 500)
Layer 2 : Conv2D -> Flatten -> Reshape
K.int_shape(y)
(None, 1, 2352)
I need to concatenate them, but I get the following error.
ValueError: A 'Concatenate' layer requires inputs with matching shapes
except for the concat axis. Got inputs shapes: [(None, None, 500),
(None, 1, 2352)]
I'm using Keras v2.1.4
I have a time series signal (n samples, each sample has 81 time steps and 3 features = n x 81 x 3).
I am using an conv1D-LSTM network. n_timesteps = 81, n_features = 3.
Normal LSTM specifies both n_timesteps and n_features, however when combined with conv1d, these are not specified.
How does the LSTM know how many time steps and features there are in the input to it?
How does the LSTM know the end of the sequence for each sample?
Are the time steps "stored up" and them fed into the LSTM or are the processed one time step at a time and fed into the LSTM one time step at a time?
If I include the "flatten" (below) it fails. Why?
Do the number of filters in the conv1d have to correspond to the number of filters in the LSTM?
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features)))
model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model.add(Dropout(0.5))
model.add(MaxPooling1D(pool_size=2))
#model.add(Flatten())
#model.add(LSTM(units=128, input_shape=(n_timesteps, n_features), return_sequences=True))
model.add(LSTM(units=128, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True))
model.add(LSTM(units=32, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
1 and 2
Everything is based on tensors (sort of like matrices, but with any number of dimensions).
The tensors have shapes and everything is based on the shapes. Your data tensors are three-dimensional: (samples, time_steps, features).
It happens that 1D convolutions also use the same 3D tensors: (samples, length, channels). So:
samples = examples = sequences
time_steps = length
features = channels
There is no secret. The data is structured and the layers will use this structure. Look at your model.summary() and see the number of steps and features for every layer's output.
3
There is no interleaving between layers.
The conv layer will process its entire input tensor and generate an output tensor.
The next conv layer will take this entire output and produce another entire output
The LSTM layer will do the same, take an entire input and output an entire tensor.
4
If you flatten the data, your 3D tensors (samples, steps, feats) will become 2D tensors (samples, something). No more data structure that can be understood by the layers.
5
There is absolutely no requirement for number of filters or units. The only thing is that the final output of your model needs to have the same shape of your y_train data.
Here is my model summary. It appears that the number of features has changed from the original 3 (of the input) to 32 (for the conv1d). Is it correct that the LSTM will now process then entire time steps (~81) on the 32 features of the conv1d instead of the 3 features of the input?
Example of summary:
The first LSTM will take an input shape of (None, 38,32). This means this LSTM will process:
38 steps
32 features
The convolutions are discarding border steps and the maxpooling is halving the steps.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 79, 32) 320
_________________________________________________________________
conv1d_1 (Conv1D) (None, 77, 32) 3104
_________________________________________________________________
dropout (Dropout) (None, 77, 32) 0
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 38, 32) 0
_________________________________________________________________
lstm (LSTM) (None, 38, 128) 82432
_________________________________________________________________
dropout_1 (Dropout) (None, 38, 128) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 38, 64) 49408
_________________________________________________________________
lstm_2 (LSTM) (None, 32) 12416
_________________________________________________________________
dense (Dense) (None, 16) 528
_________________________________________________________________
dropout_2 (Dropout) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 148,225
Trainable params: 148,225
Non-trainable params: 0
_________________________________________________________________```
I don't have problem in understanding output shape of a Dense layer followed by a Flatten layer. Output shape is in accordance of my understanding i.e (Batch size, unit).
nn= keras.Sequential()
nn.add(keras.layers.Conv2D(8,kernel_size=(2,2),input_shape=(4,5,1)))
nn.add(keras.layers.Conv2D(1,kernel_size=(2,2)))
nn.add(keras.layers.Flatten())
nn.add(keras.layers.Dense(5))
nn.add(keras.layers.Dense(1))
nn.summary()
Output is:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 3, 4, 8) 40
_________________________________________________________________
conv2d_2 (Conv2D) (None, 2, 3, 1) 33
_________________________________________________________________
flatten_1 (Flatten) (None, 6) 0
_________________________________________________________________
dense_1 (Dense) (None, 5) 35
_________________________________________________________________
dense_2 (Dense) (None, 1) 6
=================================================================
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
But I am having trouble in understanding the output shape of a dense layer for multidimensional input .So for following code
nn= keras.Sequential()
nn.add(keras.layers.Conv2D(8,kernel_size=(2,2),input_shape=(4,5,1)))
nn.add(keras.layers.Conv2D(1,kernel_size=(2,2)))
#nn.add(keras.layers.Flatten())
nn.add(keras.layers.Dense(5))
nn.add(keras.layers.Dense(1))
nn.summary()
output is
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 3, 4, 8) 40
_________________________________________________________________
conv2d_2 (Conv2D) (None, 2, 3, 1) 33
_________________________________________________________________
dense_1 (Dense) (None, 2, 3, 5) 10
_________________________________________________________________
dense_2 (Dense) (None, 2, 3, 1) 6
=================================================================
Total params: 89
Trainable params: 89
I am unable to make intuition for output shape of dense_1 and dense_2 layer. Shouldn't the final output be a scalar or (batch,unit)?
Following answer to similar question tries to explain the intuition but I can not fully grasp the concept.
From the same answer:
That is, each output "pixel" (i, j) in the 640x959 grid is calculated as a dense combination of the 8 different convolution channels at point (i, j) from the previous layer.
May be some explanation with pictures will be useful .
This is tricky but it does fit with the documentation from Keras on dense layers,
Output shape
nD tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units)
Note it is not the clearest, but they are saying with the ... that the final dimension of the input shape will be elided by the number of dense connections. Basically, for each item of the final dimension, create a connection to each of the requested dense nodes in the coming dense layer.
In your case you have something which is 2 x 3 x 1. So there is "one thing" (the 2 x 3 thing) to be connected to each of the 5 dense layer nodes, hense 2 x 3 x 5. You can think of it like channels of a CNN layer in this particular case. There is a distinct 2 x 3 sheet of outputs for each of the 5 output "nodes".
In a purely 2-D case (batch_size, units) ... then each item iterated by the final dimension units is itself a scalar value, so you end up with something of exactly the size of the number of dense nodes requested.
But in a higher-dimensional case, each item you iterate along the final dimension of the input will itself still be a higher-dimensional thing, and so the output is k distinct "clones" of those higher-dimensional things, where k is the dense layer size requested, and by "clone" we mean the output for a single dense connection has the same shape as the the items in the final dimension of the input.
Then the Dense-ness means that each specific element of that output has a connection to each element of the corresponding set of inputs. But be careful about this. Dense layers are defined by having "one" connection between each item of the output and each item of the input. So even though you have 5 "2x3 things" in your output, they each just have one solitary weight associated with them about how they are connected to the 2x3 thing that is the input. Keras also defaults to using a bias vector (not bias tensor), so if the dense layer has dimension k and the final dimension of the previous layer is n you should expect (n+1)k trainable parameters. These will always be used with numpy-like broadcasting to make the lesser dimensional shape of the weight and bias vectors conformable to the actual shapes of the input tensors.
It is customary to use Flatten as in your first example if you want to enforce the exact size of the coming dense layer. You would use multidimensional Dense layer when you want different "(n - 1)D" groups of connections to each Dense node. This is probably extremely rare for higher dimensional inputs because you'd typically want a CNN type of operation, but I could imagine maybe in some cases where a model predicts pixel-wise values or if you are generating a full nD output, like from the decoder portion of an encoder-decoder network, you might want a dense array of cells that match the dimensions of some expected structured output type like an image or video.
I am working on simple 1D convolution model, which is built as follows
model1= Sequential()
model1.add(Conv1D(60,32, strides=1, activation='relu',padding='causal',input_shape=(64,1)))
model1.add(Conv1D(80,10, strides=1, activation='relu',padding='causal'))
model1.add(Conv1D(100,5, strides=1, activation='relu',padding='causal'))
model1.add(MaxPooling1D(2))
model1.add(Dense(300,activation='relu'))
model1.add(Flatten())
model1.add(Dense(1,activation='relu'))
print(model1.summary())
Its model summary is as follows
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 64, 60) 1980
_________________________________________________________________
conv1d_2 (Conv1D) (None, 64, 80) 48080
_________________________________________________________________
conv1d_3 (Conv1D) (None, 64, 100) 40100
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 100) 0
_________________________________________________________________
dense_1 (Dense) (None, 32, 300) 30300
_________________________________________________________________
flatten_1 (Flatten) (None, 9600) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 9601
=================================================================
Total params: 130,061
Trainable params: 130,061
Non-trainable params: 0
_________________________________________________________________
If I change move the flatten layer before the first dense layer, as follows, I got the following model architecture. it seems that the number of model parameters of this one is much larger than the previous one. Why the placement of flatten layer has such a larger impact? What's the correct way to place the flatten layer.
model1= Sequential()
model1.add(Conv1D(60,32, strides=1, activation='relu',padding='causal',input_shape=(64,1)))
model1.add(Conv1D(80,10, strides=1, activation='relu',padding='causal'))
model1.add(Conv1D(100,5, strides=1, activation='relu',padding='causal'))
model1.add(MaxPooling1D(2))
model1.add(Flatten())
model1.add(Dense(300,activation='relu'))
model1.add(Dense(1,activation='relu'))
The difference is that in the first case you have a channel-wise dense layer. The layer will map 100 inputs to 300 outputs, using 100 x 300 = 30,000 weights and 300 biases, for a total of 30,300 parameters. The same operation will be repeated for all 32 channels of its input from max_pooling1d_1.
In the second case you flatten the input first, so now you have 3,200 inputs and map it to 300 outputs, requiring 300 x 3,200 + 300 = 960,300 parameters.
Which one is the correct one is up to you. In the first case the network is much smaller, will learn quicker, will be much less prone to overfitting, but might not have the expressivity necessary to give a usable performance on your dataset. But does it make sense to force the dense layer to treat all channels the same way? Only experiments can tell. You have to try both ways and see which one yields better results.