Keras Dense Layer Propagate Mask - tensorflow

After checking the official doc here keras mask tutorial, it is still not clear to me whether Keras Dense layer can propagate the mask to its following layers 4 and 5 in below example.
Another question is, when calculating the loss at the 5th layer, shall we apply the mask?
We can say it is not needed because the 2nd layer LSTM already ignored those <pad> tokens in input sequences. However, I've read somewhere that the LSTM output of <pad> tokens are NOT zero, but following the last valid token's output. It will affect the value of loss. Thus, we need to apply the mask at the 5th layer?
Our input are padded sequences, and we have a sequential model in Keras
Embedding layer with mask_zero = True //can generate mask
LSTM layer //can consume mask
Dense layer //Question: can this layer propagate mask to other layers in this model
other layers...
output layer with sigmoid as activation function
Thanks for your kind help!

Related

Keras input layer

I am unsure if I need to add a Dense input layer before adding LSTM layers in my model. Forexample, with the following model:
# Model
model = Sequential()
model.add(LSTM(128, input_shape=(train_x.shape[1], train_x.shape[2])))
model.add(Dense(5, activation="linear"))
Will the LSTM layer be the input layer, and the Dense layer the output layer (meaning no hidden layers)? Or does Keras create an input layer meaning the LSTM layer will be a hidden layer?
You don't need too. It depends on what you want to accomplish.
Check here some cases.
In your case, yes the LSTm will be the first layer and the Dense layer will be the output layer.
The current configuration is okay for simple examples. Everything is based on what you want to get as results. The model and layers subject to change based on the target goal. So, if the model is complex, you can make mix model with different layers and shapes. See reference.
Mix model layering

How to do operations on hidden vector of the decoder on every timestep and append it to the input of the next lstm unit

To implement attention in encoder-decoder, we have to take the hidden vector of an LSTM unit of the decoder, do several operation on it, to compute the attention weights. Now, my question is how can I individually take each hidden vector out from each LSTM unit in Keras?
This is how we initialize an lstm layer in Keras:
lstm_layer = LSTM(num_units)(inputs)
Now, there would be so many LSTM units being initialized in this layer. How can I take each LSTM unit's hidden vector, do some operations on it and concat it with the input to the next lstm unit?
Note - I know that we can extract the hidden vector of all the lstm units, by setting the return_sequences = true. But, I want to take the hidden vector of all the lSTM units out, do some operations on it, and concat it with the input to the next lstm unit.
Edit - By I want to take the hidden vector of all the LSTM units, what I mean is that: Suppose there are n timesteps in total, so there will be n number of lstm units. Now, I want to take the output (i.e. the hidden vector) of lstm from timestep 0, do some operations on it (not to worry on this part, as it is to be implemented by the viewer himself based on which operations you want to do), and then concat it with the input to LSTM at timestep 1. And, implement all these operations on every lstm unit. So, in general: Take hidden state of lstm from timestep t, do some operations on it and concat it with the input to lstm at timestep t+1

Keras: why must an embedding layer be used only as the first layer?

In the keras documentation it states that the embedding layer "can only be used as the first layer in a model." This makes no sense to me, I might want to do a reshape/flatten on an input before passing it to the embedding layer, but this is not allowed. Why must the embedding layer be used only as the first layer?
"can only be used as the first layer in a model." This makes no sense
to me
Generally, an embedding layer maps discrete values to continues values. In the subsequence layers, we have continues vector representation that means there is no need to convert the vectors again.
I might want to do a reshape/flatten on input before passing it to
the embedding layer
Of course, you can reshape or flatten an input but in most cases is meaningless. For example, assume we have sentences with a length of 30 and want to flatten them before passed them to embedding:
input_layer = Input(shape=(30))
flatten = Flatten()(input_layer)
embedd = Embedding(1000, 100)(flatten)
In the above example, flatten layer has no effect at all. Before and after flatten our vector size is [batch, 30].
Let look at another example, assume our inputs vector our 2D with the shape of [batch, 30, 2]. After flatting the input, the vectors have the size of [batch, 60]. We can feed them into Embedding layer but in most of the scenarios, it has no meaning. In fact, we destroy the logical relationship between features.
input_layer = Input(shape=(30, 2))
flatten = Flatten()(input_layer)
embedd = Embedding(1000, 100)(flatten)

Does Tensorflows tf.layers.dense flatten input dimensions?

I'm searching for a data leak in my model. I'm using tf.layers.dense before a masking operation and am concerned that the model could just learn to switch positions in the middle dimension of my input tensor.
When I have an input tensor x = tf.ones((2,3,4)) would tf.layers.dense(x,8) flatten x to a fully connected layer with 2*3*4=24 input neurons and 2*3*8=48 output neurons then reshape it again to [2,3,8], or would it create 2*3=6 fully connected layers with 4 input and 8 output neurons then concatenate them?
As for the Keras Dense layer, it has been already mentioned in another answer that its input is not flattened and instead, it is applied on the last axis of its input.
As for the TensorFlow Dense layer, it is actually inherited from Keras Dense layer and as a result, same as Keras Dense layer, it is applied on the last axis of its input.

What is the equivalence of Masking() Keras function in tensorflow? And does batch norm, conv, and relu support Masking?

I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.
I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?
Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?
Your help is much appreciated!!
So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.
The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.
In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.
All the details are discussed in this blog post.
But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:
final_output = output * mask
Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.