I am trying to modify the autoencoder model so I can add an attention layer on the "bottleneck/compressed" layer.
Suppose i have 2 encode Density layers, the attention layer i want to add, and 2 decode Density layers.
How can I develop this architecture in tensorflow?
I am new to attention mechanisms and I want to learn more about it by doing some practical examples. I came across a Keras implementation for multi-head attention found it in this website Pypi keras multi-head. I found two different ways to implement it in Keras.
One way is to use a multi-head attention as a keras wrapper layer with either LSTM or CNN.
This is a snippet of implementating multi-head as a wrapper layer with LSTM in Keras. This example is taken from this website keras multi-head"
import keras
from keras_multi_head import MultiHead
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=100, output_dim=20, name='Embedding'))
model.add(MultiHead(keras.layers.LSTM(units=64), layer_num=3, name='Multi-LSTMs'))
model.add(keras.layers.Flatten(name='Flatten'))
model.add(keras.layers.Dense(units=4, activation='softmax', name='Dense'))
model.build()
model.summary()
The other way is to use it separately as a stand-alone layer.
This is a snippet of the second implementation for multi-head as stand-alone laye, also taken from keras multi-head"
import keras
from keras_multi_head import MultiHeadAttention
input_layer = keras.layers.Input( shape=(2, 3), name='Input',)
att_layer = MultiHeadAttention( head_num=3, name='Multi-Head',)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
model.compile( optimizer='adam', loss='mse', metrics={},)
I have been trying to find some documents that explain this but I have not found yet.
Update:
What I have found was that the second implementation (MultiHeadAttention) is more like the Transformer paper "Attention All You Need". However, I am still struggling to understand the first implementation which is the wrapper layer.
Does the first one (as a wrapper layer) would combine the output of multi-head with LSTM?.
I was wondering if someone could explain the idea behind them, especially, the wrapper layer.
I understand your confusion. From my experience, what the Multihead (this wrapper) does is that it duplicates (or parallelize) layers to form a kind of multichannel architecture, and each channel can be used to extract different features from the input.
For instance, each channel can have a different configuration, which is later concatenated to make an inference. So, the MultiHead can be used to wrap conventional architectures to form multihead-CNN, multihead-LSTM etc.
Note that the attention layer is different. You may stack attention layers to form a new architecture. You may also parallelize the attention layer (MultiHeadAttention) and configure each layer as explained above. See here for different implementation of the attention layer.
I am trying to apply GradCAM to my pre-trained CNN model to generate heat maps of layers. My custom CNN design is shown as follows:
- It adopted all the convolution layers and the pre-trained weights from the VGG16 model.
- Extract lower level features (early convolution layers) from VGG16.
- Train the fully connected layers of both normal/high and lower level features from VGG16.
- Concatenate outputs of both normal/high- and lower-level f.c. layers and then train more f.c. layers before the final prediction.
model design
I want to use GradCAM to visualize the feature maps of the low-level route and the normal/high-level route and I have done such heatmaps on non-concatenate fine-tuned VGG using the last convolutional layers. My question is, on a concatenated CNN model, can the Grad-CAM method still work using the gradient of the prediction with respect to the low- and high-level feature map feature maps respectfully? If not, are there other methods that can do the heatmaps visualization for such a model? Is using the shared fully connected layer an option?
Any idea and suggestions are much appreciated!
I've programmed a VGG16 based CNN and now I want to create a faster R-CNN from it. In all the architecture photos I've seen it is needed to have a RoI pooling layer but I don't know how to implement one. Is there an function to do this?
Keras/Tensorflow does not provide an implementation of ROI Pooling Layer, so you need to code it yourself.
You can have code reference from this repository
When we add a bidirectional RNN layer I can understand that we have to concatenate hidden states. If we use bidirectional RNN layer in encoder decoder model do we have to train the bidirectional RNN layer separately ?
No. To quote from the abstract of Bidirectional Recurrent Neural Networks by Schuster and Paliwal:
The BRNN can be trained without the limitation of using input information just
up to a preset future frame. This is accomplished by training it
simultaneously in positive and negative time direction.
I guess you are talking about tf.nn.static_bidirectional_rnn.