I am trying to add an attention layer for my text classification model. The inputs are texts (e.g. movie review), the output is a binary outcome (e.g. positive vs negative).
model = Sequential()
model.add(Embedding(max_features, 32, input_length=maxlen))
model.add(Bidirectional(CuDNNGRU(16,return_sequences=True)))
##### add attention layer here #####
model.add(Dense(1, activation='sigmoid'))
After some searching, I found a couple of read-to-use attention layers for keras. There is the keras.layers.Attention layer that is built-in in Keras. There is also the SeqWeightedAttention and SeqSelfAttention layer in the keras-self-attention package. As a person who is relatively new to the deep learning field, I have a hard time understanding the mechanism behind these layers.
What does each of these lays do? Which one will be the best for my model?
Thank you very much!
If you are using RNN, I would not recommend using the keras.layers.Attention class.
While analysing tf.keras.layers.Attention Github code to better understand how to use the same, the first line I could come across was - "This class is suitable for Dense or CNN networks, and not for RNN networks"
There is another open source version maintained by CyberZHG called
keras-self-attention. To the best of my knowledge this is NOT a part of the Keras or TensorFlow library and seems to be an independent piece of code. This contains the two classes you mentioned - SeqWeightedAttention & SeqSelfAttention layer classes. former returns a 2D value and latter a 3D value. So the SeqWeightedAttention should work for your situation. The former seems to be loosely based on Raffel et al and can be used for Seq classification, The latter seems to be a variation of Bahdanau.
In general, I would suggest you to write your own seq to classification model. The attention piece can be added in less than half a dozen lines of code (bare-bones essence)...much less than the time you would spend in integrating or debugging or understanding the code in these external libraries.
Please refer: Create an LSTM layer with Attention in Keras for multi-label text classification neural network
Related
I am new to attention mechanisms and I want to learn more about it by doing some practical examples. I came across a Keras implementation for multi-head attention found it in this website Pypi keras multi-head. I found two different ways to implement it in Keras.
One way is to use a multi-head attention as a keras wrapper layer with either LSTM or CNN.
This is a snippet of implementating multi-head as a wrapper layer with LSTM in Keras. This example is taken from this website keras multi-head"
import keras
from keras_multi_head import MultiHead
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=100, output_dim=20, name='Embedding'))
model.add(MultiHead(keras.layers.LSTM(units=64), layer_num=3, name='Multi-LSTMs'))
model.add(keras.layers.Flatten(name='Flatten'))
model.add(keras.layers.Dense(units=4, activation='softmax', name='Dense'))
model.build()
model.summary()
The other way is to use it separately as a stand-alone layer.
This is a snippet of the second implementation for multi-head as stand-alone laye, also taken from keras multi-head"
import keras
from keras_multi_head import MultiHeadAttention
input_layer = keras.layers.Input( shape=(2, 3), name='Input',)
att_layer = MultiHeadAttention( head_num=3, name='Multi-Head',)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
model.compile( optimizer='adam', loss='mse', metrics={},)
I have been trying to find some documents that explain this but I have not found yet.
Update:
What I have found was that the second implementation (MultiHeadAttention) is more like the Transformer paper "Attention All You Need". However, I am still struggling to understand the first implementation which is the wrapper layer.
Does the first one (as a wrapper layer) would combine the output of multi-head with LSTM?.
I was wondering if someone could explain the idea behind them, especially, the wrapper layer.
I understand your confusion. From my experience, what the Multihead (this wrapper) does is that it duplicates (or parallelize) layers to form a kind of multichannel architecture, and each channel can be used to extract different features from the input.
For instance, each channel can have a different configuration, which is later concatenated to make an inference. So, the MultiHead can be used to wrap conventional architectures to form multihead-CNN, multihead-LSTM etc.
Note that the attention layer is different. You may stack attention layers to form a new architecture. You may also parallelize the attention layer (MultiHeadAttention) and configure each layer as explained above. See here for different implementation of the attention layer.
I am using Google's Dopamine framework to train a specific reinforcement learning use-case. I am using an auto encoder to pre-train the convolutional layers of the Deep Q Network and then transfer those pre-trained weights in the final network.
To that end, I have created a separate model (in this case an auto-encoder) which I train and save the resulting model and weights.
The DQN model is created using Keras's model sub-classing method and the model used to save the trained convolutional layers weights was build using the Sequential API. My issue is with when trying to load the pre-trained weights to my final DQN model. Based on whether I use the load_model() or load_weights() functionality from Tensorflow's API I get two different overall behaviors of my network and I would like to understand why. Specifically I have the two following scenarios:
Loading the weights with theload_weights() method to the final model. The weights are the weights of the encoder plus one additional layer(added just before saving the weights) to fit the architecture of the final network implemented in dopamine where they are loaded.
First load the saved model with load_model() and then when defining the new model in the __init__() method, extract the relevant layers from the loaded model and then use them for the final model.
Overall, I would expect the two approaches to yield similar results with regards to the average reward achieved per episode , when I use the same pre-trained weights. However the two approaches differ ( 1. yield higher average reward than 2. although using the same pre-trained weights) and I don't understand why.
Furthermore, in order to validate this behavior I have tried loading random weights with the two aforementioned approaches in order to see a change in behavior. In both cases, based on which of the two aforementioned loading methods I am using, I end up with very similar resulting behavior with the respected case when loading the trained weights. It's seems like the pre-trained weights in each respected case have no effect on the overall resulting training behavior. Although, this might be irrelevant to the issue I am trying to investigate here as it might be the case that the pre-trained weights don't offer any benefit overall which is also possible.
Any thoughts and ideas on this would be much appreciated.
What is the difference btwn high level and low level libraries?
I understand that keras is a high level library and tensorflow is a low level library but I'm still not familiar enough with these frameworks to understand what that means for high vs low libraries.
Keras is a high level Deep learning(DL) 'API'. Key components of the API are:
Model - to define the Neural network(NN).
Layers - building blocks of the NN model (e.g. Dense, Convolution).
Optimizers - different methods for doing gradient descent to learn weights of NN (e.g. SGD, Adam).
Losses - objective functions that the optimizer should minimize for use cases like classification, regression (e.g. categorical_crossentropy, MSE).
Moreover, it provides reasonable defaults for the APIs e.g. learning rates for Optimizers, which would work for the common use cases. This reduces the cognitive load on the user during the learning phase.
The 'Guiding Principles' section here is very informative:
https://keras.io/
The mathematical operations involved in running the Neural networks themselves like Convolutions, Matrix Multiplications etc. are delegated to the backend. One
of the backends supported by Keras is Tensorflow.
To highlight the differences with a code snippet:
Keras
# Define Neural Network
model = Sequential()
# Add Layers to the Network
model.add(Dense(512, activation='relu', input_shape=(784,)))
....
# Define objective function and optimizer
model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['accuracy'])
# Train the model for certain number of epochs by feeding train/validation data
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
Tensorflow
It ain't a code snippet anymore :) since you need to define everything starting from the Variables that would store the weights, the connections between the layers, the training loop, creating batches of data to do the training etc.
You can refer the below links to understand the code complexity with training a MNIST(DL Hello world example) in Keras vs Tensorflow.
https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/multilayer_perceptron.py
Considering the benefits that come with Keras, Tensorflow has made tf.keras the high-level API in Tensorflow 2.0.
https://www.tensorflow.org/tutorials/
High level means that your interactions are closer to writing English, and the code you write is essentially more understandable to humans.
An example of low level would be a language in which you would have to do things such as allocate memory, copy data from one memory address to another etc.
Keras is considered high level because you can make a neural network in just a few lines of code, the library will handle all the complexity for you.
In tensorflow (I haven't used it), you probably have to write many more lines of code to achieve the same thing, but probably have a greater degree of control. Reading tensorflow code for a NN would be less meaningful to a layman than reading keras code for a NN.
Keras sits on top of Tensorflow, and thus the framework is relatively 'higher-level' than Tensorflow itself.
A 'high' level language or framework is typically defined as one that has a greater number of dependencies or has a greater distance from core binary code, relative to a lower-level language or framework.
E.g., jQuery would be considered higher-level than JavaScript, as it depends on Javascript. Whereas Javascript would be considered higher-level than assembly code, as it's transpiled to assembly.
I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
I'm guessing that DNN in the sense used in TensorFlow means "deep neural network". But I find this deeply confusing since the notion of a "deep" neural network seems to be in wide use elsewhere to mean a network with typically several convolutional and/or associated layers (ReLU, pooling, dropout, etc).
In contrast, the first instance many people will encounter this term (in the tfEstimator Quickstart example code) we find:
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/iris_model")
This sounds suspiciously shallow, and even more suspiciously like an old-style multilayer perceptron (MLP) network. However, there is no mention of DNN as an alternative term on that close-to-definitive source. So is a DNN in the TensorFlow tf.estimator context actually an MLP? Documentation on the hidden_units parameter suggests this is the case:
hidden_units: Iterable of number hidden units per layer. All layers are fully connected. Ex. [64, 32] means first layer has 64 nodes and second one has 32.
That has MLP written all over it. Is this understanding correct? Is DNN therefore a misnomer, and if so should DNNClassifier ideally be deprecated in favour of MLPClassifier? Or does DNN stand for something other than deep neural network?
Give me your definition of "deep" neural network and you get your answer.
But yes, it is simply a MLP and a proper naming would be MLPclassifier indeed. But this does not sounds as cool as the current name.
First of all your definition of DNN is a bit misleading.
There are several architectures of deep neural networks. Inclussive Deep Feedforward Networks is nothing more than a multilayered MLP, plus some techniques to make them attractive.
Some works have used "DNNs" to span all Deep Learning architectures, however, by convention, "DNNs" are used to refer to architectures that use deep forward propagation networks, also called Deep Feedforward Networks
The most important example of a Deep Learning Model is the Profound Net Feedforward or Multilayer Perceptron (MLP). MLP is just a mathematical function that maps some sets of input values to output values. The function is formed by the composition of many simpler functions. You can relate each application of a different mathematical function to provide a new representation of the input.
Therefore, it makes sense that this estimator is called DNNClassifier
My advice is to read this book here.