I'm trying to add an attention layer to improve interpretability of my multi-label classification deep learning model. The attention layer is intended to be at the very input into the network and to be fed time series input of shape (250, 5, 6). There are no recurrent relations in the network. Considering the input is 3D and time series I'm not sure what kind of Attention layers would be the most effective. I've tried simple Keras Attention and Multi-headed Attention, but I've concluded simple attention is not what I need and Multi-headed attention was taking ages to get trained. Any ideas what could be useful to look into?
Related
Below is the taken from the tensorflow model optimization guide. The 3rd bullet point is highlighted.
What are the critical layers here? Is it Conv2D, activation, BatchNorm? Is there any list? Can someone please help.
Tips for better model accuracy:
It's generally better to finetune with quantization aware training as opposed to >training from scratch.
Try quantizing the later layers instead of the first layers.
Avoid quantizing critical layers (e.g. attention mechanism).
I have been playing around with neural networks for quite a while now, and recently came across the terms "freezing" & "unfreezing" the layers before training a neural network while reading about transfer learning & am struggling with understanding their usage.
When is one supposed to use freezing/unfreezing?
Which layers are to freezed/unfreezed? For instance, when I import a pre-trained model & train it on my data, is my entire neural-net except the output layer freezed?
How do I determine if I need to unfreeze?
If so how do I determine which layers to unfreeze & train to improve model performance?
I would just add to the other answer that this is most commonly used with CNNs and the amount of layers that you want to freeze (not train) is "given" by the amount of similarity between the task that you are solving and the original one (the one that the original network is solving).
If the tasks are very similar, let's say that you are using CNN pretrained on imagenet and you just want to add some other "general" objects that the network should recognize then you might get away with training just the dense top of the network.
The more dissimilar the tasks are, the more layers of the original network you will need to unfreeze during the training.
By freezing it means that the layer will not be trained. So, its weights will not be changed.
Why do we need to freeze such layers?
Sometimes we want to have deep enough NN, but we don't have enough time to train it. That's why use pretrained models that already have usefull weights. The good practice is to freeze layers from top to bottom. For examle, you can freeze 10 first layers or etc.
For instance, when I import a pre-trained model & train it on my data, is my entire neural-net except the output layer freezed?
- Yes, that's may be a case. But you can also don't freeze a few layers above the last one.
How do I freeze and unfreeze layers?
- In keras if you want to freeze layers use: layer.trainable = False
And to unfreeze: layer.trainable = True
If so how do I determine which layers to unfreeze & train to improve model performance?
- As I said, the good practice is from top to bottom. You should tune the number of frozen layers by yourself. But take into account that the more unfrozen layers you have, the slower is training.
When training a model while transfer layer, we freeze training of certain layers due to multiple reasons, such as they might have already converged or we want to train the newly added layers to an already pre-trained models. This is a really basic concept of Transfer learning and I suggest you go through this article if you have no idea about transfer learning .
I am trying to add an attention layer for my text classification model. The inputs are texts (e.g. movie review), the output is a binary outcome (e.g. positive vs negative).
model = Sequential()
model.add(Embedding(max_features, 32, input_length=maxlen))
model.add(Bidirectional(CuDNNGRU(16,return_sequences=True)))
##### add attention layer here #####
model.add(Dense(1, activation='sigmoid'))
After some searching, I found a couple of read-to-use attention layers for keras. There is the keras.layers.Attention layer that is built-in in Keras. There is also the SeqWeightedAttention and SeqSelfAttention layer in the keras-self-attention package. As a person who is relatively new to the deep learning field, I have a hard time understanding the mechanism behind these layers.
What does each of these lays do? Which one will be the best for my model?
Thank you very much!
If you are using RNN, I would not recommend using the keras.layers.Attention class.
While analysing tf.keras.layers.Attention Github code to better understand how to use the same, the first line I could come across was - "This class is suitable for Dense or CNN networks, and not for RNN networks"
There is another open source version maintained by CyberZHG called
keras-self-attention. To the best of my knowledge this is NOT a part of the Keras or TensorFlow library and seems to be an independent piece of code. This contains the two classes you mentioned - SeqWeightedAttention & SeqSelfAttention layer classes. former returns a 2D value and latter a 3D value. So the SeqWeightedAttention should work for your situation. The former seems to be loosely based on Raffel et al and can be used for Seq classification, The latter seems to be a variation of Bahdanau.
In general, I would suggest you to write your own seq to classification model. The attention piece can be added in less than half a dozen lines of code (bare-bones essence)...much less than the time you would spend in integrating or debugging or understanding the code in these external libraries.
Please refer: Create an LSTM layer with Attention in Keras for multi-label text classification neural network
I am learning convolutional networks in Tensorflow. I wonder if there is any tutorials of using TF to investigate a pre-trained convnet model, like these excellent tutorials for Caffe: this and this. I mean, how to access middle layers, get its learned parameters and blobs, to customize input shape to accept arbitrary image size or batch size, etc.
It's not quite the same thing, but there's a codelab here that shows you how to remove the top layer of a pretrained network and train up a new one on your own data:
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/index.html?index=..%2F..%2Findex#0
It might give you some ideas on how to approach this in TensorFlow.
I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.