will non-trainable layer participate in backpropagation of other layers? - tensorflow

In the below neural network, the 2nd layer is non trainable. when calculating gradient for the 1st layer, however, will the 2nd layer participate in?
In short, when a layer is set to non-trainable, will it affect the gradient descent of other layers?
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 256) 200960
my_pca_2 (my_pca) (None, 10) 2570
=================================================================
Total params: 203,530
Trainable params: 200,960
Non-trainable params: 2,570
_________________________________________________________________

The second layer in the neural network is set to non-trainable. This only means that the weights in that layer will not be updated during the training process.
However, when calculating the gradient for the first layer, the second layer will still participate. This is because in the process of calculating the gradient the outputs of the second layer are used as inputs to the first layer, and therefore have an effect on the gradients calculated for the first layer. In other words, the non-trainable status of a layer only affects its own weight updates, but not its impact on the gradients of other layers.
It is the essense of backpropagation using the chain rule that states how you even calculate the gradients of each layer. There is no way a layer could not effect the gradients of its predecessor layer.

Related

how to calculate the confidence of a softmax layer

I am working on a multi-class computer vision classification task and using a CNN with FC layers stacked on top using softmax activation, the problem is that lets say im classifying animals categories, if i predicted what a rock image is it will return a high probability for the most similar category of animals due to using softmax activation that returns a probabilistic distribution compressed between 0 and 1. what can i use to determine the confidence of my models probability output to say whether i can rely on these probabilities or not.
PS:I dont want to add a no_label class
Is it possible using keras functional api to have 2 outputs of the model the pre_softmax and the softmax output without updating the weights according to a linear activation which is the pre_softmax layer since the training would be affected
Is it possible using keras functional api to have 2 outputs of the model the pre_softmax and the softmax output without updating the weights according to a linear activation which is the pre_softmax layer since the training would be affected
Yes. You can do it like this
input = tf.keras.layers.Input((128,128,3))
x = tf.keras.layers.Conv2D(32,3)(input)
x = tf.keras.layers.MaxPooling2D()(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(128)(x)
non_softmax_output = tf.keras.layers.Dense(10)(x)
softmax_output = tf.keras.layers.Softmax()(non_softmax_output)
model = tf.keras.models.Model(inputs=input,outputs=[non_softmax_output,softmax_output])
model.summary()
>>>
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) [(None, 128, 128, 3)] 0
conv2d_1 (Conv2D) (None, 126, 126, 32) 896
max_pooling2d_1 (MaxPooling (None, 63, 63, 32) 0
2D)
flatten_1 (Flatten) (None, 127008) 0
dense_23 (Dense) (None, 128) 16257152
dense_24 (Dense) (None, 10) 1290
softmax (Softmax) (None, 10) 0
=================================================================
Total params: 16,259,338
Trainable params: 16,259,338
Non-trainable params: 0
_________________________________________________________________
The easier alternative is to just work with the predictions from the softmax layer. You don't gather much from the linear layer without the activation. Those weights by themselves do not mean much. You could instead define a function outside the model that changes the predictions based on some threshold value
Assume you define only 1 output in the above model with a softmax layer. You can define a function like this to get predictions based on some threshold value you choose
def modify_predict(test_images,threshold):
predictions = model.predict(test_images)
max_values = np.max(predictions,axis=1)
labels = np.argmax(predictions,axis=1)
new_predictions = np.where(max_values > threshold, labels, 999) #You can use any indicator here instead of 999 for your no_label class
return new_predictions
On the first part of your question, the only way you can know how your
model will behave on non-animal pictures is by having non-animal pictures
in your data.
There are two options
The first is to include non-animal pictures in the training set (and dev and test sets), and to train the model to distinguish between animal / non-animal.
You could either build a separate binary classification model to distinguish animal/non-animal (as alrady suggesetd in comments), or you could integrate it into one model by having a
'non-animal' class. (Although I recognise you indicate this last option is
not something you want to do).
The second is to include non-animal pictures in the dev and test sets, but not in the training set. You can't then train the model to distinguish between animal and non-animal, but you can at least measure how it behaves on
non-animal pictures, and perhaps create some sort of heuristic for selecting only some of your model's predictions. This seems like a worse option to me, even though it's generally accepted that dev and test sets can come from a different distribution to the training set. It's something one might do if there were only a small number of non-animal pictures available, but that surely can't be the case here.
There is, for example, a large labelled image database
available at https://www.image-net.org/index.php

How to access a particular layer of Huggingface's pre-trained BERT model?

For experimentation purposes, I need to access an Embedding layer of the encoder. That is, assuming Tensorflow implementation, the layer defined as tf.keras.layers.Embedding(...).
For example, what is a way to set 'embeddings_regularizer=' argument of the Embedding() layer in the encoder part of the transformer?
You can iterate over the BERT model in the same way as any other model, like so:
for layer in model.layers:
if isinstance(layer ,tf.keras.layers.Embedding):
layer.embeddings_regularizer = argument
isinstance checks the type of the layer, so really you can put any layer type here and change what you need.
I haven't checked specifically whether embeddings_regularizer is available, however if you want to see what methods are available to that particular layer, run a debugger and call dir(layer) inside the above function.
Updated question
The TFBertForSequenceClassification model has 3 layers:
>>> model.summary()
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 108310272
_________________________________________________________________
dropout_37 (Dropout) multiple 0
_________________________________________________________________
classifier (Dense) multiple 1538
=================================================================
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
Similarly, calling model.layers gives:
[<transformers.models.bert.modeling_tf_bert.TFBertMainLayer at 0x7efda85595d0>,
<tensorflow.python.keras.layers.core.Dropout at 0x7efd6000ae10>,
<tensorflow.python.keras.layers.core.Dense at 0x7efd6000afd0>]
We can access the layers inside TFBertMainLayer:
>>> model.layers[0]._layers
[<transformers.models.bert.modeling_tf_bert.TFBertEmbeddings at 0x7efda8080f90>,
<transformers.models.bert.modeling_tf_bert.TFBertEncoder at 0x7efda855ced0>,
<transformers.models.bert.modeling_tf_bert.TFBertPooler at 0x7efda84f0450>,
DictWrapper({'name': 'bert'})]
So from the above we can access the TFBertEmbeddings layer by:
model.layers[0].embeddings
OR
model.layers[0]._layers[0]
If you check the documentation (search for the "TFBertEmbeddings" class) you can see that this inherits a standard tf.keras.layers.Layer which means you have access to all the normal regularizer methods, so you should be able to call something like:
from tensorflow.keras import regularizers
model.layers[0].embeddings.activity_regularizer = regularizers.l2(1e-5)
Or whatever argument / regularizer you need to change. See here for regularizer docs.

Why does a 1x1 convolution layer work for feature reduction in a Neural Network Regression?

I would love some insight on this question - I've tried to find explanations in the literature, but I'm stumped. So I am building a neural network (using Keras) to solve a regression problem. I have ~500,000 samples with 20,000 features each, and am trying to predict a numerical output. Think predicting a house price based on a bunch of numerical measurements of the house, yard, etc. The features are arranged alphabetically so their neighboring features are fairly meaningless.
When I first tried to create a neural network, it suffered from severe overfitting if I provided all 20,000 features - manually reducing it to 1,000 features improved performance massively.
I read about 1x1 convolutional neural networks being used for feature reduction, but it was all used for images and 2D inputs.
So I built a basic neural network with 3 layers:
model = Sequential()
model.add(Conv1D(128, kernel_size=1, activation="relu", input_shape=(n_features,1)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
I also reshaped my training set as input from n_samples, n_features to:
reshaped= X_train.reshape(n_samples, n_features, 1) to conform with the expected input of Conv1D.
Contrary to normal dense neural networks, this works as though I manually selected the top performing features. My questions is - why does this work?? Replacing the convolution layer with a dense layer completely kills the performance. Does this even have anything to do with feature reduction or is something else going on entirely?
I thought 2d images use 1x1 convolutions to reduce the channel dimensions of the image - but I only have 1 channel with 1x1 convolution, so what's being reduced? Does setting my 1D convolution layer filters to 128 mean I have selected 128 features which are subsequently fed to the next layer? Are the features selected based on loss back propagation?
I'm having a lot of trouble visualizing what is happening to the information from my features.
Lastly, what if I were to then add another convolution layer down the road? Is there a way to conceptualize what would happen if I added another 1x1 layer? Is it further subsampling of features?
Thank you!
Let's augment your model with a Dense layer with 128 units and observe the summary for two models.
Conv Model
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
n_features = 1000 # your sequence length
model = Sequential()
model.add(Conv1D(128, kernel_size=1, activation="relu", input_shape=(n_features,1)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 1000, 128) 256
_________________________________________________________________
flatten_1 (Flatten) (None, 128000) 0
_________________________________________________________________
dense_8 (Dense) (None, 100) 12800100
_________________________________________________________________
dense_9 (Dense) (None, 1) 101
=================================================================
Total params: 12,800,457
Trainable params: 12,800,457
Non-trainable params: 0
FC Model
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
n_features = 1000 # your sequence length
model = Sequential()
model.add(Dense(128, activation="relu", input_shape=(n_features,1)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_10 (Dense) (None, 1000, 128) 256
_________________________________________________________________
flatten_2 (Flatten) (None, 128000) 0
_________________________________________________________________
dense_11 (Dense) (None, 100) 12800100
_________________________________________________________________
dense_12 (Dense) (None, 1) 101
=================================================================
Total params: 12,800,457
Trainable params: 12,800,457
Non-trainable params: 0
_____________________________
As you can see both models have an identical number of parameters in each layer. But inherently they are completely different.
Let's say we have the inputs with length 4 only. A 1 convolution with 3 filters will use 3 separate kernels on those 4 inputs, each kernel will operate on a single element of input at a time as we have chosen kernel_size = 1. So, each kernel is just a single scalar value which will be multiplied with the input array one element at a time (bias will be added). The thing here is the 1 convolution doesn't look anywhere besides the current input meaning it doesn't have any spatial freedom, it only looks at current input point at a time. (this will become useful for later explanation)
Now, with dense/fc layer each neuron is connected to each input, meaning the fc layer has full spatial freedom, it looks everywhere. The equivalent Conv layer will be something with a kernel_size = 1000 (the actual input length).
So, why Conv1D 1 convolution is maybe performing better?
Well, it's hard to tell without actually looking into data properties. But one guess would be you're using features that don't have any spatial dependency.
You have chosen the features randomly and probably mixing them (looking at many input features at once doesn't help but learns some extra noise). This could be the reason why you're getting better performance with a Conv layer which only looks one feature at a time instead of an FC layer which looks at all of them and mixes them.

Sentiment classifier training with Keras

I am using keras (backend tensorflow) to classify sentiments from Amazon review.
It starts with an embedding layer (which uses GloVe), then LSTM layer and finally a Dense layer as output. Model summary below:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 2258700
_________________________________________________________________
lstm_1 (LSTM) (None, 16) 7488
_________________________________________________________________
dense_1 (Dense) (None, 5) 85
=================================================================
Total params: 2,266,273
Trainable params: 2,266,273
Non-trainable params: 0
_________________________________________________________________
Train on 454728 samples, validate on 113683 samples
When training the train and eval accuracy is about 74% and loss (train and eval) around 0.6.
I've tried with changing amount of elements in LSTM layer, as well as including dropout, recurrent dropout, regularizer, and with GRU (instead of LSTM). Then the accuracy increased a bit (~76%).
What else could I try in order to improve my results?
I have had a great a better success with sentiment analysis using Bidirectional LSTM also stacking two layers vertically i.e 2 LSTMS together forming a deep network also helped and try to increase the number of lstm elements to be around 128.

What is the definition of a non-trainable parameter?

What is the definition of non-trainable parameter in a model?
For example, while you are building your own model, its value is 0 as a default, but when you want to use an inception model, it is becoming something else rather than 0. What would be the reason behind it?
In keras, non-trainable parameters (as shown in model.summary()) means the number of weights that are not updated during training with backpropagation.
There are mainly two types of non-trainable weights:
The ones that you have chosen to keep constant when training. This means that keras won't update these weights during training at all.
The ones that work like statistics in BatchNormalization layers. They're updated with mean and variance, but they're not "trained with backpropagation".
Weights are the values inside the network that perform the operations and can be adjusted to result in what we want. The backpropagation algorithm changes the weights towards a lower error at the end.
By default, all weights in a keras model are trainable.
When you create layers, internally it creates its own weights and they're trainable. (The backpropagation algorithm will update these weights)
When you make them untrainable, the algorithm will not update these weights anymore. This is useful, for instance, when you want a convolutional layer with a specific filter, like a Sobel filter, for instance. You don't want the training to change this operation, so these weights/filters should be kept constant.
There is a lot of other reasons why you might want to make weights untrainable.
Changing parameters:
For deciding whether weights are trainable or not, you take layers from the model and set trainable:
model.get_layer(layerName).trainable = False #or True
This must be done before compilation.
There are some details that other answers do not cover.
In Keras, non-trainable parameters are the ones that are not trained using gradient descent. This is also controlled by the trainable parameter in each layer, for example:
from keras.layers import *
from keras.models import *
model = Sequential()
model.add(Dense(10, trainable=False, input_shape=(100,)))
model.summary()
This prints zero trainable parameters, and 1010 non-trainable parameters.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 10) 1010
=================================================================
Total params: 1,010
Trainable params: 0
Non-trainable params: 1,010
_________________________________________________________________
Now if you set the layer as trainable with model.layers[0].trainable = True
then it prints:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 10) 1010
=================================================================
Total params: 1,010
Trainable params: 1,010
Non-trainable params: 0
_________________________________________________________________
Now all parameters are trainable and there are zero non-trainable parameters. But there are also layers that have both trainable and non-trainable parameters, one example is the BatchNormalization layer, where the mean and standard deviation of the activations is stored for use while test time. One example:
model.add(BatchNormalization())
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 10) 1010
_________________________________________________________________
batch_normalization_1 (Batch (None, 10) 40
=================================================================
Total params: 1,050
Trainable params: 1,030
Non-trainable params: 20
_________________________________________________________________
This specific case of BatchNormalization has 40 parameters in total, 20 trainable, and 20 non-trainable. The 20 non-trainable parameters correspond to the computed mean and standard deviation of the activations that is used during test time, and these parameters will never be trainable using gradient descent, and are not affected by the trainable flag.
Non-trainable parameters are quite a broad subject. A straightforward example is to consider the case of any specific NN model and its architecture.
Say we have already setup your network definition in Keras, and your architecture is something like 256->500->500->1. Based on this definition, we seem to have a Regression Model (one output) with two hidden layers (500 nodes each) and an input of 256.
One non-trainable parameters of your model is, for example, the number of hidden layers itself (2). Other could be the nodes on each hidden layer (500 in this case), or even the nodes on each individual layer, giving you one parameter per layer plus the number of layers itself.
These parameters are "non-trainable" because you can't optimize its value with your training data. Training algorithms (like back-propagation) will optimize and update the weights of your network, which are the actual trainable parameters here (usually several thousands, depending on your connections). Your training data as it is can't help you determine those non-trainable parameters.
However, this does not mean that numberHiddenLayers is not trainable at all, it only means that in this model and its implementation we are unable to do so. We could make numberHiddenLayers trainable; the easiest way would be to define another ML algorithm that takes this model as input and trains it with several values of numberHiddenLayers. The best value is obtained with the model that outperformed the others, thus optimizing the numberHiddenLayers variable.
In other words, non-trainable parameters of a model are those that you will not be updating and optimized during training, and that have to be defined a priori, or passed as inputs.
It is clear that if you freeze any layer of the network. all params on that frozen layer turn to non-trainable. On the other hand if you design your network from the scratch, it might have some non-trainable parameters too. For instance batchnormalization layer has 4 parameter which are;
[gamma weights, beta weights, moving_mean, moving_variance]
The first two of them are trainable but last two are not. So the batch normalization layer is highly probably the reason that your custom network has non-trainable paramteres.