Shape of the LSTM layers in multilayer LSTM model - tensorflow

model = tf.keras.Sequential([tf.keras.layers.Embedding(tokenizer.vocab_size, 64),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,return_sequences=True))
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
The second layer has 64 hidden units and since the return_sequences=True, it will output 64 sequences as well. But how can it be fed to a 32 hidden units LSTM. Won't it cause shape mismatch error?

Actually no, it won't cause it. First of all the second layer won't have the output shape of 64, but instead of 128. This is because you are using Bidirectional layer, it will be concatenated by a forward and backward pass and so you output will be (None, None, 64+64=128). You can refer to the link.
The RNN data is shaped in the following was (Batch_size, time_steps, number_of_features). This means when you try to connect two layer with different neurons the features increased or decreased based on the number of neurons.You can follow the particular link for more details.
And for your particular code this is how the model summary will look like. So to answer in short their won't be a mismatch.
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 32000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128) 66048
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64) 41216
_________________________________________________________________
dense_2 (Dense) (None, 64) 4160
_________________________________________________________________
dense_3 (Dense) (None, 1) 65
=================================================================
Total params: 143,489
Trainable params: 143,489
Non-trainable params: 0
_________________________________________________________________

Related

ValueError: Input 0 of layer sequential_40 is incompatible with the layer: expected min_ndim=3, found ndim=2. Full shape received: (None, 58)

I am working on a dataset about student performance in a course, and I want to predict student level (low, mid, high) according to their previous year's marks. I'm using a CNN for this purpose, but when I build and fit the model I get this error:
ValueError: Input 0 of layer sequential_40 is incompatible with the layer: : expected min_ndim=3, found ndim=2. Full shape received: (None, 58)
This is the code:
#reshaping data
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1]))
#checking the shape after reshaping
print(X_train.shape)
print(X_test.shape)
#normalizing the pixel values
X_train=X_train/255
X_test=X_test/255
#defining model
model=Sequential()
#adding convolution layer
model.add(Conv1D(32,3, activation='relu',input_shape=(28,1)))
#adding pooling layer
model.add(MaxPool1D(pool_size=2))
#adding fully connected layer
model.add(Flatten())
model.add(Dense(100,activation='relu'))
#adding output layer
model.add(Dense(10,activation='softmax'))
#compiling the model
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
#fitting the model
model.fit(X_train,y_train,epochs=10)
This is the output:
Model: "sequential_40"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_23 (Conv1D) (None, 9, 32) 128
_________________________________________________________________
max_pooling1d_19 (MaxPooling (None, 4, 32) 0
_________________________________________________________________
flatten_15 (Flatten) (None, 128) 0
_________________________________________________________________
dense_30 (Dense) (None, 100) 12900
_________________________________________________________________
dense_31 (Dense) (None, 10) 1010
=================================================================
Total params: 14,038
Trainable params: 14,038
Non-trainable params: 0

How does model.weights in tensorflow/keras work?

I have a model trained.
summary is as follows
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 256) 2560
dense_1 (Dense) (None, 128) 32896
dropout (Dropout) (None, 128) 0
dense_2 (Dense) (None, 1) 129
=================================================================
Total params: 35,585
Trainable params: 35,585
Non-trainable params: 0
_________________________________________________________________
And have weights
for i,weight in enumerate(Model.weights):
exec('w{}=np.array(weight)'.format(i))
have test data for predict
x=test_data.iloc[0]
then I predict with model
Model.predict(np.array(x).reshape(1,9))
get array([[226241.66]], dtype=float32)
then I predict with weights
((x#w0+w1)#w2+w3)#w4+w5
get array([98039.99664026])
Can someone explain how the weights in model works?
And how to get the model-predict result with weights?
Try Model.layers which will return a list of all layers in your model, each layer has a function get_weights() which will return the weights as numpy arrays. I was able to reproduce the output of a simple 3 layer feed-forward model with this approach.
for i,layer in enumerate(model.layers):
exec('w{}=np.array(layer.get_weights()[0])'.format(i)) # weight
exec('b{}=np.array(layer.get_weights()[1])'.format(i)) # bias
X = np.random.randn(1,9)
np.allclose(((X#w1[0] + b1[1])#w2[0] + b2[1])#w4[0] + b4[1], model.predict(X)) # True
Note: In my examle layer 0 was a input layer (no weights) and layer 3 a dropout layer (no weights). When calling model.predict(), dropout is not applied, therefore you can ignore it in this case.

Purpose of additional parameters in Quantization Nodes of TensorFlow Quantization Aware Training

Currently, I am trying to understand quantization aware training in TensorFlow. I understand, that fake quantization nodes are required to gather dynamic range information as a calibration for the quantization operation. When I compare the same model once as "plain" Keras model and once as quantization aware model, the latter has more parameters, which makes sense since we need to store the minimum and maximum values for activations during the quantization aware training.
Consider the following example:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.models import Model
def get_model(in_shape):
inpt = layers.Input(shape=in_shape)
dense1 = layers.Dense(256, activation="relu")(inpt)
dense2 = layers.Dense(128, activation="relu")(dense1)
out = layers.Dense(10, activation="softmax")(dense2)
model = Model(inpt, out)
return model
The model has the following summary:
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 784)] 0
_________________________________________________________________
dense_3 (Dense) (None, 256) 200960
_________________________________________________________________
dense_4 (Dense) (None, 128) 32896
_________________________________________________________________
dense_5 (Dense) (None, 10) 1290
=================================================================
Total params: 235,146
Trainable params: 235,146
Non-trainable params: 0
_________________________________________________________________
However, if i make my model optimization aware, it prints the following summary:
import tensorflow_model_optimization as tfmot
quantize_model = tfmot.quantization.keras.quantize_model
# q_aware stands for for quantization aware.
q_aware_model = quantize_model(standard)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 784)] 0
_________________________________________________________________
quantize_layer (QuantizeLaye (None, 784) 3
_________________________________________________________________
quant_dense_3 (QuantizeWrapp (None, 256) 200965
_________________________________________________________________
quant_dense_4 (QuantizeWrapp (None, 128) 32901
_________________________________________________________________
quant_dense_5 (QuantizeWrapp (None, 10) 1295
=================================================================
Total params: 235,164
Trainable params: 235,146
Non-trainable params: 18
_________________________________________________________________
I have two questions in particular:
What is the purpose of the quantize_layer with 3 parameters after the Input layer?
Why do we have 5 additional non-trainable parameters per layer and what are they used for exactly?
I appreciate any hint or further material that helps me (and others that stumble upon this question) understand quantization aware training.
The quantize layer is used to convert the float inputs to int8. The quantization parameters are used for output min/max and zero point calculations.
Quantized Dense Layers need a few additional parameters: min/max for kernel and min/max/zero-point for output activations.

How to read Keras's model structure?

For example:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_dataset))
def pad_to_size(vec, size):
zeros = [0] * (size - len(vec))
vec.extend(zeros)
return vec
...
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=False)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
print(model.summary())
The print reads as:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 523840
_________________________________________________________________
bidirectional (Bidirectional (None, 128) 66048
_________________________________________________________________
dense (Dense) (None, 64) 8256
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
Total params: 598,209
Trainable params: 598,209
Non-trainable params: 0
I have the following question:
1) For the embedding layer, why is the ouput shape is (None, None, 64). I understand '64' is the vector length. Why are the other two None?
2) How is the output shape of bidirectional layer is (None, 128)? Why is it 128?
For the embedding layer, why is the ouput shape is (None, None, 64). I understand '64' is the vector length. Why are the other two None?
You can see this function produces (None,None) (including the batch dimension) (in other words it does input_shape=(None,) as default) if you don't define the input_shape to the first layer of the Sequential model.
If you pass in an input tensor of size (None, None) to an embedding layer, it produces a (None, None, 64) tensor assuming embedding dimension is 64. The first None is the batch dimension and the second is the time dimension (refers to the input_length parameter). So that's why you get a (None, None, 64) sized output.
How is the output shape of bidirectional layer is (None, 128)? Why is it 128?
Here, you have a Bidirectional LSTM. Your LSTM layer produces a (None, 64) sized output (when return_sequences=False). When you have a Bidirectional layer it is like having two LSTM layers (one going forward, other going backwards). And you have a default merge_mode of concat meaning that the two output states from forward and backward layers will be concatenated. This gives you a (None, 128) sized output.

Difference of calling the Keras pretrained model without including top layers

What is the difference of calling the VGG16 model with or without including top layers of the model? I wonder, why the input parameters to the layers are not shown in the model summary when the model is called without including the top layers. I used the VGG16 model in the following two ways:
from keras.applications import vgg16
model = vgg16.VGG16(weights='imagenet', include_top=False)
print(model.summary)
The shape of the layers in the model does not show any inputs i.e.(None, None, None,64), please see below
Layer (type) Output Shape Param
===================================================================
block1_conv1 (Conv2D) (None, None, None, 64) 1792
block1_conv2 (Conv2D) (None, None, None, 64) 36928
block1_pool (MaxPooling2D) (None, None, None, 64) 0
However, the following code returns the input parameters
from keras.applications import vgg16
model = vgg16.VGG16()
print(model.summary)
The shape of the layers, in this case, return the input parameters
Layer (type) Output Shape Param
==================================================================
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
I seek to understand why it is like this, Please comment
The top layers of VGG are fully-connected layers which are connected to the output of the convolutional base. These contain a fixed number of nodes with the option to instantiate them with weights pretrained on imagenet. When instantiating a VGG model with the top layers included, the size of the architecture is therefore fixed, and the model will only accept images with a fixed input size of (224,224,3). Feeding the network with images of other sizes would change the amount of weights in the dense classification layers.
When you leave out the top classifier however, you'll be able to feed images of varying size to the network, and the output of the convolutional stack will change accordingly. In this way, you can apply the VGG architecture to images of your size of choice, and paste your own densely connected classifier on top of it. In contrast with the dense layers, the number of weights in the convolutional layers stay the same, only the shape of their output changes.
You will notice all this when you instantiate a VGG model without the top layer, but with a specific input shape:
from keras.applications import vgg16
model = vgg16.VGG16(include_top=False, input_shape=(100,100,3))
model.summary()
Will produce:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) (None, 100, 100, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 100, 100, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 100, 100, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 50, 50, 64) 0
_____________________________________________________________
etc.
It's interesting to see how the output shape of the convolutional layers change as you call the architecture with different input shapes. For the above examples, we get:
block5_conv3 (Conv2D) (None, 6, 6, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 3, 3, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
While if you would instantiate the architecture with images of shape (400,400,3), you would get this output:
_________________________________________________________________
block5_conv3 (Conv2D) (None, 25, 25, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 12, 12, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
Note how the number of weights remains the same in both cases.