Feed CNN features to LSTM - tensorflow

I want to build an end-to-end trainable model with the following proprieties:
CNN to extract features from image
The features is reshaped to a matrix
Each row of this matrix is then fed to LSTM1
Each column of this matrix is then fed to LSTM2
The output of LSTM1 and LSTM2 are concatenated for the final output
(it's more or less similar to Figure 2 in this paper: https://arxiv.org/pdf/1611.07890.pdf)
My problem now is after the reshape, how can I feed the values of feature matrix to LSTM with Keras or Tensorflow?
This is my code so far with VGG16 net (also a link to Keras issues):
# VGG16
model = Sequential()
model.add(Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(224, 224, 3)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
# block 2
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
# block 3
model.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(256, (3, 3), activation='relu'))
model.add(Conv2D(256, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
# block 4
model.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(512, (3, 3), activation='relu'))
model.add(Conv2D(512, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
# block 5
model.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(512, (3, 3), activation='relu'))
model.add(Conv2D(512, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
# block 6
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dense(4096, activation='relu'))
# reshape the feature 4096 = 64 * 64
model.add(Reshape((64, 64)))
# How to feed each row of this to LSTM?
# This is my first solution but it doesn’t look correct:
# model.add(LSTM(256, input_shape=(64, 1))) # 256 hidden units, sequence length = 64, feature dim = 1

Consider building your CNN model with Conv2D and MaxPool2D layers, until you reach your Flatten layer, because the vectorized output from the Flatten layer will be you input data to the LSTM part of your structure.
So, build your CNN model like this:
model_cnn = Sequential()
model_cnn.add(Conv2D...)
model_cnn.add(MaxPooling2D...)
...
model_cnn.add(Flatten())
Now, this is an interesting point, the current version of Keras has some incompatibility with some TensorFlow structures that will not let you stack your entire layers in just one Sequential object.
So it's time to use the Keras Model Object to complete you neural network with a trick:
input_lay = Input(shape=(None, ?, ?, ?)) #dimensions of your data
time_distribute = TimeDistributed(Lambda(lambda x: model_cnn(x)))(input_lay) # keras.layers.Lambda is essential to make our trick work :)
lstm_lay = LSTM(?)(time_distribute)
output_lay = Dense(?, activation='?')(lstm_lay)
And finally, now it's time to put together our 2 separated models:
model = Model(inputs=[input_lay], outputs=[output_lay])
model.compile(...)
OBS: Note that you can substitute my model_cnn example by your VGG without including the top layers, once the vectorized output from the VGG Flatten layer will be the input of the LSTM model.

Related

Passing output of 3DCNN layer to LSTM layer

Whilst trying to learn Recurrent Neural Networks(RNNs) am trying to train an Automatic Lip Reading Model using 3DCNN + LSTM. I tried out a code I found for the same on Kaggle.
model = Sequential()
# 1st layer group
model.add(Conv3D(32, (3, 3, 3), strides = 1, input_shape=(22, 100, 100, 1), activation='relu', padding='valid'))
model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
model.add(Conv3D(64, (3, 3, 3), activation='relu', strides=1))
model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
model.add(Conv3D(128, (3, 3, 3), activation='relu', strides=1))
model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
shape = model.get_output_shape_at(0)
model.add(Reshape((shape[-1],shape[1]*shape[2]*shape[3])))
# LSTMS - Recurrent Network Layer
model.add(LSTM(32, return_sequences=True))
model.add(Dropout(.5))
model.add((Flatten()))
# # FC layers group
model.add(Dense(2048, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
model.summary()
However, it returns the following error:
11 model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
12
---> 13 shape = model.get_output_shape_at(0)
14 model.add(Reshape((shape[-1],shape[1]*shape[2]*shape[3])))
15
RuntimeError: The layer sequential_2 has never been called and thus has no defined output shape.
From my understanding, I see that the author of the code was trying to get the output shape of the first layer and reshape it such as to forward to the LSTM layer.
Found a similar post following which I made the following changes and the error was fixed.
shape = model.layers[-1].output_shape
# shape = model.get_output_shape_at(0)
Still I am confused as to what the code does to forward the input from the CNN layer to LSTM layer. Any help to make me understand the above is appreciated. Thank You!!
When you are passing the code from top to bottom then the inputs are flowing in the graph from top to bottom, you are getting this error because you can't call this function on eager mode, as Tensorflow 2.0 is fully transferred to eager mode, so, once you will fit the function and train it 1 epoch then you can use model.get_output_at(0) otherwise use mode.layers[-1].output.
The CNN Layer will extract the features locally then LSTM will sequentially extract and learn the feature, using CONV with LSTM is a good approach, but I will recommend you directly using tf.keras.layers.ConvLSTM3D. Check it here https://www.tensorflow.org/api_docs/python/tf/keras/layers/ConvLSTM3D
tf.keras.backend.clear_session()
model = Sequential()
# 1st layer group
model.add(Conv3D(32, (3, 3, 3), strides = 1, input_shape=(22, 100, 100, 1), activation='relu', padding='valid'))
model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
model.add(Conv3D(64, (3, 3, 3), activation='relu', strides=1))
model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
model.add(Conv3D(128, (3, 3, 3), activation='relu', strides=1))
model.add(MaxPooling3D(pool_size=(2, 2, 2), strides=2))
shape = model.layers[-1].output_shape
model.add(Reshape((shape[-1],shape[1]*shape[2]*shape[3])))
# LSTMS - Recurrent Network Layer
model.add(LSTM(32, return_sequences=True))
model.add(Dropout(.5))
model.add((Flatten()))
# # FC layers group
model.add(Dense(2048, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
model.summary()

ValueError: Shapes (None, 1) and (None, 5) are incompatible in keras

model = Sequential()
model.add(Conv2D(128, (3, 3), activation='relu', input_shape=(64, 64, 3), padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(5, activation = 'softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=[tf.keras.metrics.Recall()])
This code works fine for metrics=['accuracy']), but it shows ValueError: Shapes (None, 1) and (None, 5) are incompatible for metrics=[tf.keras.metrics.Recall()])
Please help me. Thanks in advance.
Recall makes sense only for binary classification. Your final layer has 5 nodes, which essentially means you have 5 classes. You should change recall to another metric. Documentation should help you choose an appropriate metric for your model. Categorical accuracy should be good enough to get started.

how to improve the performance of the CNN model (Machine Learning - Deep Learning)

I am trying to train a model based on CNN,
Training data is the image of lunar lander, but the performance of model is not good, the accuracy is about 45%, I try to add more layer to improve it, but it still doesn't work well, could any one provide some ideas about how to improve it.
Label: up down left right (0,1,2,3)
Sample rate: 0.1
The Training data has converted from image to data
Label:
here is some of my code:
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', input_shape=(CHANNELS, ROWS, COLS), activation='relu'))
model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation = 'softmax'))
.summary()
I would experiment a lower dropout rate. Dropouts are used to prevent overfitting. It looks like your model isn't even fitting in the first place.

Tensorflow returns 10% validation accuracy for VGG model (irrespective of number of epochs)?

I am trying to train a neural network on CIFAR-10 using keras package in tensorflow. The neural network considered is VGG-16, which I directly borrowed from the official keras models.
The definition is:
def cnn_model(nb_classes=10):
# VGG-16 official keras model
img_input= Input(shape=(32,32,3))
vgg_layer= Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(img_input)
vgg_layer= Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(vgg_layer)
vgg_layer= MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(vgg_layer)
# Block 2
vgg_layer= Conv2D(64, (3, 3), activation='relu', padding='same', name='block2_conv1')(vgg_layer)
vgg_layer= Conv2D(64, (3, 3), activation='relu', padding='same', name='block2_conv2')(vgg_layer)
vgg_layer= MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(vgg_layer)
# Block 3
vgg_layer= Conv2D(128, (3, 3), activation='relu', padding='same', name='block3_conv1')(vgg_layer)
vgg_layer= Conv2D(128, (3, 3), activation='relu', padding='same', name='block3_conv2')(vgg_layer)
vgg_layer= Conv2D(128, (3, 3), activation='relu', padding='same', name='block3_conv3')(vgg_layer)
vgg_layer= MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(vgg_layer)
# Block 4
vgg_layer= Conv2D(256, (3, 3), activation='relu', padding='same', name='block4_conv1')(vgg_layer)
vgg_layer= Conv2D(256, (3, 3), activation='relu', padding='same', name='block4_conv2')(vgg_layer)
vgg_layer= Conv2D(256, (3, 3), activation='relu', padding='same', name='block4_conv3')(vgg_layer)
vgg_layer= MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(vgg_layer)
# Classification block
vgg_layer= Flatten(name='flatten')(vgg_layer)
vgg_layer= Dense(1024, activation='relu', name='fc1')(vgg_layer)
vgg_layer= Dense(1024, activation='relu', name='fc2')(vgg_layer)
vgg_layer= Dense(nb_classes, activation='softmax', name='predictions')(vgg_layer)
return Model(inputs=img_input, outputs=vgg_layer)
However during training, I always get both train and validation accuracy as 0.1 i.e, 10%.
validation accuracy for adv. training of model for epoch 1= 0.1
validation accuracy for adv. training of model for epoch 2= 0.1
validation accuracy for adv. training of model for epoch 3= 0.1
validation accuracy for adv. training of model for epoch 4= 0.1
validation accuracy for adv. training of model for epoch 5= 0.1
As a step towards debugging, whenever I replace with any other model (eg, any simple CNN model) it works perfectly well. This shows that the rest of the script works well.
For example the following CNN model works perfectly well and achieves an accuracy of 75% after 30 epochs.
def cnn_model(nb_classes=10, num_hidden=1024, weight_decay= 0.0001, cap_factor=4):
model=Sequential()
input_shape = (32,32,3)
model.add(Conv2D(32*cap_factor, kernel_size=(3,3), strides=(1,1), kernel_regularizer=keras.regularizers.l2(weight_decay), kernel_initializer="he_normal", activation='relu', padding='same', input_shape=input_shape))
model.add(Conv2D(32*cap_factor, kernel_size=(3,3), strides=(1,1), kernel_regularizer=keras.regularizers.l2(weight_decay), kernel_initializer="he_normal", activation="relu", padding="same"))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(BatchNormalization())
model.add(Dropout(0.25))
model.add(Conv2D(64*cap_factor, kernel_size=(3,3), strides=(1,1), kernel_regularizer=keras.regularizers.l2(weight_decay), kernel_initializer="he_normal", activation="relu", padding="same"))
model.add(Conv2D(64*cap_factor, kernel_size=(3,3), strides=(1,1), kernel_regularizer=keras.regularizers.l2(weight_decay), kernel_initializer="he_normal", activation="relu", padding="same"))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(BatchNormalization())
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(num_hidden, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes, activation='softmax'))
return model
It appears to me that both of these models are correctly defined. However, one works perfect while the other doesn't learn at all. I also tried writing the VGG model as an Sequential structure i.e, similar to the second one, but it still gave me 10% accuracy.
Even if the model doesn't update any weights, still the "he_normal" initializer will easily able to obtain a much better accuracy than pure chance. It appears that somehow tensorflow computing the output logits from the model which results in accuracy as pure chance.
I will be really helpful if someone can point out my mistake in it.
Your 10% corresponds higly with nr of classes = 10. That makes me think that regardless of the training, your answer is always "1" for all categories, what constantly gives you 10% accuracy on 10 classes.
Check the output of the untrained model, if it is always 1
If so, check the initial weights of the model, probably it's wrongly initialized, gradients are zero and it can't converge

How to connect Conv3D output to MaxPooling2D in Keras?

I'm using Keras to implement CNN. People often use Conv2D to do classification tasks. However, I want to get relationships between two images, then I decide to try Conv3D. However, I couldn't manage the dimension output from Conv3D and match the following layers.
More specifically, I want to apply (5,5,2) filter on two stacked images which are (480, 640, 2), and output(480, 640, 1) tensor.
Original Conv2D code: (work fine)
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',input_shape=(480, 640, 2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
...
Conv3D code: (Don't know how to concatenate Conv3D and MaxPooling2D)
model.add(Conv3D(32, 2, input_shape=(480, 640, 2, 1), data_format="channels_last"))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
...
Stack both the images (remember to stack acc. to the backend you are using, theano is channels_first and tensorflow is channels_last) and pass 2 as the number of channels in Conv2D.
Or if you have many channels for each images, then again stack them up and pass the total number of channels to Conv2D.