Consider the following RNN architecture :
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))
regressor.add(Dense(units = 1))
Note the timestep used is 60 and this architecture is developed for time series prediction ( more specifically, stock price prediction). Here is the model summary :
I just need to check my understanding about this architecture. The first layer is an LSTM layer with 50 cells, therefore, the output is a 3d matrix of shape (none,60,50), meaning that it will return the states of the 50 cells at every time step, hence, for every cell it will output 60 states for each step, so we will have the shape (none,60,50). none here is a placeholder for the batch size that will be determined when calling the fit or predict function. These values will be fully connected to the next LSTM layer. The same can be said for second and third LSTM layer. For the fourth LSTM layer, because of the return_sequence= False, Keras will return only the final hidden state at the final time step for each cell, and since we have 50 cells, then we will have 50 values for the hidden state, and hence, the output shape (none,50). These values are then connected to an output dense layer with one neuron that will compute the final value ,and hence, it has the shape (none,1). Is that correct please?
Yeah you are right. LSTM works on the principle of recurrences, first you have to compute the the first sequence of an entity then only you can go further... Adding dropout between layers in LSTM is not a good strategty why you don't use dropout inside the LSTM layer. Onemore, thing use kernel_regularizer in the last layer that would add a penalty for large weights or inputs which are more in size.
Related
I am trying to design a recurrent classificatory network with Keras. I have analyzed key characteristics of the frames of a video, and from them I want to identify when certain events occur during the video.
Specifically, I have a matrix (30 x 2) for each frame, which represents the positions of various given objects. From these positions, I would like the network to detect 4 different events, as well as in which frames they occur.
As an example, suppose I have the position of 30 cars in each frame already detected, and I want the network to learn to detect the frames in which:
a car stops
a car starts
two cars collide
a car turns
In each frame, one or none of these events can occur (category 0), but not more than one.
Something remarkable is that to identify these 4 events it is necessary to know both the data of the previous frames and those of the later ones. For example, to know that two cars collide, it is necessary to know that beforehand both were in motion, as well as that after the collision neither moves.
Following this example, and just to clarify, suppose I have a sample of 100 frames, in which there is a crash at frames 4 and 75, a stop at 12, a start at 37, and turns at 3, 30, and 60. It would have an input of 100x30x2, and an output of 100x1.
After several hours, I get the feeling that I am not understanding something well in the way of indicating to Keras how he is the model.
So far I have been trying the following, with variations in the number of LSTM layers and the number of classification neurons:
model = keras.Sequential()
model.add(layers.LSTM(100, input_shape=(30, 2)))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(5, activation = 'sigmoid'))
model.summary()
model.compile(loss='sparse_categorical_crossentropy', optimizer = 'Adam', metrics = ['SparseCategoricalAccuracy'])
I have also tried to introduce the variation
model3.add(layers.LSTM(100, input_shape=(30, 2), return_sequences=True))
so that you take into account that not only is the final output valid, but if I don't add a flatten it doesn't work, and I deduce that I don't understand the matter well.
Edit 1:
After your advice, I have now the following model:
I start with the dataset, stored in an input xp and an output yp. Here I print the sizes of both variables:
xp.shape, yp.shape
((203384, 25, 2), (203384, 1))
Then I'm encoding yp with keras.utils, and I change the shape of each input element from a (25x2) matrix to a (1x50) vector:
n = len(xp)
yp_encoded = keras.utils.to_categorical(yp)
xp_reshaped = xp[0:n,:].reshape(n,1,50)
print(n, xp_reshaped.shape, yp_encoded.shape)
203384, (203384, 1, 50), (203384, 5)
After, I define the model as we talked, with just LSTM layers
batch_size = 10
model = keras.Sequential()
model.add(layers.LSTM(100, batch_input_shape=(batch_size, 1, 50), activation = 'relu', return_sequences = True, stateful=True))
model.add(layers.LSTM(5, stateful=True, activation = 'softmax'))
model.summary()
Model: "sequential_140"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_184 (LSTM) (10, 1, 100) 58800
lstm_185 (LSTM) (10, 5) 2120
=================================================================
Total params: 60,920
Trainable params: 60,920
Non-trainable params: 0
So, as I understand, I have a LSTM model with an input of (batch_size, 1, 100) elements, that I should fit with an output of (batch_size, 5).
Then, I do the model compilation, and the fit:
model.compile(loss='categorical_crossentropy', optimizer = 'Adam', metrics = ['SparseCategoricalAccuracy'])
model.fit(xp_reshaped, yp_encoded, epochs = 5, batch_size = batch_size, shuffle = False)
I get the following error:
ValueError: Can not squeeze dim[1], expected a dimension of 1, got 5 for '{{node Squeeze}} = Squeeze[T=DT_FLOAT, squeeze_dims=[-1]](IteratorGetNext:1)' with input shapes: [10,5].
You're trying to solve classification problem - classify frames as events with classes: start, stop, turn, collide and the 0 class (nothing happens). You've chosen the right approach - LSTMs, though your archtitecture is not good.
All your layers should be LSTMs with stateful=True and return_sequences=True, you should transform your target variable to one-hot encoding, set last layer's activation to softmax, it's output shape will be (n_frames, 4).
I'm currently trying to train a custom model with tensorflow to detect 17 landmarks/keypoints on each of 2 hands shown in an image (fingertips, first knuckles, bottom knuckles, wrist, and palm), for 34 points (and therefore 68 total values to predict for x & y). However, I cannot get the model to converge, with the output instead being an array of points that are pretty much the same for every prediction.
I started off with a dataset that has images like this:
each annotated to have the red dots correlate to each keypoint. To expand the dataset to try to get a more robust model, I took photos of the hands with various backgrounds, angles, positions, poses, lighting conditions, reflectivity, etc, as exemplified by these further images:
I have about 3000 images created now, with the landmarks stored inside a csv as such:
I have a train-test split of .67 train .33 test, with the images randomly selected to each. I load the images with all 3 color channels, and scale the both the color values & keypoint coordinates between 0 & 1.
I've tried a couple different approaches, each involving a CNN. The first keeps the images as they are, and uses a neural network model built as such:
model = Sequential()
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu', input_shape = (225,400,3)))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
filters_convs = [(128, 2), (256, 3), (512, 3), (512,3)]
for n_filters, n_convs in filters_convs:
for _ in np.arange(n_convs):
model.add(Conv2D(filters = n_filters, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dense(96, activation="relu"))
model.add(Dense(72, activation="relu"))
model.add(Dense(68, activation="sigmoid"))
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())
I've modified the various hyperparameters, yet nothing seems to make any noticeable difference.
The other thing I've tried is resizing the images to fit within a 224x224x3 array to use with a VGG-16 network, as such:
vgg = VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
vgg.trainable = False
flatten = vgg.output
flatten = Flatten()(flatten)
points = Dense(256, activation="relu")(flatten)
points = Dense(128, activation="relu")(points)
points = Dense(96, activation="relu")(points)
points = Dense(68, activation="sigmoid")(points)
model = Model(inputs=vgg.input, outputs=points)
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())
This model has similar results to the first. No matter what I seem to do, I seem to get the same results, in that my mse loss minimizes around .009, with an mae around .07, no matter how many epochs I run:
Furthermore, when I run predictions based off the model it seems that the predicted output is basically the same for every image, with only slight variation between each. It seems the model predicts an array of coordinates that looks somewhat like what a splayed hand might, in the general areas hands might be most likely to be found. A catch-all solution to minimize deviation as opposed to a custom solution for each image. These images illustrate this, with the green being predicted points, and the red being the actual points for the left hand:
So, I was wondering what might be causing this, be it the model, the data, or both, because nothing I've tried with either modifying the model or augmenting the data seems to have done any good. I've even tried reducing the complexity to predict for one hand only, to predict a bounding box for each hand, and to predict a single keypoint, but no matter what I try, the results are pretty inaccurate.
Thus, any suggestions for what I could do to help the model converge to create more accurate & custom predictions for each image of hands it sees would be very greatly appreciated.
Thanks,
Sam
Usually, neural networks will have a very hard time to predict exact coordinates of landmarks. A better approach is probably a fully convolutional network. This would work as follows:
You omit the dense layers at the end and thus end up with an output of (m, n, n_filters) with m and n being the dimensions of your downsampled feature maps (since you use maxpooling at some earlier stage in the network they will be lower resolution than your input image).
You set n_filters for the last (output-)layer to the number of different landmarks you want to detect plus one more to indicate no landmark.
You remove some of the max pooling such that your final output has a fairly high resolution (so the earlier referenced m and n are bigger). Now your output has shape mxnx(n_landmarks+1) and each of the nxm (n_landmark+1)-dimensional vectors indicate which landmark is present as the position in the image that corresponds to the position in the mxn grid. So the activation for your last output convolutional layer needs to be a softmax to represent probabilities.
Now you can train your network to predict the landmarks locally without having to use dense layers.
This is a very simple architecture and for optimal results a more sophisticated architecture might be needed, but I think this should give you a first idea of a better approach than using the dense layers for the prediction.
And for the explanation why your network does predict the same values every time: This is probably, because your network is just not able to learn what you want it to learn because it is not suited to do so. If this is the case, the network will just learn to predict a value, that is fairly good for most of the images (so basically the "average" position of each landmark for all of your images).
I am learning tensorflow2.0 and follow the tutorial. In the rnn example, I found the code:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
tf.keras.layers.LSTM(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])
return model
My question is: why the code set the argument return_sequences=True and stateful=True? How about using the default argument?
The example in the tutorial is about text generation. This is the input that is fed to the network in a batch:
(64, 100, 65) # (batch_size, sequence_length, vocab_size)
return_sequences=True
Since the intention is to predict a character for every time step i.e. for every character in the sequence, the next character needs to be predicted.
So, the argument return_sequences=True is set to true, to get an output shape of (64, 100, 65). If this argument is set to False, then only the last output would be returned, so for batch of 64, output would be (64, 65) i.e. for every sequence of 100 characters, only the last predicted character would be returned.
stateful=True
From the documentation,
"If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch."
In the below diagram from the tutorial, you can see that setting stateful helps the LSTM make better predictions by providing the context of the previous prediction.
Return Sequences
Lets look at a typical model architectures built using LSTMs.
Sequence to sequence models:
We feed in a sequence of inputs (x's), one batch at a time and each LSTM cell returns an output (y_i). So if your input is of size batch_size x time_steps X input_size then the LSTM output will be batch_size X time_steps X output_size. This is called a sequence to sequence model because an input sequence is converted into an output sequence. Typical usages of this model are in tagger (POS tagger, NER Tagger). In keras this is achieved by setting return_sequences=True.
Sequence classification - Many to one Architecture
In many to one architecture we use output sates of the only the last LSTM cell. This kind of architecture is normally used for classification problems like predicting if a movie review (represented as a sequence of words) is +ve of -ve. In keras if we set return_sequences=False the model returns the output state of only the last LSTM cell.
Stateful
An LSTM cell is composed of many gates as show in figure below from this blog post. The states/gates of the previous cell is used to calculate the state of the current cell. In keras if stateful=False then the states are reset after each batch. If stateful=True the states from the previous batch for index i will be used as initial state for index i in the next batch. So state information get propagated between batches with stateful=True. Check this link for explanation of usefulness of statefulness with an example.
Let's see the differences when playing around with the arguments:
tf.keras.backend.clear_session()
tf.set_random_seed(42)
X = np.array([[[1,2,3],[4,5,6],[7,8,9]],[[1,2,3],[4,5,6],[0,0,0]]], dtype=np.float32)
model = tf.keras.Sequential([tf.keras.layers.LSTM(4, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')])
print(tf.keras.backend.get_value(model(X)).shape)
# (2, 3, 4)
print(tf.keras.backend.get_value(model(X)))
# [[[-0.16141939 0.05600287 0.15932009 0.15656665]
# [-0.10788933 0. 0.23865232 0.13983202]
[-0. 0. 0.23865232 0.0057992 ]]
# [[-0.16141939 0.05600287 0.15932009 0.15656665]
# [-0.10788933 0. 0.23865232 0.13983202]
# [-0.07900514 0.07872108 0.06463861 0.29855606]]]
So, if return_sequences is set to True the model returned the full sequence it predicts.
tf.keras.backend.clear_session()
tf.set_random_seed(42)
model = tf.keras.Sequential([
tf.keras.layers.LSTM(4, return_sequences=False, stateful=True, recurrent_initializer='glorot_uniform')])
print(tf.keras.backend.get_value(model(X)).shape)
# (2, 4)
print(tf.keras.backend.get_value(model(X)))
# [[-0. 0. 0.23865232 0.0057992 ]
# [-0.07900514 0.07872108 0.06463861 0.29855606]]
So, as the documentation states, if return_sequences is set to False, the model returns only the last output.
As for stateful it is a bit harder to dive into. But essentially, what it does is when having multiple batches of inputs, the last cell state at batch i will be the initial state at batch i+1. However, I think you will be more than fine going with the default settings.
I made the following neural network model for sound recognition purpose. The flowchart is like the following:
cnn-lstm-dense-hybrid(please click here)
The idea is the following:
I have 2 different input layers, called A and B.
(i) Input A has 100 time steps, each step has a 64-dimensional feature vector
(ii)A 1D CNN layer(Time distributed) will extract features from each time step. The CNN layer contains 64 filters, each has length 16 taps. Then, a maxpooling layer will extract the single maximum value of each convolutional output, so a total of 64 features will be extracted at each time step.
(iii) The output of the CNN layer will be fed into an LSTM layer with 64 neurons. Number of recurrence is the same as time step of input, which is 100 time steps. The LSTM layer should return a sequence of 64-dimensional output (the length of sequence == number of time steps == 100, so there should be 100*64=6400 numbers).
(iv) Meanwhile, input B also has 100 time steps, each step has a 65-dimensional feature vector, but they are treated differently from input A.
(v)Input B is fed into a dense layer (Time distributed) of 65 neurons, so it should produce a 65-dimensional output at each time step.
Now, at each time step, we have output from LSTM layer (64 neurons) and Dense layer (65 neurons), we concatenate them in a merge layer. Now we get a 129-dimensional vector at each time step.
We feed this vector into another dense layer, which produces the output (single neuron, which represents the probability of "is target sound")
A hand drawn illustration
However, I am stuck at the very beginning trying to make 1(i) work. The code of network building is below:
mfcc_input = Input(shape=(100,64), dtype='float', name='mfcc_input')
print(mfcc_input)
CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)
CNN_out = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(CNN_out)
CNN_out = TimeDistributed(MaxPooling1D(pool_size=(64-16+1), strides=None, padding='valid'))(CNN_out)
CNN_out = Dropout(0.4)(CNN_out)
LSTM_out = LSTM(64,return_sequences=True)(CNN_out)
## Auxilliary branch
delta_input = Input(shape=(100,64), dtype='float', name='delta_input')
zcr_input = Input(shape=(100,1), dtype='float', name='zcr_input')
aux_input = concatenate([delta_input, zcr_input])
aux_out = TimeDistributed(Dense(64+1))(aux_input)
### Merge branches
merged_layer = concatenate([LSTM_out, aux_out])
## Output layer
output = TimeDistributed(Dense(1))(merged_layer)
model = Model(inputs=[mfcc_input, delta_input, zcr_input], outputs=[output])
model.compile(optimizer='rmsprop', loss='binary_crossentropy',
loss_weights=[1., 0.2])
...(other code here) ...
The error at "CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)" is: IndexError: list index out of range
Anyone could help? Greatly appreciate!
I am trying to train an LSTM with Keras and Tensorflow backend but it seems to always underfit; the loss and validation loss curves have an initial drop and then flatten out very fast (see image). I have tried adding more layers, more neurons, no dropout, etc., but can't get it even anywhere near an overfit and I do have a good bit of data (almost 4 hours with 100 samples per second, and I have tried downsampling to 50/sec).
My problem is multidimensional time series prediction with continuous values.
Any ideas would be appreciated!
Here is my basic keras architecture:
data_dim = 30 #input dimensions => each timestep has 30 features
timesteps = 200
out_dim = 30 #output dimensions => each predicted output timestep
# has 30 dimensions
batch_size = 50
num_epochs = 300
learning_rate = 0.0005 #tried values between around 0.001 and 0.0003
decay=0.9
#hidden layers size
h1 = 120
h2 = 340
h3 = 340
h4 = 120
model = Sequential()
model.add(LSTM(h1, return_sequences=True,input_shape=(timesteps, data_dim)))
model.add(LSTM(h2, return_sequences=True))
model.add(LSTM(h3, return_sequences=True))
model.add(LSTM(h4, return_sequences=True))
model.add(Dense(out_dim, activation='linear'))
rmsprop_otim = keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-08, decay=decay)
model.compile(loss='mean_squared_error', optimizer=rmsprop_otim,metrics=['mse'])
#data preparation
[x_train, y_train] = readData()
x_train = x_train.reshape((int(num_samples/timesteps),timesteps,data_dim))
y_train = y_train.reshape((int(num_samples/timesteps),timesteps,num_classes))
history_callback = model.fit(x_train, y_train, validation_split=0.1,
batch_size=batch_size, epochs=num_epochs,shuffle=False,callbacks=[checkpointer, losses])
When you say 0.06 mse is underfit, this depends lot on data distribution. mse is relative term, so if the data is not normalizaed, 0.06 might even be overfit. In such case, pre-processing might help. Also, check if there is significant noise in the data.
Using 4 LSTM layers with large sizes means a lot of parameters to learn. Lesser number of layers might be enough.
Try non-linear activation in the final layer.
I suspect that your model only learns the weights of the Dense Layer properly, but not those of the LSTM layers below. As a quick check, what kind of performance do you get when you get rid of all LSTM layers and replace the Dense with a TimeDistributed(Dense...) layer? If your graphs look the same, training doesn't work, i.e. error gradients with respect to the lower layer weights may be too small. Another way of checking this is to inspect the gradients directly and/or to compare the final weighs after training with the initial weights. If this is indeed the problem you can try the following 1) Standardize your inputs, 2) use a smaller learning rate (decrease logarithmically), 3) add skip layer connections, 4) use Adam instead of RMSprop, and 5) train for more epochs.