Keras LSTM always underfits - tensorflow

I am trying to train an LSTM with Keras and Tensorflow backend but it seems to always underfit; the loss and validation loss curves have an initial drop and then flatten out very fast (see image). I have tried adding more layers, more neurons, no dropout, etc., but can't get it even anywhere near an overfit and I do have a good bit of data (almost 4 hours with 100 samples per second, and I have tried downsampling to 50/sec).
My problem is multidimensional time series prediction with continuous values.
Any ideas would be appreciated!
Here is my basic keras architecture:
data_dim = 30 #input dimensions => each timestep has 30 features
timesteps = 200
out_dim = 30 #output dimensions => each predicted output timestep
# has 30 dimensions
batch_size = 50
num_epochs = 300
learning_rate = 0.0005 #tried values between around 0.001 and 0.0003
decay=0.9
#hidden layers size
h1 = 120
h2 = 340
h3 = 340
h4 = 120
model = Sequential()
model.add(LSTM(h1, return_sequences=True,input_shape=(timesteps, data_dim)))
model.add(LSTM(h2, return_sequences=True))
model.add(LSTM(h3, return_sequences=True))
model.add(LSTM(h4, return_sequences=True))
model.add(Dense(out_dim, activation='linear'))
rmsprop_otim = keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-08, decay=decay)
model.compile(loss='mean_squared_error', optimizer=rmsprop_otim,metrics=['mse'])
#data preparation
[x_train, y_train] = readData()
x_train = x_train.reshape((int(num_samples/timesteps),timesteps,data_dim))
y_train = y_train.reshape((int(num_samples/timesteps),timesteps,num_classes))
history_callback = model.fit(x_train, y_train, validation_split=0.1,
batch_size=batch_size, epochs=num_epochs,shuffle=False,callbacks=[checkpointer, losses])

When you say 0.06 mse is underfit, this depends lot on data distribution. mse is relative term, so if the data is not normalizaed, 0.06 might even be overfit. In such case, pre-processing might help. Also, check if there is significant noise in the data.
Using 4 LSTM layers with large sizes means a lot of parameters to learn. Lesser number of layers might be enough.
Try non-linear activation in the final layer.

I suspect that your model only learns the weights of the Dense Layer properly, but not those of the LSTM layers below. As a quick check, what kind of performance do you get when you get rid of all LSTM layers and replace the Dense with a TimeDistributed(Dense...) layer? If your graphs look the same, training doesn't work, i.e. error gradients with respect to the lower layer weights may be too small. Another way of checking this is to inspect the gradients directly and/or to compare the final weighs after training with the initial weights. If this is indeed the problem you can try the following 1) Standardize your inputs, 2) use a smaller learning rate (decrease logarithmically), 3) add skip layer connections, 4) use Adam instead of RMSprop, and 5) train for more epochs.

Related

My model fit too slow, tringle of val_loss is 90

I have a task to write a neural network. On input of 9 neurons, and output of 4 neurons for a multiclass classification problem. I have tried different models and for all of them:
Drop-out mechanism is used.
Batch normalization is used.
And the resulting neural networks all are overfitting. Precision is <80%, I want to have min 90% precision. Loss is 0.8 on the median.
Please, can you suggest to me what model I should use?
Dataset:
TMS_coefficients.RData file
Part of my code:
(trainX, testX, trainY, testY) = train_test_split(dataset,
values, test_size=0.25, random_state=42)
# модель нейронки
visible = layers.Input(shape=(9,))
hidden0 = layers.Dense(64, activation="tanh")(visible)
batch0 = layers.BatchNormalization()(hidden0)
drop0 = layers.Dropout(0.3)(batch0)
hidden1 = layers.Dense(32, activation="tanh")(drop0)
batch1 = layers.BatchNormalization()(hidden1)
drop1 = layers.Dropout(0.2)(batch1)
hidden2 = layers.Dense(128, activation="tanh")(drop1)
batch2 = layers.BatchNormalization()(hidden2)
drop2 = layers.Dropout(0.5)(batch2)
hidden3 = layers.Dense(64, activation="tanh")(drop2)
batch3 = layers.BatchNormalization()(hidden3)
output = layers.Dense(4, activation="softmax")(batch3)
model = tf.keras.Model(inputs=visible, outputs=output)
model.compile(optimizer=tf.keras.optimizers.Adam(0.0001),
loss='categorical_crossentropy',
metrics=['Precision'],)
history = model.fit(trainX, trainY, validation_data=(testX, testY), epochs=5000, batch_size=256)
From the loss curve, I can say it is not overfitting at all! In fact, your model is underfitting. Why? because, when you have stopped training, the loss curve for the validation set has not become flat yet. That means, your model still has the potential to do well if it was trained more.
The model overfits when the training loss is decreasing (or remains the same) but the validation loss gradually increases without decreasing. This is clearly not the case
So, what you can do:
Try training longer.
Add more layers.
Try different activation functions like ReLU instead of tanh.
Use lower dropout (probably your model is struggling to learn for high value of dropouts).
Make sure you have shuffled your data before train-test splitting (if you are using sklearn for train_test_split() then it is done by default) and also check if the test data is similar to the train data and both of them goes under the same preprocessing steps.

How to avoid overfitting in CNN?

I'm making a model for predicting the age of people by analyzing their face. I'm using this pretrained model, and maked a custom loss function and a custom metrics. So I obtain discrete result but I want to improve it. In particular, I noticed that after some epochs the model begin to overfitt on the training set then the val_loss increases. How can I avoid this? I'm already using Dropout, but this doesn't seem to be enough.
I think maybe I should use l1 and l2 but I don't know how.
def resnet_model():
model = VGGFace(model = 'resnet50')#model :{resnet50, vgg16, senet50}
xl = model.get_layer('avg_pool').output
x = keras.layers.Flatten(name='flatten')(xl)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(11, activation='softmax', name='predictions')(x)
model = keras.engine.Model(model.input, outputs = x)
return model
model = resnet_model()
initial_learning_rate = 0.0003
epochs = 20; batch_size = 110
num_steps = train_x.shape[0]//batch_size
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
[3*num_steps, 10*num_steps, 16*num_steps, 25*num_steps],
[1e-4, 1e-5, 1e-6, 1e-7, 5e-7]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy', one_off_accuracy])
model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, validation_data=(test_x, test_y))
This is an example of result:
There are many regularization methods to help you avoid overfitting your model:
Dropouts:
Randomly disables neurons during the training, in order to force other neurons to be trained as well.
L1/L2 penalties:
Penalizes weights that change dramatically. This tries to ensure that all parameters will be equally taken into consideration when classifying an input.
Random Gaussian Noise at the inputs:
Adds random gaussian noise at the inputs: x = x + r where r is a random normal value from range [-1, 1]. This will confuse your model and prevent it from overfitting into your dataset, because in every epoch, each input will be different.
Label Smoothing:
Instead of saying that a target is 0 or 1, You can smooth those values (e.g. 0.1 & 0.9).
Early Stopping:
This is a quite common technique for avoiding training your model too much. If you notice that your model's loss is decreasing along with the validation's accuracy, then this is a good sign to stop the training, as your model begins to overfit.
K-Fold Cross-Validation:
This is a very strong technique, which ensures that your model is not fed all the time with the same inputs and is not overfitting.
Data Augmentations:
By rotating/shifting/zooming/flipping/padding etc. an image you make sure that your model is forced to train better its parameters and not overfit to the existing dataset.
I am quite sure there are also more techniques to avoid overfitting. This repository contains many examples of how the above techniques are deployed in a dataset:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
You can try incorporate image augmentation in your training, which increases the "sample size" of your data as well as the "diversity" as #Suraj S Jain mentioned. The official tutorial is here: https://www.tensorflow.org/tutorials/images/data_augmentation

Keras model not learning and predicting only one class out of three classes

New to the field of deep learning and currently working on this competition for predicting the earthquake damage to buildings.
The model I created starts at an accuracy of .56 but remains at this for any number of epochs i let it run. When finished, the model only predicts one of the three classes (which I one hot encoded into a dataframe with three columns). Changing the number of layers, optimizers, data preparation, dropout wont change anything. Even trying to overfit my model with the over-parameterization of the neural network will still have the same accuracy and a non-learning model.
What am I doing wrong?
This is my code:
model = keras.models.Sequential()
model.add(keras.layers.Dense(64, input_dim = 85, activation = "relu"))
keras.layers.Dropout(0.3)
model.add(keras.layers.Dense(128, activation = "relu"))
keras.layers.Dropout(0.3)
model.add(keras.layers.Dense(256, activation = "relu"))
keras.layers.Dropout(0.3)
model.add(keras.layers.Dense(512, activation = "relu"))
model.add(keras.layers.Dense(3, activation = "softmax"))
adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer = adam,
loss='categorical_crossentropy',
metrics = ['accuracy'])
history = model.fit(traindata, trainlabels,
epochs = 5,
validation_split = 0.2,
verbose = 1,)
There's nothing visually wrong with your model, but it may be too haevy to learn any useful features.
Try normalizing your input with https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Start with only 2 layers, and a few numbers of neurons.
Increase batch_size and try learning_rate scheduling.
Observe the validation_accuracy, stop when it starts to overfit.
Finally, for a 3-class classification, 56% accuracy is better than baseline, remmeber it's a competition so the data is not dummy playground data which you can expect to get a 90% accuracy with an MLP in the first try.
Finally, try hyperparameter optimization with tuner.

Recurrent Neural Network Mini-Batch dependency after trained

Currently, I have a neural network, built in tensorflow that is used to classify time sequence data into one of 6 categories. The network is composed of:
2 fully connected layers -> LSTM unit -> softmax -> output
All layers have regularization in the form of dropout and or layer normalization. In order to speed up the training process, I am using mini-batching of the data, where the mini-batch size = # of categories = 6. Each mini-batch contains exactly one sample for each of the 6 categories, arranged randomly in the mini-batch. Below is the feed-forward code, where x is of shape [batch_size, number of time steps, number of features], and the various get commands are simple definitions for creating standard fully connected layers and LSTM units with regularization.
def getFullyConnected(input ,hidden ,dropout, layer, phase):
weight = tf.Variable(tf.random_normal([input.shape.dims[1].value,hidden]), name="weight_layer"+str(layer))
bias = tf.Variable(tf.random_normal([1]), name="bias_layer"+str(layer))
layer = tf.add(tf.matmul(input, weight), bias)
layer = tf.contrib.layers.batch_norm(layer,
center=True, scale=True,
is_training=phase)
layer = tf.minimum(tf.nn.relu(layer), FLAGS.relu_clip)
layer = tf.nn.dropout(layer, (1.0 - dropout))
return layer
def RNN(x, weights, biases, time_steps):
#shape the input as [batch_size*time_steps, input_depth]
x = tf.reshape(x, [-1,input_depth])
layer1 = getFullyConnected(input=x, hidden=16, dropout=full_drop, layer=1, phase=True)
layer2 = getFullyConnected(input=layer1, hidden=input_depth*3, dropout=full_drop, layer=2, phase=True)
rnn_input = tf.reshape(layer2, [-1,time_steps,input_depth*3])
# 1-layer LSTM with n_hidden units.
LSTM_cell = getLSTMcell(n_hidden)
#generate prediction
outputs, state = tf.nn.dynamic_rnn(LSTM_cell,
rnn_input,
dtype=tf.float32,
time_major=False)
#good old tensorboard saves
tf.summary.histogram('weight', weights['out'])
tf.summary.histogram('bias',biases['out'])
#there are time_steps outputs, but only grab the last output for the classification
return tf.sigmoid(tf.matmul(outputs[:,-1,:], weights['out']) + biases['out'])
Surprisingly, this network trained extremely well giving me about 99.75% accuracy on my test data (which the trained network had never seen). However, it only scored this high when I fed the training data into the network with a mini-batch size the same as during training, 6. If I only fed the training data one sample at a time (mini-batch size = 1), the network was scoring around 60%. What is weird is that, if I train the network with only single samples (mini-batch size = 1), the trained network works perfectly fine with high accuracy once the network is trained. This leads me to the weird conclusion that the network is almost learning to utilize the batch size in its learning, so much so that it becomes dependent on the mini-batch to classify correctly.
Is it a thing for a deep network to become dependent on the size of the mini-batch during training, so much that the final trained network will require input data to have the same mini-batch size just to perform correctly?
All ideas or thoughts would be loved!

The CNN model does not learn when adding one/two more convolutional layers

I am trying to implement the model in the picture in tensorflow. Just instead of 6 output neurons I have 1000. I am doing this to have pretrained weights.
I implemented the full model but with only one (14,14,128) layer; just for testing and so on. Now that the porgram is matured I implemented the two more layers (or one). This makes the model not learning anything; loss is constant (around a small noise) and the accuracy tested on the training images is constant at random guess. Befor adding the layers I could get very fast (5-10 min) to accuracy of 70-80 percent in 1000 image-subset from the train dataset. As said this is not the case with the additional layers.
Here is the code for the additional layer where s1 and s2 is the stride of the conv:
w2 = weight_variable([3,3,64,128])
b2 = bias_variable([128])
h2 = tf.nn.relu(conv2d_s2(h1_pool,w2)+b2)
h2_pool = max_pool_2x2(h2)
#Starts additional layer
w3 = weight_variable([3,3,128,128])
b3 = bias_variable([128])
h3 = tf.nn.relu(conv2d_s1(h2_pool,w3)+b3)
#Ends additional layer
w5 = weight_variable([3,3,128,256])
b5 = bias_variable([256])
h5 = tf.nn.relu(conv2d_s1(h3,w5)+b5)
h5_pool = max_pool_2x2(h5)
This extra layer makes the model worthless. I have tried different hyperparameters (learning rate, batch size, epochs) without success. Where lies the problem?
Another question could be: does anyone know a small (and/or better) network of this size so I can imprement and test. My goal is to detect grasp positions in different objects (images of objects)?
If it helps, I use one GTX 980 with a very very good xeon.
The repository can be found in https://github.com/tnikolla/grasp-detection.
UPDATE
The problem was divergence in loss. Solved by lowering the learning rate
In orange is the accuracy and loss (tensorboard and terminal) when the program showed no learning at all. I was fooled by the loss showed in termminal. As pointed by #hars to check the logs of accuracy and loss I discovered that the loss in tensorboard diverges in the first steps. By changing the learning rate from 0,01 to 0,001 the divergence dissapeared and as you can see in cyan the model was learning (overfited a small subset of imagente in 1 minute).
You have a ReLU layer at the end of the model which might be clipping all gradients, followed by a softmax with logits in the training part. Therefore, the model might get stuck at poor local-minima.
Try removing tf.nn.relu in the last line of your inference and see if it trains well.
Here is your part of the code:
last few lines of model :
# fc1 layer
W_fc3 = weight_variable([512, 1000])
b_fc3 = bias_variable([1000])
output = tf.nn.relu(tf.matmul(h_fc2, W_fc3) + b_fc3)
#print("output: {}".format(output.get_shape()))
return output
training part of the code:
logits = inference_redmon.inference(images)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=labels_one))
tf.summary.scalar('loss', loss)
correct_pred = tf.equal( tf.argmax(logits,1), tf.argmax(labels_one,1))
accuracy = tf.reduce_mean( tf.cast( correct_pred, tf.float32))