Why is my convolutional Neural Network stuck in a local minimum? - optimization

I've heard that machine learning algorithms rarely get stuck in local minima, but my CNN (in tensorflow) is predicting a constant output for all values and I am using a mean square error loss function so I think this must be a local minima given the properties of MSE. I have a network with 2 convolution layers and 1 dense layer (+1 dense output layer for regression) with 24, 32 and 100 neurons respectively, but I've tried changing the numbers of layers/neurons and the issue is not solved. I have relu activations for the hidden layers and absolute value on the output layer (I know this is uncommon but it converges faster to a lower MSE than the softplus function which still has the same problem and I need strictly positive outputs). I also have a 50% dropout layer between the dense and output layers and a pooling layer between the 2 convolutions. I have also tried changing the learning rate (currently 0.0001) and batch size. I am using an Adam Optimizer.
I have seen it suggested to change/add bias but I'm not sure how to initialize it in tf.layers.conv2d/tf.layers.dense (for which I have bias=True), and I can't see any options for bias with tf.nn.conv2d which I used for my first layer so I could initialize the kernel easily.
Any suggestions would be really appreciated, thanks.
Here's the section of my code with the network:
filter_shape = [3,3,12,24]
def nn_model(input):
weights = tf.Variable(tf.truncated_normal(filter_shape, mean=10,
stddev=3), name='weights')
conv1 = tf.nn.conv2d(input, weights, [1,1,1,1], padding='SAME')
conv2 = tf.layers.conv2d(inputs=conv1, filters=32, kernel_size=[3,3],
padding="same", activation=tf.nn.relu)
pool = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2,
padding='same')
flat = tf.reshape(pool, [-1, 32*3*3])
dense_3 = tf.layers.dense(flat, neurons, activation = tf.nn.relu)
dropout_2 = tf.layers.dropout(dense_3, rate = rate)
prediction = tf.layers.dense(dropout_2, 1, activation=tf.nn.softplus)
return prediction
My inputs are 5x5 images with 12 channels of environmental data and I have ~100,000 training samples. My current MSE is ~90 on values of ~25.

I used to face the same problem with bigger images. I incresed the number of convolution layers to solve it. Maybe you should try to add even more convolution layers.
In my opinion, the problem comes from the fact you don't have enough parameters and thus get stuck in a local minimum. If you increase your number of parameters, it can help the updates to converge to a better minimum.
Also, I can't see the optimizer you are using. Is it Adam ? You can try to start with a bigger learning-rate and use a decay to decrease it epoch after epoch.

Related

Prevent exploding loss function multi-step multi-variate/output forecast ConvLSTM

I have a plausible problem that currently fails to solve. During the training, my loss function explodes becomes inf or NaN, because the MSE of all errors becomes huge if the predictions (at the beginning of the training) are worse. And that is the normal intended behavior and correct. But, how do I train a ConvLSTM to which loss function to be able to learn a multi-step multi-variate output?
E.g. i try a (32, None, 200, 12) to predict (32, None, 30, 12). 32 is the batch size, None is the number of samples (>3000). 200 is the number of time steps, 12 features wide. 30 output time steps, 12 features wide.
My ConvLSTM model:
input = Input(shape=(None, input_shape[1]))
conv1d_1 = Conv1D(filters=64, kernel_size=3, activation=LeakyReLU())(input)
conv1d_2 = Conv1D(filters=32, kernel_size=3, activation=LeakyReLU())(conv1d_1)
dropout = Dropout(0.3)(conv1d_2)
lstm_1 = LSTM(32, activation=LeakyReLU())(dropout)
dense_1 = Dense(forecasting_horizon * input_shape[1], activation=LeakyReLU())(lstm_1)
output = Reshape((forecasting_horizon, input_shape[1]))(dense_1)
model = Model(inputs=input, outputs=output)
My ds generation:
ds_inputs = tf.keras.utils.timeseries_dataset_from_array(df[:-forecast_horizon], None, sequence_length=window_size, sequence_stride=1,
shuffle=False, batch_size=None)
ds_targets = tf.keras.utils.timeseries_dataset_from_array(df[forecast_horizon:], None, sequence_length=forecast_horizon, sequence_stride=1,
shuffle=False, batch_size=None)
ds_inputs = ds_inputs.batch(batch_size, drop_remainder=True)
ds_targets = ds_targets.batch(batch_size, drop_remainder=True)
ds = tf.data.Dataset.zip((ds_inputs, ds_targets))
ds = ds.shuffle(buffer_size=(len(ds)))
Besides MSE, I already tried MeanAbsoluteError, MeanSquaredLogarithmicError, MeanAbsolutePercentageError, CosineSimilarity. Where the last, produce non-sense. MSLE works best but does not favor large errors and therefore the MSE (used as metric has an incredible variation during training). Additionally, after a while, the Network becomes stale and gets no better loss (my explanation is that the difference in loss becomes too minor on the logarithmic scale and therefore the weights cannot be well adjusted).
I can partially answer my own question. One issue is that I used ReLu/LeakyReLu which will lead to exploding gradient problem because the RNN/LSTM Layer applies the same weights over time leading to exploding values as the values add up. Weights will not be reduced by any chance (ReLu min == 0). With Tanh as activation, it is possible to have negative values which also allow a reduction of the internal weights and really minimize the chance of exploding weights/predictions within the network. After some tests, the LSTM layer stays numerical stable.

Hand Landmark Coordinate Neural Network Not Converging

I'm currently trying to train a custom model with tensorflow to detect 17 landmarks/keypoints on each of 2 hands shown in an image (fingertips, first knuckles, bottom knuckles, wrist, and palm), for 34 points (and therefore 68 total values to predict for x & y). However, I cannot get the model to converge, with the output instead being an array of points that are pretty much the same for every prediction.
I started off with a dataset that has images like this:
each annotated to have the red dots correlate to each keypoint. To expand the dataset to try to get a more robust model, I took photos of the hands with various backgrounds, angles, positions, poses, lighting conditions, reflectivity, etc, as exemplified by these further images:
I have about 3000 images created now, with the landmarks stored inside a csv as such:
I have a train-test split of .67 train .33 test, with the images randomly selected to each. I load the images with all 3 color channels, and scale the both the color values & keypoint coordinates between 0 & 1.
I've tried a couple different approaches, each involving a CNN. The first keeps the images as they are, and uses a neural network model built as such:
model = Sequential()
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu', input_shape = (225,400,3)))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
filters_convs = [(128, 2), (256, 3), (512, 3), (512,3)]
for n_filters, n_convs in filters_convs:
for _ in np.arange(n_convs):
model.add(Conv2D(filters = n_filters, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dense(96, activation="relu"))
model.add(Dense(72, activation="relu"))
model.add(Dense(68, activation="sigmoid"))
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())
I've modified the various hyperparameters, yet nothing seems to make any noticeable difference.
The other thing I've tried is resizing the images to fit within a 224x224x3 array to use with a VGG-16 network, as such:
vgg = VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
vgg.trainable = False
flatten = vgg.output
flatten = Flatten()(flatten)
points = Dense(256, activation="relu")(flatten)
points = Dense(128, activation="relu")(points)
points = Dense(96, activation="relu")(points)
points = Dense(68, activation="sigmoid")(points)
model = Model(inputs=vgg.input, outputs=points)
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())
This model has similar results to the first. No matter what I seem to do, I seem to get the same results, in that my mse loss minimizes around .009, with an mae around .07, no matter how many epochs I run:
Furthermore, when I run predictions based off the model it seems that the predicted output is basically the same for every image, with only slight variation between each. It seems the model predicts an array of coordinates that looks somewhat like what a splayed hand might, in the general areas hands might be most likely to be found. A catch-all solution to minimize deviation as opposed to a custom solution for each image. These images illustrate this, with the green being predicted points, and the red being the actual points for the left hand:
So, I was wondering what might be causing this, be it the model, the data, or both, because nothing I've tried with either modifying the model or augmenting the data seems to have done any good. I've even tried reducing the complexity to predict for one hand only, to predict a bounding box for each hand, and to predict a single keypoint, but no matter what I try, the results are pretty inaccurate.
Thus, any suggestions for what I could do to help the model converge to create more accurate & custom predictions for each image of hands it sees would be very greatly appreciated.
Thanks,
Sam
Usually, neural networks will have a very hard time to predict exact coordinates of landmarks. A better approach is probably a fully convolutional network. This would work as follows:
You omit the dense layers at the end and thus end up with an output of (m, n, n_filters) with m and n being the dimensions of your downsampled feature maps (since you use maxpooling at some earlier stage in the network they will be lower resolution than your input image).
You set n_filters for the last (output-)layer to the number of different landmarks you want to detect plus one more to indicate no landmark.
You remove some of the max pooling such that your final output has a fairly high resolution (so the earlier referenced m and n are bigger). Now your output has shape mxnx(n_landmarks+1) and each of the nxm (n_landmark+1)-dimensional vectors indicate which landmark is present as the position in the image that corresponds to the position in the mxn grid. So the activation for your last output convolutional layer needs to be a softmax to represent probabilities.
Now you can train your network to predict the landmarks locally without having to use dense layers.
This is a very simple architecture and for optimal results a more sophisticated architecture might be needed, but I think this should give you a first idea of a better approach than using the dense layers for the prediction.
And for the explanation why your network does predict the same values every time: This is probably, because your network is just not able to learn what you want it to learn because it is not suited to do so. If this is the case, the network will just learn to predict a value, that is fairly good for most of the images (so basically the "average" position of each landmark for all of your images).

Keras LSTM always underfits

I am trying to train an LSTM with Keras and Tensorflow backend but it seems to always underfit; the loss and validation loss curves have an initial drop and then flatten out very fast (see image). I have tried adding more layers, more neurons, no dropout, etc., but can't get it even anywhere near an overfit and I do have a good bit of data (almost 4 hours with 100 samples per second, and I have tried downsampling to 50/sec).
My problem is multidimensional time series prediction with continuous values.
Any ideas would be appreciated!
Here is my basic keras architecture:
data_dim = 30 #input dimensions => each timestep has 30 features
timesteps = 200
out_dim = 30 #output dimensions => each predicted output timestep
# has 30 dimensions
batch_size = 50
num_epochs = 300
learning_rate = 0.0005 #tried values between around 0.001 and 0.0003
decay=0.9
#hidden layers size
h1 = 120
h2 = 340
h3 = 340
h4 = 120
model = Sequential()
model.add(LSTM(h1, return_sequences=True,input_shape=(timesteps, data_dim)))
model.add(LSTM(h2, return_sequences=True))
model.add(LSTM(h3, return_sequences=True))
model.add(LSTM(h4, return_sequences=True))
model.add(Dense(out_dim, activation='linear'))
rmsprop_otim = keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-08, decay=decay)
model.compile(loss='mean_squared_error', optimizer=rmsprop_otim,metrics=['mse'])
#data preparation
[x_train, y_train] = readData()
x_train = x_train.reshape((int(num_samples/timesteps),timesteps,data_dim))
y_train = y_train.reshape((int(num_samples/timesteps),timesteps,num_classes))
history_callback = model.fit(x_train, y_train, validation_split=0.1,
batch_size=batch_size, epochs=num_epochs,shuffle=False,callbacks=[checkpointer, losses])
When you say 0.06 mse is underfit, this depends lot on data distribution. mse is relative term, so if the data is not normalizaed, 0.06 might even be overfit. In such case, pre-processing might help. Also, check if there is significant noise in the data.
Using 4 LSTM layers with large sizes means a lot of parameters to learn. Lesser number of layers might be enough.
Try non-linear activation in the final layer.
I suspect that your model only learns the weights of the Dense Layer properly, but not those of the LSTM layers below. As a quick check, what kind of performance do you get when you get rid of all LSTM layers and replace the Dense with a TimeDistributed(Dense...) layer? If your graphs look the same, training doesn't work, i.e. error gradients with respect to the lower layer weights may be too small. Another way of checking this is to inspect the gradients directly and/or to compare the final weighs after training with the initial weights. If this is indeed the problem you can try the following 1) Standardize your inputs, 2) use a smaller learning rate (decrease logarithmically), 3) add skip layer connections, 4) use Adam instead of RMSprop, and 5) train for more epochs.

Recurrent Neural Network Mini-Batch dependency after trained

Currently, I have a neural network, built in tensorflow that is used to classify time sequence data into one of 6 categories. The network is composed of:
2 fully connected layers -> LSTM unit -> softmax -> output
All layers have regularization in the form of dropout and or layer normalization. In order to speed up the training process, I am using mini-batching of the data, where the mini-batch size = # of categories = 6. Each mini-batch contains exactly one sample for each of the 6 categories, arranged randomly in the mini-batch. Below is the feed-forward code, where x is of shape [batch_size, number of time steps, number of features], and the various get commands are simple definitions for creating standard fully connected layers and LSTM units with regularization.
def getFullyConnected(input ,hidden ,dropout, layer, phase):
weight = tf.Variable(tf.random_normal([input.shape.dims[1].value,hidden]), name="weight_layer"+str(layer))
bias = tf.Variable(tf.random_normal([1]), name="bias_layer"+str(layer))
layer = tf.add(tf.matmul(input, weight), bias)
layer = tf.contrib.layers.batch_norm(layer,
center=True, scale=True,
is_training=phase)
layer = tf.minimum(tf.nn.relu(layer), FLAGS.relu_clip)
layer = tf.nn.dropout(layer, (1.0 - dropout))
return layer
def RNN(x, weights, biases, time_steps):
#shape the input as [batch_size*time_steps, input_depth]
x = tf.reshape(x, [-1,input_depth])
layer1 = getFullyConnected(input=x, hidden=16, dropout=full_drop, layer=1, phase=True)
layer2 = getFullyConnected(input=layer1, hidden=input_depth*3, dropout=full_drop, layer=2, phase=True)
rnn_input = tf.reshape(layer2, [-1,time_steps,input_depth*3])
# 1-layer LSTM with n_hidden units.
LSTM_cell = getLSTMcell(n_hidden)
#generate prediction
outputs, state = tf.nn.dynamic_rnn(LSTM_cell,
rnn_input,
dtype=tf.float32,
time_major=False)
#good old tensorboard saves
tf.summary.histogram('weight', weights['out'])
tf.summary.histogram('bias',biases['out'])
#there are time_steps outputs, but only grab the last output for the classification
return tf.sigmoid(tf.matmul(outputs[:,-1,:], weights['out']) + biases['out'])
Surprisingly, this network trained extremely well giving me about 99.75% accuracy on my test data (which the trained network had never seen). However, it only scored this high when I fed the training data into the network with a mini-batch size the same as during training, 6. If I only fed the training data one sample at a time (mini-batch size = 1), the network was scoring around 60%. What is weird is that, if I train the network with only single samples (mini-batch size = 1), the trained network works perfectly fine with high accuracy once the network is trained. This leads me to the weird conclusion that the network is almost learning to utilize the batch size in its learning, so much so that it becomes dependent on the mini-batch to classify correctly.
Is it a thing for a deep network to become dependent on the size of the mini-batch during training, so much that the final trained network will require input data to have the same mini-batch size just to perform correctly?
All ideas or thoughts would be loved!

Keras BatchNorm: Training accuracy increases while Testing accuracy decreases

I am trying to use BatchNorm in Keras. The training accuracy increases over time. From 12% to 20%, slowly but surely.
The test accuracy however decreases from 12% to 0%. Random baseline is 12%.
I very much assume this is due to the batchnorm layer (removing the batchnorm layer results in ~12% test accuracy), which maybe does not initialize parameters gamma and beta well enough. Do I have to regard anything special when applying batchnorm? I don't really understand what else could have gone wrong. I have the following model:
model = Sequential()
model.add(BatchNormalization(input_shape=(16, 8)))
model.add(Reshape((16, 8, 1)))
#1. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))
#2. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))
...
#8. Affine (NUM_GESTURES units) Output layer
model.add(default_Dense(NUM_GESTURES))
model.add(Activation('softmax'))
sgd = optimizers.SGD(lr=0.1)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
default_Conv2D and default_Dense are defined as follows:
def default_Conv2D():
return Conv2D(
filters=64,
kernel_size=3,
strides=1,
padding='same',
# activation=None,
# use_bias=True,
# kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), #RandomUniform(),
kernel_regularizer=regularizers.l2(0.0001),
# bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), # RandomUniform(),
# bias_regularizer=None
)
def default_Dense(units):
return Dense(
units=units,
# activation=None,
# use_bias=True,
# kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
# bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
kernel_regularizer=regularizers.l2(0.0001),
# bias_regularizer=None
)
The issue is overfitting.
This is supported by your first 2 observations :
The training accuracy increases over time. From 12% to 20%,.. test accuracy however decreases from 12% to 0%
removing the batchnorm layer results in ~12% test accuracy
The first statement tells me that your network is memorizing the training set. The second statement tells me that when you prevent the network from memorizing the training set (or even learning) then it stops making error to do with memorization.
There are a few solutions to overfitting, but it is a problem large than this post. Please treat the following list as a "top" list and not exhaustive:
add a regularizer like Dropout just before your final fully connected layer.
add a L1 or L2 regularizer on matrix weights
add a regularizer like Dropout between CONV
your network may have too many free parameters. try reducing the layers to just 1 CONV, and add one more layer at a time retraining and testing each time.
slow increase in accuracy
As a side note, you hinted that your accuracy isn't increasing as fast as you like by saying slowly but surely. I've had great success when I've done all of the following steps
change your loss function to be the average loss of all predictions for all items in the mini-batch. This makes your loss function independent of your batch size which you'll discover that if you change your batch size and your loss function changes with it then you'll have to change your learning rate in SGD.
your loss is a single number that is the average of the loss for all predicted classes and all samples, so use a learning rate of 1.0. No need to scale it anymore.
use tf.train.MomentumOptimizer with learning_rate = 1.0 and momentum = 0.5. MomentumOptimizer has been shown to be much more robust than GradientDescent.
It seems that there was something broken with Keras itself.
A naive
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
did the trick.
#wontonimo, thanks a lot for your really great answer!