I am trying to use BatchNorm in Keras. The training accuracy increases over time. From 12% to 20%, slowly but surely.
The test accuracy however decreases from 12% to 0%. Random baseline is 12%.
I very much assume this is due to the batchnorm layer (removing the batchnorm layer results in ~12% test accuracy), which maybe does not initialize parameters gamma and beta well enough. Do I have to regard anything special when applying batchnorm? I don't really understand what else could have gone wrong. I have the following model:
model = Sequential()
model.add(BatchNormalization(input_shape=(16, 8)))
model.add(Reshape((16, 8, 1)))
#1. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))
#2. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))
...
#8. Affine (NUM_GESTURES units) Output layer
model.add(default_Dense(NUM_GESTURES))
model.add(Activation('softmax'))
sgd = optimizers.SGD(lr=0.1)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
default_Conv2D and default_Dense are defined as follows:
def default_Conv2D():
return Conv2D(
filters=64,
kernel_size=3,
strides=1,
padding='same',
# activation=None,
# use_bias=True,
# kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), #RandomUniform(),
kernel_regularizer=regularizers.l2(0.0001),
# bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), # RandomUniform(),
# bias_regularizer=None
)
def default_Dense(units):
return Dense(
units=units,
# activation=None,
# use_bias=True,
# kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
# bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
kernel_regularizer=regularizers.l2(0.0001),
# bias_regularizer=None
)
The issue is overfitting.
This is supported by your first 2 observations :
The training accuracy increases over time. From 12% to 20%,.. test accuracy however decreases from 12% to 0%
removing the batchnorm layer results in ~12% test accuracy
The first statement tells me that your network is memorizing the training set. The second statement tells me that when you prevent the network from memorizing the training set (or even learning) then it stops making error to do with memorization.
There are a few solutions to overfitting, but it is a problem large than this post. Please treat the following list as a "top" list and not exhaustive:
add a regularizer like Dropout just before your final fully connected layer.
add a L1 or L2 regularizer on matrix weights
add a regularizer like Dropout between CONV
your network may have too many free parameters. try reducing the layers to just 1 CONV, and add one more layer at a time retraining and testing each time.
slow increase in accuracy
As a side note, you hinted that your accuracy isn't increasing as fast as you like by saying slowly but surely. I've had great success when I've done all of the following steps
change your loss function to be the average loss of all predictions for all items in the mini-batch. This makes your loss function independent of your batch size which you'll discover that if you change your batch size and your loss function changes with it then you'll have to change your learning rate in SGD.
your loss is a single number that is the average of the loss for all predicted classes and all samples, so use a learning rate of 1.0. No need to scale it anymore.
use tf.train.MomentumOptimizer with learning_rate = 1.0 and momentum = 0.5. MomentumOptimizer has been shown to be much more robust than GradientDescent.
It seems that there was something broken with Keras itself.
A naive
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
did the trick.
#wontonimo, thanks a lot for your really great answer!
Related
I am working on a classification problem of the semeval 2017 task 4A dataset can be found here
and I am using deep LSTM network for it. In pre-processing, I have done lower casing->tokenization->lemmatization->removing stop words->removing punctuations. For word embeddings, I have used WORD2VEC model. There are 18,000 samples in my training set and 2000 samples in testing.
The code for my model is
model = Sequential()
model.add(Embedding(max_words, 30, input_length=max_len))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True), input_shape=(128, 1,64)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
The value of max_words is 2000 and max_len is 300
But even after this, my testing accuracy is not crossing 50%. I can't figure out the problem.
PS - I am using validation technique too. The loss function is 'Binary Crossentropy' and optimizer is 'Adam'.
Training "LSTM" is very different with other common deep learning model.
I recommend a higher dropout rate like 0.7,0.8. and Adam optimizer is particularly unstable in LSTM with real world data. So, i recommend SGD scheduled for a momentum of 0.9 and ReduceLROnPlateau. You have to do very long training, and if spark loss is observed, the training is going very well. (Spark Loss is a word used by NVIDIA researchers. It refers to a phenomenon in which the value of Loss that appears to converge increases significantly.)
I am working on a document classification problem.
Multi-label classification 20 different labels, 1920 documents in training, and 480 in validation. The model is a CNN with FastText embeddings and I use a logistic regression model with Ngram as baseline.
The problem is that the baseline model gives a f1-score of 0.36 while the cnn only gives 0.3.
The architecture I use is from here:
https://www.kaggle.com/vsmolyakov/keras-cnn-with-fasttext-embeddings
I have been doing some parameter tuning, and the current best parameters are: dropout. 0.25, learning rate 0.001, trainable embeddings false, 128 filters, prediction threshold 0.15 and kernel size 9.
Do you guys have ideas to parameters to be special aware of, ideas to change the architecture, anything that might improve the f1-score?
# Parameters
BATCH_SIZE = 16
DROP_OUT = 0.25
N_EPOCHS = 20
N_FILTERS = 128
TRAINABLE = False
LEARNING_RATE = 0.001
N_DIM = 32
KERNEL_SIZE = 9
# Create model
model = Sequential()
model.add(Embedding(NB_WORDS, EMBED_DIM, weights=[embedding_matrix],
input_length=MAX_SEQ_LEN, trainable=TRAINABLE))
model.add(Conv1D(N_FILTERS, KERNEL_SIZE, activation='relu', padding='same'))
model.add(MaxPooling1D(2))
model.add(Conv1D(N_FILTERS, KERNEL_SIZE, activation='relu', padding='same'))
model.add(GlobalMaxPooling1D())
model.add(Dropout(DROP_OUT))
model.add(Dense(N_DIM, activation='relu', kernel_regularizer=regularizers.l2(1e-4)))
model.add(Dense(N_LABELS, activation='sigmoid')) #multi-label (k-hot encoding)
adam = optimizers.Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
model.summary()
Edit
I think I got some wrong hyperparameters by fixing epochs to 20 during tuning. I am now trying with a stopping criteria, the model usually converges around 30-35 epochs. It seems dropout of 0.5 works better, and I am currently tuning batch size. If somebody has some experience/knowledge about the relationship between epochs and other hyperparameters feel free to share.
A thing you should consider in general is whether the data is imbalanced and how your model performs for each class (using for example sklearn.metrics.confusion_matrix)
I think the dataset (2000 over 20 classes) might be not big enough for deep learning to work from scratch. You can consider augmenting your dataset or you could start by trying to fine-tune a pretrained language model for your task. See https://github.com/huggingface/pytorch-openai-transformer-lm .That could help you overcome the issue with the dataset size in general.
Has anybody trained Mobile Net V1 from scratch using CIFAR-10? What was the maximum accuracy you got? I am getting stuck at 70% after 110 epochs. Here is how I am creating the model. However, my training accuracy is above 99%.
#create mobilenet layer
MobileNet_model = tf.keras.applications.MobileNet(include_top=False, weights=None)
# Must define the input shape in the first layer of the neural network
x = Input(shape=(32,32,3),name='input')
#Create custom model
model = MobileNet_model(x)
model = Flatten(name='flatten')(model)
model = Dense(1024, activation='relu',name='dense_1')(model)
output = Dense(10, activation=tf.nn.softmax,name='output')(model)
model_regular = Model(x, output,name='model_regular')
I used Adam optimizer with a LR= 0.001, amsgrad = True and batch size = 64. Also normalized pixel data by dividing by 255.0. I am not using any Data Augmentation.
optimizer1 = tf.keras.optimizers.Adam(lr=0.001, amsgrad=True)
model_regular.compile(optimizer=optimizer1, loss='categorical_crossentropy', metrics=['accuracy'])
history = model_regular.fit(x_train, y_train_one_hot,validation_data=(x_test,y_test_one_hot),batch_size=64, epochs=100) # train the model
I think I am supposed to get at least 75% according to https://arxiv.org/abs/1712.04698
Am I am doing anything wrong or is this the expected accuracy after 100 epochs. Here is a plot of my validation accuracy.
Mobilenet was designed to train Imagenet which is much larger, therefore train it on Cifar10 will inevitably result in overfitting. I would suggest you plot the loss (not acurracy) from both training and validation/evaluation, and try to train it hard to achieve 99% training accuracy, then observe the validation loss. If it is overfitting, you would see that the validation loss will actually increase after reaching minima.
A few things to try to reduce overfitting:
add dropout before fully connected layer
data augmentation - random shift, crop and rotation should be enough
use smaller width multiplier (read the original paper, basically just reduce number of filter per layers) e.g. 0.75 or 0.5 to make the layers thinner.
use L2 weight regularization and weight decay
Then there are some usual training tricks:
use learning rate decay e.g. reduce the learning rate from 1e-2 to 1e-4 stepwise or exponentially
With some hyperparameter search, I got evaluation loss of 0.85. I didn't use Keras, I wrote the Mobilenet myself using Tensorflow.
The OP asked about MobileNetv1. Since MobileNetv2 has been published, here is an update on training MobileNetv2 on CIFAR-10 -
1) MobileNetv2 is tuned primarily to work on ImageNet with an initial image resolution of 224x224. It has 5 convolution operations with stride 2. Thus the GlobalAvgPool2D (penultimate layer) gets a feature map of Cx7x7, where C is the number of filters (1280 for MobileNetV2).
2) For CIFAR10, I changed the stride in the first three of these layers to 1. Thus the GlobalAvgPool2D gets a feature map of Cx8x8. Secondly, I trained with 0.25 on the width parameter (affects the depth of the network). I trained with mixup in mxnet (https://gluon-cv.mxnet.io/model_zoo/classification.html). This gets me a validation accuracy of 93.27.
3) Another MobileNetV2 implementation that seems to work well for CIFAR-10 is available here - PyTorch-CIFAR
The reported accuracy is 94.43. This implementation changes the stride in the first two of the original layers which downsample the resolution to stride 1. And it uses the full width of the channels as used for ImageNet.
4) Further, I trained a MobileNetV2 on CIFAR-10 with mixup while only setting altering the stride in the first conv layer from 2 to 1 and used the complete depth (width parameter==1.0). Thus the GlobalAvgPool2D (penultimate layer) gets a feature map of Cx2x2. This gets me an accuracy of 92.31.
I've heard that machine learning algorithms rarely get stuck in local minima, but my CNN (in tensorflow) is predicting a constant output for all values and I am using a mean square error loss function so I think this must be a local minima given the properties of MSE. I have a network with 2 convolution layers and 1 dense layer (+1 dense output layer for regression) with 24, 32 and 100 neurons respectively, but I've tried changing the numbers of layers/neurons and the issue is not solved. I have relu activations for the hidden layers and absolute value on the output layer (I know this is uncommon but it converges faster to a lower MSE than the softplus function which still has the same problem and I need strictly positive outputs). I also have a 50% dropout layer between the dense and output layers and a pooling layer between the 2 convolutions. I have also tried changing the learning rate (currently 0.0001) and batch size. I am using an Adam Optimizer.
I have seen it suggested to change/add bias but I'm not sure how to initialize it in tf.layers.conv2d/tf.layers.dense (for which I have bias=True), and I can't see any options for bias with tf.nn.conv2d which I used for my first layer so I could initialize the kernel easily.
Any suggestions would be really appreciated, thanks.
Here's the section of my code with the network:
filter_shape = [3,3,12,24]
def nn_model(input):
weights = tf.Variable(tf.truncated_normal(filter_shape, mean=10,
stddev=3), name='weights')
conv1 = tf.nn.conv2d(input, weights, [1,1,1,1], padding='SAME')
conv2 = tf.layers.conv2d(inputs=conv1, filters=32, kernel_size=[3,3],
padding="same", activation=tf.nn.relu)
pool = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2,
padding='same')
flat = tf.reshape(pool, [-1, 32*3*3])
dense_3 = tf.layers.dense(flat, neurons, activation = tf.nn.relu)
dropout_2 = tf.layers.dropout(dense_3, rate = rate)
prediction = tf.layers.dense(dropout_2, 1, activation=tf.nn.softplus)
return prediction
My inputs are 5x5 images with 12 channels of environmental data and I have ~100,000 training samples. My current MSE is ~90 on values of ~25.
I used to face the same problem with bigger images. I incresed the number of convolution layers to solve it. Maybe you should try to add even more convolution layers.
In my opinion, the problem comes from the fact you don't have enough parameters and thus get stuck in a local minimum. If you increase your number of parameters, it can help the updates to converge to a better minimum.
Also, I can't see the optimizer you are using. Is it Adam ? You can try to start with a bigger learning-rate and use a decay to decrease it epoch after epoch.
I am trying to train an LSTM with Keras and Tensorflow backend but it seems to always underfit; the loss and validation loss curves have an initial drop and then flatten out very fast (see image). I have tried adding more layers, more neurons, no dropout, etc., but can't get it even anywhere near an overfit and I do have a good bit of data (almost 4 hours with 100 samples per second, and I have tried downsampling to 50/sec).
My problem is multidimensional time series prediction with continuous values.
Any ideas would be appreciated!
Here is my basic keras architecture:
data_dim = 30 #input dimensions => each timestep has 30 features
timesteps = 200
out_dim = 30 #output dimensions => each predicted output timestep
# has 30 dimensions
batch_size = 50
num_epochs = 300
learning_rate = 0.0005 #tried values between around 0.001 and 0.0003
decay=0.9
#hidden layers size
h1 = 120
h2 = 340
h3 = 340
h4 = 120
model = Sequential()
model.add(LSTM(h1, return_sequences=True,input_shape=(timesteps, data_dim)))
model.add(LSTM(h2, return_sequences=True))
model.add(LSTM(h3, return_sequences=True))
model.add(LSTM(h4, return_sequences=True))
model.add(Dense(out_dim, activation='linear'))
rmsprop_otim = keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-08, decay=decay)
model.compile(loss='mean_squared_error', optimizer=rmsprop_otim,metrics=['mse'])
#data preparation
[x_train, y_train] = readData()
x_train = x_train.reshape((int(num_samples/timesteps),timesteps,data_dim))
y_train = y_train.reshape((int(num_samples/timesteps),timesteps,num_classes))
history_callback = model.fit(x_train, y_train, validation_split=0.1,
batch_size=batch_size, epochs=num_epochs,shuffle=False,callbacks=[checkpointer, losses])
When you say 0.06 mse is underfit, this depends lot on data distribution. mse is relative term, so if the data is not normalizaed, 0.06 might even be overfit. In such case, pre-processing might help. Also, check if there is significant noise in the data.
Using 4 LSTM layers with large sizes means a lot of parameters to learn. Lesser number of layers might be enough.
Try non-linear activation in the final layer.
I suspect that your model only learns the weights of the Dense Layer properly, but not those of the LSTM layers below. As a quick check, what kind of performance do you get when you get rid of all LSTM layers and replace the Dense with a TimeDistributed(Dense...) layer? If your graphs look the same, training doesn't work, i.e. error gradients with respect to the lower layer weights may be too small. Another way of checking this is to inspect the gradients directly and/or to compare the final weighs after training with the initial weights. If this is indeed the problem you can try the following 1) Standardize your inputs, 2) use a smaller learning rate (decrease logarithmically), 3) add skip layer connections, 4) use Adam instead of RMSprop, and 5) train for more epochs.