Loss increasing with batch normalization (tf.Keras) - tensorflow

I have a FF NN with 2 hidden layers for a regression problem. Compared to when I do not add BN, loss (MSE) is about double when training on the same number of epochs, and the execution time is also increased by about 20%. Why is that?
If I had to take a guess -- BN is not worth it on a 2-layer network, and the extra overhead introduced by BN is actually higher than whatever decrease in processing time it causes.
That would explain the execution time, but I am not sure why the loss is higher, too.
model = Sequential()
model.add(Dense(128, 'relu'))
model.add(BatchNormalization())
model.add(Dense(128, 'relu'))
model.add(BatchNormalization())
model.add(Dense(1, 'linear'))
model.compile(loss=mean_squared_error, optimizer='adam')
I've tried a variety of optimizers, activation functions, number of epochs, batch size, etc, but no difference.

For regression, you should not use BatchNorm before the output layer.
On the other hand, you could use BatchNorm right after the input layer and before the first Dense layer to normalize inputs.

Related

Improve multiclass text classification model with LSTM and Glove, Keras and Tensorflow

I have spent some time trying to improve my F1-Score for my multiclass text classification task. I am extraction aspects and sentiments from laptop reviews. Therefore there are 3 labels, B_A / I_A / O etc. I would really appreciate any suggestions to improve my network, for example additional layers or another embedding. (Maybe I should also try something else than multiclass classification for my task)
Now I have got a F1-Score of about 60% for the following code:
#vocab_size=4840, embedding is glove6B, max_seq_length=100
model = Sequential()
model.add(Embedding(vocab_size, 300, weights=[embedding_vectors], input_length=max_seq_length,
trainable= False))
model.add(Dropout(0.1))
model.add(Conv1D(3000, 1, activation='relu'))
model.add(Bidirectional(LSTM(units=150, recurrent_dropout=0, return_sequences=True)))
model.add(Dense(32, activation='relu'))
model.add(Dense(n_tags, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["categorical_accuracy"])
model.summary()
# fit model on train data
model.fit(x_train, y_train,
batch_size=64,
epochs=10)
I don't know about the data, but I do have a lot of suggestions in general for mult-text classification with keras:
Instead of adding 1 3000 Conv1D layer, try adding multiple Conv1D layers of a smaller filtering amount
For the 32 neuron Dense layer, try increasing the amount of neurons. Often, when you don't have enough neurons in the layer before the output layer, the model loses accuracy
Instead of adding activation='relu' into the layers, instead try adding a LeakyReLU, so it would fix the dying ReLU problem if it is there
Instead of adding the Dropout after the Embedding layer, add the Dropout after the Conv1D layer. I wouldn't see the need for a Dropout after an untrainable layer made just for vectorizing inputs
If you haven't tried any of my suggestions already, I would recommend trying it. I especially would try the 4th one, as a Dropout after an Embedding layer doesn't seem neccessary.

Deep LSTM accuracy not crossing 50%

I am working on a classification problem of the semeval 2017 task 4A dataset can be found here
and I am using deep LSTM network for it. In pre-processing, I have done lower casing->tokenization->lemmatization->removing stop words->removing punctuations. For word embeddings, I have used WORD2VEC model. There are 18,000 samples in my training set and 2000 samples in testing.
The code for my model is
model = Sequential()
model.add(Embedding(max_words, 30, input_length=max_len))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True), input_shape=(128, 1,64)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
The value of max_words is 2000 and max_len is 300
But even after this, my testing accuracy is not crossing 50%. I can't figure out the problem.
PS - I am using validation technique too. The loss function is 'Binary Crossentropy' and optimizer is 'Adam'.
Training "LSTM" is very different with other common deep learning model.
I recommend a higher dropout rate like 0.7,0.8. and Adam optimizer is particularly unstable in LSTM with real world data. So, i recommend SGD scheduled for a momentum of 0.9 and ReduceLROnPlateau. You have to do very long training, and if spark loss is observed, the training is going very well. (Spark Loss is a word used by NVIDIA researchers. It refers to a phenomenon in which the value of Loss that appears to converge increases significantly.)

A neural network that can't overfit?

I am fitting a model to some noisy satellite data. The labels are measurements of rock on the bars of a river. There is a noisy but significant relationship. I only have 250 points but the method would expand and eventually run off much bigger datasets. I'm looking at a mix of models (RANSAC, Huber, SVM Regression) and DNNs. My DNN results seem too good to be true. The network looks like:
model = Sequential()
model.add(Dense(128, kernel_regularizer= regularizers.l2(0.001), input_dim=NetworkDims, kernel_initializer='he_normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(32, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
And when I save the history and plot training loss (green dots) and validation loss (cyan line) vs epoch I get this:
Training and validation loss just creep down. With a small dataset, I was expecting the validation loss to go its own way. In fact, if I run a 10-fold cross val score with this network, the error reported by cross val score does creep down. This just looks too good to be true. It implies that I could train this thing for 1000 epochs and still improve results. If it looks too good to be true, it usually is, but why?
EDIT: More results.
So I tried to cut dropout to 0.1 at each and remove the L2. Inteesting. With the toned-down drop-out, I get even better results:
10% dropout rate
Without the L2, there is overfitting:
No L2 reg
My guess would be that you have such a high dropout on every layer, which is why it's having trouble just overfitting on the training data. My prediction is that if you lower that dropout and regularization, it'll learn the training data much faster.
I'm not too sure if the results are too good to be true because it's hard to base how good a model is based on loss function. But it should be the dropout and regularization that is preventing it from overfitting in a few epochs.

Why does increasing features lead to worse neural network performance?

I have a regression problem and configured a multi-layered neural network using Keras. The original dataset had 286 features, and using 20 epochs, the NN converged to a MSE loss of ~0.0009. This is using the Adam optimizer.
I then added three more features, and using the same configuration, the NN won't converge. After 1 epoch, it gets stuck at a loss of 0.003, so significantly worse.
After checking that the new features are represented correctly, I have tried the following with no success:
adjusting number of layers
adjusting number of neurons in each layer
including dropout layers
adjusting the learning rate
Here is my original configuration:
model = Sequential()
model.add(Dense(300, activation='relu',
input_dim=training_set.shape[1]))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='Adam',loss='mse')
Anybody have any ideas?

Keras BatchNorm: Training accuracy increases while Testing accuracy decreases

I am trying to use BatchNorm in Keras. The training accuracy increases over time. From 12% to 20%, slowly but surely.
The test accuracy however decreases from 12% to 0%. Random baseline is 12%.
I very much assume this is due to the batchnorm layer (removing the batchnorm layer results in ~12% test accuracy), which maybe does not initialize parameters gamma and beta well enough. Do I have to regard anything special when applying batchnorm? I don't really understand what else could have gone wrong. I have the following model:
model = Sequential()
model.add(BatchNormalization(input_shape=(16, 8)))
model.add(Reshape((16, 8, 1)))
#1. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))
#2. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))
...
#8. Affine (NUM_GESTURES units) Output layer
model.add(default_Dense(NUM_GESTURES))
model.add(Activation('softmax'))
sgd = optimizers.SGD(lr=0.1)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
default_Conv2D and default_Dense are defined as follows:
def default_Conv2D():
return Conv2D(
filters=64,
kernel_size=3,
strides=1,
padding='same',
# activation=None,
# use_bias=True,
# kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), #RandomUniform(),
kernel_regularizer=regularizers.l2(0.0001),
# bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), # RandomUniform(),
# bias_regularizer=None
)
def default_Dense(units):
return Dense(
units=units,
# activation=None,
# use_bias=True,
# kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
# bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
kernel_regularizer=regularizers.l2(0.0001),
# bias_regularizer=None
)
The issue is overfitting.
This is supported by your first 2 observations :
The training accuracy increases over time. From 12% to 20%,.. test accuracy however decreases from 12% to 0%
removing the batchnorm layer results in ~12% test accuracy
The first statement tells me that your network is memorizing the training set. The second statement tells me that when you prevent the network from memorizing the training set (or even learning) then it stops making error to do with memorization.
There are a few solutions to overfitting, but it is a problem large than this post. Please treat the following list as a "top" list and not exhaustive:
add a regularizer like Dropout just before your final fully connected layer.
add a L1 or L2 regularizer on matrix weights
add a regularizer like Dropout between CONV
your network may have too many free parameters. try reducing the layers to just 1 CONV, and add one more layer at a time retraining and testing each time.
slow increase in accuracy
As a side note, you hinted that your accuracy isn't increasing as fast as you like by saying slowly but surely. I've had great success when I've done all of the following steps
change your loss function to be the average loss of all predictions for all items in the mini-batch. This makes your loss function independent of your batch size which you'll discover that if you change your batch size and your loss function changes with it then you'll have to change your learning rate in SGD.
your loss is a single number that is the average of the loss for all predicted classes and all samples, so use a learning rate of 1.0. No need to scale it anymore.
use tf.train.MomentumOptimizer with learning_rate = 1.0 and momentum = 0.5. MomentumOptimizer has been shown to be much more robust than GradientDescent.
It seems that there was something broken with Keras itself.
A naive
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
did the trick.
#wontonimo, thanks a lot for your really great answer!