Deep LSTM accuracy not crossing 50% - tensorflow

I am working on a classification problem of the semeval 2017 task 4A dataset can be found here
and I am using deep LSTM network for it. In pre-processing, I have done lower casing->tokenization->lemmatization->removing stop words->removing punctuations. For word embeddings, I have used WORD2VEC model. There are 18,000 samples in my training set and 2000 samples in testing.
The code for my model is
model = Sequential()
model.add(Embedding(max_words, 30, input_length=max_len))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True), input_shape=(128, 1,64)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
The value of max_words is 2000 and max_len is 300
But even after this, my testing accuracy is not crossing 50%. I can't figure out the problem.
PS - I am using validation technique too. The loss function is 'Binary Crossentropy' and optimizer is 'Adam'.

Training "LSTM" is very different with other common deep learning model.
I recommend a higher dropout rate like 0.7,0.8. and Adam optimizer is particularly unstable in LSTM with real world data. So, i recommend SGD scheduled for a momentum of 0.9 and ReduceLROnPlateau. You have to do very long training, and if spark loss is observed, the training is going very well. (Spark Loss is a word used by NVIDIA researchers. It refers to a phenomenon in which the value of Loss that appears to converge increases significantly.)

Related

How to stabilize loss when using keras for image classification

Am using keras to perform image classification, I have 10 classes and ~900 image each, I used VGG 16 and built on top of that this small network
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
am training with 50 epoch
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy', metrics=['accuracy'])
I get the below accraucy and loss
[INFO] accuracy: 94.72%
[INFO] Loss: 0.45841544931342115
yet am not sure how to stabilize the loss, should I increase the epochs or there would be other parameters I need to change ?
Due to the val loss fluctuates from first epochs, I think that you forget to freeze the main VGG model and just train your adding Dense stack layers.
Indeed It's better to use 2D Global Ave Polling instead of flattening.
If problem don't solve try to use more efficient pre-trained CNN architectures such as MobileNet V2 or Xception

A neural network that can't overfit?

I am fitting a model to some noisy satellite data. The labels are measurements of rock on the bars of a river. There is a noisy but significant relationship. I only have 250 points but the method would expand and eventually run off much bigger datasets. I'm looking at a mix of models (RANSAC, Huber, SVM Regression) and DNNs. My DNN results seem too good to be true. The network looks like:
model = Sequential()
model.add(Dense(128, kernel_regularizer= regularizers.l2(0.001), input_dim=NetworkDims, kernel_initializer='he_normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(32, kernel_regularizer= regularizers.l2(0.001), kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
And when I save the history and plot training loss (green dots) and validation loss (cyan line) vs epoch I get this:
Training and validation loss just creep down. With a small dataset, I was expecting the validation loss to go its own way. In fact, if I run a 10-fold cross val score with this network, the error reported by cross val score does creep down. This just looks too good to be true. It implies that I could train this thing for 1000 epochs and still improve results. If it looks too good to be true, it usually is, but why?
EDIT: More results.
So I tried to cut dropout to 0.1 at each and remove the L2. Inteesting. With the toned-down drop-out, I get even better results:
10% dropout rate
Without the L2, there is overfitting:
No L2 reg
My guess would be that you have such a high dropout on every layer, which is why it's having trouble just overfitting on the training data. My prediction is that if you lower that dropout and regularization, it'll learn the training data much faster.
I'm not too sure if the results are too good to be true because it's hard to base how good a model is based on loss function. But it should be the dropout and regularization that is preventing it from overfitting in a few epochs.

Loss increasing with batch normalization (tf.Keras)

I have a FF NN with 2 hidden layers for a regression problem. Compared to when I do not add BN, loss (MSE) is about double when training on the same number of epochs, and the execution time is also increased by about 20%. Why is that?
If I had to take a guess -- BN is not worth it on a 2-layer network, and the extra overhead introduced by BN is actually higher than whatever decrease in processing time it causes.
That would explain the execution time, but I am not sure why the loss is higher, too.
model = Sequential()
model.add(Dense(128, 'relu'))
model.add(BatchNormalization())
model.add(Dense(128, 'relu'))
model.add(BatchNormalization())
model.add(Dense(1, 'linear'))
model.compile(loss=mean_squared_error, optimizer='adam')
I've tried a variety of optimizers, activation functions, number of epochs, batch size, etc, but no difference.
For regression, you should not use BatchNorm before the output layer.
On the other hand, you could use BatchNorm right after the input layer and before the first Dense layer to normalize inputs.

Why does increasing features lead to worse neural network performance?

I have a regression problem and configured a multi-layered neural network using Keras. The original dataset had 286 features, and using 20 epochs, the NN converged to a MSE loss of ~0.0009. This is using the Adam optimizer.
I then added three more features, and using the same configuration, the NN won't converge. After 1 epoch, it gets stuck at a loss of 0.003, so significantly worse.
After checking that the new features are represented correctly, I have tried the following with no success:
adjusting number of layers
adjusting number of neurons in each layer
including dropout layers
adjusting the learning rate
Here is my original configuration:
model = Sequential()
model.add(Dense(300, activation='relu',
input_dim=training_set.shape[1]))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='Adam',loss='mse')
Anybody have any ideas?

RNN Not Generalizing on Text Classification

I am using keras and RNN to classify slack text data on whether the text is reaction worthy or not (1 - emoji, 0 - no emoji). I have removed usernames and urls from the text as well as dropped duplicates with different target variables.
I am not able to get the model to generalize to unseen data. The loss of the train/val sets look good and continually decrease but the accuracy of the val set only decreases.
I am using a pretrained GLOVE word embedding since my training size is only about 25,000 sentences.
I have added additional layers, changed my regularization value and increased dropout but get similar results. Is my model not complex enough to generalize the data? The times i added additional layers they were much smaller but deeper because the training time was about 2 min per epoch.
Any insight would be appreciated.
embedding_layer = Embedding(len(word_index) + 1,
100,
weights=[embeddings_matrix],
input_length=max_message_length,
embeddings_regularizer=l2(0.001),
trainable=True)
# Creating the Model
model = Sequential()
model.add(embedding_layer)
model.add(Convolution1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.7))
model.add(layers.GRU(128))
model.add(Dropout(0.7))
model.add(Dense(1, activation='sigmoid'))
# Compiling the model with our given Optimizer
optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.000025)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())