I am working on a document classification problem.
Multi-label classification 20 different labels, 1920 documents in training, and 480 in validation. The model is a CNN with FastText embeddings and I use a logistic regression model with Ngram as baseline.
The problem is that the baseline model gives a f1-score of 0.36 while the cnn only gives 0.3.
The architecture I use is from here:
https://www.kaggle.com/vsmolyakov/keras-cnn-with-fasttext-embeddings
I have been doing some parameter tuning, and the current best parameters are: dropout. 0.25, learning rate 0.001, trainable embeddings false, 128 filters, prediction threshold 0.15 and kernel size 9.
Do you guys have ideas to parameters to be special aware of, ideas to change the architecture, anything that might improve the f1-score?
# Parameters
BATCH_SIZE = 16
DROP_OUT = 0.25
N_EPOCHS = 20
N_FILTERS = 128
TRAINABLE = False
LEARNING_RATE = 0.001
N_DIM = 32
KERNEL_SIZE = 9
# Create model
model = Sequential()
model.add(Embedding(NB_WORDS, EMBED_DIM, weights=[embedding_matrix],
input_length=MAX_SEQ_LEN, trainable=TRAINABLE))
model.add(Conv1D(N_FILTERS, KERNEL_SIZE, activation='relu', padding='same'))
model.add(MaxPooling1D(2))
model.add(Conv1D(N_FILTERS, KERNEL_SIZE, activation='relu', padding='same'))
model.add(GlobalMaxPooling1D())
model.add(Dropout(DROP_OUT))
model.add(Dense(N_DIM, activation='relu', kernel_regularizer=regularizers.l2(1e-4)))
model.add(Dense(N_LABELS, activation='sigmoid')) #multi-label (k-hot encoding)
adam = optimizers.Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
model.summary()
Edit
I think I got some wrong hyperparameters by fixing epochs to 20 during tuning. I am now trying with a stopping criteria, the model usually converges around 30-35 epochs. It seems dropout of 0.5 works better, and I am currently tuning batch size. If somebody has some experience/knowledge about the relationship between epochs and other hyperparameters feel free to share.
A thing you should consider in general is whether the data is imbalanced and how your model performs for each class (using for example sklearn.metrics.confusion_matrix)
I think the dataset (2000 over 20 classes) might be not big enough for deep learning to work from scratch. You can consider augmenting your dataset or you could start by trying to fine-tune a pretrained language model for your task. See https://github.com/huggingface/pytorch-openai-transformer-lm .That could help you overcome the issue with the dataset size in general.
Related
I have 4 classes, each with 1350 images. The validation set has 20% of the total images (it is generated automatically). The training model uses MobilenetV2 network:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False, weights='imagenet')
The model is created:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(32, 3, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.MaxPool2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001))
])
The model is trained through 20 epochs and then fine tunning is done in 15 epochs. The result is as follows:
Image of the model trained before fine tunning
Image of the model trained after 15 epochs and fine tunning
A bit difficult to tell without the numeric values of validation loss but I would say the results before fine tuning are slightly over fitting and for after fine tuning less over fitting. A couple of additional things you could do. One is try using an adjustable learning rate using the callback tf.keras.callbacks.ReduceLROnPlateau. Set it up to monitor validation loss. Documentation is here. I set factor=.5 and patience=1. Second replace the Flatten layer with tf.keras.layers.GlobalMaxPool2D and see if it improves the validation loss.
I am working on a classification problem of the semeval 2017 task 4A dataset can be found here
and I am using deep LSTM network for it. In pre-processing, I have done lower casing->tokenization->lemmatization->removing stop words->removing punctuations. For word embeddings, I have used WORD2VEC model. There are 18,000 samples in my training set and 2000 samples in testing.
The code for my model is
model = Sequential()
model.add(Embedding(max_words, 30, input_length=max_len))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(32, use_bias=True, return_sequences=True), input_shape=(128, 1,64)))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
The value of max_words is 2000 and max_len is 300
But even after this, my testing accuracy is not crossing 50%. I can't figure out the problem.
PS - I am using validation technique too. The loss function is 'Binary Crossentropy' and optimizer is 'Adam'.
Training "LSTM" is very different with other common deep learning model.
I recommend a higher dropout rate like 0.7,0.8. and Adam optimizer is particularly unstable in LSTM with real world data. So, i recommend SGD scheduled for a momentum of 0.9 and ReduceLROnPlateau. You have to do very long training, and if spark loss is observed, the training is going very well. (Spark Loss is a word used by NVIDIA researchers. It refers to a phenomenon in which the value of Loss that appears to converge increases significantly.)
I am new to Keras, and never asked a question here, so excuse me any rookie mistakes I might make.
What I am trying to do is to implement a binary classifier, operating on images (CTs to be exact).
My model is based on a pretrained net, that performed classification on 14 classes (see wonderful git here https://github.com/jrzech/reproduce-chexnet).
As the saying goes, "crawl before you walk, walk before you run", my current humble goal is to achieve overfitting of the network on some 100 examples.
My current problem is that the net converges to a weird solution, with the output neuron (im using sigmoid) always very close to 50%, with 100% of the predictions going to one class (that way im stuck at about 50% accuracy). My loss and accuracy do not change at all from epoch 1 or so.
Things I tried/considered:
using different optimizers (i used Adam optimizer and the following SGD).
trying also to go with categorical crossentropy (with softmax layer at the end, instead of sigmoid, since some say it might perform better [Keras' fit_generator() for binary classification predictions always 50%).
adding an additional denselayer (I thought i might be underfitting somehow).
tried to maybe change the batchsize, to 128 (and overfit on 1000 examples).
All failed miserably, so im kind of at a lost here. I would be happy to provide more details if needed, and would appreciate any help or insights you might have. Major parts of my code are attached. Note that the ModelFactory() that I'm loading and using is the pretrained one.
Thanks in advance!
data generator code
rescale = 1./255.0
target_size = (224, 224)
batch_size = 128
train_datagen = ImageDataGenerator(
width_shift_range=0.1,
height_shift_range=0.1,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
rescale=rescale
)
train_generator = train_datagen.flow_from_dataframe(
train_csv,
directory=train_path,
x_col='image_name',
y_col='class',
target_size=target_size,
color_mode='rgb',
class_mode='binary',
batch_size=batch_size,
shuffle=True,
)
my model
def get_model():
file_name='/content/brucechou1983_CheXNet_Keras_0.3.0_weights.h5'
base_model = ModelFactory().get_model(class_names=[str(i) for i in range(14)],
weights_path=file_name)
x = base_model.output
x = keras.layers.Dense(1024, activation='relu')(x)
x = keras.layers.BatchNormalization(trainable=True)(x)
predictions = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.models.Model(inputs=base_model.inputs, outputs=predictions)
for layer in base_model.layers:
layer.trainable = False
model.summary()
return model
training the model
class_weight = sklearn.utils.class_weight.compute_class_weight('balanced',np.unique(train_csv['class']), train_csv['class'])
model.compile(keras.optimizers.SGD(lr=1e-6, decay=1e-6, momentum=0.9, nesterov=True),
loss='binary_crossentropy',
metrics=['binary_accuracy'])
history = model.fit_generator(
train_generator,
steps_per_epoch=len(train_generator),
epochs=10,
verbose=1,
class_weight=class_weight
)
New to the field of deep learning and currently working on this competition for predicting the earthquake damage to buildings.
The model I created starts at an accuracy of .56 but remains at this for any number of epochs i let it run. When finished, the model only predicts one of the three classes (which I one hot encoded into a dataframe with three columns). Changing the number of layers, optimizers, data preparation, dropout wont change anything. Even trying to overfit my model with the over-parameterization of the neural network will still have the same accuracy and a non-learning model.
What am I doing wrong?
This is my code:
model = keras.models.Sequential()
model.add(keras.layers.Dense(64, input_dim = 85, activation = "relu"))
keras.layers.Dropout(0.3)
model.add(keras.layers.Dense(128, activation = "relu"))
keras.layers.Dropout(0.3)
model.add(keras.layers.Dense(256, activation = "relu"))
keras.layers.Dropout(0.3)
model.add(keras.layers.Dense(512, activation = "relu"))
model.add(keras.layers.Dense(3, activation = "softmax"))
adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer = adam,
loss='categorical_crossentropy',
metrics = ['accuracy'])
history = model.fit(traindata, trainlabels,
epochs = 5,
validation_split = 0.2,
verbose = 1,)
There's nothing visually wrong with your model, but it may be too haevy to learn any useful features.
Try normalizing your input with https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Start with only 2 layers, and a few numbers of neurons.
Increase batch_size and try learning_rate scheduling.
Observe the validation_accuracy, stop when it starts to overfit.
Finally, for a 3-class classification, 56% accuracy is better than baseline, remmeber it's a competition so the data is not dummy playground data which you can expect to get a 90% accuracy with an MLP in the first try.
Finally, try hyperparameter optimization with tuner.
I am using keras and RNN to classify slack text data on whether the text is reaction worthy or not (1 - emoji, 0 - no emoji). I have removed usernames and urls from the text as well as dropped duplicates with different target variables.
I am not able to get the model to generalize to unseen data. The loss of the train/val sets look good and continually decrease but the accuracy of the val set only decreases.
I am using a pretrained GLOVE word embedding since my training size is only about 25,000 sentences.
I have added additional layers, changed my regularization value and increased dropout but get similar results. Is my model not complex enough to generalize the data? The times i added additional layers they were much smaller but deeper because the training time was about 2 min per epoch.
Any insight would be appreciated.
embedding_layer = Embedding(len(word_index) + 1,
100,
weights=[embeddings_matrix],
input_length=max_message_length,
embeddings_regularizer=l2(0.001),
trainable=True)
# Creating the Model
model = Sequential()
model.add(embedding_layer)
model.add(Convolution1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.7))
model.add(layers.GRU(128))
model.add(Dropout(0.7))
model.add(Dense(1, activation='sigmoid'))
# Compiling the model with our given Optimizer
optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.000025)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())