I've seen other similar questions and followed their solutions, to little improvement. I'm making a model to identify the gender of names. As training data I'm using a list of baby names found here: https://www.ssa.gov/oact/babynames/limits.html. I extracted this data to a new data frame, keeping only one instance of those names occurring more than once, and sorted randomly.
Each name string in a column was converted to a numeric array of lenght max_len and normalized by the function:
def text_to_numeric(column, max_len):
word_characters = []
for word in column:
word_characters.append([c for c in word])
letters_kept = 25
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=letters_kept, oov_token='<UNK>')
tokenizer.fit_on_texts(word_characters)
word_sequence = tokenizer.texts_to_sequences(word_characters)
words_pre = tf.keras.preprocessing.sequence.pad_sequences(word_sequence, maxlen=max_len,padding="pre")
words_pre = tf.keras.utils.normalize(input_data)
return list(words_pre)
The expected output is an array of 2 element list where [1,0] means “Male” and [0,1] means “Female”. The model, where data_file contains processed names and labels, looks like this:
input_length, input_data, output_data = data_reader(data_file)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(100, input_dim=input_length, activation='relu'))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])
model.fit(input_data, output_data, epochs=30, verbose=1, validation_split=0.1)
No matter what, I always get an accuracy of around 75%. I don't know how to choose the model parameters, but I’ve tried with many combinations and the accuracy changes little. So far I've tried: normalizing input, balancing the input dataset so there are the same number of men and women, changing the optimizer, defining an optimizer and change the learning rate, changing layer number, nodes per layer and activation function, increasing number of epochs.
All of this with no significant change in the model's accuracy. Am I missing something or doing something completely wrong? Is this accuracy as good as it gets?
Related
My training and loss curves look like below and yes, similar graphs have received comments like "Classic overfitting" and I get it.
My model looks like below,
input_shape_0 = keras.Input(shape=(3,100, 100, 1), name="img3")
model = tf.keras.layers.TimeDistributed(Conv2D(8, 3, activation="relu"))(input_shape_0)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(16, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(32, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(Flatten())(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.4))(model)
model = LSTM(16, kernel_regularizer=tf.keras.regularizers.l2(0.007))(model)
# model = Dense(100, activation="relu")(model)
# model = Dense(200, activation="relu",kernel_regularizer=tf.keras.regularizers.l2(0.001))(model)
model = Dense(60, activation="relu")(model)
# model = Flatten()(model)
model = Dropout(0.15)(model)
out = Dense(30, activation='softmax')(model)
model = keras.Model(inputs=input_shape_0, outputs = out, name="mergedModel")
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
return optimizer.lr
return lr
opt = tf.keras.optimizers.RMSprop()
lr_metric = get_lr_metric(opt)
# merged.compile(loss='sparse_categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt, metrics=['accuracy',lr_metric])
model.summary()
In the above model building code, please consider the commented lines as some of the approaches I have tried so far.
I have followed the suggestions given as answers and comments to this kind of question and none seems to be working for me. Maybe I am missing something really important?
Things that I have tried:
Dropouts at different places and different amounts.
Played with inclusion and expulsion of dense layers and their number of units.
Number of units on the LSTM layer was tried with different values (started from as low as 1 and now at 16, I have the best performance.)
Came across weight regularization techniques and tried to implement them as shown in the code above and so tried to put it at different layers ( I need to know what is the technique in which I need to use it instead of simple trial and error - this is what I did and it seems wrong)
Implemented learning rate scheduler using which I reduce the learning rate as the epochs progress after a certain number of epochs.
Tried two LSTM layers with the first one having return_sequences = true.
After all these, I still cannot overcome the overfitting problem.
My data set is properly shuffled and divided in a train/val ratio of 80/20.
Data augmentation is one more thing that I found commonly suggested which I am yet to try, but I want to see if I am making some mistake so far which I can correct it and avoid diving into data augmentation steps for now. My data set has the below sizes:
Training images: 6780
Validation images: 1484
The numbers shown are samples and each sample will have 3 images. So basically, I input 3 mages at once as one sample to my time-distributed CNN which is then followed by other layers as shown in the model description. Following that, my training images are 6780 * 3 and my Validation images are 1484 * 3. Each image is 100 * 100 and is on channel 1.
I am using RMS prop as the optimizer which performed better than adam as per my testing
UPDATE
I tried some different architectures and some reularizations and dropouts at different places and I am now able to achieve a val_acc of 59% below is the new model.
# kernel_regularizer=tf.keras.regularizers.l2(0.004)
# kernel_constraint=max_norm(3)
model = tf.keras.layers.TimeDistributed(Conv2D(32, 3, activation="relu"))(input_shape_0)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(64, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(128, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(GlobalAveragePooling2D())(model)
model = LSTM(128, return_sequences=True,kernel_regularizer=tf.keras.regularizers.l2(0.040))(model)
model = Dropout(0.60)(model)
model = LSTM(128, return_sequences=False)(model)
model = Dropout(0.50)(model)
out = Dense(30, activation='softmax')(model)
Try to perform Data Augmentation as a preprocessing step. Lack of data samples can lead to such curves. You can also try using k-fold Cross Validation.
There are many ways to prevent overfitting, according to the papers below:
Dropout layers (Disabling randomly neurons). https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
Input Noise (e.g. Random Gaussian Noise on the imges). https://arxiv.org/pdf/2010.07532.pdf
Random Data Augmentations (e.g. Rotating, Shifting, Scaling, etc.).
https://arxiv.org/pdf/1906.11052.pdf
Adjusting Number of Layers & Units.
https://clgiles.ist.psu.edu/papers/UMD-CS-TR-3617.what.size.neural.net.to.use.pdf
Regularization Functions (e.g. L1, L2, etc)
https://www.researchgate.net/publication/329150256_A_Comparison_of_Regularization_Techniques_in_Deep_Neural_Networks
Early Stopping: If you notice that for N successive epochs that your model's training loss is decreasing, but the model performs poorly on validaiton data set, then It is a good sign to stop the training.
Shuffling the training data or K-Fold cross validation is also common way way of dealing with Overfitting.
I found this great repository, which contains examples of how to implement data augmentations:
https://github.com/kochlisGit/random-data-augmentations
Also, this repository here seems to have examples of CNNs that implement most of the above methods:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
The goal should be to get the model predict correctly irrespective of
the order in which the 3 images in the sample are arranged.
If the order of the images of each sample is not important for the training, I think your model does the inverse, the Timedistributed layers succeded by LSTM take into account the order of the three images. As a solution, primarily, you can add images by reordering the images of each sample (= Augmented data). Secondly, try to consider the three images as one image with three-channel and remove the Timedistributed layers (I'm not sure that the three-channels are more efficient but you can give it a try)
I am using a deep neural network model (implemented in keras)to make predictions. Something like this:
def make_model():
model = Sequential()
model.add(Conv2D(20,(5,5), activation = "relu"))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(20, activation = "relu"))
model.add(Lambda(lambda x: tf.expand_dims(x, axis=1)))
model.add(SimpleRNN(50, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = adagrad, metrics = ["accuracy"])
return model
model = make_model()
model.fit(x_train, y_train, validation_data = (x_validation,y_validation), epochs = 25, batch_size = 25, verbose = 1)
##Prediciton:
prediction = model.predict_classes(x)
probabilities = model.predict_proba(x) #I assume these are the probabilities of class being predictied
My problem is a classification(binary) problem. I wish to calculate the confidence score of each of these prediction i.e. I wish to know - Is my model 99% certain it is "0" or is it 58% it is "0".
I have found some views on how to do it, but can't implement them. The approach I wish to follow says: "With classifiers, when you output you can interpret values as the probability of belonging to each specific class. You can use their distribution as a rough measure of how confident you are that an observation belongs to that class."
How should I predict with something like above model so that I get its confidence about each predictions? I would appreciate some practical examples (preferably in Keras).
The softmax is a problematic way to estimate a confidence of the model`s prediction.
There are a few recent papers about this topic.
You can look for "calibration" of neural networks in order to find relevant papers.
This is one example you can start with - https://arxiv.org/pdf/1706.04599.pdf
In Keras, there is a method called predict() that is available for both Sequential and Functional models. It will work fine in your case if you are using binary_crossentropy as your loss function and a final Dense layer with a sigmoid activation function.
Here is how to call it with one test data instance. Below, mymodel.predict() will return an array of two probabilities adding up to 1.0. These values are the confidence scores that you mentioned. You can further use np.where() as shown below to determine which of the two probabilities (the one over 50%) will be the final class.
yhat_probabilities = mymodel.predict(mytestdata, batch_size=1)
yhat_classes = np.where(yhat_probabilities > 0.5, 1, 0).squeeze().item()
I've come to understand that the probabilities that are output by logistic regression can be interpreted as confidence.
Here are some links to help you come to your own conclusion.
https://machinelearningmastery.com/how-to-score-probability-predictions-in-python/
how to assess the confidence score of a prediction with scikit-learn
https://stats.stackexchange.com/questions/34823/can-logistic-regressions-predicted-probability-be-interpreted-as-the-confidence
https://kiwidamien.github.io/are-you-sure-thats-a-probability.html
Feel free to upvote my answer if you find it useful.
How about to use a softmax as the activation in the last layer? Let's say something like this:
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer = adagrad, metrics = ["accuracy"])
In this way, for each data point, you will be given a probabilistic-ish result by the model, which tells what is the likelihood that your data point belongs to each of two classes.
For example for a given X, if the model returns (0.3,0.7), you will know it is more likely that X belongs to class 1 than class 0. and you know that the likelihood has been estimated to be 0.7 over 0.3.
total train data record: 460000
total cross-validation data record: 89000
number of output class: 392
tensorflow 1.8.0 CPU installation
Each data record has 26 features, where 25 are numeric and one is categorical which is one hot encoded into 19 additional features. Initially, not all feature value was present for each data record. I have used avg to fill missing float type features and most frequent value for missing int type feature. Output can be any of 392 classes labeled as 0 to 391.
Finally, all features are passed through a StandardScaler()
Here is my model:
output_class = 392
X_train, X_test, y_train, y_test = get_data()
# y_train and y_test contains int from 0-391
# Make y_train and y_test categorical
y_train = tf.keras.utils.to_categorical(y_train, unique_dtc_count)
y_test = tf.keras.utils.to_categorical(y_test, unique_dtc_count)
# Convert to float type
y_train = y_train.astype(np.float32)
y_test = y_test.astype(np.float32)
# tf.enable_eager_execution() # turned off to use rmsprop optimizer
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(400, activation=tf.nn.relu, input_shape=
(44,)))
model.add(tf.keras.layers.Dense(40000, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(392, activation=tf.nn.softmax))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
import logging
logging.getLogger().setLevel(logging.INFO)
model.fit(X_train, y_train, epochs=3)
loss, acc = model.evaluate(X_test, y_test)
print('Accuracy', acc)
But this model gives only 28% accuracy on both on training and test data. What should I change here to get a good accuracy on both training and test data? Should I go wider and deeper? Or should I consider taking more features?
Note: there were total 400 unique features in the dataset. But most of the features only appeared randomly in 5 to 10 data record. And some features have no relevance in other data records. I picked 26 features based on domain knowledge and frequency in data records.
Any suggestion is appreciated. Thanks.
EDIT: I forgot to add this in the original post, #Neb suggested a less wide deeper network, I actually tried this. My first model was a [44,400,400,392] layer. It gave me around 30% accuracy in training and testing.
Your model is too wider. You have 400 nodes in the first hidden layer and 40.000 in the second layer, for a total of 400*44 + 40.000*400 + 392*400 = 16.174.400 parameters. However, you only input 44 features!
Because of this, your net is capable of detecting even the smallest, most imperceptible variations in inputs and finally it considers them as valuable information instead of noise. I'm quite sure that if you leave your network training for a long time (here I only see 3 epoch), it will end up with overfitting your training set.
You have some solutions:
reduce the number of nodes per levels. You may also experiment adding 1 or 2 new layers. A possible structure might be [44, 128, 512, 392]
Implement regression. You have multiple way to do this:
restrict the range the range in which network parameters live
implement Dropout
implement Batch normalization (which is known to have a small regularization effect)
use Adam Optimizer instead of RMSprop
If your features are somewhat correlated, you may try a CNN instead of a Fully connected network.
Then, to improve generalization you can:
explore the dataset looking for outliers and remove them. An outlier is a sample which can confuse the network or does not convey any additional information.
"randomly" initialize your parameters, e.g using Xavier's Initialization
Finally, I would say: do you really need 392 classes? Could you merge some of them?
I have one million sequences I'm trying to classify as either 0 or 1. The outcome is fairly well balanced (class 0:70%, class 1:30%). Maximum sequence length is 50, and I've post-padded by sequences with zeroes. There are 100 unique sequence symbols. Embedding length is 30. It's an LSTM NN trained on two outputs (one is the main output node, and the other is right after the LSTM). The code is below.
As a sanity check, I ran three versions of this: One in which I randomize the outcome labels (I expect terrible performance), another one where the labels are correct but I randomize the sequence of events in each sequence but the outcome labels are correct (I also expected bad performance), and finally one where everything is left unshuffled (I expected good performance).
Instead I found the following:
Shuffled labels: Accuracy = 69.5% (Model predicts every sequence is class 0)
Shuffled sequence symbols: Accuracy = 88%!
Nothing is shuffled: Accuracy = 90%
What do you make of this? All I can think of is that there is little signal to be gained from analyzing the sequences, and maybe most of the signal is from the presence or lack of presence of symbols in the sequence. Maybe RNNs and LSTMs are overkill here?
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
lstm_out = LSTM(32)(x)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1], [train_Y, train_Y], epochs=1, batch_size=200)
Assuming you've played around with the size of the LSTM, your conclusion seems reasonable. Beyond that, it's hard to say as it depends what the dataset is. For example, it could be that shorter sequences are more unpredictable, and if most of your sequences are short, then this would support the conclusion as well.
It's worth it to also try truncating your sequences in length, to say the first 25 entries.
Can someone help me understand a bit better this problem? I must train a neural network which should output 200 mutually independent categories, each of these categories is a percentage ranging from 0 to 1. This seems to me like a binary_crossentropy problem, but every example I see on the internet uses binary_crossentropy with a single output. Since my output should be 200, if I apply binary_crossentropy, would that be correct?
This is what I have in mind, is that a correct approach or should I change it?
inputs = Input(shape=(input_shape,))
hidden = Dense(2048, activation='relu')(inputs)
hidden = Dense(2048, activation='relu')(hidden)
output = Dense(200, name='output_cat', activation='sigmoid')(hidden)
model = Model(inputs=inputs, outputs=[output])
loss_map = {'output_cat': 'binary_crossentropy'}
model.compile(loss=loss_map, optimizer="sgd", metrics=['mae', 'accuracy'])
To optimize for multiple independent binary classification problems (and not multiple category problem where you can use categorical_crossentropy) using Keras, you could do the following (here I take the example of 2 independent binary outputs, but you can extend that as much as needed):
inputs = Input(shape=(input_shape,))
hidden = Dense(2048, activation='relu')(inputs)
hidden = Dense(2048, activation='relu')(hidden)
output = Dense(units = 2, activation='sigmoid')(hidden )
here you split your output using Keras's Lambda layer:
output_1 = Lambda(lambda x: x[...,:1])(output)
output_2 = Lambda(lambda x: x[...,1:])(output)
adad = optimizers.Adadelta()
your model output becomes a list of the different independent outputs
model = Model(inputs, [output_1, output_2])
you compile the model using one loss function for each output, in a list. (In fact, if you give only one kind of loss function, I believe it will apply it to all the outputs independently)
model.compile(optimizer=adad, loss=['binary_crossentropy','binary_crossentropy'])
I know this is an old question, but I believe the accepted answer is incorrect and the most upvoted answer is workable but not optimal. The original poster's method is the correct way to solve this problem. His output is 200 independent probabilities from 0 to 1, so his output layer should be a dense layer with 200 neurons and a sigmoid activation layer. It's not a categorical_crossentropy problem because it's not 200 mutually exclusive categories. Also, there's no reason to split the output using a lambda layer when a single dense layer will do. The original poster's method is correct. Here's another way to do it using the Keras interface.
model = Sequential()
model.add(Dense(2048, input_dim=n_input, activation='relu'))
model.add(Dense(2048, input_dim=n_input, activation='relu'))
model.add(Dense(200, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
binary_crossentropy with Sigmoid activation function is used for binary (positive and negative) classification, whereas your case is multi-class classification. In the case of multi-class classification, categorical_crossentropy with softmax activation is used. The Sigmoid activation function generates the probability of input being positive class, and SoftMax generates probability corresponding to input being in each class. The class with maximum probability is assigned to the input.
For multiple category classification problems, you should use categorical_crossentropy rather than binary_crossentropy. With this, when your model classifies an input, it is going give a dispersion of probabilities between all 200 categories. The category that receives the highest probability will be the output for that particular input.
You can see this when you call model.predict(). If you were to call this function only on one input, for example, and print the results, you will see a result of 200 percentages (in total summing to 1). The hope is that one of those 200 percentages would be vastly higher than the others, which signals that the model thinks that there is a strong probability that this is the correct output (category) for this particular input.
This video may help clarify the prediction piece. Printing out the predictions starts around 3:17, but to get the full context, you'll need to start from the beginning.
When there are multiple classes, categorical_crossentropy should be used. Refer to another answer here.