Is my training data set too complex for my neural network? - tensorflow

I am new to machine learning and stack overflow, I am trying to interpret two graphs from my regression model.
Training error and Validation error from my machine learning model
my case is similar to this guy Very large loss values when training multiple regression model in Keras but my MSE and RMSE are very high.
Is my modeling underfitting? if yes what can I do to solve this problem?
Here is my neural network I used for solving a regression problem
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
and my data set
I have 500 samples, 10 features and 1 target

Quite the opposite: it looks like your model is over-fitting. When you have low error rates for your training set, it means that your model has learned from the data well and can infer the results accurately. If your validation data is high afterwards however, that means that the information learned from your training data is not successfully being applied to new data. This is because your model has 'fit' onto your training data too much, and only learned how to predict well when its based off of that data.
To solve this, we can introduce common solutions to reduce over-fitting. A very common technique is to use Dropout layers. This will randomly remove some of the nodes so that the model cannot correlate with them too heavily - therefor reducing dependency on those nodes and 'learning' more using the other nodes too. I've included an example that you can test below; try playing with the value and other techniques to see what works best. And as a side note: are you sure that you need that many nodes within your dense layer? Seems like quite a bit for your data set, and that may be contributing to the over-fitting as a result too.
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
Dropout(0.2),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model

Well i think your model is overfitting
There are several ways that can help you :
1-Reduce the network’s capacity Which you can do by removing layers or reducing the number of elements in the hidden layers
2- Dropout layers, which will randomly remove certain features by setting them to zero
3-Regularization
If i want to give a brief explanation on these:
-Reduce the network’s capacity:
Some models have a large number of trainable parameters. The higher this number, the easier the model can memorize the target class for each training sample. Obviously, this is not ideal for generalizing on new data.by lowering the capacity of the network, it's going to learn the patterns that matter or that minimize the loss. But remember،reducing the network’s capacity too much will lead to underfitting.
-regularization:
This page can help you a lot
https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
-Drop out layer
You can use some layer like this
model.add(layers.Dropout(0.5))
This is a dropout layer with a 50% chance of setting inputs to zero.
For more details you can see this page:
https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/

As mentioned in the existing answer by #omoshiroiii your model in fact seems to be overfitting, that's why RMSE and MSE are too high.
Your model learned the detail and noise in the training data to the extent that it is now negatively impacting the performance of the model on new data.
The solution is therefore randomly removing some of the nodes so that the model cannot correlate with them too heavily.

Related

Is validation curve slight greater or lower in CNN models good?

Can you tell me which one among the two is a good validation vs train plot?
Both of them are trained with same keras sequential layers, but the second one is trained using more number of samples, i.e. augmented the dataset.
I'm a little bit confused about the zigzags in the first plot, otherwise I think it is better than the second.
In the second plot, there are no zigzags but the validation accuracy tends to be a little high than train, is it overfitting or considerable?
It is an image detection model where the first model's dataset size is 5170 and the second had 9743 samples.
The convolutional layers defined for the model building:
tf.keras.layers.Conv2D(128,(3,3), activation = 'relu', input_shape = (150,150,3)),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(64,(3,3), activation = 'relu'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(32,(3,3), activation = 'relu'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512,activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(128,activation='relu'),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Dense(1,activation='sigmoid')
Can the model be improved?
From the graphs the second graph where you have more samples is better. The reason is with more samples the model is trained on a much wider probability distribution of images. So when validation is run you have a better chance of correctly classifying the image. You have a lot of dropout in your model. This is good to prevent over fitting, however it will lower the training accuracy relative to the validation accuracy. Your model seems to be doing well. It might improve if you add additional convolution- max pooling layers. Alternative of course is to use transfer learning. I would recommend efficientnetb3. I also recommend using an adjustable learning rate. The Keras callback ReduceLROnPlateau works well for that purpose. Documentation is here.. Code below shows my recommended settings.
rlronp=tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=2,
verbose=1,
mode='auto'
)
in model.fit include callbacks=[rlronp]

Should I delete last 7 layers of VGG16 as I am going to use it as a pretrained model for a signature verification task?

As far as I know, cnn's last layers identify objects as a whole, this is irrelevant to the dataset with signatures. Thus, I want to remove them and add additional layers on top of the model, freezing the VGG16 from training. How would the removal of layers potentially affect the model's performance, or should I just leave and delete only dense layers?
I need to add additional layers on top anyway for the school report about the effect of convolutional layers' configurations on the model's performance.
p.s my dataset is really small it contains nearly 700 samples, which is extremely small n i know that(i tried augmenting data)
I have a dataset with Chinese signatures, but I thought that it is better to train it separately//
I am not proficient in this field and I started my acquaintance from deep learning, so pls correct me if you noticed any misconception in my explanation?/
Easiest way is to use VGG with include_top=False, weights='imagenet, and set pooling = max. This will instantiate the model with imagenet weights, the top classification layer is removed and the output of the VGG model is a flat vector you can feed directly into a dense layer. My typical code for this is shown below. In the final layer class_count is the number of classes in the training data.
base_model=tf.keras.applications.VGG16(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(256, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dropout(rate=.45, seed=123)(x)
output=Dense(class_count, activation='softmax')(x)
model=Model(inputs=base_model.input, outputs=output)
How would the removal of layers potentially affect the model's performance, or should I just leave and delete only dense layers?
This is hard to answer because what performance are you talking about? VGG16 originally were build to Imagenet problem with 1000 classes, so if you use it without any modifications probably won't work at all.
Now, if you are talking about transfer learning, so yes, the last dense layers could be replaced to classify your dataset, because the model created with cnn layers in VGG16 is a good pattern recognizer. The fully connected layers at the end work as a classifier for this patterns and you should replace it and train it again for your specific problem. VGG16 has 3 dense layers (FC1, FC2 and FC3) at end, keras only allow you to remove all three, so if you want replace just the last one, you will need to remove all three and rebuild the FC1 and FC2.
The key is what you are going to train after that, you could:
Use original weights (imagenet) in cnn layers and start you trainning from that, just finetunning with a small learning rate. A good choice when you dataset is similar to original and you have a good amount of it.
Use original weights (imagenet) in cnn layers, but freeze them, and just training the weights in the dense layers you replaced. A good choice when your dataset is small.
Don't use the original weights and retrain all the model. Usually not a good choice, because you will need to be an expert to tunning the parameters, tons of data and computacional power to make it work.

Augmenting on the fly: running out of data and validation accuracy=0.5

My validation accuracy is stuck at 50% while my training accuracy manages to converge to 100%. The pitfall is that i have very few data: 46 images in train set and 12 in validation set.
Therefore, I am augmenting my data while training but i am running out of data too early. and as i saw from previous answers that i should specify steps_per_epoch.
however, using steps_per_epoch=46/batch_size is not returning that much of iteration (maximum of 10 if i specify a very low batch size).
I assume data augmentation is not being applied? How can i be sure my data is indeed being augmented? Below is my data augmentation code:
gen=ImageDataGenerator(rotation_range=180,
horizontal_flip=True,
vertical_flip=True,
)
train_batches=gen.flow(
x=x_train,
y=Y_train,
batch_size=5,
subset=None,
shuffle=True
)
val_batches=gen.flow(
x=x_val,
y=Y_val,
batch_size=3,
subset=None,
shuffle=True
)
history= model.fit(
train_batches,
batch_size=32,
# steps_per_epoch=len(x_train)/batch_size,
epochs=50,
verbose=2,
validation_data=val_batches,
validation_steps=len(x_val)/batch_size)
I will really appreciate your help!
I think the mistake is not in your code.
You have a very small dataset, you are using only 2 augmentations, and (I assume) you initialize your model with random weights. Your model expectedly overfits.
Here are a couple of ideas that may help you:
Add more argumentations. Vertical and horizontal flips - are just not enough (with your small dataset). Think about crops, rotations, color changes etc. BTW here is a good tutorial on image augmentation where you'll find more ideas on what types of data augmentation you can use for your task: https://notrocketscience.blog/complete-guide-to-data-augmentation-for-computer-vision/
Transfer learning - is a must-do for small datasets. If you are using popular/default architecture, PyTorch and Tensorflow allow you to load model weights trained on ImageNet, for instance. If your architecture is custom - download some open-source dataset (better similar to your task) and pretrain model with this data.
Appropriate validation. Consider n-fold cross-validation, because a fixed train and test set is not a good idea for the small datasets. Your validation accuracy may be low by chance (for instance, all "hard" images are in the test set), but not because the model is bad.
Let me know if it helps!

What could be reasons for high MAE and MSE in Keras?

My MAE and MSE quite high. But the training data (not including test data 20%) (1030, 23) instances (after applied IQR and Z-score). By the way, all the categorical columns had been fully encoded.
Epoch: 1900, loss:50195632.3010, mae:3622.3535, mse:50195636.0000, val_loss:65308249.2427, val_mae:4636.2290, val_mse:65308244.0000,
Below is my setting for Keras.
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[len(dftrain.keys())]),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse'])
EPOCHS = 2000
history = model.fit(
normed_train_data,
train_labels,
epochs=EPOCHS,
validation_split = 0.2,
verbose=0,
callbacks=[tfdocs.modeling.EpochDots()])
What do you think?
"High" MAE itself is relative and varies according to the data and there could be multiple factors contributing towards it.
If you are getting started, I d recommend you to perform Exploratory Data
Analysis (EDA) and come up with features and also prepare that data for training.
Once you verify the data, try tuning the parameters of the model to suit your usecase. ML is more about experimenting than about coding.
Notebooks like these in Kaggle will help you get started.
Neural Network Model for House Prices
Comprehensive data exploration with Python
There could be many reasons actually. My quick guesses would be your dataset. The data for training. Is it compatible to the model's expectations? (shapes, formats etc.) Like, in case of text classification, are the texts encoded before feeding to the model.
Are the labels correctly, transformed to neural network expectations?
If yes, rest will be on your network definition, are you using the right loss function, layers etc?
Try a basic model architecture for your problem, this basic architecture model can be taken from implementations for the similar problem found on internet. This will give you a good starting point.
The other answers have already mentioned some good points, but another thing you can do is to normalize your data if you haven't already. NNs are highly sensitive to this. Some methods you can try here are Batch Normalization, Standard Scaler or Min-Max Scaler.
Also, if your model is overfitting (training loss decreasing, but not validation loss), consider adding regularization in the form of Dropout between your layers and see if it improves.
These links might be helpful:
link1
link2

Text classification issue

I'm newbie in ML and try to classify text into two categories. My dataset is made with Tokenizer from medical texts, it's unbalanced and there are 572 records for training and 471 for testing.
It's really hard for me to make model with diverse predict output, almost all values are same. I've tired using models from examples like this and to tweak parameters myself but output is always without sense
Here are tokenized and prepared data
Here is script: Gist
Sample model that I used
sequential_model = keras.Sequential([
layers.Dense(15, activation='tanh',input_dim=vocab_size),
layers.BatchNormalization(),
layers.Dense(8, activation='relu'),
layers.BatchNormalization(),
layers.Dense(1, activation='sigmoid')
])
sequential_model.summary()
sequential_model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
train_history = sequential_model.fit(train_data,
train_labels,
epochs=15,
batch_size=16,
validation_data=(test_data, test_labels),
class_weight={1: 1, 0: 0.2},
verbose=1)
Unfortunately I can't share datasets.
Also I've tired to use keras.utils.to_categorical with class labels but it didn't help
Your loss curves makes sense as we see the network overfit to training set while we see the usual bowl-shaped validation curve.
To make your network perform better, you can always deepen it (more layers), widen it (more units per hidden layer) and/or add more nonlinear activation functions for your layers to be able to map to a wider range of values.
Also, I believe the reason why you originally got so many repeated values is due to the size of your network. Apparently, each of the data points has roughly 20,000 features (pretty large feature space); the size of your network is too small and the possible space of output values that can be mapped to is consequently smaller. I did some testing with some larger hidden unit layers (and bumped up the number of layers) and was able to see that the prediction values did vary: [0.519], [0.41], [0.37]...
It is also understandable that your network performance varies so because the number of features that you have is about 50 times the size of your training (usually you would like a smaller proportion). Keep in mind that training for too many epochs (like more than 10) for so small training and test dataset to see improvements in loss is not great practice as you can seriously overfit and is probably a sign that your network needs to be wider/deeper.
All of these factors, such as layer size, hidden unit size and even number of epochs can be treated as hyperparameters. In other words, hold out some percentage of your training data as part of your validation split, go one by one through the each category of factors and optimize to get the highest validation accuracy. To be fair, your training set is not too high, but I believe you should hold out some 10-20% of the training as a sort of validation set to tune these hyperparameters given that you have such a large number of features per data point. At the end of this process, you should be able to determine your true test accuracy. This is how I would optimize to get the best performance of this network. Hope this helps.
More about training, test, val split