Augmenting on the fly: running out of data and validation accuracy=0.5 - data-augmentation

My validation accuracy is stuck at 50% while my training accuracy manages to converge to 100%. The pitfall is that i have very few data: 46 images in train set and 12 in validation set.
Therefore, I am augmenting my data while training but i am running out of data too early. and as i saw from previous answers that i should specify steps_per_epoch.
however, using steps_per_epoch=46/batch_size is not returning that much of iteration (maximum of 10 if i specify a very low batch size).
I assume data augmentation is not being applied? How can i be sure my data is indeed being augmented? Below is my data augmentation code:
gen=ImageDataGenerator(rotation_range=180,
horizontal_flip=True,
vertical_flip=True,
)
train_batches=gen.flow(
x=x_train,
y=Y_train,
batch_size=5,
subset=None,
shuffle=True
)
val_batches=gen.flow(
x=x_val,
y=Y_val,
batch_size=3,
subset=None,
shuffle=True
)
history= model.fit(
train_batches,
batch_size=32,
# steps_per_epoch=len(x_train)/batch_size,
epochs=50,
verbose=2,
validation_data=val_batches,
validation_steps=len(x_val)/batch_size)
I will really appreciate your help!

I think the mistake is not in your code.
You have a very small dataset, you are using only 2 augmentations, and (I assume) you initialize your model with random weights. Your model expectedly overfits.
Here are a couple of ideas that may help you:
Add more argumentations. Vertical and horizontal flips - are just not enough (with your small dataset). Think about crops, rotations, color changes etc. BTW here is a good tutorial on image augmentation where you'll find more ideas on what types of data augmentation you can use for your task: https://notrocketscience.blog/complete-guide-to-data-augmentation-for-computer-vision/
Transfer learning - is a must-do for small datasets. If you are using popular/default architecture, PyTorch and Tensorflow allow you to load model weights trained on ImageNet, for instance. If your architecture is custom - download some open-source dataset (better similar to your task) and pretrain model with this data.
Appropriate validation. Consider n-fold cross-validation, because a fixed train and test set is not a good idea for the small datasets. Your validation accuracy may be low by chance (for instance, all "hard" images are in the test set), but not because the model is bad.
Let me know if it helps!

Related

Validation loss and accuracy has a lot of 'jumps'

Hello everyone so I made this cnn model.
My data:
Train folder->30 classes->800 images each->24000 all together
Validation folder->30 classes->100 images each->3000 all together
Test folder->30 classes -> 100 images each -> 3000 all together
-I've applied data augmentation. ( on the train data)
-I got 5 conv layers with filters 32->64->128->128->128
each with maxpooling and batch normalization
-Added dropout 0.5 after flattening layers
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
Is there any way to fix this and make validation part more stable?
Note: I plann to increase epochs on my final model I'm just experimenting to see what works best since the model takes a lot of time in order to train. So for now I train with 20 epochs.
I've applied data augmentation (on the train data).
What does this mean? What kind of data did you add and how much? You might think I'm nitpicking, but if the distribution of the augmented data is different enough from the original data, then this will indeed cause your model to generalize poorly to the validation set.
Increasing your epochs isn't going to help here, your training loss is already decreasing reasonably. Training your model for longer is a good step if the validation loss is also decreasing nicely, but that's obviously not the case.
Some things I would personally try:
Try decreasing the learning rate.
Try training the model without the augmented data and see how the validation loss behaves.
Try splitting the augmented data so that it's also contained in the validation set and see how the model behaves.
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
the answer is yes. The so-called 'jumps' in the validation part may indicate that the model is not generalizing well to the validation data and therefore your model might be overfitting.
Is there any way to fix this and make validation part more stable?
To fix this you can use the following:
Increasing the size of your training set
Regularization techniques
Early stopping
Reduce the complexity of your model
Use different hyperparameters like learning rate

Why my validation loss is smaller than the training loss?

I am consistently getting a higher training loss than the validation loss while training a deep convolution autoencoder. Notice in my train data generator, I am doing data augmentation with Keras zoom_range. If I raise the zoom range like [0.8-4], [0.8,6] etc the gap between training and validation loss keeps increasing.
Is it because training loss is calculated on augmented data? Assuming more augmentation makes it harder for the model to predict(reconstruct) the input image. Or something wrong with my training method? I have attached my code snippet for the training command as well.
checkpoint = ModelCheckpoint(model_save_dir, monitor='val_loss', save_best_only=False, mode='min')
callbacks_list = [checkpoint]
history = model.fit(train_generator, validation_data=val_generator, epochs=n_epochs, shuffle=True, callbacks=callbacks_list)
It looks like your train loss increase as you are increasing data augmentation effect and basically this is because it becomes harder for the model to learn pattern with too much data augmentation.
In my point of view the goal of data augmentation is to make realistic change from initial data to improve the model's robustness like a regularization technique.
However the loss of validation remains the same so I presume the efficiency of the learning phase is not impaired so much. I will have made sure that the distribution of the labels is homogenous and the data from train/val is stratified. I will also have made a test set (without any data augmentation such as the validation set) to make comparaison more valuable.

Training dataset repeatedly - Keras

I am doing an image classification task using Keras.
I used the vgg16 architecture, I thought it is easier to do, the task is to classify the image having tumor or not in MRI images.
As usual, I read and make all the images in same shape (224×224×3) and normalised by dividing all the images by 255. Then train test split, test dataset is 25% and training dataset is 75%.
train, test = train_test_split(X, y, test_size=0.25)
Then, I trained and got val_loss as 0.64 and val_accuracy as 0.7261.
I save the trained model in my google drive.
Next day, I used the same procedure, to improve the model performance by loading the saved model.
I didn't change the model architecture, I simply loaded the saved model which scores 0.7261 accuracy.
This time, I got better performance, the val_loss is 0.58 and val_accurqcy is 0.7976.
I wonder how this gets high accuracy. Then, I found that when splitting the dataset, the images will splits in random, and thus some of the test data in the 1st training process will become training data in the 2nd training process. So, the model learns the images and predicted well in 2nd training process.
I have to clarify, is this model is truly learns the tumor patterns or it is like that we train and test the model with same dataset or same image samples.
Thanks
When using train_test_split and validating in different sessions, always set your random seed. Otherwise, you will be using different splits, and leaking data like you stated. The model is not "learning" more, rather is being validated on data that it has already trained on. You will likely get worse real-world performance.

What could be reasons for high MAE and MSE in Keras?

My MAE and MSE quite high. But the training data (not including test data 20%) (1030, 23) instances (after applied IQR and Z-score). By the way, all the categorical columns had been fully encoded.
Epoch: 1900, loss:50195632.3010, mae:3622.3535, mse:50195636.0000, val_loss:65308249.2427, val_mae:4636.2290, val_mse:65308244.0000,
Below is my setting for Keras.
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[len(dftrain.keys())]),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse'])
EPOCHS = 2000
history = model.fit(
normed_train_data,
train_labels,
epochs=EPOCHS,
validation_split = 0.2,
verbose=0,
callbacks=[tfdocs.modeling.EpochDots()])
What do you think?
"High" MAE itself is relative and varies according to the data and there could be multiple factors contributing towards it.
If you are getting started, I d recommend you to perform Exploratory Data
Analysis (EDA) and come up with features and also prepare that data for training.
Once you verify the data, try tuning the parameters of the model to suit your usecase. ML is more about experimenting than about coding.
Notebooks like these in Kaggle will help you get started.
Neural Network Model for House Prices
Comprehensive data exploration with Python
There could be many reasons actually. My quick guesses would be your dataset. The data for training. Is it compatible to the model's expectations? (shapes, formats etc.) Like, in case of text classification, are the texts encoded before feeding to the model.
Are the labels correctly, transformed to neural network expectations?
If yes, rest will be on your network definition, are you using the right loss function, layers etc?
Try a basic model architecture for your problem, this basic architecture model can be taken from implementations for the similar problem found on internet. This will give you a good starting point.
The other answers have already mentioned some good points, but another thing you can do is to normalize your data if you haven't already. NNs are highly sensitive to this. Some methods you can try here are Batch Normalization, Standard Scaler or Min-Max Scaler.
Also, if your model is overfitting (training loss decreasing, but not validation loss), consider adding regularization in the form of Dropout between your layers and see if it improves.
These links might be helpful:
link1
link2

Is my training data set too complex for my neural network?

I am new to machine learning and stack overflow, I am trying to interpret two graphs from my regression model.
Training error and Validation error from my machine learning model
my case is similar to this guy Very large loss values when training multiple regression model in Keras but my MSE and RMSE are very high.
Is my modeling underfitting? if yes what can I do to solve this problem?
Here is my neural network I used for solving a regression problem
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
and my data set
I have 500 samples, 10 features and 1 target
Quite the opposite: it looks like your model is over-fitting. When you have low error rates for your training set, it means that your model has learned from the data well and can infer the results accurately. If your validation data is high afterwards however, that means that the information learned from your training data is not successfully being applied to new data. This is because your model has 'fit' onto your training data too much, and only learned how to predict well when its based off of that data.
To solve this, we can introduce common solutions to reduce over-fitting. A very common technique is to use Dropout layers. This will randomly remove some of the nodes so that the model cannot correlate with them too heavily - therefor reducing dependency on those nodes and 'learning' more using the other nodes too. I've included an example that you can test below; try playing with the value and other techniques to see what works best. And as a side note: are you sure that you need that many nodes within your dense layer? Seems like quite a bit for your data set, and that may be contributing to the over-fitting as a result too.
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
Dropout(0.2),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
Well i think your model is overfitting
There are several ways that can help you :
1-Reduce the network’s capacity Which you can do by removing layers or reducing the number of elements in the hidden layers
2- Dropout layers, which will randomly remove certain features by setting them to zero
3-Regularization
If i want to give a brief explanation on these:
-Reduce the network’s capacity:
Some models have a large number of trainable parameters. The higher this number, the easier the model can memorize the target class for each training sample. Obviously, this is not ideal for generalizing on new data.by lowering the capacity of the network, it's going to learn the patterns that matter or that minimize the loss. But remember،reducing the network’s capacity too much will lead to underfitting.
-regularization:
This page can help you a lot
https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
-Drop out layer
You can use some layer like this
model.add(layers.Dropout(0.5))
This is a dropout layer with a 50% chance of setting inputs to zero.
For more details you can see this page:
https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/
As mentioned in the existing answer by #omoshiroiii your model in fact seems to be overfitting, that's why RMSE and MSE are too high.
Your model learned the detail and noise in the training data to the extent that it is now negatively impacting the performance of the model on new data.
The solution is therefore randomly removing some of the nodes so that the model cannot correlate with them too heavily.