Data augmentation in validation - data-augmentation

I am a little bit confused about the data augmentation. If I perform data augmentation in train dataset, validation dataset should have the same operations?
For example
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'val': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),}
Why do we take the 'resize' and 'CenterCrop' operations in 'val' dataset?

Since validation data is used to measure how good a trained model is, it should not be changed across different trained models. That is, we should use a fixed measure to evaluate things. This is the reason why the augmentation of validation data does not contain any randomness, which exist in the training data augmentation.
SIDE NOTE:
Unlike test data, validation data is used to tune the hyper parameters.

I strongly disagree with the #Yashio Yamauchi's answer. Yes, data augmentation is commonly used in the training dataset to increase the number of the dataset samples, when the dataset is small. However, there are cases where your validation dataset is also small, so You cannot actually evaluate your model.
For example, let's say that your task is to recognise Logos on T-Shirts (e.g. Adidas Logo), no matter how they are rotated on an images (e.g. 90 degrees). Then, You will have to use data-augmentations to ensure that your model is fed with rotated t-shirts. However, If you want to measure how well your model identifies "Adidas" when it is rotated 90 degrees, then You will need to have images in your validation dataset that contain rotated t-shirts, as well.
In such case, Data Augmentations could be used once in the validation dataset, before the training happens!

Related

Why my validation loss is smaller than the training loss?

I am consistently getting a higher training loss than the validation loss while training a deep convolution autoencoder. Notice in my train data generator, I am doing data augmentation with Keras zoom_range. If I raise the zoom range like [0.8-4], [0.8,6] etc the gap between training and validation loss keeps increasing.
Is it because training loss is calculated on augmented data? Assuming more augmentation makes it harder for the model to predict(reconstruct) the input image. Or something wrong with my training method? I have attached my code snippet for the training command as well.
checkpoint = ModelCheckpoint(model_save_dir, monitor='val_loss', save_best_only=False, mode='min')
callbacks_list = [checkpoint]
history = model.fit(train_generator, validation_data=val_generator, epochs=n_epochs, shuffle=True, callbacks=callbacks_list)
It looks like your train loss increase as you are increasing data augmentation effect and basically this is because it becomes harder for the model to learn pattern with too much data augmentation.
In my point of view the goal of data augmentation is to make realistic change from initial data to improve the model's robustness like a regularization technique.
However the loss of validation remains the same so I presume the efficiency of the learning phase is not impaired so much. I will have made sure that the distribution of the labels is homogenous and the data from train/val is stratified. I will also have made a test set (without any data augmentation such as the validation set) to make comparaison more valuable.

Augmenting on the fly: running out of data and validation accuracy=0.5

My validation accuracy is stuck at 50% while my training accuracy manages to converge to 100%. The pitfall is that i have very few data: 46 images in train set and 12 in validation set.
Therefore, I am augmenting my data while training but i am running out of data too early. and as i saw from previous answers that i should specify steps_per_epoch.
however, using steps_per_epoch=46/batch_size is not returning that much of iteration (maximum of 10 if i specify a very low batch size).
I assume data augmentation is not being applied? How can i be sure my data is indeed being augmented? Below is my data augmentation code:
gen=ImageDataGenerator(rotation_range=180,
horizontal_flip=True,
vertical_flip=True,
)
train_batches=gen.flow(
x=x_train,
y=Y_train,
batch_size=5,
subset=None,
shuffle=True
)
val_batches=gen.flow(
x=x_val,
y=Y_val,
batch_size=3,
subset=None,
shuffle=True
)
history= model.fit(
train_batches,
batch_size=32,
# steps_per_epoch=len(x_train)/batch_size,
epochs=50,
verbose=2,
validation_data=val_batches,
validation_steps=len(x_val)/batch_size)
I will really appreciate your help!
I think the mistake is not in your code.
You have a very small dataset, you are using only 2 augmentations, and (I assume) you initialize your model with random weights. Your model expectedly overfits.
Here are a couple of ideas that may help you:
Add more argumentations. Vertical and horizontal flips - are just not enough (with your small dataset). Think about crops, rotations, color changes etc. BTW here is a good tutorial on image augmentation where you'll find more ideas on what types of data augmentation you can use for your task: https://notrocketscience.blog/complete-guide-to-data-augmentation-for-computer-vision/
Transfer learning - is a must-do for small datasets. If you are using popular/default architecture, PyTorch and Tensorflow allow you to load model weights trained on ImageNet, for instance. If your architecture is custom - download some open-source dataset (better similar to your task) and pretrain model with this data.
Appropriate validation. Consider n-fold cross-validation, because a fixed train and test set is not a good idea for the small datasets. Your validation accuracy may be low by chance (for instance, all "hard" images are in the test set), but not because the model is bad.
Let me know if it helps!

Would adding channel_shift_range to Keras preprocessing (image augmentation) allow the model to be used in variable light situations?

I am currently creating a machine learning model with the ultimate goal of deploying the model in an iOS app. The app would be used in the field, where light conditions are highly variable compared to the testing and training set.
Would adding a high channel_shift_range to my image data generator improve my model's ability to recognize the images even when the light conditions are highly variable?
I am currently using
train_datagen = ImageDataGenerator(
rotation_range=360,
width_shift_range=0.2,
height_shift_range=0.2,
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
channel_shift_range=100,
data_format=ch_format,
brightness_range=(0.75, 1.25),
fill_mode='nearest')
test_datagen = ImageDataGenerator(rescale = 1./255,
data_format=ch_format)
for my data generators and using flow_from_directory to load my images into the model.
On a conceptual level, would this work and create the desired results?
Also, if I added a channel_shift_range to the test_datagen would it more accurately reflect my model's performance in variable light situations?
Thank you for any help!
Normally, you would want to apply all sorts of data augmentation, not only channel_shift_range.
Note that, on principle, you should not touch your test dataset. In your case, if you apply a kind of augmentation on the training set, you do not want to apply it also on the test set.
The idea of using the augmentation is to provide the model with many examples, so that it is robust when it goes 'to production'. If you augmented the true test dataset in the exact way you augment the training set, you would just 'mini-cheat'; as you know you gave a lot of examples with, say channel_shift_range with value K to your training set, giving the same exact value to the test set would be just be almost like copying training data to your test data; you do not want to do that.
Ensure that you use a numerous and relevant augmentations in your case (for instance, color shift when comparing apples and oranges is not a wise augmentation).

Text classification issue

I'm newbie in ML and try to classify text into two categories. My dataset is made with Tokenizer from medical texts, it's unbalanced and there are 572 records for training and 471 for testing.
It's really hard for me to make model with diverse predict output, almost all values are same. I've tired using models from examples like this and to tweak parameters myself but output is always without sense
Here are tokenized and prepared data
Here is script: Gist
Sample model that I used
sequential_model = keras.Sequential([
layers.Dense(15, activation='tanh',input_dim=vocab_size),
layers.BatchNormalization(),
layers.Dense(8, activation='relu'),
layers.BatchNormalization(),
layers.Dense(1, activation='sigmoid')
])
sequential_model.summary()
sequential_model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
train_history = sequential_model.fit(train_data,
train_labels,
epochs=15,
batch_size=16,
validation_data=(test_data, test_labels),
class_weight={1: 1, 0: 0.2},
verbose=1)
Unfortunately I can't share datasets.
Also I've tired to use keras.utils.to_categorical with class labels but it didn't help
Your loss curves makes sense as we see the network overfit to training set while we see the usual bowl-shaped validation curve.
To make your network perform better, you can always deepen it (more layers), widen it (more units per hidden layer) and/or add more nonlinear activation functions for your layers to be able to map to a wider range of values.
Also, I believe the reason why you originally got so many repeated values is due to the size of your network. Apparently, each of the data points has roughly 20,000 features (pretty large feature space); the size of your network is too small and the possible space of output values that can be mapped to is consequently smaller. I did some testing with some larger hidden unit layers (and bumped up the number of layers) and was able to see that the prediction values did vary: [0.519], [0.41], [0.37]...
It is also understandable that your network performance varies so because the number of features that you have is about 50 times the size of your training (usually you would like a smaller proportion). Keep in mind that training for too many epochs (like more than 10) for so small training and test dataset to see improvements in loss is not great practice as you can seriously overfit and is probably a sign that your network needs to be wider/deeper.
All of these factors, such as layer size, hidden unit size and even number of epochs can be treated as hyperparameters. In other words, hold out some percentage of your training data as part of your validation split, go one by one through the each category of factors and optimize to get the highest validation accuracy. To be fair, your training set is not too high, but I believe you should hold out some 10-20% of the training as a sort of validation set to tune these hyperparameters given that you have such a large number of features per data point. At the end of this process, you should be able to determine your true test accuracy. This is how I would optimize to get the best performance of this network. Hope this helps.
More about training, test, val split

Feeding individual examples into TensorFlow graph trained on files?

I'm new to TensorFlow and am getting a bit tripped up on the mechanics of reading data. I set up a TensorFlow graph on the mnist data, but I'd like to modify it so that I can run one program to train it + save the model out, and run another to load said graph, make predictions, and compute test accuracy.
Where I'm getting confused is how to bypass the original I/O system in the training graph and "inject" an image to predict or an (image, label) tuple of test data for accuracy testing. To read the training data, I'm using this code:
_, input_data = util.read_examples(
paths_to_files,
batch_size,
shuffle=shuffle,
num_epochs=None)
feature_map = {
'label': tf.FixedLenFeature(
shape=[], dtype=tf.int64, default_value=[-1]),
'image': tf.FixedLenFeature(
shape=[NUM_PIXELS * NUM_PIXELS], dtype=tf.int64),
}
example = tf.parse_example(input_data, features=feature_map)
I then feed example to a convolution layer, etc. and generate the output.
Now imagine that I train my graph with that code specifying the input, save out the graph and weights, and then restore the graph and weights in another script for prediction -- I'd like to take (say) 10 images and feed them to the graph to generate predictions. How do I "inject" those 10 images so that the predictions come out the other end?
I played around with feed dictionaries and placeholders, but I'm not sure if they're the right things for me to use... it seems like they rely on having data in memory, as opposed to reading from a queue of test data, for example.
Thanks!
A feed dictionary with placeholders would make sense if you wanted to perform a small number of inferences/evaluations (i.e. enough to fit in memory) - e.g. if you were serving a simple model or running small eval loops.
If you specifically want to infer or evaluate large batches then you should use the same approach you've used for training, but with a different path to your test/eval/live data. e.g.
_, eval_data = util.read_examples(
paths_to_files, # CHANGE THIS BIT
batch_size,
shuffle=shuffle,
num_epochs=None)
You can use this as a normal python variable and set up successive, dependent steps to use this as a provided variable. e.g.
def get_example(data):
return tf.parse_example(data, features=feature_map)
sess.run([get_example(path_to_your_data)])