train and test set for a ML algorithm - testing

I have a model which is trained on 33 datasets with SVM using LOOCV. I collected another 13 datasets which I divide like leave one out. In the testing phase, I combine datasets from training (33) and 12 from test and have a model which is trained on 45 datasets and test on the remaining datasets iteratively (similar to LOOCV). Is this method of testing right? All the recordings are independent of each other and can be reoffered as IID.

No, LOOCV is only used for small datasets or when you want an accurate estimate of your model performance.
Let's say your train accuracy is 90%, your test accuracy may be 50%.
This is due to overfitting from the large train size and small test size.
Image of overfitting in ML models
Assuming your 45 dataset sizes are the same, your train test size will be 98% - 2%.
The general rule of thumb for train test size is 80% - 20%
You could use train_test_split, k-fold split, stratifiedshufflesplit etc. instead.

Related

tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training

I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first build code for single gpu and achieved test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
Code I have modified to support multi-gpu training from single-gpu training.
By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.
loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
changed data_generator to tf.data.Dataset obj. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training.
let me know if you need any additional details.
Thank you

Training dataset repeatedly - Keras

I am doing an image classification task using Keras.
I used the vgg16 architecture, I thought it is easier to do, the task is to classify the image having tumor or not in MRI images.
As usual, I read and make all the images in same shape (224×224×3) and normalised by dividing all the images by 255. Then train test split, test dataset is 25% and training dataset is 75%.
train, test = train_test_split(X, y, test_size=0.25)
Then, I trained and got val_loss as 0.64 and val_accuracy as 0.7261.
I save the trained model in my google drive.
Next day, I used the same procedure, to improve the model performance by loading the saved model.
I didn't change the model architecture, I simply loaded the saved model which scores 0.7261 accuracy.
This time, I got better performance, the val_loss is 0.58 and val_accurqcy is 0.7976.
I wonder how this gets high accuracy. Then, I found that when splitting the dataset, the images will splits in random, and thus some of the test data in the 1st training process will become training data in the 2nd training process. So, the model learns the images and predicted well in 2nd training process.
I have to clarify, is this model is truly learns the tumor patterns or it is like that we train and test the model with same dataset or same image samples.
Thanks
When using train_test_split and validating in different sessions, always set your random seed. Otherwise, you will be using different splits, and leaking data like you stated. The model is not "learning" more, rather is being validated on data that it has already trained on. You will likely get worse real-world performance.

Small gap between train and test error, does that mean overfitting?

i am working on a dataset of 368609 samples and 34 features, i wanted to use a neural network to predict latency (real value) using keras, the model has 3 hidden layers, each layer has 1024 neurons, i have used drop_out (50 %) and l2 regularization (0.001) for each hidden layer. The problem is i am getting a test mean absolute error of 3.5505 ms and train mean absolute error of 3.4528. Here, the train error is smaller than test error by a small gap, does this mean that we have an overfitting problem here ?
Not really, yet it is still always a good idea to see how your model is generalizing to new data.
Keep something between 10%-20% of your original dataset as a test set and try to predict the output for each record in the test set.
Sometimes when we deal with the same validation set for many attempts of improving our model, we tend to overfit the evaluation dataset as well.
Having 3 different datasets for training, evaluation and testing usually provides a whole solution to overfitting.
If you get a high accuracy on your training set and a low accuracy on you test set it often means that your are overfitting. So in you case - no you are probably not overfitting.
Normally you would also have a validation set, so you don't fit your data to the testset.

Time taken to train Resnet on CIFAR-10

I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.

How to interpret the strange training curve for RNN?

I use the tensorflow to train a simple two-layer RNN on my data set. The training curve is shown as follows:
where, the x-axis is the steps(in one step, a batch_size number of samples is used to update the net parameters), the y-axis is the accuracy. The red, green, blue line is the accuracy in training set, validation set, and the test set, respectively. It seems the training curve is not smooth and have some corrupt change. Is it reasonable?
Have you tried gradient clipping, Adam optimizer and learning rate decay?
From my experience, gradient clipping can prevent exploding gradients, Adam optimizer can converge faster, and learning rate decay can improve generalization.
Have you shuffled the training data?
In addition, visualizing the distribution of weights also helps debugging the model.
It's absolutely OK since you are using SGD. General trend is that your accuracy increases as number of used minibatches increases, however, some minibatches could significantly 'differ' from most of the others, therefore accuracy could be poor on them.
The fact that your test and validation accuracy drops horribly at times 13 and 21 is suspicious. E.g. 13 drops the test score below epoch 1.
This implies your learning rate is probably too large: a single mini-batch shouldn't cause that amount of weight change.