Keras training/testing results vary greatly after multiple runs - tensorflow

I am using Keras with TensorFlow backend. The dataset I am working with is sequence data with a Y value that is continuous between 0 and 1. The dataset is split into training with size 1900 and a testing with size 400. I am using the VGG19 architecture that I created from scratch in Keras. I am using an epoch of 30.
My question is, if I run this architecture multiple times, I get very different results. My results can be between 0.15 and 0.5 RMSE. Is this normal for this type of data? Is it because I am not running enough epochs? The loss from the network seems to stabilize around 0.024 at the end of the run. Any ideas?

Related

YOLOv4 loss too high

I am using YOLOv4-tiny for a custom dataset of 26 classes that I collected from Open Images Dataset. The dataset is almost balanced(850 images per class but different number of bounding boxes). When I used YOLOv4-tiny to train on just 3 classes the loss was near 0.5, it was fairly accurate. But for 26 classes as soon as the loss goes below 2 the model starts to overfit. The prediction are also very inaccurate.
I have tried to change the parameters like the learning rate, the momentum and the size but whatever I do the models becomes worse then before. Using regular YOLOv4 model rather then YOLO-tiny does not help either. How can I bring the loss further down?
Have you tried training with mAP? You can take a subset of your training set and make it the validation set. This can be done in the same way you made your training and test set. Then, you can run darknet.exe detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -map. This will keep track of the loss in your validation set. When the error in the validation say goes up, this is the time to stop training and prevent overfitting (this is called: early stopping).
You need to run the training for (classes*2000)iterations. However, for the best scores, you need to train your model for at least 6000 iterations (also known as max_batches). Also please remember if you are using a b&w image, change the channels=3 to channels=1. You can stop your training once the avg loss becomes something like this: 0.XXXX.
Here's my mAP graph for 6000 iterations that ran for 6.2 hours:
avg loss with 6000 max_batches.
Moreover, you can follow this FAQ documentation here by Stéphane Charette.

TensorFlow image classification colab sheet from training material: newbie questions

Apologies if my questions are relatively simple, but I have been approaching the TensorFlow bit recently with the aim to learn new skills.
In the example, but there are several things I can't get:
in the explore data section, the size of the datasets return as 60/10k respectively for train and test.
where the size of the train/test size declared?
packages like SkLearn allows this to be specified in percentage when invoking the split methods.
in the training model part, when the 5 epochs are trained, the 1875 number appear below.
- what is that?
- I was expecting the training to run over the 60k items, but even by multiplying 1875 by 5 the number doesn't reach the 10k.
Dataset is loaded using tensorflow datasets API
The source itself has the split of 60K (Train) and 10K (Test)
https://www.tensorflow.org/datasets/catalog/fashion_mnist
An Epoch is a complete run with all the training samples. The training is done in batches. In the example you refer to, a batch size of 32 is used. So to complete one epoch, 1875 batches (60000 / 32) are run.
Hope this helps.

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

same train and eval dataset but get a different result

Versions:
TensorFlow: 1.8.0
TensorBoard: 1.8.0
What i did:
I'm training a model with imbalanced dataset with tf.estimator.DNNClassifier. When i did two times of training process both start from a totally new beginning(AKA, no checkpoint for each training) with the same data. I got two results which are very different from each other as shown in the following pictures.
1st-train
2nd-train
A few points to comment:
There is not difference between the two training process (no code or data changes), they both start from a new beginning.
The training dataset size is about 100M.
Both training results are from 6 epochs. (And each result cost $25 on google ml-engine.)
From the two pictures we can tell:
The 1st training learns nothing for 6 epochs.
The 2nd training learns (it got a AUC over 0.6).
Although the difference of AUC values between two trainings is only 0.1 (0.6 - 0.5), but it has big different in the meaning (a-random-guess versus a-non-random-guess).
Problems:
Why is this happen: same training data but get a totally different result?

Training with RMSPROP gives different results

I'm training a neural network (a CNN on top of an RNN), using theano, and using RMSPROP for optimisation (I'm using lasagne implementation for that).
My problem is, every time I train the network, I get totally different results (accuracies). I'm initializing the parameters using a fixed seed and the problem doesn't happen when I train with SGD, so I guess RMSPROP is what causes the problem.
Is this a normal behaviour with RMSPROP? What is the best practice to deal with that? Should I train the network several times and take the best model?
I'm also optimising using one example per time (my training set is small so I'm not using mini-batches or batches) is this a good practice with RMSPROP?
Usually batch sizes around 40 gives better results, as for my experience training with 40 batch size for 3 epocs using default RMsprop gives around 89% accuracy. Have you tried adjusting the learning rate of RMsprop optimizer ? try with a very small value first ( default is 0.001, in keras implementation) and try to increment it with factors of 10 or 100.