How to automatically judge whether the training process of the deep learning model is converged? - tensorflow

When training a deep learning model, I have to look at the loss curve and performance curve to judge whether the training process of the deep learning model is converged.
This has cost me a lot of time. Sometimes, the time of convergence judged by the naked eye is not accurate.
Therefore, I'd like to know whether there exists an algorithm or a package that can automatically judge whether the training process of the deep learning model is converged.
Can anyone help me?
Thanks a lot.

To the risk of disappointing you, I believe there is no such universal algorithm. In my experience, it depends on what you want to achieve, which metrics are important to you and how much time you are willing to let the training go on for.
I have already seen validation losses dramatically go up (a sign of overfitting) while other metrics (mIoU in this case) were still improving on the validation set. In these cases, you need to know what your target is.
It is possible (although it is very rare) that your loss goes up for a substantial amount of time before going down again and reach better levels than before. There is no way to anticipate this.
Finally, and this is arguably a common case if you have tons of training data, your validation loss may continually go down, but do so slower and slower. In this case, the best strategy if you had an infinite amount of time would be to let it keep the training going indefinitely. In practice, this is impossible, and you would need to find the right balance between performance and training time.
If you really need an algorithm, I would suggest this quite simple one :
Compute a validation metric M(i) after each ith epoch on a fixed subset of your validation set or the whole validation set. Let's suppose that the higher M(i)is, the better. Fix k an integer depending on the duration of one training epoch (k~3 should do the trick)
If for some n you have M(n) > max(M(n+1), ..., M(n+k)), stop and keep the network you had at epoch n.
It's far from perfect, but should be enough for simple tasks.
[Edit] If you're not using it yet, I invite you to use TensorBoard to visualize the evolution of your metrics throughout the training. Once set up, it is a huge gain of time.

Related

Does changing a token name in an image caption model affect performance?

If I train an image caption model then stop to rename a few tokens:
Should I train the model from scratch?
Or can I reload the model and continue training from the last epoch with the updated vocabulary?
Will either approach effect model accuracy/performance differently?
I would go for option 2.
When training the model from scratch, you are initializing the model's weights randomly and then you fit them based on your problem. However, if, instead of using random weights, you use weights that have already been trained for a similar problem, you may decrease the convergence time. This option is kind similar to the idea of transfer learning.
Just to give the other team a voice: So what is actually the difference between training from scratch and reloading a model and continuing training?
(2) will converge faster, (1) will probably have a better performance and should thus be chosen. Do we actually care about training times when we trade them off with performance - do you really? See you do not.
The further your model is already converged to a specific problem, the harder it gets to get it back into another optimum. Now you might be lucky and the chance, that you are going down the right rabid hole, rises with similar tasks and similar data. Yet with a change in your setup this can not be guaranteed.
Initializing a few epochs on other than your target domain, definitely makes sense and is beneficial, yet the question arises why you would not train on your target domain from the very beginning.
Note: For a more substantial read I'd like to refer you to this paper, where they explain in more depth why domain is of the essence and transfer learning could mess with your final performance.
It depends on the number of tokens being relabeled compared to the total amount. Just because you mentioned there are few of them, then the optimal solution in my opinion is clear.
You should start the training from scratch but initialize the weights with the values they had from wherever the previous training stopped (again mentioning that it is crucial that the samples that are being re-labeled are not of substantial amount). This way, the model will likely converge faster than starting with random weights and also better than trying to re-fit ("forget") what it managed to learn from the previous training.
Topologically speaking you are initializing in a position where the model is closer to a global minimum but has not made any steps towards a local minimum.
Hope this helps.

Can reducing the number of back propagation steps improve training performance?

I want to know how beneficial it would be if we could reduce the number of back propagation steps by 50%.
For example, let's say a neural network performed back propagation 1000 times for training. And another neural network performs back propagation 500 to get trained (Lets assume that both of them gave same accuracy after training). Will the second one be significantly faster? Or does it not matter much? It will increase the speed of training.
If you can train two networks, to the same accuracy, but one of them only needs to process half as much data, then yes that is a good thing.
The resulting network will not be any faster to execute during inference time, but there are still several important benefits to the training process.
Training will take half as long. This is valuable by itself. It is extra valuable when you consider that you can now try twice as many ideas in the same amount of time. That will improve results quality for the entire process.
Faster convergence can reduce generalization error and overfitting. The optimization does not have as many opportunities to "fidget" and find opportunities to overfit.
Extremely fast convergence, called super-convergence, can improve the final training error while still keeping generalization error low, leading to better validation scores too.
Speaking more generally, there is a lot of research and other activity on the topic of how to make networks train as quickly and cheaply as possible. One such benchmark is DAWNBench, which sets a target accuracy to achieve and then ranks approaches based on how fast they reach that target, and how much the GPUs or other infrastructure cost to do it.
This general idea of "cost reduction" is also one of the drivers behind the general idea of Transfer Learning.

Why machine learning algorithms focus on speed and not accuracy?

I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. (https://en.wikipedia.org/wiki/Universal_approximation_theorem) There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.

Deep learning basic thoughts

I try to understand the basics of deep learning, lastly reading a bit through deeplearning4j. However, I don't really find an answer for: How does the training performance scale with the amount of training data?
Apparently, the cost function always depends on all the training data, since it just sums the squared error per input. Thus, I guess at each optimization step, all datapoints have to be taken into account. I mean deeplearning4j has the dataset iterator and the INDArray, where the data can live anywhere and thus (I think) doesn't limit the amount of training data. Still, doesn't that mean, that the amount of training data is directly related to the calculation time per step within the gradient descend?
DL4J uses iterator. Keras uses generator. Still the same idea - your data comes in batches, and used for SGD. So, minibatches matter, not the the whole amount of data you have.
Fundamentally speaking it doesn't (though your mileage may vary). You must research right architecture for your problem. Adding new data records may introduce some new features, which may be hard to capture with your current architecture. I'd safely always question my net's capacity. Retrain your model and check if metrics drop.

Is this overfitting

I’m running a machine learning algorithm to answer True/False questions.
Assuming I use classification algo.
After running 1200 data, I got 30% of accuracy.
But then, I made a second algorithm to always negate the first algorithm’s answer
Thus it’s accuracy is 70%
Is this an overfitting for the second algo? Assuming my 1st algorithm consistenly predicts 30% accuracy
To your questions.
I feel like this answer kind of depends on the machine learning model which you choose and the training set. Most ML Models make mistakes initially. In your case if the training set of Algo 2 is 70% it might mean that it is good at predicting the wrong thing? If i'm understanding this correctly? All though this might be true in the beginning of the data negating a ML answer is a bad idea. The better idea is to prepare your data correctly and train it on a data set which is the best fit for your model.
Most Machine learning models make mistakes it is bound to happen. But the training set and all that data helps you to choose the right model. Data preparation is key in order to make your training set correctly. I know I'm bouncing all over the place. I apologize for that
For instance we might have a logistic regression model and we want to identify the individuals who have a certain condition versus those who don't. The first thing we do is properly prepare our data and then train it (this is the short version) but my point is training a model is very important it allows your ML model to be able to predict the accuracy of it.
I should say I really enjoy Machine Learning/ Deep Learning but I am no means an expert. I highly recommend this class though its how I started off understanding the fundamentals.
Coursera Andrew Ng course