While training deep learning model, with every increase in number of epoch, the time taken to complete one step is continuously decreasing. What made this increase in efficiency as the data are same?
And why in first epoch, it very large as compare to other epochs? Any answer or reference for the same will be appreciable.
Here is my training model screenshot:
You can see the time/step is decreasing as 3s/step,810ms/step, 722ms/step and so on..
Partial answer:
The first epoch is slower due to a variety of initialization overhead: your entire model initializes to the selected values or distributions, the model layers are instantiated, etc.
Later epochs may accelerate for any of a variety of reasons. The most common, in the work I do, is that various algorithmic analyzers are learning the data+flow control of your model, and are adjusting the flow for better performance.
This can involve input ingestion (caching), operation short-circuiting, switching to sparse-matrix computations as kernel weights "shake out" to have a majority oof 0.0 elements, etc.
However, without a proper example to accurately reproduce the effect, and no attempt to profile the execution, these ideas are only conjecture.
This is very case specific and can not be generalized. Time taken is a variable component and is dependent on various external factors as well like memory availability during the course of run, input sizes etc.
Related
When training a deep learning model, I have to look at the loss curve and performance curve to judge whether the training process of the deep learning model is converged.
This has cost me a lot of time. Sometimes, the time of convergence judged by the naked eye is not accurate.
Therefore, I'd like to know whether there exists an algorithm or a package that can automatically judge whether the training process of the deep learning model is converged.
Can anyone help me?
Thanks a lot.
To the risk of disappointing you, I believe there is no such universal algorithm. In my experience, it depends on what you want to achieve, which metrics are important to you and how much time you are willing to let the training go on for.
I have already seen validation losses dramatically go up (a sign of overfitting) while other metrics (mIoU in this case) were still improving on the validation set. In these cases, you need to know what your target is.
It is possible (although it is very rare) that your loss goes up for a substantial amount of time before going down again and reach better levels than before. There is no way to anticipate this.
Finally, and this is arguably a common case if you have tons of training data, your validation loss may continually go down, but do so slower and slower. In this case, the best strategy if you had an infinite amount of time would be to let it keep the training going indefinitely. In practice, this is impossible, and you would need to find the right balance between performance and training time.
If you really need an algorithm, I would suggest this quite simple one :
Compute a validation metric M(i) after each ith epoch on a fixed subset of your validation set or the whole validation set. Let's suppose that the higher M(i)is, the better. Fix k an integer depending on the duration of one training epoch (k~3 should do the trick)
If for some n you have M(n) > max(M(n+1), ..., M(n+k)), stop and keep the network you had at epoch n.
It's far from perfect, but should be enough for simple tasks.
[Edit] If you're not using it yet, I invite you to use TensorBoard to visualize the evolution of your metrics throughout the training. Once set up, it is a huge gain of time.
If I train an image caption model then stop to rename a few tokens:
Should I train the model from scratch?
Or can I reload the model and continue training from the last epoch with the updated vocabulary?
Will either approach effect model accuracy/performance differently?
I would go for option 2.
When training the model from scratch, you are initializing the model's weights randomly and then you fit them based on your problem. However, if, instead of using random weights, you use weights that have already been trained for a similar problem, you may decrease the convergence time. This option is kind similar to the idea of transfer learning.
Just to give the other team a voice: So what is actually the difference between training from scratch and reloading a model and continuing training?
(2) will converge faster, (1) will probably have a better performance and should thus be chosen. Do we actually care about training times when we trade them off with performance - do you really? See you do not.
The further your model is already converged to a specific problem, the harder it gets to get it back into another optimum. Now you might be lucky and the chance, that you are going down the right rabid hole, rises with similar tasks and similar data. Yet with a change in your setup this can not be guaranteed.
Initializing a few epochs on other than your target domain, definitely makes sense and is beneficial, yet the question arises why you would not train on your target domain from the very beginning.
Note: For a more substantial read I'd like to refer you to this paper, where they explain in more depth why domain is of the essence and transfer learning could mess with your final performance.
It depends on the number of tokens being relabeled compared to the total amount. Just because you mentioned there are few of them, then the optimal solution in my opinion is clear.
You should start the training from scratch but initialize the weights with the values they had from wherever the previous training stopped (again mentioning that it is crucial that the samples that are being re-labeled are not of substantial amount). This way, the model will likely converge faster than starting with random weights and also better than trying to re-fit ("forget") what it managed to learn from the previous training.
Topologically speaking you are initializing in a position where the model is closer to a global minimum but has not made any steps towards a local minimum.
Hope this helps.
I try to understand the basics of deep learning, lastly reading a bit through deeplearning4j. However, I don't really find an answer for: How does the training performance scale with the amount of training data?
Apparently, the cost function always depends on all the training data, since it just sums the squared error per input. Thus, I guess at each optimization step, all datapoints have to be taken into account. I mean deeplearning4j has the dataset iterator and the INDArray, where the data can live anywhere and thus (I think) doesn't limit the amount of training data. Still, doesn't that mean, that the amount of training data is directly related to the calculation time per step within the gradient descend?
DL4J uses iterator. Keras uses generator. Still the same idea - your data comes in batches, and used for SGD. So, minibatches matter, not the the whole amount of data you have.
Fundamentally speaking it doesn't (though your mileage may vary). You must research right architecture for your problem. Adding new data records may introduce some new features, which may be hard to capture with your current architecture. I'd safely always question my net's capacity. Retrain your model and check if metrics drop.
I am planning to make small Tensor Flow image classification project, which is expected to run on machines with low processing power, and one of the concerns I was asked about was the time needed to train the model.
The project is still in the conception stage and no clear boundary is made.
But assuming that we will use Tensor flow for Python, with a simple Neural Network for say n images data set, is there a way to estimate or predict the time required to train the model before performing the training given the hardware in use?
I have asked one of my colleagues who works in NN and he said that maybe we could calculate the time needed by measuring the time for the first epoch and making an estimation how many epochs needed afterwards. Is this is a valid way? If yes then is it even possible to estimate the number of epochs needed? And either cases is there a way to calculate it before performing any training?
There is no definite way of finding the number of epochs to which the model converges. It is one of the hyperparameter.
Apart from the type of model you are training, convergence also depends on the distribution of data, and the optimizer you are using.
The rough estimate you can make by looking at the number of parameters you have in your model, check time for one epoch, and get a rough idea from "experience" on the number of epochs. BUT you always have to look at the training and validation loss curves to check for the convergence.
I am working on comparison of deep learning models with application in Vehicular network communication security. I want to know how I can compute the complexity of these models to know the performance of my proposed ones. I am making use of tensorflow
You can compare the complexity of two deep networks with respect to space and time.
Regarding space complexity:
Number of parameters in your model -> this is directly proportional to the amount of memory consumed by your model.
Regarding time complexity:
Amount of time it takes to train a single batch for a given batch size.
Amount of time it takes for training to converge
Amount of time it takes to perform inference on a single sample
Some papers also discuss the architecture complexity. For example, if GoogLeNet accuracy is only marginally higher than VGG-net, some people might prefer VGG-net as it is a lot easier to implement.
You can also discuss some analysis on tolerance of your network to hyperparameter tuning i.e. how your performance varies when you change the hyperparameters.
If your model is in a distributed setting, there are other things to mention such as the communication interval as it is the bottleneck sometimes.
In summary, you can discuss pretty much anything you feel that is implemented differently in another network and that is contributing additional complexity without much improvement in accuracy with respect to your network.
I don't think you would want it but there is also an open source project called deepBench to benchmark different deep network models.