I want to know how beneficial it would be if we could reduce the number of back propagation steps by 50%.
For example, let's say a neural network performed back propagation 1000 times for training. And another neural network performs back propagation 500 to get trained (Lets assume that both of them gave same accuracy after training). Will the second one be significantly faster? Or does it not matter much? It will increase the speed of training.
If you can train two networks, to the same accuracy, but one of them only needs to process half as much data, then yes that is a good thing.
The resulting network will not be any faster to execute during inference time, but there are still several important benefits to the training process.
Training will take half as long. This is valuable by itself. It is extra valuable when you consider that you can now try twice as many ideas in the same amount of time. That will improve results quality for the entire process.
Faster convergence can reduce generalization error and overfitting. The optimization does not have as many opportunities to "fidget" and find opportunities to overfit.
Extremely fast convergence, called super-convergence, can improve the final training error while still keeping generalization error low, leading to better validation scores too.
Speaking more generally, there is a lot of research and other activity on the topic of how to make networks train as quickly and cheaply as possible. One such benchmark is DAWNBench, which sets a target accuracy to achieve and then ranks approaches based on how fast they reach that target, and how much the GPUs or other infrastructure cost to do it.
This general idea of "cost reduction" is also one of the drivers behind the general idea of Transfer Learning.
Related
When training a deep learning model, I have to look at the loss curve and performance curve to judge whether the training process of the deep learning model is converged.
This has cost me a lot of time. Sometimes, the time of convergence judged by the naked eye is not accurate.
Therefore, I'd like to know whether there exists an algorithm or a package that can automatically judge whether the training process of the deep learning model is converged.
Can anyone help me?
Thanks a lot.
To the risk of disappointing you, I believe there is no such universal algorithm. In my experience, it depends on what you want to achieve, which metrics are important to you and how much time you are willing to let the training go on for.
I have already seen validation losses dramatically go up (a sign of overfitting) while other metrics (mIoU in this case) were still improving on the validation set. In these cases, you need to know what your target is.
It is possible (although it is very rare) that your loss goes up for a substantial amount of time before going down again and reach better levels than before. There is no way to anticipate this.
Finally, and this is arguably a common case if you have tons of training data, your validation loss may continually go down, but do so slower and slower. In this case, the best strategy if you had an infinite amount of time would be to let it keep the training going indefinitely. In practice, this is impossible, and you would need to find the right balance between performance and training time.
If you really need an algorithm, I would suggest this quite simple one :
Compute a validation metric M(i) after each ith epoch on a fixed subset of your validation set or the whole validation set. Let's suppose that the higher M(i)is, the better. Fix k an integer depending on the duration of one training epoch (k~3 should do the trick)
If for some n you have M(n) > max(M(n+1), ..., M(n+k)), stop and keep the network you had at epoch n.
It's far from perfect, but should be enough for simple tasks.
[Edit] If you're not using it yet, I invite you to use TensorBoard to visualize the evolution of your metrics throughout the training. Once set up, it is a huge gain of time.
I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. (https://en.wikipedia.org/wiki/Universal_approximation_theorem) There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.
I am working on comparison of deep learning models with application in Vehicular network communication security. I want to know how I can compute the complexity of these models to know the performance of my proposed ones. I am making use of tensorflow
You can compare the complexity of two deep networks with respect to space and time.
Regarding space complexity:
Number of parameters in your model -> this is directly proportional to the amount of memory consumed by your model.
Regarding time complexity:
Amount of time it takes to train a single batch for a given batch size.
Amount of time it takes for training to converge
Amount of time it takes to perform inference on a single sample
Some papers also discuss the architecture complexity. For example, if GoogLeNet accuracy is only marginally higher than VGG-net, some people might prefer VGG-net as it is a lot easier to implement.
You can also discuss some analysis on tolerance of your network to hyperparameter tuning i.e. how your performance varies when you change the hyperparameters.
If your model is in a distributed setting, there are other things to mention such as the communication interval as it is the bottleneck sometimes.
In summary, you can discuss pretty much anything you feel that is implemented differently in another network and that is contributing additional complexity without much improvement in accuracy with respect to your network.
I don't think you would want it but there is also an open source project called deepBench to benchmark different deep network models.
I have faced some problem when I needed to solve Regression Task and use as minimum instances as possible. When I tried to use Xgboost I had to feed 4 instances to get the reasonable result. But Multilayer Perceptron tuned to overcoming Regression problems has to take 20 instances, tried to change amount of neurons&layers but the answer is still 20 .Is it possible to do something to make Neural Network solve Resgression tasks with from 2 to 4 instances? if yes - explain please what should I do to succeed in it? Maybe there is some correlation between how much instances are needed to train and get reasonable results from Perceptron and how features are valuable inside dataset?
Thanks in advance for any help
With small numbers of samples, there are likely better methods to apply, Xgaboost definitely comes to mind as a method that does quite well at avoiding overfitting.
Neural networks tend to work well with larger numbers of samples. They often over fit to small datasets and underperform other algorithms.
There is, however, an active area of research in semi-supervised techniques using neural networks with large datasets of unlabeled data and small datasets of labeled samples.
Here's a paper to start you down that path, search on 'semi supervised learning'.
http://vdel.me.cmu.edu/publications/2011cgev/paper.pdf
Another area of interest to reduce overfitting in smaller datasets is in multi-task learning.
http://ruder.io/multi-task/
Multi task learning requires the network to achieve multiple target goals for a given input. Adding more requirements tends to reduce the space of solutions that the network can converge on and often achieves better results because of it. To say that differently: when multiple objectives are defined, the parameters necessary to do well at one task are often beneficial for the other task and vice versa.
Lastly, another area of open research is GANs and how they might be used in semi-supervised learning. No papers pop to the forefront of my mind on the subject just now, so I'll leave this mention as a footnote.
What are the potential reasons for a NN to output different values for the same input? Especially when there isn't any random or stochastic processes?
This is a very broad and general question, might be even too broad to even be on here, but there are several things you should know about neural networks:
They are NOT methods for finding one prefect optimal solution. A neural network usually learn examples that it is given and "figures out" a way to predict results reasonably well. Reasonable is relative, and for some models may mean 50% success and for others anything short of 99.9% will be considered failure.
They're outcome is very dependent on the data that was trained on. The order of data matters, and it's usually a good idea to shuffle data during training, but that can lead to wildly different results. Also, the quality of data matters - if the training data is very different in nature to the test data for example.
The best analogy of neural networks in computing is of course - the brain. Even with the same information and same basic underlying biology, we could all evolve different opinions on matters based on endless other variables. Same thing with computer learning to some extent.
Some types of neural networks use dropout layers, that are specifically designed to shut off random parts of the network during training. This should not affect the final prediction process, because for predictions that layer is usually set to allow all the parts of the network to operate, but if you are inputting data and telling the model it is "training" instead of asking it to predict, the results may vary significantly.
The sum of all this is just to say: The training of neural networks should be expected to yield different results from similar starting conditions, and so must be tested multiple times for every condition to determine what parts of it are inevitable and what parts are not.
It might be due to shuffling of data , If you want to use the same vector you should turn the shuffle argument off.
You should try disabling dropout. Dropout randomly sets the outputs of certain neurons to 0. This will mean that your output will be different each time.