I am trying to understand the purpose of ReduceLROnPlateau() function in keras.
I understood that this function helps to reduce the learning rate when there is no improvement in the validation loss. But will this not make the network not to get out of a local minimum? What if the network stays at a local minimum for about 5 epochs and this function further reduces the learning rate while increasing the learning rate would actually help the network get out of such a local minimum?
In other words, how will it understand if it has reached a local minimum or a plateau?
First up, here is a good explanation from CS231n class why learning rate decay is reasonable in general:
In training deep networks, it is usually helpful to anneal the
learning rate over time. Good intuition to have in mind is that with a
high learning rate, the system contains too much kinetic energy and
the parameter vector bounces around chaotically, unable to settle down
into deeper, but narrower parts of the loss function. Knowing when to
decay the learning rate can be tricky: Decay it slowly and you’ll be
wasting computation bouncing around chaotically with little
improvement for a long time. But decay it too aggressively and the
system will cool too quickly, unable to reach the best position it
can.
Concerning your question, unfortunately, you can't know it. If the optimizer hits a deep valley and can't get out of it, it simply hopes that this valley is good and worth exploring with smaller learning rate. Currently, there's no technique to tell whether there are better valleys, i.e., if it's a local or global minimum. So the optimizer makes a bet to explore the current one, rather than jump far away and start over. As it turns out in practice, no local minimum is much worse than others, that's why this strategy often works.
Also note that the loss surface may appear like a plateau for some learning rate, but not for 10 times smaller learning rate. So "escape the plateau" and "escape local minimum" are different challenges, and ReduceLROnPlateau aims for the first one.
Related
I am performing an optimization using gradient descent but sometime it jumps over the mimimum and the cost function increases. I added a condition that if cost function value increased then step back and make the learning rate smaller this time. It is working very well. Why is it that I am not seeing this in the literature anywhere? I have read a lot of optimization literature trying to adapt the learning rate but they never step back and modify their step. Is there something wrong with this approach?
i encountered this phrase few times before, mostly in the context of neural networks and tensorflow, but i get the impression its something more general and not restricted to these environments.
here for example, they say that this "convolution warmup" process takes about 10k iterations.
why do convolutions need to warmup? what prevents them from reaching their top speed right away?
one thing that i can think of is memory allocation. if so, i would expect that it would be solved after 1 (or at least <10) iteration. why 10k?
edit for clarification: i understand that the warmup is a time period or number of iterations that have to be done until the convolution operator reaches its top speed (time per operator).
what i ask is - why is it needed and what happens during this time that makes the convolution faster?
Training neural networks works by offering training data, calculating the output error, and backpropagating the error back to the individual connections. For symmetry breaking, the training doesn't start with all zeros, but by random connection strengths.
It turns out that with the random initialization, the first training iterations aren't really effective. The network isn't anywhere near to the desired behavior, so the errors calculated are large. Backpropagating these large errors would lead to overshoot.
A warmup phase is intended to get the initial network away from a random network, and towards a first approximation of the desired network. Once the approximation has been achieved, the learning rate can be accelerated.
This is an empirical result. The number of iterations will depend on the complexity of your program domain, and therefore also with the complexity of the necessary network. Convolutional neural networks are fairly complex, so warmup is more important for them.
You are not alone to claiming the timer per iteration varies.
I run the same example and I get the same question.And I can say the main reason is the differnet input image shape and obeject number to detect.
I offer my test result to discuss it.
I enable trace and get the timeline at the first,then I found that Conv2D occurrences vary between steps in gpu stream all compulte,Then I use export TF_CUDNN_USE_AUTOTUNE=0 to disable autotune.
then there are same number of Conv2D in the timeline,and the time are about 0.4s .
the time cost are still different ,but much closer!
I want to optimize KNN. There is a lot about SVM, RF and XGboost; but very few for KNN.
As far as I know the number of neighbors is one parameter to tune.
But what other parameters to test? Is there any good article?
Thank you
KNN is so simple method that there is pretty much nothing to tune besides K. The whole method is literally:
for a given test sample x:
- find K most similar samples from training set, according to similarity measure s
- return the majority vote of the class from the above set
Consequently the only thing used to define KNN besides K is the similarity measure s, and that's all. There is literally nothing else in this algorithm (as it has 3 lines of pseudocode). On the other hand finding "the best similarity measure" is equivalently hard problem as learning a classifier itself, thus there is no real method of doing so, and people usually end up using either simple things (Euclidean distance) or use their domain knowledge to adapt s to the problem at hand.
Lejlot, pretty much summed it all. K-NN is so simple that it's an instance based nonparametric algorithm, that's what makes it so beautiful, and works really well for certain specific examples. Most of K-NN research is not in K-NN itself but in the computation and hardware that goes into it. If you'd like some readings on K-NN and machine learning algorithms Charles Bishop - Pattern Recognition and Machine Learning. Warning: it is heavy in the mathematics, but, Machine Learning and real computer science is all math.
By optimizing if you are also focusing on the reduction of prediction time (you should) then there are other aspects which you can implement to make the algorithm more efficient (But these are not parameter tuning). The major draw back with the KNN is that with the increasing number of training examples, the prediction time also goes high thus performance go low.
To optimize, you can check on the KNN with KD-trees, KNN with inverted lists(index) and KNN with locality sensitive hashing (KNN with LSH).
These will reduce the search space during the prediction time thus optimizing the algorithm.
I implemented the A3C network in https://arxiv.org/abs/1602.01783 in TensorFlow.
At this point I'm 90% sure the algorithm is implemented correctly. However, the network diverges after convergence. See the attached image that I got from a toy example where the maximum episode reward is 7.
When it diverges, policy network starts giving a single action very high probability (>0.9) for most states.
What should I check for this kind of problem? Is there any reference for it?
Note that in Figure 1 of the original paper the authors say:
For asynchronous methods we average over the best 5
models from 50 experiments.
That can mean that in lot of cases the algorithm does not work that well. From my experience, A3C often diverges, even after convergence. Carefull learning-rate scheduling can help. Or do what the authors did - learn several agents with different seed and pick the one performing the best on your validation data. You could also employ early stopping when validation error becomes to increase.
Im new in this subject and trying some different things about escaping from a local-minimum. Im using randomized learning rate and momentum but for a small percentile of trainings, it stucks and cant learn anything(sometimes stucks at beginning, sometimes middle ) even with random starting weights and biases.
I tried several different settings for teaching XOR such as:
1)Faster learning but with a bigger chance of locally trapped.
(learns in less than 1200 iterations total)
2)Slow learning but with evading local minimum better.
(learns under 40k iterations total)
3)Very steep learning with ~%50 chance of pit-fall(learns under 300 iterations total)
Question: Is throwing several students into training and selecting the best learner worthy? Or do we need to concentrate on getting %100 success rate for a single setting?
Example:
3 students (XOR candidates) learning in parallel:
-First student is learning fast(learns first, tells others to stop to save cycles)
-Other two are slow learners to increase success rate of training
There are many possible methods of escaping local minima. Parallel learning has been investigated in the past, with different results, but it did not get to the "global usage". Some researchers simply proposed repeated training with different parameters and/or starting points, others - tried to use other training algorithms, like simulated annealing reporting good results.
The most recent methods include so called Extreme Learning Machines, where neural networks are learned in the heavily regularized form with global minimum found using Moore–Penrose pseudo inverse. If you are facing a problem of local minima in your work, I would suggest giving it a try, as a very recent, powerful and achieving suprisingly good results - model.
I do not really undestand why are you referring XOR problem, AFAIK this problem has no local minima.
I have never heard anyone calling a machine learning model "student", this makes the question quite weird to read. Some people are using "learner" but "student"?