Neural network, local minimum evasion techniques - optimization

Im new in this subject and trying some different things about escaping from a local-minimum. Im using randomized learning rate and momentum but for a small percentile of trainings, it stucks and cant learn anything(sometimes stucks at beginning, sometimes middle ) even with random starting weights and biases.
I tried several different settings for teaching XOR such as:
1)Faster learning but with a bigger chance of locally trapped.
(learns in less than 1200 iterations total)
2)Slow learning but with evading local minimum better.
(learns under 40k iterations total)
3)Very steep learning with ~%50 chance of pit-fall(learns under 300 iterations total)
Question: Is throwing several students into training and selecting the best learner worthy? Or do we need to concentrate on getting %100 success rate for a single setting?
Example:
3 students (XOR candidates) learning in parallel:
-First student is learning fast(learns first, tells others to stop to save cycles)
-Other two are slow learners to increase success rate of training

There are many possible methods of escaping local minima. Parallel learning has been investigated in the past, with different results, but it did not get to the "global usage". Some researchers simply proposed repeated training with different parameters and/or starting points, others - tried to use other training algorithms, like simulated annealing reporting good results.
The most recent methods include so called Extreme Learning Machines, where neural networks are learned in the heavily regularized form with global minimum found using Moore–Penrose pseudo inverse. If you are facing a problem of local minima in your work, I would suggest giving it a try, as a very recent, powerful and achieving suprisingly good results - model.
I do not really undestand why are you referring XOR problem, AFAIK this problem has no local minima.
I have never heard anyone calling a machine learning model "student", this makes the question quite weird to read. Some people are using "learner" but "student"?

Related

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

Neural Network optimization

I am trying to understand the purpose of ReduceLROnPlateau() function in keras.
I understood that this function helps to reduce the learning rate when there is no improvement in the validation loss. But will this not make the network not to get out of a local minimum? What if the network stays at a local minimum for about 5 epochs and this function further reduces the learning rate while increasing the learning rate would actually help the network get out of such a local minimum?
In other words, how will it understand if it has reached a local minimum or a plateau?
First up, here is a good explanation from CS231n class why learning rate decay is reasonable in general:
In training deep networks, it is usually helpful to anneal the
learning rate over time. Good intuition to have in mind is that with a
high learning rate, the system contains too much kinetic energy and
the parameter vector bounces around chaotically, unable to settle down
into deeper, but narrower parts of the loss function. Knowing when to
decay the learning rate can be tricky: Decay it slowly and you’ll be
wasting computation bouncing around chaotically with little
improvement for a long time. But decay it too aggressively and the
system will cool too quickly, unable to reach the best position it
can.
Concerning your question, unfortunately, you can't know it. If the optimizer hits a deep valley and can't get out of it, it simply hopes that this valley is good and worth exploring with smaller learning rate. Currently, there's no technique to tell whether there are better valleys, i.e., if it's a local or global minimum. So the optimizer makes a bet to explore the current one, rather than jump far away and start over. As it turns out in practice, no local minimum is much worse than others, that's why this strategy often works.
Also note that the loss surface may appear like a plateau for some learning rate, but not for 10 times smaller learning rate. So "escape the plateau" and "escape local minimum" are different challenges, and ReduceLROnPlateau aims for the first one.

What parameters to optimize in KNN?

I want to optimize KNN. There is a lot about SVM, RF and XGboost; but very few for KNN.
As far as I know the number of neighbors is one parameter to tune.
But what other parameters to test? Is there any good article?
Thank you
KNN is so simple method that there is pretty much nothing to tune besides K. The whole method is literally:
for a given test sample x:
- find K most similar samples from training set, according to similarity measure s
- return the majority vote of the class from the above set
Consequently the only thing used to define KNN besides K is the similarity measure s, and that's all. There is literally nothing else in this algorithm (as it has 3 lines of pseudocode). On the other hand finding "the best similarity measure" is equivalently hard problem as learning a classifier itself, thus there is no real method of doing so, and people usually end up using either simple things (Euclidean distance) or use their domain knowledge to adapt s to the problem at hand.
Lejlot, pretty much summed it all. K-NN is so simple that it's an instance based nonparametric algorithm, that's what makes it so beautiful, and works really well for certain specific examples. Most of K-NN research is not in K-NN itself but in the computation and hardware that goes into it. If you'd like some readings on K-NN and machine learning algorithms Charles Bishop - Pattern Recognition and Machine Learning. Warning: it is heavy in the mathematics, but, Machine Learning and real computer science is all math.
By optimizing if you are also focusing on the reduction of prediction time (you should) then there are other aspects which you can implement to make the algorithm more efficient (But these are not parameter tuning). The major draw back with the KNN is that with the increasing number of training examples, the prediction time also goes high thus performance go low.
To optimize, you can check on the KNN with KD-trees, KNN with inverted lists(index) and KNN with locality sensitive hashing (KNN with LSH).
These will reduce the search space during the prediction time thus optimizing the algorithm.

Deep neural network diverges after convergence

I implemented the A3C network in https://arxiv.org/abs/1602.01783 in TensorFlow.
At this point I'm 90% sure the algorithm is implemented correctly. However, the network diverges after convergence. See the attached image that I got from a toy example where the maximum episode reward is 7.
When it diverges, policy network starts giving a single action very high probability (>0.9) for most states.
What should I check for this kind of problem? Is there any reference for it?
Note that in Figure 1 of the original paper the authors say:
For asynchronous methods we average over the best 5
models from 50 experiments.
That can mean that in lot of cases the algorithm does not work that well. From my experience, A3C often diverges, even after convergence. Carefull learning-rate scheduling can help. Or do what the authors did - learn several agents with different seed and pick the one performing the best on your validation data. You could also employ early stopping when validation error becomes to increase.

Using tensorflow to identify lego bricks?

having read this article about a guy who uses tensorflow to sort cucumber into nine different classes I was wondering if this type of process could be applied to a large number of classes. My idea would be to use it to identify Lego parts.
At the moment, a site like Bricklink describes more than 40,000 different parts so it would be a bit different than the cucumber example but I am wondering if it sounds suitable. There is no easy way to get hundreds of pictures for each part but does the following process sound feasible :
take pictures of a part ;
try to identify the part using tensorflow ;
if it does not identify the correct part, take more pictures and feed the neural network with them ;
go on with the next part.
That way, each time we encounter a new piece we "teach" the network its reference so that it can better be recognized the next time. Like that and after hundreds of iterations monitored by a human, could we imagine tensorflow to be able to recognize the parts? At least the most common ones?
My question might sound stupid but I am not into neural networks so any advice is welcome. At the moment I have not found any way to identify a lego part based on pictures and this "cucumber example" sounds promising so I am looking for some feedback.
Thanks.
You can read about the work of Jacques Mattheij, he actually uses a customized version of Xception1 running on https://keras.io/.
The introduction is Sorting 2 Metric Tons of Lego.
In Sorting 2 Tons of Lego, The software Side you can read:
The hard challenge to deal with next was to get a training set large
enough to make working with 1000+ classes possible. At first this
seemed like an insurmountable problem. I could not figure out how to
make enough images and to label them by hand in acceptable time, even
the most optimistic calculations had me working for 6 months or longer
full-time in order to make a data set that would allow the machine to
work with many classes of parts rather than just a couple.
In the end the solution was staring me in the face for at least a week
before I finally clued in: it doesn’t matter. All that matters is that
the machine labels its own images most of the time and then all I need
to do is correct its mistakes. As it gets better there will be fewer
mistakes. This very rapidly expanded the number of training images.
The first day I managed to hand-label about 500 parts. The next day
the machine added 2000 more, with about half of those labeled wrong.
The resulting 2500 parts where the basis for the next round of
training 3 days later, which resulted in 4000 more parts, 90% of which
were labeled right! So I only had to correct some 400 parts, rinse,
repeat… So, by the end of two weeks there was a dataset of 20K images,
all labeled correctly.
This is far from enough, some classes are severely under-represented
so I need to increase the number of images for those, perhaps I’ll
just run a single batch consisting of nothing but those parts through
the machine. No need for corrections, they’ll all be labeled
identically.
A recent update is Sorting 2 Tons of Lego, Many Questions, Results.
1CHOLLET, François. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv preprint arXiv:1610.02357, 2016.
I have started this using IBM Watson's Visual Recognition.
I had six different bricks to be recognized on the transport belt background.
I am actually thinking about tensorflow, since I can have it running locally.
The codelab : TensorFlow for Poets, describes almost exactly what you want to achieve,
For a demo of the Watson version:
https://www.ibm.com/developerworks/community/blogs/ibmandgoogle/entry/Lego_bricks_recognition_with_Watosn_lego_and_raspberry_pi?lang=en