stopping gradient optimizer in TensorFlow - optimization

I'm trying to build a simple neural network in Tensorflow, but I have a question about gradient optimization.
It might be a naive question, but do I have to set conditions to stop the optimizer? Below is a sample printout from my network and you can see that after iteration (batch gradient descent of all data) 66, the cost begins to increase again. So is it up to me to make sure the optimizer stops at this point? (note: I didn't put all the output here, but the cost begins to increase exponentially as the number of iterations increase).
Thanks for any guidance.
iteration 64 with average cost of 654.621 and diff of 0.462708
iteration 65 with average cost of 654.364 and diff of 0.257202
iteration 66 with average cost of 654.36 and diff of 0.00384521
iteration 67 with average cost of 654.663 and diff of -0.302368
iteration 68 with average cost of 655.328 and diff of -0.665161
iteration 69 with average cost of 656.423 and diff of -1.09497
iteration 70 with average cost of 658.011 and diff of -1.58826

That's correct - the TensorFlow tf.train.Optimizer classes expose an operation that you can run to take one (gradient descent-style) step, but they do not monitor the current value of the cost or decide when to stop, so you may see increasing cost once the network begins to overfit.

Related

Asynchrony loss function over an array of 1D signals

So I have an array of N 1D-signals (e.g. time series) with same number of samples per signal (all in equal resolution) and I want to define a differentiable loss function to penalize asynchrony among them and therefore be zero if all N 1D signals will be equal to each other. I've been searching the literature to find something but haven't had luck yet.
Few remarks:
1 - since N (number of signals) could be quite large I can not afford to calculate Mean squared loss between every single pair which could grow combinatorialy large. also I'm not quite sure whether it would be optimal in any mathematical sense for the goal to achieve.
There are two naive loss functions that I could think of :
a) Total variation loss for each time sample across all signals (to force to reach ideally zero variation). the problem is here the weight needs to be very large to yield zero varion. masking any other loss term that is going to be added and also there is no inherent order among the N signals, which doesnt make it suitable to TV loss to begin with.
b) minimizing the sum of variance at each time point among all signals. however, choice of the reference of variance (aka mean) could be crucial I believe as just using the sample mean might not really yield the desired result, not quite sure.

Initial jump in loss with TensorFlow

Suppose I have a saved model that is nearly at the minimum, but with some room for improvement. For example, the loss (as reported by tf.keras.Models.model.evaluate() ) might be 11.390, and I know that the model can go down to 11.300.
The problem is that attempts to refine this model (using tf.keras.Models.model.fit()) consistently result in the weights receiving an initial 'jolt' during the first epoch, which sends the loss way upwards. After that, it starts to decrease, but it does not always converge on the correct minimum (and may not even get back to where it started.)
It looks like this:
tf.train.RMSPropOptimizer(0.0002):
0 11.982
1 11.864
2 11.836
3 11.822
4 11.809
5 11.791
(...)
15 11.732
tf.train.AdamOptimizer(0.001):
0 14.667
1 11.483
2 11.400
3 11.380
4 11.371
5 11.365
tf.keras.optimizers.SGD(0.00001):
0 12.288
1 11.760
2 11.699
3 11.650
4 11.666
5 11.601
Dataset with 30M observations, batch size 500K in all cases.
I can mitigate this by turning the learning rate way down, but then it takes forever to converge.
Is there any way to prevent training from going "wild" at the beginning, without impacting the long-term convergence rate?
As you tried decreasing the learning rate is the way to go.
E.g. learning rate = 0.00001
tf.train.AdamOptimizer(0.00001)
Especially with Adam that should be promising, since the learning rate is at the same time an upper bound for the step size.
On top of that you could try learning rate scheduling, where you set the learning rate according to your predefined schedule.
Also I feel that from what you show when you decreased the learning rate, this does not seem to be too bad, in terms of convergence rate.
Maybe another hyperparameter you could tune in your case would be to reduce the batch size, to decrease computation cost per update.
Note:
I find the term "not the right minimum" rather misleading. To further understand nonconvex optimization for artificial neural networks, I would like to Point to the deep learning book of Ian Goodfellow et al

What is meaning of "parameter optimization of SVM by PSO"?

I can change parameters C and epsilon manually to obtain an optimised result, but I found that there is parameter optimization of SVM by PSO (or any other optimization algorithm). There is no algorithm. What does it mean: how can PSO automatically optimize the SVM parameters? I read several papers on this topic, but I'm still not sure.
Particle Swarm Optimization is a technique that uses the ML parameters (SVM parameters, in your case) as its features.
Each "particle" in the swarm is characterized by those parameter values. For instance, you might have initial coordinates of
degree epsilon gamma C
p1 3 0.001 0.25 1.0
p2 3 0.003 0.20 0.9
p3 2 0.0003 0.30 1.2
p4 4 0.010 0.25 0.5
...
pn ...........................
The "fitness" of each particle (p1-p4 shown here out of a population of n particles) is measured by the accuracy of the resulting model: the PSO algorithm trains and tests a model for each particle, returning that model's error rate as the value analogous to that from the training loss function (which it how the value is computed).
On each iteration, particles move toward the fittest neighbours. The process repeats until a maximum (hopefully the global one) appears as a convergence point. This process is simply one from the familiar gradient descent family.
There are two basic PSO variants. In gbest (global best), every particle affects every other particle, sort of a universal gravitation principle. It converges quickly, but may well miss a global max in favor of a local max that happened to be nearer to the swarm's original center. In lbest (local best), a particle responds to only its k closest neighbors. This can form localized clusters; it converges more slowly, but is more likely to find the global max in a non-convex space.
I'll try to briefly explain enough to answer your clarification questions. If that doesn't work, I'm afraid you'll probably have to find someone to discuss this in front of a white board.
To use PSO, you have to decide which SVM parameters you'll try to optimize, and how many particles you want to use. PSO is a meta-algorithm, so its features are the SVM parameters. The PSO parameters are population (how many particles you want to use, update neighbourhood (lbest size and a distance function; gbest is the all-inclusive case), and velocity (learning rate for the SVM parameters).
For a bit of illustration, let's assume the particle table above, extended to a population of 20 particles. We'll use lbest with a neighbourhood of 4, and a velocity of 0.1. We choose (randomly, in a grid, or however we think might give us nice results) the initial values of degree, epsilon, gamma, and C for each of the 20 particles.
Each iteration of PSO works like this:
# Train the model described by each particle's "position"
For each of the 20 particles:
Train an SVM with the SVM input and the given parameters.
Test the SVM; return the error rate as the PSO loss function value.
# Update the particle positions
for each of the 20 particles:
find the nearest 4 neighbours (using the PSO distance function)
identify the neighbour with the lowest loss (SVM's error rate).
adjust this particle's features (degree, epsilon, gamma, C) 0.1 of the way toward that neighbour's features. 0.1 is our learning rate / velocity. (Yes, I realize that changing degree is not likely to happen (it's a discrete value) without a special case in the update routine.
Continue iterating through PSO until the particles have converged to your liking.
gbest is simply lbest with an infinite neighbourhood; in that case, you don't need a distance function on the particle space.

How to correctly interpret NuPIC output vol.2

Here is discussion about correct interpretation of NuPIC output which I would like to extend. First I will provide short summary and then ask another question.
Consider following output:
step,original,prediction,anomaly score
175,0,0.0,0.32500000000000001
176,62,52.0,0.65000000000000002
177,402,0.0,1.0
178,0,0.0,0.125
179,402,0.0,1.0
180,0,0.0,0.0
181,3,402.0,0.050000000000000003
182,50,52.0,0.10000000000000001
183,68,13.0,0.90000000000000002
This is output of one step ahead prediction without using inference shifter. It basically mean that the prediction made at step N is for step N+1. Or in another words if the prediction is perfectly right then prediction value at step N should correspond to the original value at step N+1.
Anomaly score can be viewed as confidence of prediction. For example, NuPIC might only be 23% confident in the best prediction it gives, in which case the anomaly score could be very high. This is the case of step 179 where prediction is 0 and the original value on step 180 is 0. Note that anomaly score on step 179 is 1.0. It means that NuPIC was not confident in prediction, despite that the prediction was correct.
Opposite situation happens on step 180 where prediction is 0 and the original value on step 181 is 3. Note that anomaly score on step 180 is 0. That means that NuPIC was quiet confident in prediction but it was not correct.
Questions:
Does anomaly score on given step also counts with the original value on given step? For example anomaly score on this step
181,3,402.0,0.050000000000000003
take into account that 3 is the original value? Or it is computed without respect to this value?
Is it possible to compute some kind of debug information reading prediction and anomaly score? I mean something like this from NuPIC perspective: I'm 23% sure that next value will be 10, I'm 27% sure that next value will be 20, I'm 50% sure that next value will be 30.
Is OK to predict data for zero steps forward if I'm just interested in the prediction accuracy?
Does NuPIC make some kind of look back? I mean if NuPIC was at step 180 confident that next value will be 0 but later it shows that it was mistake does NuPIC somehow recount the anomaly score from step 180 for further data processing? Or this is done automatically in HTM?

How to speed up the rjags model training in Bayesian ranking?

All,
I am doing Bayesian modeling using rjags. However, when the number of observation is larger than 1000. The graph size is too big.
More specifically, I am doing a Bayesian ranking problem. Traditionally, one observation means one X[i, 1:N]-Y[i] pair, where X[i, 1:N] means the i-th item is represented by a N-size predictor vector, and Y[i] is a response. The objective is to minimize the point-wise error of predicted values,for example, least square error.
A ranking problem is different. Since we more care about the order, we use a pair-wise 1-0 indicator to represent the order between Y[i] and Y[j], for example, when Y[i]>Y[j], I(i,j)=1; otherwise I(i,j)=0. We treat this 1-0 indicator as an observation. Therefore, assuming we have K items: Y[1:K], the number of indicator is 0.5*K*(K-1). Hence when K is increased from 500 to 5000, the number of observations is very large, i.e. from 500^2 to 5000^2. The garph size of the rjags model is large too, for example graph size > 500,000. And the log-posterior will be very small.
And it takes a long time to complete the training. I think the consumed time is >40 hours. It is not practical for me to do further experiment. Therefore, do you have any idea to speed up the rjags. I heard that the RStan is faster than Rjags. Any one who has similar experience?