How to make scipy.optimize.differential_evolution ignore 'nan' evaluations? - optimization

I am trying to optimize a complex PDE using differential_evolution. Sometimes my function for specific cases returns 'nan' and the differential evolution evaluation report takes that as the result of the evaluation. How can I tell the optimizer to ignore the results when it gets a 'nan' value?

Related

scipy-optimize-minimize does not perform the optimization - CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

I am trying to minimize a function defined as follows:
utility(decision) = decision * (risk - cost)
where variables take the following form:
decision = binary array
risk = array of floats
cost = constant
I know the solution will take the form of:
decision = 1 if (risk >= threshold)
decision = 0 otherwise
Therefore, in order to minimize this function I can assume that I transform the function utility to depend only on this threshold. My direct translation to scipy is the following:
def utility(threshold,risk,cost):
selection_list = [float(risk[i]) >= threshold for i in range(len(risk))]
v = np.array(risk.astype(float)) - cost
total_utility = np.dot(v, selection_list)
return -1.0*total_utility
result = minimize(fun=utility, x0=0.2, args=(r,c),bounds=[(0,1)], options={"disp":True} )
This gives me the following result:
fun: array([-17750.44298655]) hess_inv: <1x1 LbfgsInvHessProduct with
dtype=float64>
jac: array([0.])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 2
nit: 0 status: 0 success: True
x: array([0.2])
However, I know the result is wrong because in this case it must be equal to cost. On top of that, no matter what x0 I use, it always returns it as the result. Looking at the results I observe that jacobian=0 and does not compute 1 iteration correctly.
Looking more thoroughly into the function. I plot it and observe that it is not convex on the limits of the bounds but we can clearly see the minimum at 0.1. However, no matter how much I adjust the bounds to be in the convex part only, the result is still the same.
What could I do to minimize this function?
The error message tells you that the gradient was at some point too small and thus numerically the same as zero. This is likely due to the thresholding that you do when you calculate your selection_list. There you say float(risk[i]) >= threshold, which has derivative 0 almost everywhere. Hence, almost every starting value will give you the warning you receive.
A solution could be to apply some smoothing to the thresholding operation. So instead of float(risk[i]) >= threshold, you would use a continuous function:
def g(x):
return 1./(1+np.exp(-x))
With this function, you can express the thresholding operation as
g((risk[i] - threshold)/a), which a parameter a. The larger a, the closer is this modified error function to what you are doing so far. At something like a=20 or so, you would probably have pretty much the same that you have at the moment. You would therefore derive a sequence of solutions, where you start with a=1 and then take that solution as a starting value for the same problem with a=2, take that solution as a starting value for the problem with a=4, and so on. At some point, you will notice that changing a does no longer change the solution and you're done.

Will using multiple minimizing ops at once work as expected in Tensorflow?

For example, if I do:
loss_one = something
loss_two = somthing_else
train_one = tf.train.AdamOptimzer(0.001).minimize(loss_one)
train_two = tf.train.AdamOptimizer(0.001).minimize(loss_two)
sess.run([train_one, train_two])
Will that do what's expected? The reason I'm concerned is because I don't exactly know how gradients are accumulated. Are they stored on the optimizers themselves? Or on the variables? If it's the second, I can imagine them interfering.
Most likely not. Presumably, both loss_one and loss_two are a measure of how close the output of your model, let's say out, is to what you expected, so they would both be a function of out and maybe something else. Both optimizers compute the variable updates from the out computed with the values that the variables had before calling session.run. So if you apply one update and then the other, the second update would not be really correct, because it has not been computed using the now-updated variables. This may not be a huge issue though, since. A more complicated problem is that, depending on how exactly the optimizer is implemented, if it is something more or less like variable = variable + update then it is not deterministic whether that variable on the right-hand side of the expression has the original or first-updated value, so you could end adding only one of the updates or both, non-deterministically.
There are several better alternatives:
Use only one optimizer at a time, so you call sess.run(train_one) first and sess.run(train_two) later.
Optimize the (possibly weighted) sum of both losses (tf.train.AdamOptimzer(0.001).minimize(loss_one + loss_two)).
Call compute_gradients from the optimizer for each loss value, combine the resulting gradients however you see fit (e.g. adding or averaging the updates) and apply them with apply_gradients.
Use tf.control_dependencies to make sure that one optimization step always takes place after the other. However this means that using the second optimizer will always require using the first one (could be work around, maybe with tf.cond, but it's more of a hassle).
the optimizer is mainly in charge of calculating the gradients(backpropagation), if you give it loss twice(run it two times as you are doing), it will update the gradients twice by performing inference once.not sure why would you do that though

Errors to fit parameters of scipy.optimize

I use the scipy.optimize.minimize ( https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html ) function with method='L-BFGS-B.
An example of what it returns is here above:
fun: 32.372210618549758
hess_inv: <6x6 LbfgsInvHessProduct with dtype=float64>
jac: array([ -2.14583906e-04, 4.09272616e-04, -2.55795385e-05,
3.76587650e-05, 1.49213975e-04, -8.38440428e-05])
message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 420
nit: 51
status: 0
success: True
x: array([ 0.75739412, -0.0927572 , 0.11986434, 1.19911266, 0.27866406,
-0.03825225])
The x value correctly contains the fitted parameters. How do I compute the errors associated to those parameters?
TL;DR: You can actually place an upper bound on how precisely the minimization routine has found the optimal values of your parameters. See the snippet at the end of this answer that shows how to do it directly, without resorting to calling additional minimization routines.
The documentation for this method says
The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= ftol.
Roughly speaking, the minimization stops when the value of the function f that you're minimizing is minimized to within ftol of the optimum. (This is a relative error if f is greater than 1, and absolute otherwise; for simplicity I'll assume it's an absolute error.) In more standard language, you'll probably think of your function f as a chi-squared value. So this roughly suggests that you would expect
Of course, just the fact that you're applying a minimization routine like this assumes that your function is well behaved, in the sense that it's reasonably smooth and the optimum being found is well approximated near the optimum by a quadratic function of the parameters xi:
where Δxi is the difference between the found value of parameter xi and its optimal value, and Hij is the Hessian matrix. A little (surprisingly nontrivial) linear algebra gets you to a pretty standard result for an estimate of the uncertainty in any quantity X that's a function of your parameters xi:
which lets us write
That's the most useful formula in general, but for the specific question here, we just have X = xi, so this simplifies to
Finally, to be totally explicit, let's say you've stored the optimization result in a variable called res. The inverse Hessian is available as res.hess_inv, which is a function that takes a vector and returns the product of the inverse Hessian with that vector. So, for example, we can display the optimized parameters along with the uncertainty estimates with a snippet like this:
ftol = 2.220446049250313e-09
tmp_i = np.zeros(len(res.x))
for i in range(len(res.x)):
tmp_i[i] = 1.0
hess_inv_i = res.hess_inv(tmp_i)[i]
uncertainty_i = np.sqrt(max(1, abs(res.fun)) * ftol * hess_inv_i)
tmp_i[i] = 0.0
print('x^{0} = {1:12.4e} ± {2:.1e}'.format(i, res.x[i], uncertainty_i))
Note that I've incorporated the max behavior from the documentation, assuming that f^k and f^{k+1} are basically just the same as the final output value, res.fun, which really ought to be a good approximation. Also, for small problems, you can just use np.diag(res.hess_inv.todense()) to get the full inverse and extract the diagonal all at once. But for large numbers of variables, I've found that to be a much slower option. Finally, I've added the default value of ftol, but if you change it in an argument to minimize, you would obviously need to change it here.
One approach to this common problem is to use scipy.optimize.leastsq after using minimize with 'L-BFGS-B' starting from the solution found with 'L-BFGS-B'. That is, leastsq will (normally) include and estimate of the 1-sigma errors as well as the solution.
Of course, that approach makes several assumption, including that leastsq can be used and may be appropriate for solving the problem. From a practical view, this requires the objective function return an array of residual values with at least as many elements as variables, not a cost function.
You may find lmfit (https://lmfit.github.io/lmfit-py/) useful here: It supports both 'L-BFGS-B' and 'leastsq' and gives a uniform wrapper around these and other minimization methods, so that you can use the same objective function for both methods (and specify how to convert the residual array into the cost function). In addition, parameter bounds can be used for both methods. This makes it very easy to first do a fit with 'L-BFGS-B' and then with 'leastsq', using the values from 'L-BFGS-B' as starting values.
Lmfit also provides methods to more explicitly explore confidence limits on parameter values in more detail, in case you suspect the simple but fast approach used by leastsq might be insufficient.
It really depends what you mean by "errors". There is no general answer to your question, because it depends on what you're fitting and what assumptions you're making.
The easiest case is one of the most common: when the function you are minimizing is a negative log-likelihood. In that case the inverse of the hessian matrix returned by the fit (hess_inv) is the covariance matrix describing the Gaussian approximation to the maximum likelihood.The parameter errors are the square root of the diagonal elements of the covariance matrix.
Beware that if you are fitting a different kind of function or are making different assumptions, then that doesn't apply.

Calculating a t-test in Stata

I am currently trying to run a t test on a variable and determine if it's statistically significantly different from 1. Here is the code I am using:
ttest dm1=1
And it is spitting out this output:
I don't want my null hypothesis to be that mean=1, I want it to be that dm1=1. When I do regular calculations ({Beta(dm1)-1}/SE(Beta(dm1))) on the ttest, I get that the new t statistic should be around -48.89. What is the code to determine if the coefficient is statistically different than one, if this is not the proper way? Also, here is an image of the regression model for reference:
The first t-test syntax is for testing that the null that the mean of dm1 is 1. It has nothing to do with the regression coefficients at all.
If I understand what you are asking, you want a Wald test:
sysuse auto
reg price mpg weight i.foreign
test mpg=1

What is the output of XGboost using 'rank:pairwise'?

I use the python implementation of XGBoost. One of the objectives is rank:pairwise and it minimizes the pairwise loss (Documentation). However, it does not say anything about the scope of the output. I see numbers between -10 and 10, but can it be in principle -inf to inf?
good question. you may have a look in kaggle competition:
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
It gives predicted score for ranking.
However, the scores are valid for ranking only in their own groups.
So we must set the groups for input data.
For esay ranking, refer to my project xgboostExtension
If I understand your questions correctly, you mean the output of the predict function on a model fitted using rank:pairwise.
Predict gives the predicted variable (y_hat).
This is the same for reg:linear / binary:logistic etc. The only difference is that reg:linear builds trees to Min(RMSE(y, y_hat)), while rank:pairwise build trees to Max(Map(Rank(y), Rank(y_hat))). However, output is always y_hat.
Depending on the values of your dependent variables, output can be anything. But I typically expect output to be much smaller in variance vs the dependent variable. This is usually the case as it is not necessary to fit extreme data values, the tree just needs to produce predictors that are large/small enough to be ranked first/last in the group.