How to estimate confidence of nonlinear regression? - optimization

I use Levenberg -- Marquardt algorithm to fit my nonlinear function f(x,b) (x:Nx1, b:Mx1) to data X:NxK.
Now I want to estimate goodness (confidence) of solution b.
This post says that I should not try to find R-squared in nonlinear case. What should I do then? Are there any reliable universal metrics at all? I could not google any answer for this.

Standard errors are usually calculated as:
s.e. = sigma^2 inv(J'J)
or as
s.e. = sigma^2 inv(H)
where
J : Jacobian matrix
H : Hessian matrix
sigma^2 = SSE/df = sum of squared errors / (n-p)
A confidence interval is then
b +- s.e. * t(n-p,alpha/2)
where t is the critical value for the Student’s t distribution

Related

Parameters for numpy.random.lognormal function

I need to create a fictitious log-normal distribution of household income in a particular area. The data I have are: Average: 13,600 and Standard Deviation 7,900.
What should be the parameters in the function numpy.random.lognormal?
When i set the mean and the standard deviation as they are most of the values in the distribution are "inf", and the values also doesn't make sense when i set the parameters as the log of the mean and standard deviation.
If someone can help me to figure out what the parameters are it would be great.
Thanks!
This is indeed a nontrivial task as the moments of the log-normal distribution should be solved for the unknown parameters. By looking at say [Wikipedia][1], you will find the mean and variance of the log-normal distribution to be exp(mu + sigma2) and [exp(sigma2)-1]*exp(2*mu+sigma**2), respectively.
The choice of mu and sigma should solve exp(mu + sigma**2) = 13600 and [exp(sigma**2)-1]*exp(2*mu+sigma**2)= 7900**2. This can be solved analytically because the first equation squared provides exactly exp(2*mu+sigma**2) thus eliminating the variable mu from the second equation.
A sample code is provided below. I took a large sample size to explicitly show that the mean and standard deviation of the simulated data are close to the desired numbers.
import numpy as np
# Input characteristics
DataAverage = 13600
DataStdDev = 7900
# Sample size
SampleSize = 100000
# Mean and variance of the standard lognormal distribution
SigmaLogNormal = np.sqrt( np.log(1+(DataStdDev/DataAverage)**2))
MeanLogNormal = np.log( DataAverage ) - SigmaLogNormal**2/2
print(MeanLogNormal, SigmaLogNormal)
# Obtain draw from log-normal distribution
Draw = np.random.lognormal(mean=MeanLogNormal, sigma=SigmaLogNormal, size=SampleSize)
# Check
print( np.mean(Draw), np.std(Draw))

Using the piecewise function of the IBM CPLEX python API, but the problem cannot be solved

I try to use MILP (Mixed Integer Linear Programming) to calculate the unit commitment problem. (unit commitment: An optimization problem trying to find the best scheduling of generator)
Because the relationship between generator power and cost is a quadratic function, so I use piecewise function to convert power to cost.
I modify the answer on this page:
unit commitment problem using piecewise-linear approximation become MIQP
The simple program structure is like this:
from docplex.mp.model import Model
mdl = Model(name='buses')
nbbus40 = mdl.integer_var(name='nbBus40')
nbbus30 = mdl.integer_var(name='nbBus30')
mdl.add_constraint(nbbus40*40 + nbbus30*30 >= 300, 'kids')
#after 4 buses, additional buses of a given size are cheaper
f1=mdl.piecewise(0, [(0,0),(4,2000),(10,4400)], 0.8)
f2=mdl.piecewise(0, [(0,0),(4,1600),(10,3520)], 0.8)
cost1= f1(nbbus40)
cost2 = f2(nbbus30)
mdl.minimize(cost1+ cost1)
mdl.solve()
mdl.report()
for v in mdl.iter_integer_vars():
print(v," = ",v.solution_value)
which gives
* model buses solved with
objective = 3520.000
nbBus40 = 0
nbBus30 = 10.0
The answer is perfect but there is no way to apply my example.
I used a piecewise function to formulate a piecewise linear relationship between power and cost, and got a new object (cost1), and then calculated the minimum value of this object.
The following is my actual code(simply):
(min1,miny1), (pw1_1,pw1_1y),(pw1_2,pw1_2y), (max1,maxy1) are the breakpoints on the power-cost curve.
pwl_func_1phase = ucpm.piecewise(
0,
[(0,0),(min1,miny1),
(pw1_1,pw1_1y),
(pw1_2,pw1_2y),
(max1,maxy1)
],
0
)
#df_decision_vars_spinning is a dataframe store Optimization variables
df_decision_vars_spinning.at[
(units,period),
'variable_cost'
] = pwl_func_1phase(
df_decision_vars_spinning.at[
(units,period),
'production'
]
)
total_variable_cost = ucpm.sum(
(df_decision_vars_spinning.variable_cost))
ucpm.minimize(total_variable_cost )
I don’t know what causes this optimization problem can't be solve. Here is my complete code :
https://colab.research.google.com/drive/1JSKfOf0Vzo3E3FywsxcDdOz4sAwCgOHd?usp=sharing
With an unlimited edition of CPLEX, your model solves (though very slowly). Here are two ideas to better control what happens in solve()
use solve(log_output=True) to print the log: you'll see the gap going down
set a mip gap: setting mip gap to 5% stops the solve at 36s
ucpm.parameters.mip.tolerances.mipgap = 0.05
ucpm.solve(log_output=True)
Not an answer, but to illustrate my comment.
Let's say we have as the cost curve
cost = α + β⋅power^2
Furthermore, we are minimizing cost.
We can approximate using a few linear curves. Here I have drawn a few:
Let's say each linear curve has the form
cost = a(i) + b(i)⋅power
for i=1,...,n (n=number of linear curves).
It is easy to see that is we write:
min cost
cost ≥ a(i) + b(i)⋅power ∀i
we have a good approximation for the quadratic cost curve. This is exactly as I said in the comment.
No binary variables were used here.

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

Which objective is optimized Intra-Cluster sum of distances or MSE?

In the cluster analysis papers using meta-heuristic algorithms, many has optimized Mean-Squared Quantization Error (MSE). For example in
[1] and [2] .
I have a confusion with the results. They have told that they have used the MSE as the objective function. But they have reported the result values in intra-cluster sum of Euclidean distances.
K-Means minimizes Within-Cluster Sum of Squares (WCSS) (from wiki) [3]. I could not find what is the difference between WCSS and MSE, when Euclidean distance is used in the case of the difference metric when calculating MSE.
In the case of K-Means the WCSS is minimized, and if we use the same MSE function with the meta-heuristics algorithms they will also minimize it. In this case how the sum of Euclidean distances for the K-Means and the other vary?
I can reproduce the results shown in the papers if I optimize the intra-cluster sum of Euclidean distances.
I think I am doing something wrong here. Can anyone help me with this.
Main question: What objectives did the referenced papers [1] and [2] optimize, and which function's values are shown in the table?
K-means optimizes the (sum of within-cluster-) sum of squares aka variance aka sum of squared Euclidean distances.
This is easy to see if you study the convergence proof.
I can't study the two papers you referenced. They're with crappy Elsevier and paywalled, and I'm not going to pay $36+$32 to answer your question.
Update: I managed to get a free copy of one of them. They call it "MSE, mean-square quantization error", but their equation is the usual within-cluster-sum-of-squares, no mean involved; with a shady self-citation attached to this statement, and half of the references being self-citations... it seems like it's more this author that likes to call it different than everybody else. Looks bit like "reinventing the wheel with a different name" to me. I'd carefully double-check their results. I'm not saying they are false, I havn't checked in more detail. But the "mean-square error" doesn't involve a mean for sure; it's the sum of squared errors.
Update: if "intra-cluster sum" means sum of pairwise distances of any two objects, consider the following:
Without loss of generality, move the data such that the mean is 0. (Translation doesn't change Euclidean or squared Euclidean distances).
sum_x sum_y sum_i (x_i-y_i)^2
= sum_x sum_y [ sum_i (x_i)^2 + sum_i (y_i)^2 - 2 sum_i (x_i*y_i) ]
= n * sum_x sum_i (x_i)^2 + n * sum_y sum_i (y_i)
- 2 * sum_i [sum_x x_i * sum_y y_i]
The first two summands are the same. So we have 2n times the WCSS.
But since mu_i = 0, sum_x x_i = sum_y y_i = 0, and the third term disappears.
If I didn't screw up this computation, then the mean, asymmetric pairwise squared Euclidean distance within a cluster is the same as WCSS.

Constrained np.polyfit

I am trying to fit a quadratic to some experimental data and using polyfit in numpy. I am looking to get a concave curve, and hence want to make sure that the coefficient of the quadratic term is negative, also the fit itself is weighted, as in there are some weights on the points. Is there an easy way to do that? Thanks.
The use of weights is described here (numpy.polyfit).
Basically, you need a weight vector with the same length as x and y.
To avoid the wrong sign in the coefficient, you could use a fit function definition like
def fitfunc(x,a,b,c):
return -1 * abs(a) * x**2 + b * x + c
This will give you a negative coefficient for x**2 at all times.
You can use curve_fit
.
Or you can run polyfit with rank 2 and if the last coefficient is bigger than 0. run again linear polyfit (polyfit with rank 1)