Fitting to Poisson histogram - numpy

I am trying to fit a curve over the histogram of a Poisson distribution that looks like this
I have modified the fit function so that it resembles a Poisson distribution, with the parameter t as a variable. But the curve_fit function can not be plotted and I am not sure why.
def histo(bsize):
N = bsize
#binwidth
bw = (dt.max()-dt.min())/(N-1.)
bin1 = dt.min()+ bw*np.arange(N)
#define the array to hold the occurrence count
bincount= np.array([])
for bin in bin1:
count = np.where((dt>=bin)&(dt<bin+bw))[0].size
bincount = np.append(bincount,count)
#bin center
binc = bin1+0.5*bw
plt.figure()
plt.plot(binc,bincount,drawstyle= 'steps-mid')
plt.xlabel("Interval[ticks]")
plt.ylabel("Frequency")
histo(30)
plt.xlim(0,.5e8)
plt.ylim(0,25000)
import numpy as np
from scipy.optimize import curve_fit
delta_t = 1.42e7
def func(x, t):
return t * np.exp(- delta_t/t)
popt, pcov = curve_fit(func, np.arange(0,.5e8),histo(30))
plt.plot(popt)

The problem with your code is that you do not know what the return values of curve_fit are. It is the parameters for the fit-function and their covariance matrix - not something you can plot directly.
Binned Least-Squares Fit
In general you can get everything much, much more easily:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.special import factorial
from scipy.stats import poisson
# get poisson deviated random numbers
data = np.random.poisson(2, 1000)
# the bins should be of integer width, because poisson is an integer distribution
bins = np.arange(11) - 0.5
entries, bin_edges, patches = plt.hist(data, bins=bins, density=True, label='Data')
# calculate bin centers
bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1])
def fit_function(k, lamb):
'''poisson function, parameter lamb is the fit parameter'''
return poisson.pmf(k, lamb)
# fit with curve_fit
parameters, cov_matrix = curve_fit(fit_function, bin_centers, entries)
# plot poisson-deviation with fitted parameter
x_plot = np.arange(0, 15)
plt.plot(
x_plot,
fit_function(x_plot, *parameters),
marker='o', linestyle='',
label='Fit result',
)
plt.legend()
plt.show()
This is the result:
Unbinned Maximum-Likelihood fit
An even better possibility would be to not use a histogram at all
and instead to carry out a maximum-likelihood fit.
But by closer examination even this is unnecessary, because the
maximum-likelihood estimator for the parameter of the poissonian distribution is the arithmetic mean.
However, if you have other, more complicated PDFs, you can use this as example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.special import factorial
from scipy import stats
def poisson(k, lamb):
"""poisson pdf, parameter lamb is the fit parameter"""
return (lamb**k/factorial(k)) * np.exp(-lamb)
def negative_log_likelihood(params, data):
"""
The negative log-Likelihood-Function
"""
lnl = - np.sum(np.log(poisson(data, params[0])))
return lnl
def negative_log_likelihood(params, data):
''' better alternative using scipy '''
return -stats.poisson.logpmf(data, params[0]).sum()
# get poisson deviated random numbers
data = np.random.poisson(2, 1000)
# minimize the negative log-Likelihood
result = minimize(negative_log_likelihood, # function to minimize
x0=np.ones(1), # start value
args=(data,), # additional arguments for function
method='Powell', # minimization method, see docs
)
# result is a scipy optimize result object, the fit parameters
# are stored in result.x
print(result)
# plot poisson-distribution with fitted parameter
x_plot = np.arange(0, 15)
plt.plot(
x_plot,
stats.poisson.pmf(x_plot, result.x),
marker='o', linestyle='',
label='Fit result',
)
plt.legend()
plt.show()

Thank you for the wonderful discussion!
You might want to consider the following:
1) Instead of computing "poisson", compute "log poisson", for better numerical behavior
2) Instead of using "lamb", use the logarithm (let me call it "log_mu"), to avoid the fit "wandering" into negative values of "mu".
So
log_poisson(k, log_mu): return k*log_mu - loggamma(k+1) - math.exp(log_mu)
Where "loggamma" is the scipy.special.loggamma function.
Actually, in the above fit, the "loggamma" term only adds a constant offset to the functions being minimized, so one can just do:
log_poisson_(k, log_mu): return k*log_mu - math.exp(log_mu)
NOTE: log_poisson_() not the same as log_poisson(), but when used for minimization in the manner above, will give the same fitted minimum (the same value of mu, up to numerical issues). The value of the function being minimized will have been offset, but one doesn't usually care about that anyway.

Related

Minimization gives weird result (multi-parameter fitting)

I'm working on fitting of the experimental data. In order to fit it I use the minimization of the function of residual. Everything is quite trivial, but but this time I can't find what's wrong and why the result of fitting is so weird. The example is simplified in comparison with original problem. But anyway it gives wrong parameters even when I set used values of parameters as initial guess.
import matplotlib.pyplot as plt
import numpy as np
import csv
from scipy.optimize import curve_fit, minimize
x=np.arange(0,10,0.5)
a=0.5
b=3
ini_pars=[a, b]
def func(x, a, b):
return a*x+b
plt.plot(x, func(x,a,b))
plt.show()
def fit(pars):
A,B = pars
res = (func(x,a, b)-func(x, *pars))**2
s=sum(res)
return s
bnds=[(0.1,0.5),(1,5)]
x0=[0.1,4]
opt = minimize(fit, x0, bounds=bnds)
new_pars=[opt.x[0], opt.x[0]]
example = fit(ini_pars)
print(example)
example = fit(new_pars)
print(example)
print(new_pars)
plt.plot(x, func(x, *ini_pars))
plt.plot(x, func(x, *new_pars))
plt.show()
```[enter image description here][1]
[1]: https://i.stack.imgur.com/qc1Nu.png
It should be new_pars=[opt.x[0], opt.x[1]] instead of new_pars=[opt.x[0], opt.x[0]]. Note also that you can directly extract the values by new_pars = opt.x.

scipy weird unexpected behavior curve_fit large data set for sin wave

For some reason when I am trying to large amount of data to a sin wave it fails and fits it to a horizontal line. Can somebody explain?
Minimal working code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
# Seed the random number generator for reproducibility
import pandas
np.random.seed(0)
# Here it work as expected
# x_data = np.linspace(-5, 5, num=50)
# y_data = 2.9 * np.sin(1.05 * x_data + 2) + 250 + np.random.normal(size=50)
# With this data it breaks
x_data = np.linspace(0, 2500, num=2500)
y_data = -100 * np.sin(0.01 * x_data + 1) + 250 + np.random.normal(size=2500)
# And plot it
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data)
def test_func(x, a, b, c, d):
return a * np.sin(b * x + c) + d
# Used to fit the correct function
# params, params_covariance = optimize.curve_fit(test_func, x_data, y_data)
# making some guesses
params, params_covariance = optimize.curve_fit(test_func, x_data, y_data,
p0=[-80, 3, 0, 260])
print(params)
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_data, test_func(x_data, *params),
label='Fitted function')
plt.legend(loc='best')
plt.show()
Does anybody know, how to fix this issue. Should I use a different fitting method not least square? Or should I reduce the number of data points?
Given your data, you can use the more robust lmfit instead of scipy.
In particular, you can use SineModel (see here for details).
SineModel in lmfit is not for "shifted" sine waves, but you can easily deal with the shift doing
y_data_offset = y_data.mean()
y_transformed = y_data - y_data_offset
plt.scatter(x_data, y_transformed)
plt.axhline(0, color='r')
Now you can fit to sine wave
from lmfit.models import SineModel
mod = SineModel()
pars = mod.guess(y_transformed, x=x_data)
out = mod.fit(y_transformed, pars, x=x_data)
you can inspect results with print(out.fit_report()) and plot results with
plt.plot(x_data, y_data, lw=7, color='C1')
plt.plot(x_data, out.best_fit+y_data_offset, color='k')
# we add the offset ^^^^^^^^^^^^^
or with the builtin plot method out.plot_fit(), see here for details.
Note that in SineModel all parameters "are constrained to be non-negative", so your defined negative amplitude (-100) will be positive (+100) in the parameters fit results. So the phase too won't be 1 but π+1 (PS: they call shift the phase)
print(out.best_values)
{'amplitude': 99.99631403054289,
'frequency': 0.010001193681616227,
'shift': 4.1400215410836605}

Using scipy.odr to fit curve

I'm trying to fit a set of data points via a fit function that depends on two variables, let's call these xdata and sdata. Problem is my curve is rather flat I want it to more or less "follow the points".
I've tried using scipy.odr to fit the curve it works rather well except that the curve is too flat:
import numpy as np
from math import pi
from math import sqrt
from math import log
from scipy import optimize
import scipy.optimize
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.odr import *
mudr=np.array([ 57.43708609, 46.26119205, 55.60688742, 33.21615894,
28.27072848, 22.54649007, 21.80662252, 11.21483444, 5.80211921])
#xdata points
dme=array([ 128662.54890776, 105265.32915726, 128652.56835434,
77968.67019573, 66273.56542068, 58464.58559543,
54570.66624991, 27286.90038703, 19480.92689266]) #xdata error
dmss22=np.array([ 4.90050000e+17, 4.90050000e+17, 4.90050000e+17,
4.90050000e+17, 4.90050000e+17, 4.90050000e+17,
4.90050000e+17, 4.90050000e+17, 4.90050000e+17]) #sdata points
dmse=np.array([ 1.09777592e+21, 1.11512117e+21, 1.13381702e+21,
1.15033267e+21, 1.14883089e+21, 1.27076265e+21,
1.22637165e+21, 1.19237598e+21, 1.64539205e+21]) # sdata error
F=np.array([ 115.01944248, 110.24354867, 112.77812389, 104.81830088,
104.35746903, 101.32016814, 100.54513274, 96.94226549,
93.00424779]) #ydata points
dF=np.array([ 72710.75386699, 72590.6256987 , 176539.40403673,
130555.27503081, 124299.52080164, 176426.64340597,
143013.52848306, 122117.93022746, 157547.78395513])#ydata error
def Ffitsso(p,X,B=2.58,Fc=92.2,mu=770,Za=0.9468): #fitfunction
temp1 = (2*B*X[0])/(4*pi*Fc)**2
temp2 = temp1*(afij[0]+afij[1]*np.log((2*B*X[0])/mu**2))
temp3 = temp1**2*(afij[2]+afij[3]*np.log((2*B*X[0])/mu**2)+\
afij[4]*(np.log((2*B*X[0])/mu**2))**2)
temp4 = temp1**3*(afij[5]+afij[6]*np.log((2*B*X[0])/mu**2)+\
afij[7]*(np.log((2*B*X[0])/mu**2))**2+\
afij[8]*(np.log((2*B*X[0])/mu**2))**3)
return Fc/Za*(1+p[0]*X[1])*(1+temp2+temp3+temp4)+p[1]
#fitting using scipy.odr
xtot=np.row_stack( (mudr, dmss22) )
etot=np.row_stack( (Ze, dmss22e) )
fitting = Model(Ffitsso)
mydata = RealData(xtot, F, sx=etot2, sy=dF)
myodr = ODR(mydata, fitting, beta0=[0, 100])
myoutput = myodr.run()
myoutput.pprint()
bet=myoutput.beta
plt.plot(mudr,F,"b^")
plt.plot(mudr,Ffitsso(bet,[mudr,dmss22]))
p[0]*X[0] in the fitfunction is supposed to be small compared to 1 but with the fit the value for p[0] is in order of e-18 whilst dmss22 values are in the order of e-17 which is not small enough.
Even worse is that it's negative meaning the function decreases which is not supposed to happen it's supposed to increase like the plotted data points.
Edit: I fixed, didn't know that it was so sensitive to initial beta values, put beta[0]=1.5*10(-15) and it works!**
Here is a graphical fitter with both curve_fit and ODR fitters using scipy's Differential Evolution (DE) genetic algorithm to supply initial parameter estimates for the non-linear solvers. The scipy implementation of DE uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this requires parameter bounds within which to search - in this example, these bounds are taken from the data maximum and minimum values. Note that it is much easier to give bounds for the initial parameter estimates rather than individual specific values.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.odr
from scipy.optimize import differential_evolution
import warnings
xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7, 0.0])
yData = numpy.array([1.1, 20.2, 30.3, 40.4, 50.0, 60.6, 70.7, 0.1])
def func(x, a, b, c, d, offset): # curve fitting function for curve_fit()
return a*numpy.exp(-(x-b)**2/(2*c**2)+d) + offset
def func_wrapper_for_ODR(parameters, x): # parameter order for ODR
return func(x, *parameters)
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
parameterBounds = []
parameterBounds.append([minY, maxY]) # search bounds for a
parameterBounds.append([minX, maxX]) # search bounds for b
parameterBounds.append([minX, maxX]) # search bounds for c
parameterBounds.append([minY, maxY]) # search bounds for d
parameterBounds.append([0.0, maxY]) # search bounds for Offset
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
geneticParameters = generate_Initial_Parameters()
##########################
# curve_fit section
##########################
fittedParameters_curvefit, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters curve_fit:', fittedParameters_curvefit)
print()
modelPredictions_curvefit = func(xData, *fittedParameters_curvefit)
absError_curvefit = modelPredictions_curvefit - yData
SE_curvefit = numpy.square(absError_curvefit) # squared errors
MSE_curvefit = numpy.mean(SE_curvefit) # mean squared errors
RMSE_curvefit = numpy.sqrt(MSE_curvefit) # Root Mean Squared Error, RMSE
Rsquared_curvefit = 1.0 - (numpy.var(absError_curvefit) / numpy.var(yData))
print()
print('RMSE curve_fit:', RMSE_curvefit)
print('R-squared curve_fit:', Rsquared_curvefit)
print()
##########################
# ODR section
##########################
data = scipy.odr.odrpack.Data(xData,yData)
model = scipy.odr.odrpack.Model(func_wrapper_for_ODR)
odr = scipy.odr.odrpack.ODR(data, model, beta0=geneticParameters)
# Run the regression.
odr_out = odr.run()
print('Fitted parameters ODR:', odr_out.beta)
print()
modelPredictions_odr = func(xData, *odr_out.beta)
absError_odr = modelPredictions_odr - yData
SE_odr = numpy.square(absError_odr) # squared errors
MSE_odr = numpy.mean(SE_odr) # mean squared errors
RMSE_odr = numpy.sqrt(MSE_odr) # Root Mean Squared Error, RMSE
Rsquared_odr = 1.0 - (numpy.var(absError_odr) / numpy.var(yData))
print()
print('RMSE ODR:', RMSE_odr)
print('R-squared ODR:', Rsquared_odr)
print()
##########################################################
# graphics output section
def ModelsAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plots
xModel = numpy.linspace(min(xData), max(xData))
yModel_curvefit = func(xModel, *fittedParameters_curvefit)
yModel_odr = func(xModel, *odr_out.beta)
# now the models as line plots
axes.plot(xModel, yModel_curvefit)
axes.plot(xModel, yModel_odr)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelsAndScatterPlot(graphWidth, graphHeight)

Fit sigmoid curve in python

Thanks ahead! I am trying to fit a sigmoid curve over some data, below is my code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
====== some code in between =======
plt.scatter(drag[0].w,drag[0].s, s = 10, label = 'drag%d'%0)
def sigmoid(x,x0,k):
y = 1.0/(1.0+np.exp(-x0*(x-k)))
return y
popt,pcov = curve_fit(sigmoid, drag[0].w, drag[0].s)
xx = np.linspace(10,1000,10)
yy = sigmoid(xx, *popt)
plt.plot(xx,yy,'r-', label='fit')
plt.legend(loc='upper left')
plt.xlabel('weight(kg)', fontsize=12)
plt.ylabel('wing span(m)', fontsize=12)
plt.show()
this is now showing the graph below which is not very rightfitting curve is the red one at bottom
What are the possible solutions?
Also I am open to other methods of fitting logistic curves on this set of data
Thanks again!
Here is an example graphical fitter using your equation with an amplitude scaling factor for my test data. This code uses scipy's Differential Evolution genetic algorithm to provide initial parameter estimates for curve_fit(), as the scipy default initial parameter estimates of all 1.0 are not always optimal. The scipy implementation of Differential Evolution uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this requires bounds within which to search. In this example those bounds are taken from the example data I provide, when using your own data please check that the bounds seem reasonable. Note that ranges on the parameters are much easier to provide than specific values for the initial parameter estimates.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])
def sigmoid(x, amplitude, x0, k):
return amplitude * 1.0/(1.0+numpy.exp(-x0*(x-k)))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = sigmoid(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
parameterBounds = []
parameterBounds.append([minY, maxY]) # search bounds for amplitude
parameterBounds.append([minX, maxX]) # search bounds for x0
parameterBounds.append([minX, maxX]) # search bounds for k
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(sigmoid, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = sigmoid(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = sigmoid(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Implementing minimization in SciPy

I am trying to implement the 'Iterative hessian Sketch' algorithm from https://arxiv.org/abs/1411.0347 page 12. However, I am struggling with step two which needs to minimize the matrix-vector function.
Imports and basic data generating function
import numpy as np
import scipy as sp
from sklearn.datasets import make_regression
from scipy.optimize import minimize
import matplotlib.pyplot as plt
%matplotlib inline
from numpy.linalg import norm
def generate_data(nsamples, nfeatures, variance=1):
'''Generates a data matrix of size (nsamples, nfeatures)
which defines a linear relationship on the variables.'''
X, y = make_regression(n_samples=nsamples, n_features=nfeatures,\
n_informative=nfeatures,noise=variance)
X[:,0] = np.ones(shape=(nsamples)) # add bias terms
return X, y
To minimize the matrix-vector function, I have tried implementing a function which computes the quanity I would like to minimise:
def f2min(x, data, target, offset):
A = data
S = np.eye(A.shape[0])
#S = gaussian_sketch(nrows=A.shape[0]//2, ncols=A.shape[0] )
y = target
xt = np.ravel(offset)
norm_val = (1/2*S.shape[0])*norm(S#A#(x-xt))**2
#inner_prod = (y - A#xt).T#A#x
return norm_val - inner_prod
I would eventually like to replace S with some random matrices which can reduce the dimensionality of the problem, however, first I need to be confident that this optimisation method is working.
def grad_f2min(x, data, target, offset):
A = data
y = target
S = np.eye(A.shape[0])
xt = np.ravel(offset)
S_A = S#A
grad = (1/S.shape[0])*S_A.T#S_A#(x-xt) - A.T#(y-A#xt)
return grad
x0 = np.zeros((X.shape[0],1))
xt = np.zeros((2,1))
x_new = np.zeros((2,1))
for it in range(1):
result = minimize(f2min, x0=xt,args=(X,y,x_new),
method='CG', jac=False )
print(result)
x_new = result.x
I don't think that this loop is correct at all because at the very least there should be some local convergence before moving on to the next step. The output is:
fun: 0.0
jac: array([ 0.00745058, 0.00774882])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 416
nit: 0
njev: 101
status: 2
success: False
x: array([ 0., 0.])
Does anyone have an idea if:
(1) Why I'm not achieving convergence at each step
(2) I can implement step 2 in a better way?