scipy weird unexpected behavior curve_fit large data set for sin wave - pandas

For some reason when I am trying to large amount of data to a sin wave it fails and fits it to a horizontal line. Can somebody explain?
Minimal working code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
# Seed the random number generator for reproducibility
import pandas
np.random.seed(0)
# Here it work as expected
# x_data = np.linspace(-5, 5, num=50)
# y_data = 2.9 * np.sin(1.05 * x_data + 2) + 250 + np.random.normal(size=50)
# With this data it breaks
x_data = np.linspace(0, 2500, num=2500)
y_data = -100 * np.sin(0.01 * x_data + 1) + 250 + np.random.normal(size=2500)
# And plot it
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data)
def test_func(x, a, b, c, d):
return a * np.sin(b * x + c) + d
# Used to fit the correct function
# params, params_covariance = optimize.curve_fit(test_func, x_data, y_data)
# making some guesses
params, params_covariance = optimize.curve_fit(test_func, x_data, y_data,
p0=[-80, 3, 0, 260])
print(params)
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_data, test_func(x_data, *params),
label='Fitted function')
plt.legend(loc='best')
plt.show()
Does anybody know, how to fix this issue. Should I use a different fitting method not least square? Or should I reduce the number of data points?

Given your data, you can use the more robust lmfit instead of scipy.
In particular, you can use SineModel (see here for details).
SineModel in lmfit is not for "shifted" sine waves, but you can easily deal with the shift doing
y_data_offset = y_data.mean()
y_transformed = y_data - y_data_offset
plt.scatter(x_data, y_transformed)
plt.axhline(0, color='r')
Now you can fit to sine wave
from lmfit.models import SineModel
mod = SineModel()
pars = mod.guess(y_transformed, x=x_data)
out = mod.fit(y_transformed, pars, x=x_data)
you can inspect results with print(out.fit_report()) and plot results with
plt.plot(x_data, y_data, lw=7, color='C1')
plt.plot(x_data, out.best_fit+y_data_offset, color='k')
# we add the offset ^^^^^^^^^^^^^
or with the builtin plot method out.plot_fit(), see here for details.
Note that in SineModel all parameters "are constrained to be non-negative", so your defined negative amplitude (-100) will be positive (+100) in the parameters fit results. So the phase too won't be 1 but π+1 (PS: they call shift the phase)
print(out.best_values)
{'amplitude': 99.99631403054289,
'frequency': 0.010001193681616227,
'shift': 4.1400215410836605}

Related

Extracting BCI Geodetic and ECI coordinates of an orbit

I am using astropy to define a Tundra orbit around Earth and subsequently, I would like to extract the ECI and geodetic coordinates as the object propagates in time. I was able to get something but it does not agree with what I would expect (ECI coordinates extracted from another SW). The two orbits are not even on the same plane, which is clearly wrong.
Can anybody tell me if I am doing something obviously wrong?
The plot below shows the two results. Orange is with Astropy.
import astropy
from astropy import units as u
from poliastro.bodies import Earth
from astropy.coordinates import CartesianRepresentation
from poliastro.twobody import Orbit
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
epoch = astropy.time.Time('2020-01-01T00:00:00.000', scale='tt')
# Tundra
tundra1 = Orbit.from_classical(attractor=Earth,
a = 42164 *u.km,
ecc = 0.2684 * u.one,
inc = 63.4 * u.deg,
raan = 25 * u.deg,
argp = 270 * u.deg,
nu = 50 * u.deg,
# epoch=epoch
)
def plot_orb(orb, start_t, end_t, step_t, ax, c='k'):
orb_list = []
for t in np.arange(start_t, end_t, step_t):
single_orb = orb.propagate(t*u.min)
orb_list = orb_list + [single_orb]
xyz = orb.sample().xyz
ax.plot(*xyz,'r')
s_xyz_ar = np.zeros((len(orb_list), 3))
for i, s_orb in enumerate(orb_list):
s_xyz = s_orb.represent_as(CartesianRepresentation).xyz
s_xyz_ar[i, :] = s_xyz
ax.scatter(s_xyz_ar[:, 0], s_xyz_ar[:, 1], s_xyz_ar[:, 2], c)
return s_xyz_ar, t
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
s_xyz_ar1, t1 = plot_orb(orb=tundra1, start_t=0, end_t=1440, step_t=10, ax=ax, c='k')
When I wrote that you can do this more efficiently I was under the mistaken assumption that Orbit.propagate can be called directly on an array of time steps like:
>>> tt = np.arange(0, 1440, 10) * u.min
>>> orb = tundra1.propagate(tt)
While this "works" in that it returns a new orbit with an array of epochs, it appears Orbit is not really designed to work with an array of epochs and trying to do something like orb.represent_as just returns a value for the first epoch in the array. This would be a nice possible enhancement to poliastro.
However, the code you wrote for the scatter plot can still be significantly simplified to something like this:
>>> tt = np.arange(0, 1440, 10) * u.min
>>> xyz = np.vstack([tundra1.propagate(t).represent_as(CartesianRepresentation).xyz for t in tt])
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> ax.scatter(*xyz.T)
>>> fig.show()
Result:
Ideally you should be able to do this without the np.vstack and instead just call tundra1.propagate(tt).represent_as(CartesianRepresentation).xyz without a for loop. But as the above demonstrates you can still simplify a lot by using np.vstack to make an array from a list of (x, y, z) triplets.
I'm not sure this really answers your original question though, which it seems you found the answer to that wasn't really related to the code. Still, I hope this helps!

Using scipy.odr to fit curve

I'm trying to fit a set of data points via a fit function that depends on two variables, let's call these xdata and sdata. Problem is my curve is rather flat I want it to more or less "follow the points".
I've tried using scipy.odr to fit the curve it works rather well except that the curve is too flat:
import numpy as np
from math import pi
from math import sqrt
from math import log
from scipy import optimize
import scipy.optimize
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.odr import *
mudr=np.array([ 57.43708609, 46.26119205, 55.60688742, 33.21615894,
28.27072848, 22.54649007, 21.80662252, 11.21483444, 5.80211921])
#xdata points
dme=array([ 128662.54890776, 105265.32915726, 128652.56835434,
77968.67019573, 66273.56542068, 58464.58559543,
54570.66624991, 27286.90038703, 19480.92689266]) #xdata error
dmss22=np.array([ 4.90050000e+17, 4.90050000e+17, 4.90050000e+17,
4.90050000e+17, 4.90050000e+17, 4.90050000e+17,
4.90050000e+17, 4.90050000e+17, 4.90050000e+17]) #sdata points
dmse=np.array([ 1.09777592e+21, 1.11512117e+21, 1.13381702e+21,
1.15033267e+21, 1.14883089e+21, 1.27076265e+21,
1.22637165e+21, 1.19237598e+21, 1.64539205e+21]) # sdata error
F=np.array([ 115.01944248, 110.24354867, 112.77812389, 104.81830088,
104.35746903, 101.32016814, 100.54513274, 96.94226549,
93.00424779]) #ydata points
dF=np.array([ 72710.75386699, 72590.6256987 , 176539.40403673,
130555.27503081, 124299.52080164, 176426.64340597,
143013.52848306, 122117.93022746, 157547.78395513])#ydata error
def Ffitsso(p,X,B=2.58,Fc=92.2,mu=770,Za=0.9468): #fitfunction
temp1 = (2*B*X[0])/(4*pi*Fc)**2
temp2 = temp1*(afij[0]+afij[1]*np.log((2*B*X[0])/mu**2))
temp3 = temp1**2*(afij[2]+afij[3]*np.log((2*B*X[0])/mu**2)+\
afij[4]*(np.log((2*B*X[0])/mu**2))**2)
temp4 = temp1**3*(afij[5]+afij[6]*np.log((2*B*X[0])/mu**2)+\
afij[7]*(np.log((2*B*X[0])/mu**2))**2+\
afij[8]*(np.log((2*B*X[0])/mu**2))**3)
return Fc/Za*(1+p[0]*X[1])*(1+temp2+temp3+temp4)+p[1]
#fitting using scipy.odr
xtot=np.row_stack( (mudr, dmss22) )
etot=np.row_stack( (Ze, dmss22e) )
fitting = Model(Ffitsso)
mydata = RealData(xtot, F, sx=etot2, sy=dF)
myodr = ODR(mydata, fitting, beta0=[0, 100])
myoutput = myodr.run()
myoutput.pprint()
bet=myoutput.beta
plt.plot(mudr,F,"b^")
plt.plot(mudr,Ffitsso(bet,[mudr,dmss22]))
p[0]*X[0] in the fitfunction is supposed to be small compared to 1 but with the fit the value for p[0] is in order of e-18 whilst dmss22 values are in the order of e-17 which is not small enough.
Even worse is that it's negative meaning the function decreases which is not supposed to happen it's supposed to increase like the plotted data points.
Edit: I fixed, didn't know that it was so sensitive to initial beta values, put beta[0]=1.5*10(-15) and it works!**
Here is a graphical fitter with both curve_fit and ODR fitters using scipy's Differential Evolution (DE) genetic algorithm to supply initial parameter estimates for the non-linear solvers. The scipy implementation of DE uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this requires parameter bounds within which to search - in this example, these bounds are taken from the data maximum and minimum values. Note that it is much easier to give bounds for the initial parameter estimates rather than individual specific values.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.odr
from scipy.optimize import differential_evolution
import warnings
xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7, 0.0])
yData = numpy.array([1.1, 20.2, 30.3, 40.4, 50.0, 60.6, 70.7, 0.1])
def func(x, a, b, c, d, offset): # curve fitting function for curve_fit()
return a*numpy.exp(-(x-b)**2/(2*c**2)+d) + offset
def func_wrapper_for_ODR(parameters, x): # parameter order for ODR
return func(x, *parameters)
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
parameterBounds = []
parameterBounds.append([minY, maxY]) # search bounds for a
parameterBounds.append([minX, maxX]) # search bounds for b
parameterBounds.append([minX, maxX]) # search bounds for c
parameterBounds.append([minY, maxY]) # search bounds for d
parameterBounds.append([0.0, maxY]) # search bounds for Offset
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
geneticParameters = generate_Initial_Parameters()
##########################
# curve_fit section
##########################
fittedParameters_curvefit, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters curve_fit:', fittedParameters_curvefit)
print()
modelPredictions_curvefit = func(xData, *fittedParameters_curvefit)
absError_curvefit = modelPredictions_curvefit - yData
SE_curvefit = numpy.square(absError_curvefit) # squared errors
MSE_curvefit = numpy.mean(SE_curvefit) # mean squared errors
RMSE_curvefit = numpy.sqrt(MSE_curvefit) # Root Mean Squared Error, RMSE
Rsquared_curvefit = 1.0 - (numpy.var(absError_curvefit) / numpy.var(yData))
print()
print('RMSE curve_fit:', RMSE_curvefit)
print('R-squared curve_fit:', Rsquared_curvefit)
print()
##########################
# ODR section
##########################
data = scipy.odr.odrpack.Data(xData,yData)
model = scipy.odr.odrpack.Model(func_wrapper_for_ODR)
odr = scipy.odr.odrpack.ODR(data, model, beta0=geneticParameters)
# Run the regression.
odr_out = odr.run()
print('Fitted parameters ODR:', odr_out.beta)
print()
modelPredictions_odr = func(xData, *odr_out.beta)
absError_odr = modelPredictions_odr - yData
SE_odr = numpy.square(absError_odr) # squared errors
MSE_odr = numpy.mean(SE_odr) # mean squared errors
RMSE_odr = numpy.sqrt(MSE_odr) # Root Mean Squared Error, RMSE
Rsquared_odr = 1.0 - (numpy.var(absError_odr) / numpy.var(yData))
print()
print('RMSE ODR:', RMSE_odr)
print('R-squared ODR:', Rsquared_odr)
print()
##########################################################
# graphics output section
def ModelsAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plots
xModel = numpy.linspace(min(xData), max(xData))
yModel_curvefit = func(xModel, *fittedParameters_curvefit)
yModel_odr = func(xModel, *odr_out.beta)
# now the models as line plots
axes.plot(xModel, yModel_curvefit)
axes.plot(xModel, yModel_odr)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelsAndScatterPlot(graphWidth, graphHeight)

Fit sigmoid curve in python

Thanks ahead! I am trying to fit a sigmoid curve over some data, below is my code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
====== some code in between =======
plt.scatter(drag[0].w,drag[0].s, s = 10, label = 'drag%d'%0)
def sigmoid(x,x0,k):
y = 1.0/(1.0+np.exp(-x0*(x-k)))
return y
popt,pcov = curve_fit(sigmoid, drag[0].w, drag[0].s)
xx = np.linspace(10,1000,10)
yy = sigmoid(xx, *popt)
plt.plot(xx,yy,'r-', label='fit')
plt.legend(loc='upper left')
plt.xlabel('weight(kg)', fontsize=12)
plt.ylabel('wing span(m)', fontsize=12)
plt.show()
this is now showing the graph below which is not very rightfitting curve is the red one at bottom
What are the possible solutions?
Also I am open to other methods of fitting logistic curves on this set of data
Thanks again!
Here is an example graphical fitter using your equation with an amplitude scaling factor for my test data. This code uses scipy's Differential Evolution genetic algorithm to provide initial parameter estimates for curve_fit(), as the scipy default initial parameter estimates of all 1.0 are not always optimal. The scipy implementation of Differential Evolution uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this requires bounds within which to search. In this example those bounds are taken from the example data I provide, when using your own data please check that the bounds seem reasonable. Note that ranges on the parameters are much easier to provide than specific values for the initial parameter estimates.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])
def sigmoid(x, amplitude, x0, k):
return amplitude * 1.0/(1.0+numpy.exp(-x0*(x-k)))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = sigmoid(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
parameterBounds = []
parameterBounds.append([minY, maxY]) # search bounds for amplitude
parameterBounds.append([minX, maxX]) # search bounds for x0
parameterBounds.append([minX, maxX]) # search bounds for k
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(sigmoid, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = sigmoid(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = sigmoid(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

How can I plot the label on the line of a lineplot?

I would like to plot labels on a line of a lineplot in matplotlib.
Minimal example
#!/usr/bin/env python
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
sns.set_palette(sns.color_palette("Greens", 8))
from scipy.ndimage.filters import gaussian_filter1d
for i in range(8):
# Create data
y = np.roll(np.cumsum(np.random.randn(1000, 1)),
np.random.randint(0, 1000))
y = gaussian_filter1d(y, 10)
sns.plt.plot(y, label=str(i))
sns.plt.legend()
sns.plt.show()
generates
instead, I would prefer something like
Maybe a bit hacky, but does this solve your problem?
#!/usr/bin/env python
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
sns.set_palette(sns.color_palette("Greens", 8))
from scipy.ndimage.filters import gaussian_filter1d
for i in range(8):
# Create data
y = np.roll(np.cumsum(np.random.randn(1000, 1)),
np.random.randint(0, 1000))
y = gaussian_filter1d(y, 10)
p = sns.plt.plot(y, label=str(i))
color = p[0].get_color()
for x in [250, 500, 750]:
y2 = y[x]
sns.plt.plot(x, y2, 'o', color='white', markersize=9)
sns.plt.plot(x, y2, 'k', marker="$%s$" % str(i), color=color,
markersize=7)
sns.plt.legend()
sns.plt.show()
Here's the result I get:
Edit: I gave it a little more thought and came up with a solution that automatically tries to find the best possible position for the labels in order to avoid the labels being positioned at x-values where two lines are very close to each other (which could e.g. lead to overlap between the labels):
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
sns.set_palette(sns.color_palette("Greens", 8))
from scipy.ndimage.filters import gaussian_filter1d
# -----------------------------------------------------------------------------
def inline_legend(lines, n_markers=1):
"""
Take a list containing the lines of a plot (typically the result of
calling plt.gca().get_lines()), and add the labels for those lines on the
lines themselves; more precisely, put each label n_marker times on the
line.
[Source of problem: https://stackoverflow.com/q/43573623/4100721]
"""
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from math import fabs
def chunkify(a, n):
"""
Split list a into n approximately equally sized chunks and return the
indices (start/end) of those chunks.
[Idea: Props to http://stackoverflow.com/a/2135920/4100721 :)]
"""
k, m = divmod(len(a), n)
return list([(i * k + min(i, m), (i + 1) * k + min(i + 1, m))
for i in range(n)])
# Calculate linear interpolations of every line. This is necessary to
# compare the values of the lines if they use different x-values
interpolations = [interp1d(_.get_xdata(), _.get_ydata())
for _ in lines]
# Loop over all lines
for idx, line in enumerate(lines):
# Get basic properties of the current line
label = line.get_label()
color = line.get_color()
x_values = line.get_xdata()
y_values = line.get_ydata()
# Get all lines that are not the current line, as well as the
# functions that are linear interpolations of them
other_lines = lines[0:idx] + lines[idx+1:]
other_functions = interpolations[0:idx] + interpolations[idx+1:]
# Split the x-values in chunks to get regions in which to put
# labels. Creating 3 times as many chunks as requested and using only
# every third ensures that no two labels for the same line are too
# close to each other.
chunks = list(chunkify(line.get_xdata(), 3*n_markers))[::3]
# For each chunk, find the optimal position of the label
for chunk_nr in range(n_markers):
# Start and end index of the current chunk
chunk_start = chunks[chunk_nr][0]
chunk_end = chunks[chunk_nr][1]
# For the given chunk, loop over all x-values of the current line,
# evaluate the value of every other line at every such x-value,
# and store the result.
other_values = [[fabs(y_values[int(x)] - f(x)) for x in
x_values[chunk_start:chunk_end]]
for f in other_functions]
# Now loop over these values and find the minimum, i.e. for every
# x-value in the current chunk, find the distance to the closest
# other line ("closest" meaning abs_value(value(current line at x)
# - value(other lines at x)) being at its minimum)
distances = [min([_ for _ in [row[i] for row in other_values]])
for i in range(len(other_values[0]))]
# Now find the value of x in the current chunk where the distance
# is maximal, i.e. the best position for the label and add the
# necessary offset to take into account that the index obtained
# from "distances" is relative to the current chunk
best_pos = distances.index(max(distances)) + chunks[chunk_nr][0]
# Short notation for the position of the label
x = best_pos
y = y_values[x]
# Actually plot the label onto the line at the calculated position
plt.plot(x, y, 'o', color='white', markersize=9)
plt.plot(x, y, 'k', marker="$%s$" % label, color=color,
markersize=7)
# -----------------------------------------------------------------------------
for i in range(8):
# Create data
y = np.roll(np.cumsum(np.random.randn(1000, 1)),
np.random.randint(0, 1000))
y = gaussian_filter1d(y, 10)
sns.plt.plot(y, label=str(i))
inline_legend(plt.gca().get_lines(), n_markers=3)
sns.plt.show()
Example output of this solution (note how the x-positions of the labels are no longer all the same):
If one wants to avoid the use of scipy.interpolate.interp1d, one might consider a solution where for a given x-value of line A, one finds the x-value of line B that is closest to that. I think this might be problematic though if the lines use very different and/or sparse grids?

Show confidence limits and prediction limits in scatter plot

I have two arrays of data for height and weight:
import numpy as np, matplotlib.pyplot as plt
heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65])
weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45])
plt.plot(heights,weights,'bo')
plt.show()
How can I produce a plot similar to the following?
Here's what I put together. I tried to closely emulate your screenshot.
Given
import numpy as np
import scipy as sp
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
# Raw Data
heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65])
weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45])
Two detailed options to plot confidence intervals:
def plot_ci_manual(t, s_err, n, x, x2, y2, ax=None):
"""Return an axes of confidence bands using a simple approach.
Notes
-----
.. math:: \left| \: \hat{\mu}_{y|x0} - \mu_{y|x0} \: \right| \; \leq \; T_{n-2}^{.975} \; \hat{\sigma} \; \sqrt{\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n{(x_i-\bar{x})^2}}}
.. math:: \hat{\sigma} = \sqrt{\sum_{i=1}^n{\frac{(y_i-\hat{y})^2}{n-2}}}
References
----------
.. [1] M. Duarte. "Curve fitting," Jupyter Notebook.
http://nbviewer.ipython.org/github/demotu/BMC/blob/master/notebooks/CurveFitting.ipynb
"""
if ax is None:
ax = plt.gca()
ci = t * s_err * np.sqrt(1/n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2))
ax.fill_between(x2, y2 + ci, y2 - ci, color="#b9cfe7", edgecolor="")
return ax
def plot_ci_bootstrap(xs, ys, resid, nboot=500, ax=None):
"""Return an axes of confidence bands using a bootstrap approach.
Notes
-----
The bootstrap approach iteratively resampling residuals.
It plots `nboot` number of straight lines and outlines the shape of a band.
The density of overlapping lines indicates improved confidence.
Returns
-------
ax : axes
- Cluster of lines
- Upper and Lower bounds (high and low) (optional) Note: sensitive to outliers
References
----------
.. [1] J. Stults. "Visualizing Confidence Intervals", Various Consequences.
http://www.variousconsequences.com/2010/02/visualizing-confidence-intervals.html
"""
if ax is None:
ax = plt.gca()
bootindex = sp.random.randint
for _ in range(nboot):
resamp_resid = resid[bootindex(0, len(resid) - 1, len(resid))]
# Make coeffs of for polys
pc = sp.polyfit(xs, ys + resamp_resid, 1)
# Plot bootstrap cluster
ax.plot(xs, sp.polyval(pc, xs), "b-", linewidth=2, alpha=3.0 / float(nboot))
return ax
Code
# Computations ----------------------------------------------------------------
# Modeling with Numpy
def equation(a, b):
"""Return a 1D polynomial."""
return np.polyval(a, b)
x = heights
y = weights
p, cov = np.polyfit(x, y, 1, cov=True) # parameters and covariance from of the fit of 1-D polynom.
y_model = equation(p, x) # model using the fit parameters; NOTE: parameters here are coefficients
# Statistics
n = weights.size # number of observations
m = p.size # number of parameters
dof = n - m # degrees of freedom
t = stats.t.ppf(0.975, n - m) # t-statistic; used for CI and PI bands
# Estimates of Error in Data/Model
resid = y - y_model # residuals; diff. actual data from predicted values
chi2 = np.sum((resid / y_model)**2) # chi-squared; estimates error in data
chi2_red = chi2 / dof # reduced chi-squared; measures goodness of fit
s_err = np.sqrt(np.sum(resid**2) / dof) # standard deviation of the error
# Plotting --------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8, 6))
# Data
ax.plot(
x, y, "o", color="#b9cfe7", markersize=8,
markeredgewidth=1, markeredgecolor="b", markerfacecolor="None"
)
# Fit
ax.plot(x, y_model, "-", color="0.1", linewidth=1.5, alpha=0.5, label="Fit")
x2 = np.linspace(np.min(x), np.max(x), 100)
y2 = equation(p, x2)
# Confidence Interval (select one)
plot_ci_manual(t, s_err, n, x, x2, y2, ax=ax)
#plot_ci_bootstrap(x, y, resid, ax=ax)
# Prediction Interval
pi = t * s_err * np.sqrt(1 + 1/n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2))
ax.fill_between(x2, y2 + pi, y2 - pi, color="None", linestyle="--")
ax.plot(x2, y2 - pi, "--", color="0.5", label="95% Prediction Limits")
ax.plot(x2, y2 + pi, "--", color="0.5")
#plt.show()
The following modifications are optional, originally implemented to mimic the OP's desired result.
# Figure Modifications --------------------------------------------------------
# Borders
ax.spines["top"].set_color("0.5")
ax.spines["bottom"].set_color("0.5")
ax.spines["left"].set_color("0.5")
ax.spines["right"].set_color("0.5")
ax.get_xaxis().set_tick_params(direction="out")
ax.get_yaxis().set_tick_params(direction="out")
ax.xaxis.tick_bottom()
ax.yaxis.tick_left()
# Labels
plt.title("Fit Plot for Weight", fontsize="14", fontweight="bold")
plt.xlabel("Height")
plt.ylabel("Weight")
plt.xlim(np.min(x) - 1, np.max(x) + 1)
# Custom legend
handles, labels = ax.get_legend_handles_labels()
display = (0, 1)
anyArtist = plt.Line2D((0, 1), (0, 0), color="#b9cfe7") # create custom artists
legend = plt.legend(
[handle for i, handle in enumerate(handles) if i in display] + [anyArtist],
[label for i, label in enumerate(labels) if i in display] + ["95% Confidence Limits"],
loc=9, bbox_to_anchor=(0, -0.21, 1., 0.102), ncol=3, mode="expand"
)
frame = legend.get_frame().set_edgecolor("0.5")
# Save Figure
plt.tight_layout()
plt.savefig("filename.png", bbox_extra_artists=(legend,), bbox_inches="tight")
plt.show()
Output
Using plot_ci_manual():
Using plot_ci_bootstrap():
Hope this helps. Cheers.
Details
I believe that since the legend is outside the figure, it does not show up in matplotblib's popup window. It works fine in Jupyter using %maplotlib inline.
The primary confidence interval code (plot_ci_manual()) is adapted from another source producing a plot similar to the OP. You can select a more advanced technique called residual bootstrapping by uncommenting the second option plot_ci_bootstrap().
Updates
This post has been updated with revised code compatible with Python 3.
stats.t.ppf() accepts the lower tail probability. According to the following resources, t = sp.stats.t.ppf(0.95, n - m) was corrected to t = sp.stats.t.ppf(0.975, n - m) to reflect a two-sided 95% t-statistic (or one-sided 97.5% t-statistic).
original notebook and equation
statistics reference (thanks #Bonlenfum and #tryptofan)
verified t-value given dof=17
y2 was updated to respond more flexibly with a given model (#regeneration).
An abstracted equation function was added to wrap the model function. Non-linear regressions are possible although not demonstrated. Amend appropriate variables as needed (thanks #PJW).
See Also
This post on plotting bands with statsmodels library.
This tutorial on plotting bands and computing confidence intervals with uncertainties library (install with caution in a separate environment).
You can use seaborn plotting library to create plots as you want.
In [18]: import seaborn as sns
In [19]: heights = np.array([50,52,53,54,58,60,62,64,66,67, 68,70,72,74,76,55,50,45,65])
...: weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45])
...:
In [20]: sns.regplot(heights,weights, color ='blue')
Out[20]: <matplotlib.axes.AxesSubplot at 0x13644f60>
I need to do this sort of plot occasionally... this was my first time doing it with Python/Jupyter, and this post helps me a lot, especially the detailed Pylang answer.
I know there are 'easier' ways to get there, but I think this way is much more didactic and allows me to learn step by step what's going on. I even learned here that there are 'prediction intervals'! Thanks.
Below is the Pylang code in a more straightforward fashion, including the calculation of Pearson's correlation (and so the r2) and the mean square error (MSE). Of course, the final plot (!) must be adapted for every dataset...
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65])
weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45])
x = heights
y = weights
slope, intercept = np.polyfit(x, y, 1) # linear model adjustment
y_model = np.polyval([slope, intercept], x) # modeling...
x_mean = np.mean(x)
y_mean = np.mean(y)
n = x.size # number of samples
m = 2 # number of parameters
dof = n - m # degrees of freedom
t = stats.t.ppf(0.975, dof) # Students statistic of interval confidence
residual = y - y_model
std_error = (np.sum(residual**2) / dof)**.5 # Standard deviation of the error
# calculating the r2
# https://www.statisticshowto.com/probability-and-statistics/coefficient-of-determination-r-squared/
# Pearson's correlation coefficient
numerator = np.sum((x - x_mean)*(y - y_mean))
denominator = ( np.sum((x - x_mean)**2) * np.sum((y - y_mean)**2) )**.5
correlation_coef = numerator / denominator
r2 = correlation_coef**2
# mean squared error
MSE = 1/n * np.sum( (y - y_model)**2 )
# to plot the adjusted model
x_line = np.linspace(np.min(x), np.max(x), 100)
y_line = np.polyval([slope, intercept], x_line)
# confidence interval
ci = t * std_error * (1/n + (x_line - x_mean)**2 / np.sum((x - x_mean)**2))**.5
# predicting interval
pi = t * std_error * (1 + 1/n + (x_line - x_mean)**2 / np.sum((x - x_mean)**2))**.5
############### Ploting
plt.rcParams.update({'font.size': 14})
fig = plt.figure()
ax = fig.add_axes([.1, .1, .8, .8])
ax.plot(x, y, 'o', color = 'royalblue')
ax.plot(x_line, y_line, color = 'royalblue')
ax.fill_between(x_line, y_line + pi, y_line - pi, color = 'lightcyan', label = '95% prediction interval')
ax.fill_between(x_line, y_line + ci, y_line - ci, color = 'skyblue', label = '95% confidence interval')
ax.set_xlabel('x')
ax.set_ylabel('y')
# rounding and position must be changed for each case and preference
a = str(np.round(intercept))
b = str(np.round(slope,2))
r2s = str(np.round(r2,2))
MSEs = str(np.round(MSE))
ax.text(45, 110, 'y = ' + a + ' + ' + b + ' x')
ax.text(45, 100, '$r^2$ = ' + r2s + ' MSE = ' + MSEs)
plt.legend(bbox_to_anchor=(1, .25), fontsize=12)
For a project of mine, I needed to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
In the case of linear regression:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
from tsmoothie.utils_func import sim_randomwalk
# generate 10 randomwalks of length 50
np.random.seed(33)
data = sim_randomwalk(n_series=10, timesteps=50,
process_noise=10, measure_noise=30)
# operate smoothing
smoother = PolynomialSmoother(degree=1)
smoother.smooth(data)
# generate intervals
low_pi, up_pi = smoother.get_intervals('prediction_interval', confidence=0.05)
low_ci, up_ci = smoother.get_intervals('confidence_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low_pi[0], up_pi[0], alpha=0.3, color='blue')
plt.fill_between(range(len(smoother.data[0])), low_ci[0], up_ci[0], alpha=0.3, color='blue')
In the case of regression with order bigger than 1:
# operate smoothing
smoother = PolynomialSmoother(degree=5)
smoother.smooth(data)
# generate intervals
low_pi, up_pi = smoother.get_intervals('prediction_interval', confidence=0.05)
low_ci, up_ci = smoother.get_intervals('confidence_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low_pi[0], up_pi[0], alpha=0.3, color='blue')
plt.fill_between(range(len(smoother.data[0])), low_ci[0], up_ci[0], alpha=0.3, color='blue')
I point out also that tsmoothie can carry out the smoothing of multiple time-series in a vectorized way. Hope this can help someone