best curve fitting the distribution

best curve fitting the distribution - numpy

I tried to use a polynomial (3-degrees) to fit a data series, but it seems that it's still not the best fit (some points are off in graph shown below). I also tried to add a log function to help plot. But result is not improved either.
What would be the best curve fitting here?
Here are the raw data points I have:
x_values = [ 0.51,0.56444444,0.61888889 , 0.67333333 , 0.72777778, 0.78222222, 0.83666667, 0.89111111 , 0.94555556 , 1. ]
y_values = [0.67154591, 0.66657266, 0.65878351, 0.6488696, 0.63499979, 0.6202393, 0.59887225, 0.56689689, 0.51768976, 0.33029004]
Results with polynomial fit:

It would be better, if your curve fitting procedure were hypothesis driven, i.e., you had already an idea, what kind of relationship to expect. The shape looked to me more like an exponential function:
from matplotlib import pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
#the function that describes the data
def func(x, a, b, c, d):
return a * np.exp(b * x + c) + d
x_values = [0.51,0.56444444, 0.61888889, 0.67333333 , 0.72777778, 0.78222222, 0.83666667, 0.89111111 , 0.94555556 , 1. ]
y_values = [0.67154591, 0.66657266, 0.65878351, 0.6488696, 0.63499979, 0.6202393, 0.59887225, 0.56689689, 0.51768976, 0.33029004]
#start values [a, b, c, d]
start = [-.1, 1, 0, .1]
#curve fitting
popt, pcov = curve_fit(func, x_values, y_values, p0 = start)
#output [a, b, c, d]
print(popt)
#calculating the fit curve at a better resolution
x_fit = np.linspace(min(x_values), max(x_values), 1000)
y_fit = func(x_fit, *popt)
#plot data and fit
plt.scatter(x_values, y_values, label = "data")
plt.plot(x_fit, y_fit, label = "fit")
plt.legend()
plt.show()
This gives the following output:
This still does not look correct, the first part seems to have a linear offset. If we take this into consideration:
from matplotlib import pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c, d, e):
return a * np.exp(b * x + c) + d * x + e
x_values = [0.51,0.56444444, 0.61888889, 0.67333333 , 0.72777778, 0.78222222, 0.83666667, 0.89111111 , 0.94555556 , 1. ]
y_values = [0.67154591, 0.66657266, 0.65878351, 0.6488696, 0.63499979, 0.6202393, 0.59887225, 0.56689689, 0.51768976, 0.33029004]
start = [-.1, 1, 0, .1, 1]
popt, pcov = curve_fit(func, x_values, y_values, p0 = start)
print(popt)
x_fit = np.linspace(min(x_values), max(x_values), 1000)
y_fit = func(x_fit, *popt)
plt.scatter(x_values, y_values, label = "data")
plt.plot(x_fit, y_fit, label = "fit")
plt.legend()
plt.show()
we have the following output:
This now is closer to your data points.
BUT. You should go to your data and think about, which model is most likely to reflect reality, then implement this model. You can always construct more complicated functions that better fit your data, but they do not necessarily reflect better reality.

Related

scipy weird unexpected behavior curve_fit large data set for sin wave

For some reason when I am trying to large amount of data to a sin wave it fails and fits it to a horizontal line. Can somebody explain?
Minimal working code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
# Seed the random number generator for reproducibility
import pandas
np.random.seed(0)
# Here it work as expected
# x_data = np.linspace(-5, 5, num=50)
# y_data = 2.9 * np.sin(1.05 * x_data + 2) + 250 + np.random.normal(size=50)
# With this data it breaks
x_data = np.linspace(0, 2500, num=2500)
y_data = -100 * np.sin(0.01 * x_data + 1) + 250 + np.random.normal(size=2500)
# And plot it
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data)
def test_func(x, a, b, c, d):
return a * np.sin(b * x + c) + d
# Used to fit the correct function
# params, params_covariance = optimize.curve_fit(test_func, x_data, y_data)
# making some guesses
params, params_covariance = optimize.curve_fit(test_func, x_data, y_data,
p0=[-80, 3, 0, 260])
print(params)
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_data, test_func(x_data, *params),
label='Fitted function')
plt.legend(loc='best')
plt.show()
Does anybody know, how to fix this issue. Should I use a different fitting method not least square? Or should I reduce the number of data points?

Given your data, you can use the more robust lmfit instead of scipy.
In particular, you can use SineModel (see here for details).
SineModel in lmfit is not for "shifted" sine waves, but you can easily deal with the shift doing
y_data_offset = y_data.mean()
y_transformed = y_data - y_data_offset
plt.scatter(x_data, y_transformed)
plt.axhline(0, color='r')
Now you can fit to sine wave
from lmfit.models import SineModel
mod = SineModel()
pars = mod.guess(y_transformed, x=x_data)
out = mod.fit(y_transformed, pars, x=x_data)
you can inspect results with print(out.fit_report()) and plot results with
plt.plot(x_data, y_data, lw=7, color='C1')
plt.plot(x_data, out.best_fit+y_data_offset, color='k')
# we add the offset ^^^^^^^^^^^^^
or with the builtin plot method out.plot_fit(), see here for details.
Note that in SineModel all parameters "are constrained to be non-negative", so your defined negative amplitude (-100) will be positive (+100) in the parameters fit results. So the phase too won't be 1 but π+1 (PS: they call shift the phase)
print(out.best_values)
{'amplitude': 99.99631403054289,
'frequency': 0.010001193681616227,
'shift': 4.1400215410836605}

Extracting BCI Geodetic and ECI coordinates of an orbit

I am using astropy to define a Tundra orbit around Earth and subsequently, I would like to extract the ECI and geodetic coordinates as the object propagates in time. I was able to get something but it does not agree with what I would expect (ECI coordinates extracted from another SW). The two orbits are not even on the same plane, which is clearly wrong.
Can anybody tell me if I am doing something obviously wrong?
The plot below shows the two results. Orange is with Astropy.
import astropy
from astropy import units as u
from poliastro.bodies import Earth
from astropy.coordinates import CartesianRepresentation
from poliastro.twobody import Orbit
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
epoch = astropy.time.Time('2020-01-01T00:00:00.000', scale='tt')
# Tundra
tundra1 = Orbit.from_classical(attractor=Earth,
a = 42164 *u.km,
ecc = 0.2684 * u.one,
inc = 63.4 * u.deg,
raan = 25 * u.deg,
argp = 270 * u.deg,
nu = 50 * u.deg,
# epoch=epoch
)
def plot_orb(orb, start_t, end_t, step_t, ax, c='k'):
orb_list = []
for t in np.arange(start_t, end_t, step_t):
single_orb = orb.propagate(t*u.min)
orb_list = orb_list + [single_orb]
xyz = orb.sample().xyz
ax.plot(*xyz,'r')
s_xyz_ar = np.zeros((len(orb_list), 3))
for i, s_orb in enumerate(orb_list):
s_xyz = s_orb.represent_as(CartesianRepresentation).xyz
s_xyz_ar[i, :] = s_xyz
ax.scatter(s_xyz_ar[:, 0], s_xyz_ar[:, 1], s_xyz_ar[:, 2], c)
return s_xyz_ar, t
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
s_xyz_ar1, t1 = plot_orb(orb=tundra1, start_t=0, end_t=1440, step_t=10, ax=ax, c='k')

When I wrote that you can do this more efficiently I was under the mistaken assumption that Orbit.propagate can be called directly on an array of time steps like:
>>> tt = np.arange(0, 1440, 10) * u.min
>>> orb = tundra1.propagate(tt)
While this "works" in that it returns a new orbit with an array of epochs, it appears Orbit is not really designed to work with an array of epochs and trying to do something like orb.represent_as just returns a value for the first epoch in the array. This would be a nice possible enhancement to poliastro.
However, the code you wrote for the scatter plot can still be significantly simplified to something like this:
>>> tt = np.arange(0, 1440, 10) * u.min
>>> xyz = np.vstack([tundra1.propagate(t).represent_as(CartesianRepresentation).xyz for t in tt])
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> ax.scatter(*xyz.T)
>>> fig.show()
Result:
Ideally you should be able to do this without the np.vstack and instead just call tundra1.propagate(tt).represent_as(CartesianRepresentation).xyz without a for loop. But as the above demonstrates you can still simplify a lot by using np.vstack to make an array from a list of (x, y, z) triplets.
I'm not sure this really answers your original question though, which it seems you found the answer to that wasn't really related to the code. Still, I hope this helps!

Calculating and plotting parametric equations in sympy

So i'm struggling with these parametric equations in Sympy.
𝑓(𝜃) = cos(𝜃) − sin(𝑎𝜃) and 𝑔(𝜃) = sin(𝜃) + cos(𝑎𝜃)
with 𝑎 ∈ ℝ∖{0}.
import matplotlib.pyplot as plt
import sympy as sp
from IPython.display import display
sp.init_printing()
%matplotlib inline
This is what I have to define them:
f = sp.Function('f')
g = sp.Function('g')
f = sp.cos(th) - sp.sin(a*th)
g = sp.sin(th) + sp.cos(a*th)
I don't know how to define a with the domain ℝ∖{0} and it gives me trouble when I want to solve the equation
𝑓(𝜃)+𝑔(𝜃)=0
The solution should be:
𝜃=[3𝜋/4,3𝜋/4𝑎,𝜋/2(𝑎−1),𝜋/(𝑎+1)]
Next I want to plot the parametric equations when a=2, a=4, a=6 and a=8. I want to have a different color for every value of a. The most efficient way will probably be with a for-loop.
I also need to use lambdify to have a list of values but I'm fairly new to this so it's a bit vague.
This is what I already have:
fig, ax = plt.subplots(1, figsize=(12, 12))
theta_range = np.linspace(0, 2*np.pi, 750)
colors = ['blue', 'green', 'orange', 'cyan']
a = [2, 4, 6, 8]
for index in range(0, 4):
# I guess I need to use lambdify here but I don't see how
plt.show()
Thank you in advance!

You're asking two very different questions. One question about solving a symbolic expression, and one about plotting curves.
First, about the symbolic expression. a can be defined as a = sp.symbols('a', real=True, nonzero=True) and theta as th = sp.symbols('theta', real=True). There is no need to define f and g as sympy symbols, as they get assigned a sympy expression. To solve the equation, just use sp.solve(f+g, th). Sympy gives [pi, pi/a, pi/(2*(a - 1)), pi/(a + 1)] as the result.
Sympy also has a plotting function, which could be called as sp.plot(*[(f+g).subs({a:a_val}) for a_val in [2, 4, 6, 8]]). But there is very limited support for options such as color.
To have more control, matplotlib can do the plotting based on numpy functions. sp.lambdify converts the expression: sp.lambdify((th, a), f+g, 'numpy').
Then, matplotlib can do the plotting. There are many options to tune the result.
Here is some example code:
import matplotlib.pyplot as plt
import numpy as np
import sympy as sp
th = sp.symbols('theta', real=True)
a = sp.symbols('a', real=True, nonzero=True)
f = sp.cos(th) - sp.sin(a*th)
g = sp.sin(th) + sp.cos(a*th)
thetas = sp.solve(f+g, th)
print("Solutions for theta:", thetas)
fg_np = sp.lambdify((th, a), f+g, 'numpy')
fig, ax = plt.subplots(1, figsize=(12, 12))
theta_range = np.linspace(0, 2*np.pi, 750)
colors = plt.cm.Set2.colors
for a_val, color in zip([2,4,6,8], colors):
plt.plot(theta_range, fg_np(theta_range, a_val), color=color, label=f'a={a_val}')
plt.axhline(0, color='black')
plt.xlabel("theta")
plt.ylabel(f+g)
plt.legend()
plt.grid()
plt.autoscale(enable=True, axis='x', tight=True)
plt.show()

Get the y value of a given x

I have a simple question but have not found any answer..
Let's have a look at this code :
from matplotlib import pyplot
import numpy
x=[0,1,2,3,4]
y=[5,3,40,20,1]
pyplot.plot(x,y)
It is plotted and all the points ared linked.
Let's say I want to get the y value of x=1,3.
How can I get the x values matching with y=30 ? (there are two)
Many thanks for your help

You could use shapely to find the intersections:
import matplotlib.pyplot as plt
import numpy as np
import shapely.geometry as SG
x=[0,1,2,3,4]
y=[5,3,40,20,1]
line = SG.LineString(list(zip(x,y)))
y0 = 30
yline = SG.LineString([(min(x), y0), (max(x), y0)])
coords = np.array(line.intersection(yline))
print(coords[:, 0])
fig, ax = plt.subplots()
ax.axhline(y=y0, color='k', linestyle='--')
ax.plot(x, y, 'b-')
ax.scatter(coords[:, 0], coords[:, 1], s=50, c='red')
plt.show()
finds solutions for x at:
[ 1.72972973 2.5 ]

The following code might do what you want. The interpolation of y(x) is straight forward, as the x-values are monotonically increasing. The problem of finding the x-values for a given y is not so easy anymore, once the function is not monotonically increasing as in this case. So you still need to know roughly where to expect the values to be.
import numpy as np
import scipy.interpolate
import scipy.optimize
x=np.array([0,1,2,3,4])
y=np.array([5,3,40,20,1])
#if the independent variable is monotonically increasing
print np.interp(1.3, x, y)
# if not, as in the case of finding x(y) here,
# we need to find the zeros of an interpolating function
y0 = 30.
initial_guess = 1.5 #for the first zero,
#initial_guess = 3.0 # for the secon zero
f = scipy.interpolate.interp1d(x,y,kind="linear")
fmin = lambda x: np.abs(f(x)-y0)
s = scipy.optimize.fmin(fmin, initial_guess, disp=False)
print s

I use python 3.
print(numpy.interp(1.3, x, y))
Y = 30
eps = 1e-6
j = 0
for i, ((x0, x1), (y0, y1)) in enumerate(zip(zip(x[:-1], x[1:]), zip(y[:-1], y[1:]))):
dy = y1 - y0
if abs(dy) < eps:
if y0 == Y:
print('There are infinite number of solutions')
else:
t = (Y - y0)/dy
if 0 < t < 1:
sol = x0 + (x1 - x0)*t
print('solution #{}: {}'.format(j, sol))
j += 1

group boxplot histogramming

I would like to group my data and to plot the boxplot for all the groups. There are many questions and answer about that, my problem is that I want to group by a continuos variable, so I want to histogramming my data.
Here what I have done. My data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
x = np.random.chisquare(5, size=100000)
y = np.random.normal(size=100000) / (0.05 * x + 0.1) + 2 * x
f, ax = plt.subplots()
ax.plot(x, y, '.', alpha=0.05)
plt.show()
I want to study the behaviour of y (location, width, ...) as a function of x. I am not interested in the distribution of x so I will normalized it.
f, ax = plt.subplots()
xbins = np.linspace(0, 25, 50)
ybins = np.linspace(-20, 50, 50)
H, xedges, yedges = np.histogram2d(y, x, bins=(ybins, xbins))
norm = np.sum(H, axis = 0)
H /= norm
ax.pcolor(xbins, ybins, np.nan_to_num(H), vmax=.4)
plt.show()
I can plot histogram, but I want boxplot
binning = np.concatenate(([0], np.sort(np.random.random(20) * 25), [25]))
idx = np.digitize(x, binning)
data_to_plot = [y[idx == i] for i in xrange(len(binning))]
f, ax = plt.subplots()
midpoints = 0.5 * (binning[1:] + binning[:-1])
widths = 0.9 * (binning[1:] - binning[:-1])
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
majorLocator = MultipleLocator(2)
ax.boxplot(data_to_plot, positions = midpoints, widths=widths)
ax.set_xlim(0, 25)
ax.xaxis.set_major_locator(majorLocator)
ax.set_xlabel('x')
ax.set_ylabel('median(y)')
plt.show()
Is there an automatic way to do that, like ax.magic(x, y, binning)? Is there a better way to do that? (Have a look to https://root.cern.ch/root/html/TProfile.html for example, which plot the mean and the error of the mean as error bars)
In addition, I want to minize the memory footprint (my real data are much more than 100000), I am worried about data_to_plot, is it a copy?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

best curve fitting the distribution - numpy

Related

scipy weird unexpected behavior curve_fit large data set for sin wave

Extracting BCI Geodetic and ECI coordinates of an orbit

Calculating and plotting parametric equations in sympy

Get the y value of a given x

group boxplot histogramming

Categories

Resources