Locally weighted smoothing for binary valued random variable - numpy

I have a random variable as follows:
f(x) = 1 with probability g(x)
f(x) = 0 with probability 1-g(x)
where 0 < g(x) < 1.
Assume g(x) = x. Let's say I am observing this variable without knowing the function g and obtained 100 samples as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binned_statistic
list = np.ndarray(shape=(200,2))
g = np.random.rand(200)
for i in range(len(g)):
list[i] = (g[i], np.random.choice([0, 1], p=[1-g[i], g[i]]))
print(list)
plt.plot(list[:,0], list[:,1], 'o')
Plot of 0s and 1s
Now, I would like to retrieve the function g from these points. The best I could think is to use draw a histogram and use the mean statistic:
bin_means, bin_edges, bin_number = binned_statistic(list[:,0], list[:,1], statistic='mean', bins=10)
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], lw=2)
Histogram mean statistics
Instead, I would like to have a continuous estimation of the generating function.
I guess it is about kernel density estimation but I could not find the appropriate pointer.

straightforward without explicitly fitting an estimator:
import seaborn as sns
g = sns.lmplot(x= , y= , y_jitter=.02 , logistic=True)
plug in x= your exogenous variable and analogously y = dependent variable. y_jitter is jitter the point for better visibility if you have a lot of data points. logistic = True is the main point here. It will give you the logistic regression line of the data.
Seaborn is basically tailored around matplotlib and works great with pandas, in case you want to extend your data to a DataFrame.

Related

Linear regression to fit a power-law in Python

I have two data sets index_list and frequency_list which I plot in a loglog plot by plt.loglog(index_list, freq_list). Now I'm trying to fit a power law a*x^(-b) with linear regression. I expect the curve to follow the initial curve closely but the following code seems to output a similar curve but mirrored on the y-axis.
I suspect I am using curve_fit badly.
why is this curve mirrored on the x-axis and how I can get it to properly fit my inital curve?
Using this data
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
f = open ("input.txt", "r")
index_list = []
freq_list = []
index = 0
for line in f:
split_line = line.split()
freq_list.append(int(split_line[1]))
index_list.append(index)
index += 1
plt.loglog(index_list, freq_list)
def power_law(x, a, b):
return a * np.power(x, -b)
popt, pcov = curve_fit(power_law, index_list, freq_list)
plt.plot(index_list, power_law(freq_list, *popt))
plt.show()
The code below made the following changes:
For the scipy functions to work, it is best that both index_list and freq_list are numpy arrays, not Python lists. Also, for the power not to overflow too rapidly, these arrays should be of float type (not of int).
As 0 to a negative power causes a divide-by-zero problem, it makes sense to start the index_list with 1.
Due to the powers, also for floats an overflow can be generated. Therefore, it makes sense to add bounds to curve_fit. Especially b should be limited not to cross about 50 (the highest value is about power(100000, b) giving an overflow when be.g. is100). Also setting initial values helps to direct the fitting process (p0=...).
Drawing a plot with index_list as x and power_law(freq_list, ...) as y would generate a very weird curve. It is necessary that the same x is used for the plot and for the function.
Note that calling plt.loglog() changes both axes of the plot to logarithmic. All subsequent plots on the same axes will continue to use the logarithmic scale.
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import pandas as pd
import numpy as np
def power_law(x, a, b):
return a * np.power(x, -b)
df = pd.read_csv("https://norvig.com/google-books-common-words.txt", delim_whitespace=True, header=None)
index_list = df.index.to_numpy(dtype=float) + 1
freq_list = df[1].to_numpy(dtype=float)
plt.loglog(index_list, freq_list, label='given data')
popt, pcov = curve_fit(power_law, index_list, freq_list, p0=[1, 1], bounds=[[1e-3, 1e-3], [1e20, 50]])
plt.plot(index_list, power_law(index_list, *popt), label='power law')
plt.legend()
plt.show()

Evaluation of log density for various values of `mean`

I can evaluate the log probability density of a multivariate normal by doing
import numpy as np
import scipy.stats
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.zeros(2), cov = np.eye(2))
Now, I'm interested in evaluating the log density of the point [0,0] over a variety of values of mean. Here is what I have tried
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.stack([x,y], axis = -1), cov = np.eye(2))
This results in an error: ValueError: Array 'mean' must be a vector of length 5202.
How can I evaluate the log density of a multivariate normal over a variety of values of mean?
As your error suggest logpdf is waiting a 1D array for the mean argument.
Since your covariance matrix is 2x2, you should give him a 2x1 array to mean.
If you want to evaluate the density for multiple mean values you can use a for loop after flattening x and y as follows :
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
x,y = x.flatten(), y.flatten()
res = []
for i in range(len(x)):
x_i, y_i = x[i], y[i]
res.append(scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)))
You can also also use list comprehension in place of the for loop :
res = [scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)) for i in range(len(x))]
To visualize the result you can use matplotlib.pyplot :
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x,y,c=res)
plt.show()
But I don't see the point of trying to evaluate the multivariate gaussian logpdf over several mean values.
In the case of a multivariate normal distribution the argument x and the mean m have symmetric roles as you can see in the exponential term : (x-m)^T Sigam^(-1) (x-m).
What you are doing is equivalent to evaluate the logpdf of a multivariate gaussian of mean [0,0] and of covariance eye(2).

Estimation of t-distribution by mean of samples does not work

I am trying to create a t-distribution by taking the mean of many samples from a normal distribution (and then estimating the shape with kernel density estimation).
For some reason, I am getting pretty different results when I compare what I get with a proper t-distribution. I don't understand what is going wrong, so I think I am confused about something.
Here is the code:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import seaborn
inner_sample_size = 10
X = np.arange(-3, 3, 0.01)
results = [
np.mean(np.random.normal(size=inner_sample_size))
for _ in range(10000)
]
estimation = gaussian_kde(results)
plt.plot(X, estimation.evaluate(X))
t_samples = np.random.standard_t(inner_sample_size, 10000)
t_estimator = gaussian_kde(t_samples)
plt.plot(X, t_estimator.evaluate(X))
plt.ylabel("Probability density")
plt.show()
And here is the plot I get:
Where the orange line is numpy's own t-distribution, and the blue line is the one estimated by sampling.
Your assumption that the mean of Standard Normals has T distribution is incorrect. In fact, the mean of Standard Normals has Normal Distribution, which explains the shape of your blue graph. To generate one random variable T from a T distribution with k degrees of freedom, you first generate k+1 independent Standard Normals Z_i, i=0,...,k. You then compute
T = Z_0 / sqrt( sum(Z_i^2, i=1 to k)/k ).
The sum of squared Standard Normals sum(Z_i^2, i=1 to k) has Chi-Squared Distribution with k degrees of freedom, so if there is a pre-canned method to generate this, you should use it, since it's likely more efficient.

Contour plotting orbitals in pyquante2 using matplotlib

I'm currently writing line and contour plotting functions for my PyQuante quantum chemistry package using matplotlib. I have some great functions that evaluate basis sets along a (npts,3) array of points, e.g.
from somewhere import basisset, line
bfs = basisset(h2) # Generate a basis set
points = line((0,0,-5),(0,0,5)) # Create a line in 3d space
bfmesh = bfs.mesh(points)
for i in range(bfmesh.shape[1]):
plot(bfmesh[:,i])
This is fast because it evaluates all of the basis functions at once, and I got some great help from stackoverflow here and here to make them extra-nice.
I would now like to update this to do contour plotting as well. The slow way I've done this in the past is to create two one-d vectors using linspace(), mesh these into a 2D grid using meshgrid(), and then iterating over all xyz points and evaluating each one:
f = np.empty((50,50),dtype=float)
xvals = np.linspace(0,10)
yvals = np.linspace(0,20)
z = 0
for x in xvals:
for y in yvals:
f = bf(x,y,z)
X,Y = np.meshgrid(xvals,yvals)
contourplot(X,Y,f)
(this isn't real code -- may have done something dumb)
What I would like to do is to generate the mesh in more or less the same way I do in the contour plot example, "unravel" it to a (npts,3) list of points, evaluate the basis functions using my new fast routines, then "re-ravel" it back to X,Y matrices for plotting with contourplot.
The problem is that I don't have anything that I can simply call .ravel() on: I either have 1d meshes of xvals and yvals, the 2D versions X,Y, and the single z value.
Can anyone think of a nice, pythonic way to do this?
If you can express f as a function of X and Y, you could avoid the Python for-loops this way:
import matplotlib.pyplot as plt
import numpy as np
def bf(x, y):
return np.sin(np.sqrt(x**2+y**2))
xvals = np.linspace(0,10)
yvals = np.linspace(0,20)
X, Y = np.meshgrid(xvals,yvals)
f = bf(X,Y)
plt.contour(X,Y,f)
plt.show()
yields

Exponential decay curve fitting in numpy and scipy

I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq for exponential functions and scipy.optimize.curve_fit for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq and Numpy lstsq seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x) and is constrained such that y=1 at x=0.
Excel trend line agrees with the Numpy lstsq result, but as Scipy leastsq is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.show()
Edit - additional information
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit curve presents an R^2 of 0.82, while the numpy.linalg.lstsq curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq, I you don't need to vstack a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]
To expound a bit on Jaime's point, any non-linear transformation of the data will lead to a different error function and hence to different solutions. These will lead to different confidence intervals for the fitting parameters. So you have three possible criteria to use to make a decision: which error you want to minimize, which parameters you want more confidence in, and finally, if you are using the fitting to predict some value, which method yields less error in the interesting predicted value. Playing around a bit analytically and in Excel suggests that different kinds of noise in the data (e.g. if the noise function scales the amplitude, affects the time-constant or is additive) leads to different choices of solution.
I'll also add that while this trick "works" for exponential decay to 0, it can't be used in the more general (and common) case of damped exponentials (rising or falling) to values that cannot be assumed to be 0.