Estimation of t-distribution by mean of samples does not work - numpy

I am trying to create a t-distribution by taking the mean of many samples from a normal distribution (and then estimating the shape with kernel density estimation).
For some reason, I am getting pretty different results when I compare what I get with a proper t-distribution. I don't understand what is going wrong, so I think I am confused about something.
Here is the code:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import seaborn
inner_sample_size = 10
X = np.arange(-3, 3, 0.01)
results = [
np.mean(np.random.normal(size=inner_sample_size))
for _ in range(10000)
]
estimation = gaussian_kde(results)
plt.plot(X, estimation.evaluate(X))
t_samples = np.random.standard_t(inner_sample_size, 10000)
t_estimator = gaussian_kde(t_samples)
plt.plot(X, t_estimator.evaluate(X))
plt.ylabel("Probability density")
plt.show()
And here is the plot I get:
Where the orange line is numpy's own t-distribution, and the blue line is the one estimated by sampling.

Your assumption that the mean of Standard Normals has T distribution is incorrect. In fact, the mean of Standard Normals has Normal Distribution, which explains the shape of your blue graph. To generate one random variable T from a T distribution with k degrees of freedom, you first generate k+1 independent Standard Normals Z_i, i=0,...,k. You then compute
T = Z_0 / sqrt( sum(Z_i^2, i=1 to k)/k ).
The sum of squared Standard Normals sum(Z_i^2, i=1 to k) has Chi-Squared Distribution with k degrees of freedom, so if there is a pre-canned method to generate this, you should use it, since it's likely more efficient.

Related

Central Limit Theorem: Sample means do not follow a normal distribution

The Problem
Good evening.
I am learning about the Central Limit Theorem. As practice, I ran simulations in an attempt to find the mean of a fair die (I know, a toy problem).
I took 4000 samples, and in each sample I rolled a die 50 times (screenshot of the code at the bottom). For each of these 4000 samples I computed the mean. Then, I plotted these 4000 sample means in a histogram (with bin size 0.03) using matplotlib.
Here is the result:
Question
Why aren't the sample means normally distributed given that the conditions for CLT (sample size >= 30) were respected?
Specifically, why does the histogram look like two normal distributions superimposed on top of each other? More intriguingly, why does the "outer" distribution look "discrete" with empty spaces occurring at regular intervals?
It almost seems like the result is off in a systematic way.
All help is greatly appreciated. I am very lost.
Supplementary Code
The code I used to generate the 4000 sample means.
"""
Take multiple samples of dice rolls. For
each sample, compute the sample mean.
With the sample means, plot a histogram.
By the Central Limit Theorem, the sample
means should be normally distributed.
"""
sample_means = []
num_samples = 4000
for i in range(num_samples):
# Large enough for CLT to hold
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
When num_rolls equals 50, each possible mean will be a fraction with denominator 50. So, in reality, you are looking at a discrete distribution.
To create a histogram of a discrete distribution, the bin boundaries are best placed nicely in-between the values. Using a step size of 0.03, some bin boundaries will coincide with the values, putting the double of values into one bin compared to its neighbor. Moreover, due to subtle floating point rounding problems, the result can become unpredictable when values and boundaries coincide.
Here is some code to illustrate what is going on:
from matplotlib import pyplot as plt
import numpy as np
import random
sample_means = []
num_samples = 4000
for i in range(num_samples):
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
fig, axs = plt.subplots(2, 2, figsize=(14, 8))
random_y = np.random.rand(len(sample_means))
for (ax0, ax1), step in zip(axs, [0.03, 0.02]):
bins = np.arange(3.01, 4, step)
ax0.hist(sample_means, bins=bins)
ax0.set_title(f'step={step}')
ax0.vlines(bins, 0, ax0.get_ylim()[1], ls=':', color='r') # show the bin boundaries in red
ax1.scatter(sample_means, random_y, s=1) # show the sample means with a random y
ax1.vlines(bins, 0, 1, ls=':', color='r') # show the bin boundaries in red
ax1.set_xticks(np.arange(3, 4, 0.02))
ax1.set_xlim(3.0, 3.3) # zoom in to region to better see the ins
ax1.set_title('bin boundaries between values' if step == 0.02 else 'chaotic bin boundaries')
plt.show()
PS: Note that the code would run much, much faster if instead of Python lists, the code would work completely with numpy.

Why curve fit function does not combine all data points. How to get best fit?

I'm not familiar that how to decide the fitting function? But by looking at the trend of data points I choosed Poisson distribution as my fitting function. Green curve is quite smooth but fitting curve is is far away from first data point having position (0,0.55). I want to get smooth curve using fitting function because it is far away from my actual data points. I tried to increase number of bins but still getting same type of curve. I have doubt that may be I am not choosing proper fitting function or may be I am missing something else?
`def Poisson_fit(x,a):
return (a*np.exp(-x))
def Poisson(x):
return (np.exp(-x))
x_data =np.linspace(0,5,10)
print("x_data: ",x_data)
[0.,0.55555556, 1.11111111, 1.66666667, 2.22222222, 2.77777778, 3.33333333,
3.88888889, 4.44444444, 5.]
hist, bin_edges= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist:[5.41041394e-01,1.42611032e-01,3.44975130e-02,7.60221121e-03,
1.66115522e-03,3.26808028e-04,6.70741368e-05,1.14168743e-05,5.70843717e-06,
1.42710929e-06]
plt.scatter(x_data, hist,marker='o',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, x_data, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',
marker='.',color='red', label='Fit')
plt.plot(x_data,Poisson(x_data),marker='.',color='green',label='Poisson')`
#Second Graph(Find best fit)
In the following graph I have fit two different distributions on data points. For me its hard to judge which is best fit. Should I print error on the fitting function to judge the best fit?
`perr = np.sqrt(np.diag(pcov))`
If all data-points need to coincide with the interpolating fit, splines (e.g. cubic splines) can be used, generally resulting in a reasonably smooth fit (only generally, because what is "reasonably smooth" depends both on the data and the application).
Example:
import numpy as np
from scipy.interpolate import CubicSpline
import pylab
x_data = np.linspace(0,5,10)
y_data = np.array([5.41041394e-01,1.42611032e-01,3.44975130e-02,
7.60221121e-03,1.66115522e-03,3.26808028e-04,
6.70741368e-05,1.14168743e-05,5.70843717e-06,
1.42710929e-06])
spline = CubicSpline(x_data, y_data)
plot_x = np.linspace(0,5,1000)
pylab.plot(x_data, y_data, 'b*', label='Data')
pylab.plot(plot_x, spline(plot_x), 'k-', label='Spline')
pylab.legend(loc='best')
pylab.show()

Locally weighted smoothing for binary valued random variable

I have a random variable as follows:
f(x) = 1 with probability g(x)
f(x) = 0 with probability 1-g(x)
where 0 < g(x) < 1.
Assume g(x) = x. Let's say I am observing this variable without knowing the function g and obtained 100 samples as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binned_statistic
list = np.ndarray(shape=(200,2))
g = np.random.rand(200)
for i in range(len(g)):
list[i] = (g[i], np.random.choice([0, 1], p=[1-g[i], g[i]]))
print(list)
plt.plot(list[:,0], list[:,1], 'o')
Plot of 0s and 1s
Now, I would like to retrieve the function g from these points. The best I could think is to use draw a histogram and use the mean statistic:
bin_means, bin_edges, bin_number = binned_statistic(list[:,0], list[:,1], statistic='mean', bins=10)
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], lw=2)
Histogram mean statistics
Instead, I would like to have a continuous estimation of the generating function.
I guess it is about kernel density estimation but I could not find the appropriate pointer.
straightforward without explicitly fitting an estimator:
import seaborn as sns
g = sns.lmplot(x= , y= , y_jitter=.02 , logistic=True)
plug in x= your exogenous variable and analogously y = dependent variable. y_jitter is jitter the point for better visibility if you have a lot of data points. logistic = True is the main point here. It will give you the logistic regression line of the data.
Seaborn is basically tailored around matplotlib and works great with pandas, in case you want to extend your data to a DataFrame.

Cubic spline interpolation drops out halfway

I am trying to make a cubic spline interpolation and for some reason, the interpolation drops off in the middle of it. It's very mysterious and I can't find any mention of similar occurrences anywhere online.
This is for my dissertation so I have excluded some labels etc. to keep it obscure intentionally, but all the relevant code is as follows. For context, this is an astronomy related plot.
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
W = np.array([0.435,0.606,0.814,1.05,1.25,1.40,1.60])
sum_all = np.array([sum435,sum606,sum814,sum105,sum125,sum140,sum160])
sum_can = np.array([sumc435,sumc606,sumc814,sumc105,sumc125,sumc140,sumc160])
fall = CubicSpline(W,sum_all)
newallx=np.arange(0.435,1.6,0.001)
newally=fall(newallx)
fcan = CubicSpline(W,sum_can)
newcanx=np.arange(0.435,1.6,0.001)
newcany=fcan(newcanx)
#----plot
plt.plot(newallx,newally)
plt.plot(newcanx,newcany)
plt.plot(W,sum_all,marker='o',color='r',linestyle='')
plt.plot(W,sum_can,marker='o',color='b',linestyle='')
plt.yscale("log")
plt.ylabel("Flux S$_v$ [erg s$^-$$^1$ cm$^-$$^2$ Hz$^-$$^1$]")
plt.xlabel("Wavelength [n$\lambda$]")
plt.show()
The plot that I get from that comes out like this, with a clear gap in the interpolation:
And in case you are wondering, these are the values in the sum_all and sum_can arrays (I assume it doesn't matter, but just in case you want the numbers to plot it yourself):
sum_all:
[ 3.87282732e+32 8.79993191e+32 1.74866333e+33 1.59946687e+33
9.08556547e+33 6.70458731e+33 9.84832359e+33]
can_all:
[ 2.98381061e+28 1.26194810e+28 3.30328780e+28 2.90254609e+29
3.65117723e+29 3.46256846e+29 3.64483736e+29]
The gap happens between [0.606,1.26194810e+28] and [0.814,3.30328780e+28]. If I change the intervals from 0.001 to something higher, it's obvious that the plot doesn't actually break off but merely dips below 0 on the y-axis (but the plot is continuous). So why does it do that? Surely that's not a correct interpolation? Just looking with our eyes, that's clearly not a well-interpolated connection between those two points.
Any tips or comments would be extremely appreciated. Thank you so much in advance!
The reason for the breakdown can be better observed on a linear scale.
We see that the spline actually passes below 0, which is undefined on a log scale.
So I would suggest to first take the logarithm of the data, perform the spline interpolation on the logarithmically scaled data, and then scale back by the 10th power.
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
W = np.array([0.435,0.606,0.814,1.05,1.25,1.40,1.60])
sum_all = np.array([ 3.87282732e+32, 8.79993191e+32, 1.74866333e+33, 1.59946687e+33,
9.08556547e+33, 6.70458731e+33, 9.84832359e+33])
sum_can = np.array([ 2.98381061e+28, 1.26194810e+28, 3.30328780e+28, 2.90254609e+29,
3.65117723e+29, 3.46256846e+29, 3.64483736e+29])
fall = CubicSpline(W,np.log10(sum_all))
newallx=np.arange(0.435,1.6,0.001)
newally=fall(newallx)
fcan = CubicSpline(W,np.log10(sum_can))
newcanx=np.arange(0.435,1.6,0.01)
newcany=fcan(newcanx)
plt.plot(newallx,10**newally)
plt.plot(newcanx,10**newcany)
plt.plot(W,sum_all,marker='o',color='r',linestyle='')
plt.plot(W,sum_can,marker='o',color='b',linestyle='')
plt.yscale("log")
plt.ylabel("Flux S$_v$ [erg s$^-$$^1$ cm$^-$$^2$ Hz$^-$$^1$]")
plt.xlabel("Wavelength [n$\lambda$]")
plt.show()

Exponential decay curve fitting in numpy and scipy

I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq for exponential functions and scipy.optimize.curve_fit for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq and Numpy lstsq seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x) and is constrained such that y=1 at x=0.
Excel trend line agrees with the Numpy lstsq result, but as Scipy leastsq is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.show()
Edit - additional information
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit curve presents an R^2 of 0.82, while the numpy.linalg.lstsq curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq, I you don't need to vstack a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]
To expound a bit on Jaime's point, any non-linear transformation of the data will lead to a different error function and hence to different solutions. These will lead to different confidence intervals for the fitting parameters. So you have three possible criteria to use to make a decision: which error you want to minimize, which parameters you want more confidence in, and finally, if you are using the fitting to predict some value, which method yields less error in the interesting predicted value. Playing around a bit analytically and in Excel suggests that different kinds of noise in the data (e.g. if the noise function scales the amplitude, affects the time-constant or is additive) leads to different choices of solution.
I'll also add that while this trick "works" for exponential decay to 0, it can't be used in the more general (and common) case of damped exponentials (rising or falling) to values that cannot be assumed to be 0.