How numpy Gaussian Distribution works? - numpy-random

I am new to numpy. There is a part which in not so clear to me in numpy gaussian distribution.
from numpy import random
x = random.normal(size=(2, 3))
print(x)
This generate x as:
[[ 1.15917179 -1.29036526 0.88074813]
[-1.07674284 2.35437249 -0.54746161]]
But i don't know what this number says, and how they get created or how do we know this represents gaussian distribution?

Related

TensorFlow Probability (tfp) equivalent of np.quantile()

I am trying to find a TensorFlow equivalent of np.quantile(). I have found tfp.stats.quantiles() (tfp stands for TensorFlow Probability). However, its constructs are a bit different from that of np.quantile().
Consider the following example:
import tensorflow_probability as tfp
import tensorflow as tf
import numpy as np
inputs = tf.random.normal((1, 4096, 4))
print("NumPy")
print(np.quantile(inputs.numpy(), q=0.9, axis=1, keepdims=False))
I am not sure from the TFP docs how I could write the above using tfp.stats.quantile(). I tried checking out the source code of both methods, but it didn't help.
Let me try to be more helpful here than I was on GitHub.
There is a difference in behavior between np.quantile and tfp.stats.quantiles. The key difference here is that numpy.quantile will
Compute the q-th quantile of the data along the specified axis.
where q is the
Quantile or sequence of quantiles to compute, which must be between 0 and 1 inclusive.
and tfp.stats.quantiles
Given a vector x of samples, this function estimates the cut points by returning num_quantiles + 1 cut points
So you need to tell tfp.stats.quantiles how many quantiles you want and then select out the qth quantile. If it isn't clear how to do this just from the API, if you look at the source for tfp.stats.quantiles (for v0.19.0) we can see that it shows us how we can get a similar return structure as NumPy.
For completeness, setting up a virtual environment with
$ cat requirements.txt
numpy==1.24.2
tensorflow==2.11.0
tensorflow-probability==0.19.0
allows us to run
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
inputs = tf.random.normal((1, 4096, 4), dtype=tf.float64)
q = 0.9
numpy_quantiles = np.quantile(inputs.numpy(), q=q, axis=1, keepdims=False)
tfp_quantiles = tfp.stats.quantiles(
inputs, num_quantiles=100, axis=1, interpolation="linear"
)[int(q * 100)]
assert np.allclose(numpy_quantiles, tfp_quantiles.numpy())
print(f"{numpy_quantiles=}")
# numpy_quantiles=array([[1.31727661, 1.2699167 , 1.28735237, 1.27137588]])
print(f"{tfp_quantiles=}")
# tfp_quantiles=<tf.Tensor: shape=(1, 4), dtype=float64, numpy=array([[1.31727661, 1.2699167 , 1.28735237, 1.27137588]])>
You could also use tfp.stats.percentile(inputs, 90., axis=1, keepdims=False) -- the only difference from quantile is the 90. replacing .90.

Evaluation of log density for various values of `mean`

I can evaluate the log probability density of a multivariate normal by doing
import numpy as np
import scipy.stats
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.zeros(2), cov = np.eye(2))
Now, I'm interested in evaluating the log density of the point [0,0] over a variety of values of mean. Here is what I have tried
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.stack([x,y], axis = -1), cov = np.eye(2))
This results in an error: ValueError: Array 'mean' must be a vector of length 5202.
How can I evaluate the log density of a multivariate normal over a variety of values of mean?
As your error suggest logpdf is waiting a 1D array for the mean argument.
Since your covariance matrix is 2x2, you should give him a 2x1 array to mean.
If you want to evaluate the density for multiple mean values you can use a for loop after flattening x and y as follows :
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
x,y = x.flatten(), y.flatten()
res = []
for i in range(len(x)):
x_i, y_i = x[i], y[i]
res.append(scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)))
You can also also use list comprehension in place of the for loop :
res = [scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)) for i in range(len(x))]
To visualize the result you can use matplotlib.pyplot :
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x,y,c=res)
plt.show()
But I don't see the point of trying to evaluate the multivariate gaussian logpdf over several mean values.
In the case of a multivariate normal distribution the argument x and the mean m have symmetric roles as you can see in the exponential term : (x-m)^T Sigam^(-1) (x-m).
What you are doing is equivalent to evaluate the logpdf of a multivariate gaussian of mean [0,0] and of covariance eye(2).

Estimation of t-distribution by mean of samples does not work

I am trying to create a t-distribution by taking the mean of many samples from a normal distribution (and then estimating the shape with kernel density estimation).
For some reason, I am getting pretty different results when I compare what I get with a proper t-distribution. I don't understand what is going wrong, so I think I am confused about something.
Here is the code:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import seaborn
inner_sample_size = 10
X = np.arange(-3, 3, 0.01)
results = [
np.mean(np.random.normal(size=inner_sample_size))
for _ in range(10000)
]
estimation = gaussian_kde(results)
plt.plot(X, estimation.evaluate(X))
t_samples = np.random.standard_t(inner_sample_size, 10000)
t_estimator = gaussian_kde(t_samples)
plt.plot(X, t_estimator.evaluate(X))
plt.ylabel("Probability density")
plt.show()
And here is the plot I get:
Where the orange line is numpy's own t-distribution, and the blue line is the one estimated by sampling.
Your assumption that the mean of Standard Normals has T distribution is incorrect. In fact, the mean of Standard Normals has Normal Distribution, which explains the shape of your blue graph. To generate one random variable T from a T distribution with k degrees of freedom, you first generate k+1 independent Standard Normals Z_i, i=0,...,k. You then compute
T = Z_0 / sqrt( sum(Z_i^2, i=1 to k)/k ).
The sum of squared Standard Normals sum(Z_i^2, i=1 to k) has Chi-Squared Distribution with k degrees of freedom, so if there is a pre-canned method to generate this, you should use it, since it's likely more efficient.

Locally weighted smoothing for binary valued random variable

I have a random variable as follows:
f(x) = 1 with probability g(x)
f(x) = 0 with probability 1-g(x)
where 0 < g(x) < 1.
Assume g(x) = x. Let's say I am observing this variable without knowing the function g and obtained 100 samples as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binned_statistic
list = np.ndarray(shape=(200,2))
g = np.random.rand(200)
for i in range(len(g)):
list[i] = (g[i], np.random.choice([0, 1], p=[1-g[i], g[i]]))
print(list)
plt.plot(list[:,0], list[:,1], 'o')
Plot of 0s and 1s
Now, I would like to retrieve the function g from these points. The best I could think is to use draw a histogram and use the mean statistic:
bin_means, bin_edges, bin_number = binned_statistic(list[:,0], list[:,1], statistic='mean', bins=10)
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], lw=2)
Histogram mean statistics
Instead, I would like to have a continuous estimation of the generating function.
I guess it is about kernel density estimation but I could not find the appropriate pointer.
straightforward without explicitly fitting an estimator:
import seaborn as sns
g = sns.lmplot(x= , y= , y_jitter=.02 , logistic=True)
plug in x= your exogenous variable and analogously y = dependent variable. y_jitter is jitter the point for better visibility if you have a lot of data points. logistic = True is the main point here. It will give you the logistic regression line of the data.
Seaborn is basically tailored around matplotlib and works great with pandas, in case you want to extend your data to a DataFrame.

Curve fitting a large data set

Right now, I'm trying to fit a curve to a large set of data; there are two arrays, x and y, each with 352 elements. I've fit a polynomial to the data, which works fine:
import numpy as np
import matplotlib.pyplot as plt
coeff=np.polyfit(x, y, 20)
coeff=np.polyfit(x, y, 20)
poly=np.poly1d(coeff)
But I need a more accurately optimized curve, so I've been trying to fit a curve with scipy. Here's the code that I have so far:
import numpy as np
import scipy
from scipy import scipy.optimize as sp
coeff=np.polyfit(x, y, 20)
coeff=np.polyfit(x, y, 20)
poly=np.poly1d(coeff)
poly_y=poly(x)
def poly_func(x): return poly(x)
param=sp.curve_fit(poly_func, x, y)
But all it returns is this:
ValueError: Unable to determine number of fit parameters.
How can I get this to work? (Or how can I fit a curve to this data?)
Your fit function does not make sense, it takes no parameter to fit.
Curve fit uses a non-linear optimizer, which needs a initial guess of the fitting parameters.
If no guess is given, it tries to determine number of parameters via introspection, which fails for your function, and set them to one (something you almost never want.)