Equivalent of R's of cor.test in Python - numpy

Is there a way I can find the r confidence interval in Python?
In R i could do something like:
cor.test(m, h)
Pearson's product-moment correlation
data: m and h
t = 0.8974, df = 4, p-value = 0.4202
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6022868 0.9164582
sample estimates:
cor
0.4093729
In Python I can calculate r (cor) using:
r,p = scipy.stats.pearsonr(df.age, df.pets)
But that doesn't return the r confidence interval.

Here's one way to calculate confidence internal
First get the correlation value (pearson's)
In [85]: from scipy import stats
In [86]: corr = stats.pearsonr(df['col1'], df['col2'])
In [87]: corr
Out[87]: (0.551178607008175, 0.0)
Use the Fisher transformation to get z
In [88]: z = np.arctanh(corr[0])
In [89]: z
Out[89]: 0.62007264620685021
And, the sigma value i.e standard error
In [90]: sigma = (1/((len(df.index)-3)**0.5))
In [91]: sigma
Out[91]: 0.013840913308956662
Get normal 95% interval probability density function for normal continuous random variable apply two-sided conditional formula
In [92]: cint = z + np.array([-1, 1]) * sigma * stats.norm.ppf((1+0.95)/2)
Finally take hyperbolic tangent to get interval values for 95%
In [93]: np.tanh(cint)
Out[93]: array([ 0.53201034, 0.56978224])

Related

How to combine scipy interp1d with mpmath quadosc

I have a density function (from quantum mechanics calculations) to be multiplied with the spherical Bessel function with a momentum grid (momentum q 1d array, real space distance r 1d array, so need to calculate jn(q*r) 2d array). The product will be integrated across the real space to get results as function of momentum (results in 1d array same shape as q).
The Bessel function has oscillation, while the density function fast decay over a threshold distance. I used the adaptive quadrature in quadpy which is fine when oscillation is slow but it fails with high oscillation (for high momentum values or high orders in Bessel functions). The mpmath quadosc could be a nice option, but currently I have the problem that "object arrays are not supported", which seems to be the same as in Relation between mpmath and scipy: Type Error, what would be the best way to solve it since the density function is calculated outside of the mpmath.
from mpmath import besselj, sqrt, pi, besseljzero, inf,quadosc
from scipy.interpolate import interp1d
n = 1
q = np.geomspace(1e-7, 500, 1000)
# lets create a fake gaussian density
x = np.geomspace(1e-7, 10, 1000)
y = np.exp(-(x-5)**2)
density = interp1d(x,y,kind='cubic',fill_value=0,bounds_error=False)
# if we just want to integrate the spherical bessel function
def spherical_jn(x,n=n):
return besselj(n + 1 / 2, x) * sqrt(pi / 2 / x)
# this is fine
vals = quadosc(
spherical_jn, [0, inf], zeros=lambda m: besseljzero(n + 1 / 2, m)
)
# now we want to integrate the spherical bessel function times the density
def spherical_jn_density(x,n=lprimeprime):
grid = q[..., None] *x
return besselj(n + 1 / 2, grid) * sqrt(pi / 2 / grid)*density(x)
# this will fail
vals_density = quadosc(
spherical_jn_density, [0, inf], zeros=lambda m: besseljzero(n + 1 / 2, m)
)
Expect: an accurate integral of highly oscillating spherical Bessel function with arbitrary decaying function (decay to zero at large distance).
Your density is interp callable, which works like:
In [33]: density(.5)
Out[33]: array(1.60522789e-09)
It does not work when given a mpmath object:
In [34]: density(mpmath.mpf(.5))
ValueError: object arrays are not supported
It's ok if x is first converted to ordinary float:
In [37]: density(float(mpmath.mpf(.5)))
Out[37]: array(1.60522789e-09)
Tweaking your function:
def spherical_jn_density(x,n=1):
print(repr(x))
grid = q[..., None] *x
return besselj(n + 1 / 2, grid) * sqrt(pi / 2 /grid) * density(x)
and trying to run the quadosc (with a smaller q)
In [57]: vals_density = quadosc(
...: spherical_jn_density, [0, inf], zeros=lambda m: besseljzero(n + 1 / 2, m))
mpf('0.506414729137261838698106')
TypeError: cannot create mpf from array([[mpf('5.06414729137261815781894e-8')],
[mpf('0.000000473559111442409924364745')],
[mpf('0.00000442835129247081824275722')],
[mpf('0.0000414104484439061558283487')],
[mpf('0.000387237851532012775822723')],
[mpf('0.00362114295531604773233197')],
[mpf('0.0338620727569835882851491')],
[mpf('0.316651395857188250996884')],
[mpf('2.96107409661232278850947')],
[mpf('27.6896294168963266721213')]], dtype=object)
In other words,
besselj(n + 1 / 2, grid)
is having problems, even before trying to evaluate density(x). mpmath functions don't work with numpy arrays; and many numpy/scipy functions don't work with mpmath objects.

How to efficiently compute an L2 distance between rows of two array using only basic numpy operations? [duplicate]

I have 2 lists of points as numpy.ndarray, each row is the coordinate of a point, like:
a = np.array([[1,0,0],[0,1,0],[0,0,1]])
b = np.array([[1,1,0],[0,1,1],[1,0,1]])
Here I want to calculate the euclidean distance between all pairs of points in the 2 lists, for each point p_a in a, I want to calculate the distance between it and every point p_b in b. So the result is
d = np.array([[1,sqrt(3),1],[1,1,sqrt(3)],[sqrt(3),1,1]])
How to use matrix multiplication in numpy to compute the distance matrix?
Using direct numpy broadcasting, you can do this:
dist = np.sqrt(((a[:, None] - b[:, :, None]) ** 2).sum(0))
Alternatively, scipy has a routine that will compute this slightly more efficiently (particularly for large matrices)
from scipy.spatial.distance import cdist
dist = cdist(a, b)
I would avoid solutions that depend on factoring-out matrix products (of the form A^2 + B^2 - 2AB), because they can be numerically unstable due to floating point roundoff errors.
To compute the squared euclidean distance for each pair of elements off them - x and y, we need to find :
(Xik-Yjk)**2 = Xik**2 + Yjk**2 - 2*Xik*Yjk
and then sum along k to get the distance at coressponding point as dist(Xi,Yj).
Using associativity, it reduces to :
dist(Xi,Yj) = sum_k(Xik**2) + sum_k(Yjk**2) - 2*sum_k(Xik*Yjk)
Bringing in matrix-multiplication for the last part, we would have all the distances, like so -
dist = sum_rows(X^2), sum_rows(Y^2), -2*matrix_multiplication(X, Y.T)
Hence, putting into NumPy terms, we would end up with the euclidean distances for our case with a and b as the inputs, like so -
np.sqrt((a**2).sum(1)[:,None] + (b**2).sum(1) - 2*a.dot(b.T))
Leveraging np.einsum, we could replace the first two summation-reductions with -
np.einsum('ij,ij->i',a,a)[:,None] + np.einsum('ij,ij->i',b,b)
More info could be found on eucl_dist package's wiki page (disclaimer: I am its author).
If you have 2 each 1-dimensional arrays, x and y, you can convert the arrays into matrices with repeating columns, transpose, and apply the distance formula. This assumes that x and y are coordinated pairs. The result is a symmetrical distance matrix.
x = [1, 2, 3]
y = [4, 5, 6]
xx = np.repeat(x,3,axis = 0).reshape(3,3)
yy = np.repeat(y,3,axis = 0).reshape(3,3)
dist = np.sqrt((xx-xx.T)**2 + (yy-yy.T)**2)
dist
Out[135]:
array([[0. , 1.41421356, 2.82842712],
[1.41421356, 0. , 1.41421356],
[2.82842712, 1.41421356, 0. ]])
L2 distance = (a^2 + b^2 - 2ab)^0.5
a = np.random.randn(5, 3)
b = np.random.randn(2, 3)
a2 = np.sum(np.square(a), axis = 1)[..., None]
b2 = np.sum(np.square(b), axis = 1)[None, ...]
ab = -2*np.dot(a, b.T)
dist = np.sqrt(a2 + b2 + ab)

Python numpy percentile vs scipy percentileofscore

I am confused as to what I am doing incorrectly.
I have the following code:
import numpy as np
from scipy import stats
df
Out[29]: array([66., 69., 67., 75., 69., 69.])
val = 73.94
z1 = stats.percentileofscore(df, val)
print(z1)
Out[33]: 83.33333333333334
np.percentile(df, z1)
Out[34]: 69.999999999
I was expecting that np.percentile(df, z1) would give me back val = 73.94
I think you're not quite understanding what percentileofscore and percentile actually do. They are not inverses of each other.
From the docs for scipy.stats.percentileofscore:
The percentile rank of a score relative to a list of scores.
A percentileofscore of, for example, 80% means that 80% of the scores in a are below the given score. In the case of gaps or ties, the exact definition depends on the optional keyword, kind.
So when you supply the value 73.94, there are 5 elements of df that fall below that score, and 5/6 gives you your 83.3333% result.
Now in the Notes for numpy.percentile:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V.
The default interpolation parameter is 'linear' so:
'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Since you have provided 83 as your input parameter, you're looking at a value 83/100 of the way from minimum to the maximum in your array.
If you're interested in digging through the source, you can find it here, but here is a simplified look at the calculation being done here:
ap = np.asarray(sorted(df))
Nx = df.shape[0]
indices = z1 / 100 * (Nx - 1)
indices_below = np.floor(indices).astype(int)
indices_above = indices_below + 1
weight_above = indices - indices_below
weight_below = 1 - weight_above
x1 = ap[b] * weight_below # 57.50000000000004
x2 = ap[a] * weight_above # 12.499999999999956
x1 + x2
70.0

Loop through numpy array on indexes and apply function [duplicate]

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.
Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])

numpy.fft.fft not computing dft at frequencies given by numpy.fft.fftfreq?

This is a mathematical question, but it is tied to the numpy implementation, so I decided to ask it at SO. Perhaps I'm hugely misunderstanding something, but if so I would like to be put straight.
numpy.ftt.ftt computes DFT according to equation:
numpy.ftt.fftfreq is supposed to return frequencies at which DFT was computed.
Say we have:
x = [0, 0, 1, 0, 0]
X = np.fft.fft(x)
freq = np.fft.fftfreq(5)
Then for signal x, its DFT transformation is X, and frequencies at which X is computed are given by freq. For example X[0] is DFT of x at frequency freq[0], X[1] is DFT of x at frequency freq[1], and so on.
But when I compute DFT of a simple signal by hand with the formula quoted above, my results indicate that X[1] is DFT of x at frequency 1, not at freq[1], X[2] is DFT of x at frequency 2, etc, not at freq[2], etc.
As an example:
In [32]: x
Out[32]: [0, 0, 1, 0, 0]
In [33]: X
Out[33]:
array([
1.00000000+0.j,
-0.80901699-0.58778525j,
0.30901699+0.95105652j, 0.30901699-0.95105652j,
-0.80901699+0.58778525j])
In [34]: freq
Out[34]: array([ 0. , 0.2, 0.4, -0.4, -0.2])
If I compute DFT of above signal for k = 0.2 (or freq[1]), I get
X at freq = 0.2: 0.876 - 0.482j, which isn't X[1].
If however I compute for k = 1 I get the same results as are in X[1] or -0.809 - 0.588j.
So what am I misunderstanding? If numpy.fft.fft(x)[n] is a DFT of x at frequency n, not at frequency numpy.fft.fttfreq(len(x))[n], what is the purpose of numpy.fft.fttfreq?
I think that because the values in the array returned by the numpy.fft.fttfreq are equal to the (k/n)*sampling frequency.
The frequencies of the dft result are equal to k/n divided by the time spacing, because the periodic function's period's amplitude will become the inverse of the original value after fft. You can consider the digital signal function is a periodic sampling function convoluted by the analog signal function. The convolution in time domain means multiplication in frequency domain, so that the time spacing of the input data will affect the frequency spacing of the dft result and the frequency spacing's value will become the original one divided by the time spacing. Originally, the frequency spacing of the dft result is equal to 1/n when the time spacing is equal to 1. So after the dft, the frequency spacing will become 1/n divided by the time spacing, which eqauls to 1/n multiplied by the sampling frequency.
To calculate that, the numpy.fft.fttfreq has two arguments, the length of the input and time spacing, which means the inverse of the sampling rate. The length of the input is equal to n, and the time spacing is equal to the value which the result k/n divided by (Default is 1.)
I have tried to let k = 2, and the result is equal to the X[2] in your example. In this situation, the k/n*1 is equal to the freq[2].
The DFT is a dimensionless basis transform or matrix multiplication. The output or result of a DFT has nothing to do with frequencies unless you know the sampling rate represented by the input vector (samples per second, per meter, per radian, etc.)
You can compute a Goertzel filter of the same length N with k=0.2, but that result isn't contained in an DFT or FFT result of length N. A DFT only contains complex Goertzel filter results for integer k values. And to get from k to the frequency represented by X[k], you need to know the sample rate.
Yours is not a SO question
You wrote
If I compute DFT of above signal for k = 0.2 .
and I reply "You shouldn't"... the DFT can be meaningfully computed only for integer values of k.
The relationship between an index k and a frequency is given by f_k = k Δf or, if you prefer circular frequencies, ω_k = k Δω where Δf = 1/T and Δω = 2πΔf, T being the period of the signal.
The arguments of fftfreq are a bit misleading... the required one is the number of samples n and the optional argument is the sampling interval, by default d=1.0, but at any rate T=n*d and Δf = 1/(n*d)
>>> fftfreq(5) # d=1
array([ 0. , 0.2, 0.4, -0.4, -0.2])
>>> fftfreq(5,2)
array([ 0. , 0.1, 0.2, -0.2, -0.1])
>>> fftfreq(5,10)
array([ 0. , 0.02, 0.04, -0.04, -0.02])
and the different T are 5,10,50 and the respective df are -.2,0.1,0.02 as (I) expected.
Why fftfreq doesn't simply require the signal's period? because it is mainly intended as an helper in demangling the Nyquist frequency issue.
As you know, the DFT is periodic, for a signal x of length N you have that
DFT(x,k) is equal to DFT(x,k+mN) where m is an integer.
This imply that there are only N/2 positive and N/2 negative distinct frequencies and that, when N/2<k<N, the frequency that must be associated with k in the most meaningful way is not k df but (k-N) df.
To perform this, fftfreq needs more information that the period T, hence the choice of requiring n and computing df from an assumption on sampling interval.