Python numpy percentile vs scipy percentileofscore - numpy

I am confused as to what I am doing incorrectly.
I have the following code:
import numpy as np
from scipy import stats
df
Out[29]: array([66., 69., 67., 75., 69., 69.])
val = 73.94
z1 = stats.percentileofscore(df, val)
print(z1)
Out[33]: 83.33333333333334
np.percentile(df, z1)
Out[34]: 69.999999999
I was expecting that np.percentile(df, z1) would give me back val = 73.94

I think you're not quite understanding what percentileofscore and percentile actually do. They are not inverses of each other.
From the docs for scipy.stats.percentileofscore:
The percentile rank of a score relative to a list of scores.
A percentileofscore of, for example, 80% means that 80% of the scores in a are below the given score. In the case of gaps or ties, the exact definition depends on the optional keyword, kind.
So when you supply the value 73.94, there are 5 elements of df that fall below that score, and 5/6 gives you your 83.3333% result.
Now in the Notes for numpy.percentile:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V.
The default interpolation parameter is 'linear' so:
'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Since you have provided 83 as your input parameter, you're looking at a value 83/100 of the way from minimum to the maximum in your array.
If you're interested in digging through the source, you can find it here, but here is a simplified look at the calculation being done here:
ap = np.asarray(sorted(df))
Nx = df.shape[0]
indices = z1 / 100 * (Nx - 1)
indices_below = np.floor(indices).astype(int)
indices_above = indices_below + 1
weight_above = indices - indices_below
weight_below = 1 - weight_above
x1 = ap[b] * weight_below # 57.50000000000004
x2 = ap[a] * weight_above # 12.499999999999956
x1 + x2
70.0

Related

Pandas get max delta in a timeseries for a specified period

Given a dataframe with a non-regular time series as an index, I'd like to find the max delta between the values for a period of 10 secs. Here is some code that does the same thing:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
xs = np.cumsum(np.random.rand(200))
# This function is to create a general situation where the max is not aways at the end or beginning
ys = xs**1.2 + 10 * np.sin(xs)
plt.plot(xs, ys, '+-')
threshold = 10
xs_thresh_ind = np.zeros_like(xs, dtype=int)
deltas = np.zeros_like(ys)
for i, x in enumerate(xs):
# Find indices that lie within the time threshold
period_end_ind = np.argmax(xs > x + threshold)
# Only operate when the window is wide enough (this can be treated differently)
if period_end_ind > 0:
xs_thresh_ind[i] = period_end_ind
# Find extrema in the period
period_min = np.min(ys[i:period_end_ind + 1])
period_max = np.max(ys[i:period_end_ind + 1])
deltas[i] = period_max - period_min
max_ind_low = np.argmax(deltas)
max_ind_high = xs_thresh_ind[max_ind_low]
max_delta = deltas[max_ind_low]
print(
'Max delta {:.2f} is in period x[{}]={:.2f},{:.2f} and x[{}]={:.2f},{:.2f}'
.format(max_delta, max_ind_low, xs[max_ind_low], ys[max_ind_low],
max_ind_high, xs[max_ind_high], ys[max_ind_high]))
df = pd.DataFrame(ys, index=xs)
OUTPUT:
Max delta 48.76 is in period x[167]=86.10,200.32 and x[189]=96.14,249.09
Is there an efficient pandaic way to achieve something similar?
Create a Series from ys values, indexed by xs - but convert xs to be actual timedelta elements, rather than the float equivalent.
ts = pd.Series(ys, index=pd.to_timedelta(xs, unit="s"))
We want to apply a leading, 10 second window in which we calculate the difference between max and min. Because we want it to be leading, we'll sort the Series in descending order and apply a trailing window.
deltas = ts.sort_index(ascending=False).rolling("10s").agg(lambda s: s.max() - s.min())
Find the maximum delta with deltas[deltas == deltas.max()], which gives
0 days 00:01:26.104797298 48.354851
meaning a delta of 48.35 was found in the interval [86.1, 96.1)

Python exponential curve fitting in pandas: Define function parameters per row

my dataframe [11 x 300], where the column header equals 'x' ([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25]), and each row-value represents 'y' for. Each row can be described by an exponential function in the following format : a * x ^k + b.
The goal is to add three additional columns, describing a, k and b for that specific row. Just like: Python curve fitting on pandas dataframe then add coef to new columns
Instead of a polynomial function, my data needs be described in the following format: a * x **k + b.
As I cannot find any solution to derive the coefficients by using np.polyfit, I split my dataframe into different lists.
x = np.array([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25])
y1 = np.array([288.79,238.32,199.42,181.22,165.50,154.74,152.25,152.26,144.81,144.81,144.81])
y2 = np.array([309.92,255.75,214.02,194.48,177.61,166.06,163.40,163.40,155.41,155.41,155.41])
...
y300 = np.array([352.18,290.63,243.20,221.00,201.83,188.71,185.68,185.68,176.60,176.60,176.60])
def func(x,a,k,b):
return a * (x**k) + b
popt1, pcov = curve_fit(func,x,y1, p0 = (300,-0.5,0))
...
popt300, pcov = curve_fit(func,x,y300, p0 = (300,-0.5,0))
output:
popt1
[107.73727907 -1.545475 123.48621504]
...
popt300
[131.38411712 -1.5454452 150.59522147
This works, when I split all dataframe rows into lists and define popt for every list/row.
Avoiding to split all 300 columns - I prefer to apply the same methodology as Python curve fitting on pandas dataframe then add coef to new columns
my_coep_array = pd.DataFrame(np.polyfit(x, df.values,1)).T
But how to define my np.polyfit - a * x **k + b?

How to efficiently compute an L2 distance between rows of two array using only basic numpy operations? [duplicate]

I have 2 lists of points as numpy.ndarray, each row is the coordinate of a point, like:
a = np.array([[1,0,0],[0,1,0],[0,0,1]])
b = np.array([[1,1,0],[0,1,1],[1,0,1]])
Here I want to calculate the euclidean distance between all pairs of points in the 2 lists, for each point p_a in a, I want to calculate the distance between it and every point p_b in b. So the result is
d = np.array([[1,sqrt(3),1],[1,1,sqrt(3)],[sqrt(3),1,1]])
How to use matrix multiplication in numpy to compute the distance matrix?
Using direct numpy broadcasting, you can do this:
dist = np.sqrt(((a[:, None] - b[:, :, None]) ** 2).sum(0))
Alternatively, scipy has a routine that will compute this slightly more efficiently (particularly for large matrices)
from scipy.spatial.distance import cdist
dist = cdist(a, b)
I would avoid solutions that depend on factoring-out matrix products (of the form A^2 + B^2 - 2AB), because they can be numerically unstable due to floating point roundoff errors.
To compute the squared euclidean distance for each pair of elements off them - x and y, we need to find :
(Xik-Yjk)**2 = Xik**2 + Yjk**2 - 2*Xik*Yjk
and then sum along k to get the distance at coressponding point as dist(Xi,Yj).
Using associativity, it reduces to :
dist(Xi,Yj) = sum_k(Xik**2) + sum_k(Yjk**2) - 2*sum_k(Xik*Yjk)
Bringing in matrix-multiplication for the last part, we would have all the distances, like so -
dist = sum_rows(X^2), sum_rows(Y^2), -2*matrix_multiplication(X, Y.T)
Hence, putting into NumPy terms, we would end up with the euclidean distances for our case with a and b as the inputs, like so -
np.sqrt((a**2).sum(1)[:,None] + (b**2).sum(1) - 2*a.dot(b.T))
Leveraging np.einsum, we could replace the first two summation-reductions with -
np.einsum('ij,ij->i',a,a)[:,None] + np.einsum('ij,ij->i',b,b)
More info could be found on eucl_dist package's wiki page (disclaimer: I am its author).
If you have 2 each 1-dimensional arrays, x and y, you can convert the arrays into matrices with repeating columns, transpose, and apply the distance formula. This assumes that x and y are coordinated pairs. The result is a symmetrical distance matrix.
x = [1, 2, 3]
y = [4, 5, 6]
xx = np.repeat(x,3,axis = 0).reshape(3,3)
yy = np.repeat(y,3,axis = 0).reshape(3,3)
dist = np.sqrt((xx-xx.T)**2 + (yy-yy.T)**2)
dist
Out[135]:
array([[0. , 1.41421356, 2.82842712],
[1.41421356, 0. , 1.41421356],
[2.82842712, 1.41421356, 0. ]])
L2 distance = (a^2 + b^2 - 2ab)^0.5
a = np.random.randn(5, 3)
b = np.random.randn(2, 3)
a2 = np.sum(np.square(a), axis = 1)[..., None]
b2 = np.sum(np.square(b), axis = 1)[None, ...]
ab = -2*np.dot(a, b.T)
dist = np.sqrt(a2 + b2 + ab)

Represent a first order differential equation in numpy

I have an equation dy/dx = x + y/5 and an initial value, y(0) = -3.
I would like to know how to plot the exact graph of this function using pyplot.
I also have a x = np.linspace(0, interval, steps+1) which I would like to use as the x axis. So I'm only looking for the y axis values.
Thanks in advance.
Just for completeness, this kind of equation can easily be integrated numerically, using scipy.integrate.odeint.
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
# function dy/dx = x + y/5.
func = lambda y,x : x + y/5.
# Initial condition
y0 = -3 # at x=0
# values at which to compute the solution (needs to start at x=0)
x = np.linspace(0, 4, 101)
# solution
y = odeint(func, y0, x)
# plot the solution, note that y is a column vector
plt.plot(x, y[:,0])
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Given that you need to solve the d.e. you might prefer doing this algebraically, with sympy. (Or you might not.)
Import the module and define the function and the dependent variable.
>>> from sympy import *
>>> f = Function('f')
>>> var('x')
x
Invoke the solver. Note that all terms of the d.e. must be transposed to the left of the equals sign, and that the y must be replaced by the designator for the function.
>>> dsolve(Derivative(f(x),x)-x-f(x)/5)
Eq(f(x), (C1 + 5*(-x - 5)*exp(-x/5))*exp(x/5))
As you would expect, the solution is given in terms of an arbitrary constant. We must solve for that using the initial value. We define it as a sympy variable.
>>> var('C1')
C1
Now we create an expression to represent this arbitrary constant as the left side of an equation that we can solve. We replace f(0) with its value in the initial condition. Then we substitute the value of x in that condition to get an equation in C1.
>>> expr = -3 - ( (C1 + 5*(-x - 5)*exp(-x/5))*exp(x/5) )
>>> expr.subs(x,0)
-C1 + 22
In other words, C1 = 22. Finally, we can use this value to obtain the particular solution of the differential equation.
>>> ((C1 + 5*(-x - 5)*exp(-x/5))*exp(x/5)).subs(C1,22)
((-5*x - 25)*exp(-x/5) + 22)*exp(x/5)
Because I'm absentminded and ever fearful of making egregious mistakes I check that this function satisfies the initial condition.
>>> (((-5*x - 25)*exp(-x/5) + 22)*exp(x/5)).subs(x,0)
-3
(Usually things are incorrect only when I forget to check them. Such is life.)
And I can plot this in sympy too.
>>> plot(((-5*x - 25)*exp(-x/5) + 22)*exp(x/5),(x,-1,5))
<sympy.plotting.plot.Plot object at 0x0000000008C2F780>

Equivalent of R's of cor.test in Python

Is there a way I can find the r confidence interval in Python?
In R i could do something like:
cor.test(m, h)
Pearson's product-moment correlation
data: m and h
t = 0.8974, df = 4, p-value = 0.4202
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6022868 0.9164582
sample estimates:
cor
0.4093729
In Python I can calculate r (cor) using:
r,p = scipy.stats.pearsonr(df.age, df.pets)
But that doesn't return the r confidence interval.
Here's one way to calculate confidence internal
First get the correlation value (pearson's)
In [85]: from scipy import stats
In [86]: corr = stats.pearsonr(df['col1'], df['col2'])
In [87]: corr
Out[87]: (0.551178607008175, 0.0)
Use the Fisher transformation to get z
In [88]: z = np.arctanh(corr[0])
In [89]: z
Out[89]: 0.62007264620685021
And, the sigma value i.e standard error
In [90]: sigma = (1/((len(df.index)-3)**0.5))
In [91]: sigma
Out[91]: 0.013840913308956662
Get normal 95% interval probability density function for normal continuous random variable apply two-sided conditional formula
In [92]: cint = z + np.array([-1, 1]) * sigma * stats.norm.ppf((1+0.95)/2)
Finally take hyperbolic tangent to get interval values for 95%
In [93]: np.tanh(cint)
Out[93]: array([ 0.53201034, 0.56978224])