Numpy or Pandas function for "x-value-window" means or other stats? - numpy

Let's say I have x-y data samples sorted by x-value. I'm going to use Pandas as example, but I would be perfectly happy with a Numpy/Scipy-only solution, of course.
In [24]: pd.set_option('display.max_rows', 10)
In [25]: df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
In [26]: df = df.sort('x')
In [27]: df
Out[27]:
x y
13 -3.403818 0.717744
49 -2.688876 1.936267
74 -2.388332 -0.121599
52 -2.185848 0.617896
90 -2.155343 -1.132673
.. ... ...
65 1.736506 -0.170502
0 1.770901 0.520490
60 1.878376 0.206113
63 2.263602 1.112115
33 2.384195 -1.877502
[100 rows x 2 columns]
Now, I want to kind of "window" it or "discretize" it and get statistics on each window. But I don't want to do the Pandas moving-window functions because they define windows by rows. I want to define windows by a span of x-values, thus "x-value-window". Specifically, let's define each x-value-window with 2 parameters:
center x-value of each window
in this example, let's say I want x = 0.0 + 0.4 * k for all positive or negative k
thus -3.2, -2.8, -2.4, ..., 1.6, 2.0, 2.4
width of each window
in this example, let's say I want W = 0.5
thus, the example windows will be [-3.2-0.25, -3.2+0.25], [-2.8-0.25, -2.8+0.25], ..., [2.4-0.25, 2.4+0.25]
note that the windows overlap, which is intended
Having thus defined the windows, I would like to ask if there's a function that will produce the following data frame (or numpy array):
x y
-3.2 mean of y-values in x-value-window centered at -3.2
-2.8 mean of y-values in x-value-window centered at -2.8
-2.4 mean of y-values in x-value-window centered at -2.4
... ...
1.6 mean of y-values in x-value-window centered at 1.6
2.0 mean of y-values in x-value-window centered at 2.0
2.4 mean of y-values in x-value-window centered at 2.4
Is there anything that will do this for me? Or do I have to totally roll my own (and probably in a very slow python loop instead of fast numpy or pandas code)?
Extra 1: It would be even better if there's support for weighted windows (such as supported by Pandas's rolling_window function) but of course the weights in this case would not be based on how far the sample's row is from the center row of the window, but rather, how far the sample's x-value is from the center of the x-value-window.
Extra 2: It would be nice if there's support for statistics other than mean on the x-value-windows, e.g. (a) variance of the y-values in each x-value-window or (b) count of the number of samples falling within each x-value-window.

I first create a range of x values centered at zero. This range is wide enough so that then min value minus the width and the max value plus the width will capture all x values.
I then iterate through this range of x values which have k as the step size. At each point, I use loc to capture y values located at the selected x value plus and minus the width. The mean of these selected values are then calculated. These values are used to create the result dataframe.
import math
import numpy as np
import pandas as pd
k = .4
w = .5
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
x_range = np.arange(math.floor((df.x.min() + w) / k) * k,
k * (math.ceil((df.x.max() - w) / k) + 1), k)
result = pd.DataFrame((df.loc[df.x.between(x - w, x + w), 'y'].mean() for x in x_range),
index=x_range, columns=['y_mean'])
result.index.name = 'centered_x'
>>> result
y_mean
centered_x
-2.400000e+00 0.653619
-2.000000e+00 0.733606
-1.600000e+00 0.576594
-1.200000e+00 0.150462
-8.000000e-01 0.065884
-4.000000e-01 0.022925
-8.881784e-16 0.211693
4.000000e-01 0.057527
8.000000e-01 -0.141970
1.200000e+00 0.233695
1.600000e+00 0.203570
2.000000e+00 0.306409
2.400000e+00 0.576789

Related

changing range causes a distribution not normal

A post gives some code to plot this figure
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-10, 11)
xU, xL = x + 0.5, x - 0.5
prob = ss.norm.cdf(xU, scale = 3) - ss.norm.cdf(xL, scale = 3)
prob = prob / prob.sum() #normalize the probabilities so their sum is 1
nums = np.random.choice(x, size = 10000, p = prob)
plt.hist(nums, bins = len(x))
I modifyied this line
x = np.arange(-10, 11)
to this line
x = np.arange(10, 31)
I got this figure
How to fix that?
Given what you're asking Python to do, there's no error in this plot: it's a histogram of 10,000 samples from the tail (anything that rounds to between 10 and 31) of a normal distribution with mean 0 and standard deviation 3. Since probabilities drop off steeply in the tail of a normal, it happens that none of the 10,000 exceeded 17, which is why you didn't get the full range up to 31.
If you just want the x-axis of the plot to cover your full intended range, you could add plt.xlim(9.5, 31.5) after plt.hist.
If you want a histogram with support over this entire range, then you'll need to adjust the mean and/or variance of the distribution. For instance, if you specify that your normal distribution has mean 20 rather than mean 0 when you obtain prob, i.e.
prob = ss.norm.cdf(xU, loc=20, scale=3) - ss.norm.cdf(xL, loc=20, scale=3)
then you'll recover a similar-looking histogram, just translated to the right by 20.

Creating a grid of polar histograms (python)

I wish to create a sub plot that looks like the following picture,
it is supposed to contain 25 polar histograms, and I wish to add them to the plot one by one.
needs to be in python.
I already figured I need to use matplotlib but can't seem to figure it out completely.
thanks a lot!
You can create a grid of polar axes via projection='polar'.
hist creates a histogram, also when working with polar axes. Note that the x is in radians with a range of 2π. It works best when you give the bins explicitly as a linspace from 0 to 2π (or from -π to π, depending on the data). The third parameter of linspace should be one more than the number of bars that you'd want for the full circle.
About the exact parameters of axs[i][j].hist(x, bins=np.linspace(0, 2 * np.pi, np.random.randint(7, 30), endpoint=True), color='dodgerblue', ec='black'):
axs[i][j] draw on the jth subplot of the ith line
.hist create a histogram
x: the values that are put into bins
bins=: to enter the bins (either a fixed number between lowest and highest x or some explicit boundaries; default is 10 fixed boundaries)
np.random.randint(7, 30) a random whole number between 7 and 29
np.linspace(0, 2 * np.pi, n, endpoint=True) divide the range between 0 and 2π into n equal parts; endpoint=True makes boundaries at 0, at 2π and at n-2 positions in between; when endpoint=False there will be a boundary at 0, at n-1 positions in between but none at the end
color='dodgerblue': the color of the histogram bars will be blueish
ec='black': the edge color of the bars will be black
import numpy as np
import matplotlib.pyplot as plt
fig, axs = plt.subplots(5, 5, figsize=(8, 8),
subplot_kw=dict(projection='polar'))
for i in range(5):
for j in range(5):
x = np.random.uniform(0, 2 * np.pi, 50)
axs[i][j].hist(x, bins=np.linspace(0, 2 * np.pi, np.random.randint(7, 30)), color='dodgerblue', ec='black')
plt.tight_layout()
plt.show()

Python numpy percentile vs scipy percentileofscore

I am confused as to what I am doing incorrectly.
I have the following code:
import numpy as np
from scipy import stats
df
Out[29]: array([66., 69., 67., 75., 69., 69.])
val = 73.94
z1 = stats.percentileofscore(df, val)
print(z1)
Out[33]: 83.33333333333334
np.percentile(df, z1)
Out[34]: 69.999999999
I was expecting that np.percentile(df, z1) would give me back val = 73.94
I think you're not quite understanding what percentileofscore and percentile actually do. They are not inverses of each other.
From the docs for scipy.stats.percentileofscore:
The percentile rank of a score relative to a list of scores.
A percentileofscore of, for example, 80% means that 80% of the scores in a are below the given score. In the case of gaps or ties, the exact definition depends on the optional keyword, kind.
So when you supply the value 73.94, there are 5 elements of df that fall below that score, and 5/6 gives you your 83.3333% result.
Now in the Notes for numpy.percentile:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V.
The default interpolation parameter is 'linear' so:
'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Since you have provided 83 as your input parameter, you're looking at a value 83/100 of the way from minimum to the maximum in your array.
If you're interested in digging through the source, you can find it here, but here is a simplified look at the calculation being done here:
ap = np.asarray(sorted(df))
Nx = df.shape[0]
indices = z1 / 100 * (Nx - 1)
indices_below = np.floor(indices).astype(int)
indices_above = indices_below + 1
weight_above = indices - indices_below
weight_below = 1 - weight_above
x1 = ap[b] * weight_below # 57.50000000000004
x2 = ap[a] * weight_above # 12.499999999999956
x1 + x2
70.0

Mutiple plots in a single window

I need to draw many such rows (for a0 .. a128) in a single window. I've searched in FacetGrid, PairGrid and all over around but couldn't find. Only regplot has similar argument ax but it doesn't plot histograms. My data is 128 real valued features with label column [0, 1]. I need the graphs to be shown from my Python code as a separate application on Linux.
Also, it there a way to scale this histogram to show relative values on Y such that the right curve is not skewed?
g = sns.FacetGrid(df, col="Result")
g.map(plt.hist, "a0", bins=20)
plt.show()
Just a simple example using matplotlib. The code is not optimized (ugly, but simple plot-indexing):
import numpy as np
import matplotlib.pyplot as plt
N = 5
data = np.random.normal(size=(N*N, 1000))
f, axarr = plt.subplots(N, N) # maybe you want sharex=True, sharey=True
pi = [0,0]
for i in range(data.shape[0]):
if pi[1] == N:
pi[0] += 1 # next row
pi[1] = 0 # first column again
axarr[pi[0], pi[1]].hist(data[i], normed=True) # i was wrong with density;
# normed=True should be used
pi[1] += 1
plt.show()
Output:

Transform a numpy 3D ndarray to a symmetric form with respect to a specific index

In the case of a matrix mat n x n, i can do the following
sym = 0.5 * (mat + mat.T)
the operation gives the desired result sym[i,j] = sym[j,i]
Suppose we have a 3D array ndarr[i,j,k], where i,j,k 0,1,...n,
then ndarr is n x n x n. The idea is to obtain the following "symmetric" form
nsym[i,j,k] = nsym[j,i,k] using ndarr. I tried this:
import numpy as np
# Generate some random matrix, n = 5
ndarr = np.random.beta(0.1,1,(5,5,5))
# First attempt to symmetrize
sym1 = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(5)])
The problem here is that sym1[i,j,k] != sym1[j,i,k] as it is required. In fact I obtain sym1[i,j,k] = sym1[i,k,j], symmetric under the exchange of the last two symbols!
# Second attempt
sym2 = 0.5*(ndarr+ndarr.T)
Same problem here and sym2 is symmetric with respect the second index sym2[i,j,k]=sym2[k,j,i].
To resume, the goal is to find a symmetric form for a 3D array with respect to the third index and to preserve the values in the diagonal for the original ndarr[i,i,i].
The problem here is that you're not using the correct transpose:
sym = 0.5 * (ndarr + np.transpose(ndarr, (1, 0, 2)))
By default, np.transpose and the .T property will reverse the order of the axes. In your case, we want to only flip the first two axes: (0,1,2) -> (1,0,2).
EDIT: The reason your first attempt failed is because you were concatenating each symmetrized matrix along the first axis, not the last. It's more clear if you make ndarr with shape (5, 5, 3):
In [16]: sym = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(3)])
In [17]: sym.shape
Out[17]: (3L, 5L, 5L)
In any case, the version above with np.transpose is cleaner and more efficient.