Why the weighted average get unintended result? - numpy

we know the weighted average formula is:
so, when I use numpy:
a = np.array([1,2,3,4])
wts = np.array([1,2,3,4])
print(np.average(a, weights=wts))
it should be:
np.sum([1*1, 2*2, 3*3, 4*4]) / 4 # 7.5
but why get 3.0?

According to the doc of average, the average is
avg = sum(a * weights) / sum(weights)
If you want to divide by the number of weights instead of the summation, you can simply do
a = np.array([1,2,3,4])
wts = np.array([1,2,3,4])
np.dot(a,wts) / wts.shape[0]

Related

Pandas Groupby Weighted Standard Deviation

I have a dataframe:
Type Weights Value ....
0 W 0.5 15
1 C 1.2 19
2 W 12 25
3 C 7.1 15 .....
.......
.......
I want to group on type and then calculate weighted mean and weighted standard deviation.
There seem to be solution available for weighted mean (groupby weighted average and sum in pandas dataframe) but none for weighted standard deviation.
Is there a simple way to do it.
I have used the weighted standard deviation formula from the following link:
https://doc-archives.microstrategy.com/producthelp/10.7/FunctionsRef/Content/FuncRef/WeightedStDev__weighted_standard_deviation_of_a_sa.htm
However you can modify for a different formula
import numpy as np
def weighted_sd(input_df):
weights = input_df['Weights']
vals = input_df['Value']
numer = np.sum(weights * (vals - vals.mean())**2)
denom = ((vals.count()-1)/vals.count())*np.sum(weights)
return np.sqrt(numer/denom)
print(df.groupby('Type').apply(weighted_sd))
Minor correction to the weighted standard deviation formula from the previous answer.
import numpy as np
def weighted_sd(input_df):
weights = input_df['Weights']
vals = input_df['Value']
weighted_avg = np.average(vals, weights=weights)
numer = np.sum(weights * (vals - weighted_avg)**2)
denom = ((vals.count()-1)/vals.count())*np.sum(weights)
return np.sqrt(numer/denom)
print(df.groupby('Type').apply(weighted_sd))

The weighted means of group is not equal to the total mean in pandas groupby

I have a strange problem with calculating the weighted mean of a pandas dataframe. I want to do the following steps:
(1) calculate the weighted mean of all the data
(2) calculate the weighted mean of each group of data
The issue is when I do step 2, then the mean of groups means (weighted by the number of members in each group) is not the same as the weighted mean of all the data (step 1). Mathematically it should be (here). I even thought maybe the issue is the dtype, so I set everything on float64 but the problem still exists. Below I provided a simple example that illustrates this problem:
My dataframe has a data, a weight and group columns:
data = np.array([
0.20651903, 0.52607571, 0.60558061, 0.97468593, 0.10253621, 0.23869854,
0.82134792, 0.47035085, 0.19131938, 0.92288234
])
weights = np.array([
4.06071562, 8.82792146, 1.14019687, 2.7500913, 0.70261312, 6.27280216,
1.27908358, 7.80508994, 0.69771745, 4.15550846
])
groups = np.array([1, 1, 2, 2, 2, 2, 3, 3, 4, 4])
df = pd.DataFrame({"data": data, "weights": weights, "groups": groups})
print(df)
>>> print(df)
data weights groups
0 0.206519 4.060716 1
1 0.526076 8.827921 1
2 0.605581 1.140197 2
3 0.974686 2.750091 2
4 0.102536 0.702613 2
5 0.238699 6.272802 2
6 0.821348 1.279084 3
7 0.470351 7.805090 3
8 0.191319 0.697717 4
9 0.922882 4.155508 4
# Define a weighted mean function to apply to each group
def my_fun(x, y):
tmp = np.average(x, weights=y)
return tmp
# Mean of the population
total_mean = np.average(np.array(df["data"], dtype="float64"),
weights= np.array(df["weights"], dtype="float64"))
# Group data
group_means = df.groupby("groups").apply(lambda d: my_fun(d["data"],d["weights"]))
# number of members of each group
counts = np.array([2, 4, 2, 2],dtype="float64")
# Total mean calculated from mean of groups mean weighted by counts of each group
total_mean_from_group_means = np.average(np.array(group_means,
dtype="float64"),
weights=counts)
print(total_mean)
0.5070955626929458
print(total_mean_from_group_means)
0.5344436242465216
As you can see the total mean calculated from group means is not equal to the total mean. What I am doing wrong here?
EDIT: Fixed a typo in the code.
You compute a weighted mean within each group, so when you compute the total mean from the weighted means, the correct weight for each group is the sum of the weights within the group (and not the size of the group).
In [47]: wsums = df.groupby("groups").apply(lambda d: d["weights"].sum())
In [48]: total_mean_from_group_means = np.average(group_means, weights=wsums)
In [49]: total_mean_from_group_means
Out[49]: 0.5070955626929458

Python numpy percentile vs scipy percentileofscore

I am confused as to what I am doing incorrectly.
I have the following code:
import numpy as np
from scipy import stats
df
Out[29]: array([66., 69., 67., 75., 69., 69.])
val = 73.94
z1 = stats.percentileofscore(df, val)
print(z1)
Out[33]: 83.33333333333334
np.percentile(df, z1)
Out[34]: 69.999999999
I was expecting that np.percentile(df, z1) would give me back val = 73.94
I think you're not quite understanding what percentileofscore and percentile actually do. They are not inverses of each other.
From the docs for scipy.stats.percentileofscore:
The percentile rank of a score relative to a list of scores.
A percentileofscore of, for example, 80% means that 80% of the scores in a are below the given score. In the case of gaps or ties, the exact definition depends on the optional keyword, kind.
So when you supply the value 73.94, there are 5 elements of df that fall below that score, and 5/6 gives you your 83.3333% result.
Now in the Notes for numpy.percentile:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V.
The default interpolation parameter is 'linear' so:
'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Since you have provided 83 as your input parameter, you're looking at a value 83/100 of the way from minimum to the maximum in your array.
If you're interested in digging through the source, you can find it here, but here is a simplified look at the calculation being done here:
ap = np.asarray(sorted(df))
Nx = df.shape[0]
indices = z1 / 100 * (Nx - 1)
indices_below = np.floor(indices).astype(int)
indices_above = indices_below + 1
weight_above = indices - indices_below
weight_below = 1 - weight_above
x1 = ap[b] * weight_below # 57.50000000000004
x2 = ap[a] * weight_above # 12.499999999999956
x1 + x2
70.0

Calculating Cosine Distance in MXNet

I want to be able to calculate the cosine distance between row vectors using MXNet. Additionally I am working with batches of samples, and would like to calculate the cosine distance for each pair of samples (i.e. cosine distance of 1st row vector of batch #1 with 1st row vector of batch #2).
Cosine distance between two vectors is defined as in scipy.spatial.distance.cosine:
You can use mx.nd.batch_dot to perform this batch-wise cosine distance:
import mxnet as mx
def batch_cosine_dist(a, b):
a1 = mx.nd.expand_dims(a, axis=1)
b1 = mx.nd.expand_dims(b, axis=2)
d = mx.nd.batch_dot(a1, b1)[:,0,0]
a_norm = mx.nd.sqrt(mx.nd.sum((a*a), axis=1))
b_norm = mx.nd.sqrt(mx.nd.sum((b*b), axis=1))
dist = 1.0 - d / (a_norm * b_norm)
return dist
And it will return an array with batch_size number of distances.
batch_size = 3
dim = 2
a = mx.random.uniform(shape=(batch_size, dim))
b = mx.random.uniform(shape=(batch_size, dim))
dist = batch_cosine_dist(a, b)
print(dist.asnumpy())
# [ 0.04385382 0.25792354 0.10448891]

Equivalent of R's of cor.test in Python

Is there a way I can find the r confidence interval in Python?
In R i could do something like:
cor.test(m, h)
Pearson's product-moment correlation
data: m and h
t = 0.8974, df = 4, p-value = 0.4202
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6022868 0.9164582
sample estimates:
cor
0.4093729
In Python I can calculate r (cor) using:
r,p = scipy.stats.pearsonr(df.age, df.pets)
But that doesn't return the r confidence interval.
Here's one way to calculate confidence internal
First get the correlation value (pearson's)
In [85]: from scipy import stats
In [86]: corr = stats.pearsonr(df['col1'], df['col2'])
In [87]: corr
Out[87]: (0.551178607008175, 0.0)
Use the Fisher transformation to get z
In [88]: z = np.arctanh(corr[0])
In [89]: z
Out[89]: 0.62007264620685021
And, the sigma value i.e standard error
In [90]: sigma = (1/((len(df.index)-3)**0.5))
In [91]: sigma
Out[91]: 0.013840913308956662
Get normal 95% interval probability density function for normal continuous random variable apply two-sided conditional formula
In [92]: cint = z + np.array([-1, 1]) * sigma * stats.norm.ppf((1+0.95)/2)
Finally take hyperbolic tangent to get interval values for 95%
In [93]: np.tanh(cint)
Out[93]: array([ 0.53201034, 0.56978224])