The weighted means of group is not equal to the total mean in pandas groupby - numpy

I have a strange problem with calculating the weighted mean of a pandas dataframe. I want to do the following steps:
(1) calculate the weighted mean of all the data
(2) calculate the weighted mean of each group of data
The issue is when I do step 2, then the mean of groups means (weighted by the number of members in each group) is not the same as the weighted mean of all the data (step 1). Mathematically it should be (here). I even thought maybe the issue is the dtype, so I set everything on float64 but the problem still exists. Below I provided a simple example that illustrates this problem:
My dataframe has a data, a weight and group columns:
data = np.array([
0.20651903, 0.52607571, 0.60558061, 0.97468593, 0.10253621, 0.23869854,
0.82134792, 0.47035085, 0.19131938, 0.92288234
])
weights = np.array([
4.06071562, 8.82792146, 1.14019687, 2.7500913, 0.70261312, 6.27280216,
1.27908358, 7.80508994, 0.69771745, 4.15550846
])
groups = np.array([1, 1, 2, 2, 2, 2, 3, 3, 4, 4])
df = pd.DataFrame({"data": data, "weights": weights, "groups": groups})
print(df)
>>> print(df)
data weights groups
0 0.206519 4.060716 1
1 0.526076 8.827921 1
2 0.605581 1.140197 2
3 0.974686 2.750091 2
4 0.102536 0.702613 2
5 0.238699 6.272802 2
6 0.821348 1.279084 3
7 0.470351 7.805090 3
8 0.191319 0.697717 4
9 0.922882 4.155508 4
# Define a weighted mean function to apply to each group
def my_fun(x, y):
tmp = np.average(x, weights=y)
return tmp
# Mean of the population
total_mean = np.average(np.array(df["data"], dtype="float64"),
weights= np.array(df["weights"], dtype="float64"))
# Group data
group_means = df.groupby("groups").apply(lambda d: my_fun(d["data"],d["weights"]))
# number of members of each group
counts = np.array([2, 4, 2, 2],dtype="float64")
# Total mean calculated from mean of groups mean weighted by counts of each group
total_mean_from_group_means = np.average(np.array(group_means,
dtype="float64"),
weights=counts)
print(total_mean)
0.5070955626929458
print(total_mean_from_group_means)
0.5344436242465216
As you can see the total mean calculated from group means is not equal to the total mean. What I am doing wrong here?
EDIT: Fixed a typo in the code.

You compute a weighted mean within each group, so when you compute the total mean from the weighted means, the correct weight for each group is the sum of the weights within the group (and not the size of the group).
In [47]: wsums = df.groupby("groups").apply(lambda d: d["weights"].sum())
In [48]: total_mean_from_group_means = np.average(group_means, weights=wsums)
In [49]: total_mean_from_group_means
Out[49]: 0.5070955626929458

Related

How to do an advanced multiplication with panda dataframe

I have a dataframe1 of 1802 rows and 29 columns (in code as df) - each row is a person and each column is a number representing their answer to 29 different questions.
I have another dataframe2 of 29 different coefficients (in code as seg_1).
Each column needs to be multiplied by the corresponding coefficient and this needs to be repeated for each participant.
For example - 1802 iterations of q1 * coeff1, 1802 iterations of q2 * coeff2 etc
So I should end up with 1802 * 29 = 52,258
but the answer doesn't seem to be this length and also the answers aren't what I expect - I think the loop is multiplying q1-29 by coeff1, then repeating this for coeff2 but that's not what I need.
questions = range(0, 28)
co = range(0, 28)
segment_1 = []
for a in questions:
for b in co:
answer = df.iloc[:,a] * seg_1[b]
segment_1.append([answer])
Proper encoding of the coefficients as a Pandas frame makes this a one-liner
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
and circumvents slow for-loops. In addition, you don't need to remember the number of rows in the given table 1802 and can use the code without changes even if you data grows larger.
For a minimum viable example, see:
# answer frame
df_person = pd.DataFrame({'Question_1': [10, 20, 15], 'Question_2' : [4, 4, 2], 'Question_3' : [2, -2, 1]})
# coefficient frame
seg_1 = [2, 4, -1]
N = len(df_person)
df_coeffs = pd.DataFrame({'C_1': [seg_1[0]] * N, 'C_2' : [seg_1[1]] * N, 'C_3' : [seg_1[2]] * N})
# elementwise multiplication & row-wise summation
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
giving
for the coefficient table df_coeffs
and answer table df_person

In pandas, how to reindex(fill 0) in level 2 in multiindex

I have a dataframe pivot table with 2 level of index: month and rating. The rating should be 1,2,3 (not to be confused with the columns 1,2,3). I found that for some months, the rating could be missing. E.g, (Population and 2021-10) only has rating 1,2. I need every month to have ratings 1,2,3. So I need to fill 0 for the missing rating index.
tbl = pd.pivot_table(self.df, values=['ID'], index=['month', 'risk'],
columns=["Factor"], aggfunc='count', fill_value=0)
tbl = tbl.droplevel(None, axis=1).rename_axis(None, axis=1).rename_axis(index={'month': None,
'Risk': 'Client Risk Rating'})
# show Low for rating 1, Moderate for rating 2, Potential High for rating 3
rating = {1: 'Low',
2: 'Moderate',
3: 'Potential High'
}
pop = {'N': 'Refreshed Clients', 'Y': 'Population'}
tbl.rename(index={**rating,**pop}, inplace=True)
tbl = tbl.applymap(lambda x: x.replace(',', '')).astype(np.int64)
tbl = tbl.div(tbl.sum(axis=1), axis=0)
# client risk rating may be missing (e.g., only 1,2).
# To draw, need to fill the missing client risk rating with 0
print("before",tbl)
tbl=tbl.reindex(pd.MultiIndex.from_product(tbl.index.levels), fill_value=0)
print("after pd.MultiIndex.from_product",tbl)
I have used pd.MultiIndex.from_product. It does not work when all data is missing one index. For example, population has Moderate, 2021-03 and 2021-04 have Low and Moderate. After pd.MultiIndex.from_product, population has Low and Moderate, but all are missing High. My question is to have every month with risk 1,2,3. It seems the index values are from data.
You can use pd.MultiIndex.from_product to create a full index:
>>> df
1 2 3
(Population) 1 0.436954 0.897747 0.387058
2 0.464940 0.611953 0.133941
2021-08(Refreshed) 1 0.496111 0.282798 0.048384
2 0.163582 0.213310 0.504647
3 0.008980 0.651175 0.400103
>>> df.reindex(pd.MultiIndex.from_product(df.index.levels), fill_value=0)
1 2 3
(Population) 1 0.436954 0.897747 0.387058
2 0.464940 0.611953 0.133941
3 0.000000 0.000000 0.000000 # New record
2021-08(Refreshed) 1 0.496111 0.282798 0.048384
2 0.163582 0.213310 0.504647
3 0.008980 0.651175 0.400103
Update
I wonder df=df.reindex([1,2,3],level='rating',fill_value=0) doesn't work because the new index values [1,2,3] cannot fill the missing values for the previous rating index. By using the from_product, it creates the product of two index.
In fact it works. I mean it has an effect but not the one you expect. The method reindex the level not the values. Let me show you:
# It seems there is not effect because you don't see 3 and 4 as expected?
>>> df.reindex([1, 2, 3, 4], level='ratings')
0 1 2
ratings
(Population) 1 0.536154 0.671380 0.839362
2 0.729484 0.512379 0.440018
2021-08(Refreshed) 1 0.279990 0.295757 0.405536
2 0.864217 0.798092 0.144219
3 0.214566 0.407581 0.736905
# But yes something happens
>>> df.reindex([1, 2, 3, 4], level='ratings').index.levels
FrozenList([['(Population)', '2021-08(Refreshed)'], [1, 2, 3, 4]])
The level has been reindexed ---^
# It's different from values
>>> df.reindex([1, 2, 3, 4], level='ratings').index.get_level_values('ratings')
Int64Index([1, 2, 1, 2, 3], dtype='int64', name='ratings')

Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)

I am looking to calculate summary statistics on subsets of a dataframe but related to a specific values within the row.
For example, I have a dataframe that has latitude and longitude and number of people.
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
I want to know the total people within .05 miles from each row. This can be easily created with a loop, but as the space starts to increase this becomes unusable.
Current/Sample:
from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]
Is there any pythonic way to do this?
Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
calculate distance on each of these combinations
filter query() to distances required
groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
finally join() this back together and you have what you want
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude
longitude
people
nearby
index_y
0
40.9919
-106.049
1
6
[0, 1, 2]
1
40.992
-106.049
2
6
[0, 1, 2]
2
40.9916
-106.049
3
6
[0, 1, 2]
3
40.9899
-106.05
4
4
[3]
4
40.9878
-106.049
5
5
[4]
Update - use combinations instead of Cartesian product
It's been bugging me that a Cartesian product is a huge overhead, when all that is required is to calculate distances between valid combinations
make use of itertools.combinations() to make a list of valid combinations of indexes
calculate distances between this minimum set
filter down to only distances we're interested in
now build permutations of this smaller set to provide a simple join to actual data
join and aggregate
# get distances between all valid combinations
dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
.merge(df, left_on=0, right_index=True)
.merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
(r.latitude_1,r.longitude_1)).miles, axis=1))
.loc[:,[0,1,"distance"]]
# filter down to close proximities
.query("distance <= 0.05")
)
# build all valid permuations of close by combinations
dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
.merge(df.loc[:,"people"], left_on=1, right_index=True)
)
# bring it all together
df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
.groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
)
0
latitude
longitude
people
people_near
0
40.9919
-106.049
1
5
1
40.992
-106.049
2
4
2
40.9916
-106.049
3
3
3
40.9899
-106.05
4
0
4
40.9878
-106.049
5
0

Rolling means in Pandas dataframe

I am trying to run some computations on DataFrames. I want to compute the average difference between two sets of rolling mean. To be more specific, the average of the difference between a long-term mean (lst) and a smaller-one (lst_2). I am trying to combine the calculation with a double for loop as follows:
import pandas as pd
import numpy as pd
def main(df):
df=df.pct_change()
lst=[100,150,200,250,300]
lst_2=[5,10,15,20]
result=pd.DataFrame(np.sum([calc(df,T,t) for T in lst for t in lst_2]))/(len(lst)+len(lst_2))
return result
def calc(df,T,t):
roll=pd.DataFrame(np.sign(df.rolling(t).mean()-df.rolling(T).mean()))
return roll
Overall I should have 20 differences (5 and 100, 10 and 100, 15 and 100 ... 20 and 300); I take the sign of the difference and I want the average of these differences at each point in time. Ideally the result would be a dataframe result.
I got the error: cannot copy sequence with size 3951 to array axis with dimension 1056 when it runs the double for loops. Obviously I understand that due to rolling of different T and t, the dimensions of the dataframes are not equal when it comes to the array conversion (with np.sum), but I thought it would put "NaN" to align the dimensions.
Hope I have been clear enough. Thank you.
As requested in the comments, here is an example. Let's suppose the following
dataframe:
df = pd.DataFrame({'A': [100,101.4636,104.9477,106.7089,109.2701,111.522,113.3832,113.8672,115.0718,114.6945,111.7446,108.8154]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
df=df.pct_change()
and I have the following 2 sets of mean I need to compute:
lst=[8,10]
lst_1=[3,4]
Then I follow these steps:
1/
I want to compute the rolling mean(3) - rolling mean(8), and get the sign of it:
roll=np.sign(df.rolling(3).mean()-df.rolling(8).mean())
This should return the following:
roll = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
2/
I redo step 1 with the combination of differences 3-10 ; 4-8 ; 4-10. So I get overall 4 roll dataframes.
roll_3_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_3_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
3/
Now that I have all the diffs, I simply want the average of them, so I sum all the 4 rolling dataframes, and I divide it by 4 (number of differences computed). The results should be (before dropping all N/A values):
result = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])

numpy, sums of subsets with no iterations [duplicate]

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)