how to do a proper pandas group by, where, sum - pandas

i am having a hard time using a group by + where to apply a sum to a broader range.
given this code:
from io import StringIO
import numpy as np
f = pd.read_csv(StringIO("""
fund_id,l_s,val
fund1,L,10
fund1,L,20
fund1,S,30
fund2,L,15
fund2,L,25
fund2,L,35
"""))
# fund total - works as expected
f['fund_total'] = f.groupby('fund_id')['val'].transform(np.sum)
# fund L total - applied only to L rows.
f['fund_total_l'] = f[f['l_s'] == "L"].groupby('fund_id')['val'].transform(np.sum)
f
this code gets me close:
numbers are correct, but i would like fund_total_l column to show 30 for all rows of fund1 (not just L). I want a fund level summary, but sum filtered by the l_s column
i know i can do this with multiple steps, but this needs to be a single operation. i can use a separate generic function if that helps.
playground: https://repl.it/repls/UnusualImpeccableDaemons

Use Series.where, to create NaN, these will be ignored in your sum:
f['val_temp'] = f['val'].where(f['l_s'] == "L")
f['fund_total_l'] = f.groupby('fund_id')['val_temp'].transform('sum')
f = f.drop(columns='val_temp')
Or in one line using assign:
df['fun_total_l'] = (
f.assign(val_temp=f['val'].where(f['l_s'] == "L"))
.groupby('fund_id')['val_temp'].transform('sum')
)
Another way would be to partly use your solution, but then use DataFrame.reindex to get the original index back and then use ffill and bfill to fill up our NaN:
f['fund_total_l'] = (
f[f['l_s'] == "L"]
.groupby('fund_id')['val']
.transform('sum')
.reindex(f.index)
.ffill()
.bfill()
)
fund_id l_s val fund_total_l
0 fund1 L 10 30.0
1 fund1 L 20 30.0
2 fund1 S 30 30.0
3 fund2 L 15 75.0
4 fund2 L 25 75.0
5 fund2 L 35 75.0

I think there is a more elegant solution, but I'm not able to broadcast the results back to the individual rows.
Essentially, with a boolean mask of all the "L" rows
f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L")))
you obtain
fund_id
fund1 30
fund2 75
dtype: int64
now we can just merge after using reset_index to obtain
pd.merge(f, f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L"))).reset_index(), on="fund_id")
to obtain
fund_id l_s val 0
0 fund1 L 10 30
1 fund1 L 20 30
2 fund1 S 30 30
3 fund2 L 15 75
4 fund2 L 25 75
5 fund2 L 35 75
However, I'd guess that the merging is not necessary and can be obtained directly in apply

Related

Pandas groupby custom nlargest

When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24
there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0

pandas – "multiplication table" for each row with custom function

I have a DataFrame with cities coordinates, like this (example):
x y
A 10 20
B 20 30
C 15 60
I want to calculate their distance : sqrt(x^2 + y^2) from each other with sort of a multiplication table (example):
A B C
A 0 20 30
B 20 0 25
C 30 25 0
How can I do this? I've tried using apply function but need some guidance.
You can make use of the broadcasting feature in pandas, together with .apply():
df['distance'] = (df['x'] ** 2 + df['y'] ** 2).apply(np.sqrt)
The easiest way is to use distance_matrix of scipy:
from scipy.spatial import distance_matrix
df = pd.DataFrame({'x':[10,20,30], 'y': [20,30,60]},index=list('ABC'))
pd.DataFrame(distance_matrix(df,df), index=df.index, columns=df.index)
Output:
A B C
A 0.000000 14.142136 40.311289
B 14.142136 0.000000 30.413813
C 40.311289 30.413813 0.000000

how to calculate percentage changes across 2 columns in a dataframe using pct_change in Python

I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']
Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']
pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833
IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']

Pandas: extract top three IDs based on values in a different column from a previous day

I have a data frame like this:
date_test = pd.DataFrame({
'id': ['q','w','e','r','t','y',
'a','s','d','f','g',
'z','x',
'b','n','m','k'],
'metric': [123,122,45,31,5,2,
634,372,312,229,110,
434,334,
256,156,44,23],
'date':['2019-11-01','2019-11-01','2019-11-01','2019-11-01','2019-11-01', '2019-11-01',
'2019-11-02','2019-11-02','2019-11-02','2019-11-02','2019-11-02',
'2019-11-04','2019-11-04',
'2019-11-05','2019-11-05','2019-11-05','2019-11-05']
})
It was sorted by date and metric. The tricky part is that I have gaps in dates, so I cannot calculate previous based on datestamp.
For each date I need to grab top-3 ids. If there are fewer ids on a previous day, I should use top_1 instead.
The first date should be filled with NaNs as there is no previous period to look at.
The result should look like this:
id metric date top_1 top_2 top_3
0 q 123 2019-11-01 None None None
1 w 122 2019-11-01 None None None
2 e 45 2019-11-01 None None None
3 r 31 2019-11-01 None None None
4 t 5 2019-11-01 None None None
5 y 2 2019-11-01 None None None
6 a 634 2019-11-02 q w e
7 s 372 2019-11-02 q w e
8 d 312 2019-11-02 q w e
9 f 229 2019-11-02 q w e
10 g 110 2019-11-02 q w e
11 z 434 2019-11-04 a s d
12 x 334 2019-11-04 a s d
13 b 256 2019-11-05 z x z
14 n 156 2019-11-05 z x z
15 m 44 2019-11-05 z x z
16 k 23 2019-11-05 z x z
I will greatly appreciate your help!
I have to make some assumptions here. It's not clear what you'd want to do if there was a tie. I also would make a separate dataframe to store the results.
# Date should be a datetime
date_test['date'] = pd.to_datetime(date_test['date'])
# Initialize a place to store results
min_date = date_test['date'].min()
max_date = date_test['date'].max()
solution = pd.DataFrame(index=pd.DatetimeIndex(start=min_date,end=max_date,freq='d'))
# Iterate for results
for i in solution.index:
mask = date_test['date'] == i
vals = date_test[mask].sort_values('metric',ascending=False)['id'].values[:3]
# Store results if found
for j in range(min([3,vals.shape[0]])):
solution.loc[i,'top_%i'%(j+1)]=vals[j]
If you need to offset, you can. Not hard to modify to include the metric in the solution DF.
I'm adding some information based on a comment.
If you want to fill values. You can use pd.fillna(). The code below will populate the NA values with the last available date.
solution.fillna(method='ffill',inplace=True)

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)