Calculate a value inside a group - pandas

Suppose that I have a Pandas DataFrame name is df:
Origin Dest T R
0 N N 100 3
1 N A 2 6
2 A B 356 7
3 A B 789 8
4 B N 999 9
5 B A 345 2
6 N A 456 3
I want to produce a DataFrame that for each group by Origin do the following calculation:
Sum values in column 'T' then divide with sum of the values in 'R' for each groups. I want to see this result in a origin_dest matrix form.
I am trying to following, but does not work.
Matrix_Origin =df.pivot_table(values=['T','R'], index='Origin', columns ='Dest', fill_value=0, aggfunc=[lambda x: df['T'].sum()/df['R'].sum() ])
This is what I want to produce:
Origin N A B
N 33.33 50.88 0
A 0 0 76.33
B 111 172.5 0
Any help will be appreciated.

A combination of groupby, with unstack can yield your desired outcome :
res = df.groupby(["Origin", "Dest"]).sum().unstack()
#divide column T with column R
outcome = (
res["T"]
.div(res["R"])
.reindex(index=["N", "A", "B"], columns=["N", "A", "B"])
.fillna(0)
#optional
.round(2)
)
outcome
Dest N A B
Origin
N 33.33 50.89 0.00
A 0.00 0.00 76.33
B 111.00 172.50 0.00

Related

backwards average only when value in column changes

I tried to calculate the average for the last x rows in a DataFrame only when the value is changing
A and B are my inputs and C is my desired output
a = 0
def iloc_backwards (df, col):
for i in df.index:
val1 = df[col].iloc[i]
val2 = df[col].iloc[i+1]
if val1 == val2 :
a+
else: df.at[i,col] = df.rolling(window=a).mean()
A B C
1 0 0.25
2 0 0.25
3 0 0.25
4 1 0.25
5 0 0.5
6 1 0.5
If you should take the average of all values up to the first encounter of a value that is non-zero, try this code:
df['group'] = df['B'].shift().ne(0).cumsum()
df['C'] = df.groupby('group').B.transform('mean')
df[['A', 'B', 'C']]
This corresponds with your desired output.

Calculate % change in flat tables

From df1 I would like to calculate the percentage change from, which should give df2. Would you please assist me?
df1
lst=[['01012021','A',10],['01012021','B',20],['01012021','A',12],['01012021','B',23]]
df1=pd.DataFrame(lst,columns=['Date','FN','AuM'])
df2
lst=[['01012021','A',10,''],['01012021','B',20,''],['01012021','A',12,0.2],['01012021','B',23,0.15]]
df2=pd.DataFrame(lst,columns=['Date','FN','AuM','%_delta'])
Thank you
Use groupby and pct_change:
df1['%_delta'] = df1.groupby('FN')['AuM'].pct_change()
print(df1)
# Output:
Date FN AuM %_delta
0 01012021 A 10 NaN
1 01012021 B 20 NaN
2 01012021 A 12 0.20
3 01012021 B 23 0.15

Create a new pandas DataFrame Column with a groupby

I have a dataframe and I'd like to group by a column value and then do a calculation to create a new column. Below is the set up data:
import pandas as pd
df = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100]
})
df.groupby('Groups').apply(print)
What I want to do is create a 'TOTAL' column in the original dataframe. If it is the first record of the group 'TOTAL' gets a zero otherwise TOTAL will get the ['Blue'] at index subtracted by ['Red'] at index-1.
I tried to do this in a function below but it does not work.
def funct(group):
count = 0
lst = []
for info in group:
if count == 0:
lst.append(0)
count += 1
else:
num = group.iloc[count]['Blue'] - group.iloc[count-1]['Red']
lst.append(num)
count += 1
group['Total'] = lst
return group
df = df.join(df.groupby('Groups').apply(funct))
The code works for the first group but then errors out.
The desired outcome is:
df_final = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100],
'Total':[0,0,29,37,48,0,65,74,83,92]
})
df_final
df_final.groupby('Groups').apply(print)
Thank you for the help!
For each group, calculate the difference between Blue and shifted Red (Red at previous index):
df['Total'] = (df.groupby('Groups')
.apply(lambda g: g.Blue - g.Red.shift().fillna(g.Blue))
.reset_index(level=0, drop=True))
df
Red Groups Blue Total
0 1 A 10 0.0
1 2 B 20 0.0
2 3 A 30 29.0
3 4 A 40 37.0
4 5 B 50 48.0
5 6 C 60 0.0
6 7 B 70 65.0
7 8 C 80 74.0
8 9 B 90 83.0
9 10 C 100 92.0
Or as #anky has commented, you can avoid apply by shifting Red column first:
df['Total'] = (df.Blue - df.Red.groupby(df.Groups).shift()).fillna(0, downcast='infer')
df
Red Groups Blue Total
0 1 A 10 0
1 2 B 20 0
2 3 A 30 29
3 4 A 40 37
4 5 B 50 48
5 6 C 60 0
6 7 B 70 65
7 8 C 80 74
8 9 B 90 83
9 10 C 100 92

Find multiple strings in a given column

I'm not sure whether it is possible to do easily.
I have 2 dataframes. In the first one (df1) there is a column with texts ('Texts') and in the second one there are 2 columns, one with some sort texts ('subString') and the second with a score ('Score').
What I want is to sum up all the scores associated to the subString field in the second dataframe when these subString are a substring of the text column in the first dataframe.
For example, if I have a dataframe like this:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Texts':['this is a string',
'here we have another string',
'this one is completly different',
'one more',
'this is one more',
'and the last one'],
'c':['C','C','C','C','C','C'],
'd':['D','D','D','D','NaN','NaN']
}, columns = ['ID','Texts','c','d'])
df1
Out[2]:
ID Texts c d
0 1 this is a string C D
1 2 here we have another string C D
2 3 this one is completly different C D
3 4 one more C D
4 5 this is one more C NaN
5 6 and the last one C NaN
And another dataframe like this:
df2 = pd.DataFrame({
'SubString':['This', 'one', 'this is', 'is one'],
'Score':[0.5, 0.2, 0.75, -0.5]
}, columns = ['SubString','Score'])
df2
Out[3]:
SubString Score
0 This 0.50
1 one 0.20
2 this is 0.75
3 is one -0.50
I want to get something like this:
df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
score = 0
for index2, row2 in df2.iterrows():
if row2['SubString'] in row1['Texts']:
score += row2['Score']
df1.set_value(index1, 'Score', score)
df1
Out[4]:
ID Texts c d Score
0 1 this is a string C D 0.75
1 2 here we have another string C D 0.00
2 3 this one is completly different C D -0.30
3 4 one more C D 0.20
4 5 this is one more C NaN 0.45
5 6 and the last one C NaN 0.20
Is there a less garbled and faster way to do it?
Thanks!
Option 1
In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()]
).sum(axis=0)
Out[691]: array([ 0.75, 0. , -0.3 , 0.2 , 0.45, 0.2 ])
Option 2
In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 -0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
Note: apply doesn't get rid of loops, it just hides them.

issue merging two dataframes with pandas by summing element by element [duplicate]

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20