Calculate % change in flat tables - pandas

From df1 I would like to calculate the percentage change from, which should give df2. Would you please assist me?
df1
lst=[['01012021','A',10],['01012021','B',20],['01012021','A',12],['01012021','B',23]]
df1=pd.DataFrame(lst,columns=['Date','FN','AuM'])
df2
lst=[['01012021','A',10,''],['01012021','B',20,''],['01012021','A',12,0.2],['01012021','B',23,0.15]]
df2=pd.DataFrame(lst,columns=['Date','FN','AuM','%_delta'])
Thank you

Use groupby and pct_change:
df1['%_delta'] = df1.groupby('FN')['AuM'].pct_change()
print(df1)
# Output:
Date FN AuM %_delta
0 01012021 A 10 NaN
1 01012021 B 20 NaN
2 01012021 A 12 0.20
3 01012021 B 23 0.15

Related

regroup uneven number of rows pandas df

I need to regroup a df from the above format in the one below but it fails and the output shape is (unique number of IDs, 2). Is there a more obvious solution?
You can use groupby and pivot:
(df.assign(n=df.groupby('ID').cumcount().add(1))
.pivot(index='ID', columns='n', values='Value')
.add_prefix('val_')
.reset_index()
)
Example input:
df = pd.DataFrame({'ID': [7,7,8,11,12,18,22,22,22],
'Value': list('abcdefghi')})
Output:
n ID val_1 val_2 val_3
0 7 a b NaN
1 8 c NaN NaN
2 11 d NaN NaN
3 12 e NaN NaN
4 18 f NaN NaN
5 22 g h i

Calculate a value inside a group

Suppose that I have a Pandas DataFrame name is df:
Origin Dest T R
0 N N 100 3
1 N A 2 6
2 A B 356 7
3 A B 789 8
4 B N 999 9
5 B A 345 2
6 N A 456 3
I want to produce a DataFrame that for each group by Origin do the following calculation:
Sum values in column 'T' then divide with sum of the values in 'R' for each groups. I want to see this result in a origin_dest matrix form.
I am trying to following, but does not work.
Matrix_Origin =df.pivot_table(values=['T','R'], index='Origin', columns ='Dest', fill_value=0, aggfunc=[lambda x: df['T'].sum()/df['R'].sum() ])
This is what I want to produce:
Origin N A B
N 33.33 50.88 0
A 0 0 76.33
B 111 172.5 0
Any help will be appreciated.
A combination of groupby, with unstack can yield your desired outcome :
res = df.groupby(["Origin", "Dest"]).sum().unstack()
#divide column T with column R
outcome = (
res["T"]
.div(res["R"])
.reindex(index=["N", "A", "B"], columns=["N", "A", "B"])
.fillna(0)
#optional
.round(2)
)
outcome
Dest N A B
Origin
N 33.33 50.89 0.00
A 0.00 0.00 76.33
B 111.00 172.50 0.00

Pandas subtract column values only when non nan

I have a dataframe df as follows with about 200 columns:
Date Run_1 Run_295 Prc
2/1/2020 3
2/2/2020 2 6
2/3/2020 5 2
I want to subtract column Prc from columns Run_1 Run_295 Run_300 only when they are non-Nan or non empty, to get the following:
Date Run_1 Run_295
2/1/2020
2/2/2020 -4
2/3/2020 3
I am not sure how to proceed with the above.
Code to reproduce the dataframe:
import pandas as pd
from io import StringIO
s = """Date,Run_1,Run_295,Prc
2/1/2020,,,3
2/2/2020,2,,6
2/3/2020,,5,2"""
df = pd.read_csv(StringIO(s))
print(df)
You can simply subtract it. It exactly does what you want:
df.Run_1-df.Prc
Here is the complete code to your output:
df.Run_1= df.Run_1-df.Prc
df.Run_295= df.Run_295-df.Prc
df.drop('Prc', axis=1, inplace=True)
df
Date Run_1 Run_295
0 2/1/2020 NaN NaN
1 2/2/2020 -4.0 NaN
2 2/3/2020 NaN 3.0
Three steps, melt to unpivot your dataframe
Then loc to handle assignment
& GroupBy to reomake your original df.
sure there is a better way to this, but this avoids loops and apply
cols = df.columns
s = pd.melt(df,id_vars=['Date','Prc'],value_name='Run Rate')
s.loc[s['Run Rate'].isnull()==False,'Run Rate'] = s['Run Rate'] - s['Prc']
df_new = s.groupby([s["Date"], s["Prc"], s["variable"]])["Run Rate"].first().unstack(-1)
print(df_new[cols])
variable Date Run_1 Run_295 Prc
0 2/1/2020 NaN NaN 3
1 2/2/2020 -4.0 NaN 6
2 2/3/2020 NaN 3.0 2

Groupby Year and other column and calculate average based on specific condition pandas

I have a data frame as shown below
Tenancy_ID Unit_ID End_Date Rental_value
1 A 2012-04-26 10
2 A 2012-08-27 20
3 A 2013-04-27 50
4 A 2014-04-27 40
1 B 2011-06-26 10
2 B 2011-09-27 30
3 B 2013-04-27 60
4 B 2015-04-27 80
From the above I would like to prepare below data frame
Expected Output:
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
A NaN 15 50 40 NaN
B 20 NaN 60 NaN 80
Steps:
Unit_ID = A, has two contracts in 2012 with rental value 10 and 20, Hence the average is 15.
Avg_2012 = Average rental value in 2012.
Use pivot_table directly with the s.dt.year
#df['End_Date']=pd.to_datetime(df['End_Date']) if dtype of End_Date is not datetime
final = (df.pivot_table('Rental_value','Unit_ID',df['End_Date'].dt.year)
.add_prefix('Avg_').reset_index().rename_axis(None,axis=1))
print(final)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
You can aggregate averages and reshape by Series.unstack, last change columns names by DataFrame.add_prefix and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df1 = (df.groupby(['Unit_ID', df['End_Date'].dt.year])['Rental_value']
.mean()
.unstack()
.add_prefix('Avg_')
.reset_index()
.rename_axis(None, axis=1))
print (df1)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0

Find multiple strings in a given column

I'm not sure whether it is possible to do easily.
I have 2 dataframes. In the first one (df1) there is a column with texts ('Texts') and in the second one there are 2 columns, one with some sort texts ('subString') and the second with a score ('Score').
What I want is to sum up all the scores associated to the subString field in the second dataframe when these subString are a substring of the text column in the first dataframe.
For example, if I have a dataframe like this:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Texts':['this is a string',
'here we have another string',
'this one is completly different',
'one more',
'this is one more',
'and the last one'],
'c':['C','C','C','C','C','C'],
'd':['D','D','D','D','NaN','NaN']
}, columns = ['ID','Texts','c','d'])
df1
Out[2]:
ID Texts c d
0 1 this is a string C D
1 2 here we have another string C D
2 3 this one is completly different C D
3 4 one more C D
4 5 this is one more C NaN
5 6 and the last one C NaN
And another dataframe like this:
df2 = pd.DataFrame({
'SubString':['This', 'one', 'this is', 'is one'],
'Score':[0.5, 0.2, 0.75, -0.5]
}, columns = ['SubString','Score'])
df2
Out[3]:
SubString Score
0 This 0.50
1 one 0.20
2 this is 0.75
3 is one -0.50
I want to get something like this:
df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
score = 0
for index2, row2 in df2.iterrows():
if row2['SubString'] in row1['Texts']:
score += row2['Score']
df1.set_value(index1, 'Score', score)
df1
Out[4]:
ID Texts c d Score
0 1 this is a string C D 0.75
1 2 here we have another string C D 0.00
2 3 this one is completly different C D -0.30
3 4 one more C D 0.20
4 5 this is one more C NaN 0.45
5 6 and the last one C NaN 0.20
Is there a less garbled and faster way to do it?
Thanks!
Option 1
In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()]
).sum(axis=0)
Out[691]: array([ 0.75, 0. , -0.3 , 0.2 , 0.45, 0.2 ])
Option 2
In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 -0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
Note: apply doesn't get rid of loops, it just hides them.