Calculate Compound Interest in Pandas - pandas

I have been trying to work out how to calculate the future value of a savings account where each month I must deposit $100.
import pandas as pd
# deposit an extra $100 per month
deposit = [100] * 4
# unbelievable rate of 10%!
rate = [0.1] * 4
df = pd.DataFrame({ 'deposit':deposit, 'rate':rate})
df['interest'] = df.deposit * df.rate
df['total'] = df.deposit.cumsum() + df.interest.cumsum()
This gives the incorrect total of $440 when it should be $464.10 due to compound interest.
total = 0
r = 0.1
d = 100
for i in range(0,4):
total = (total * r) + total + d
print (total)
100.0
210.0
331.0
464.1
What is the correct way to do this in Pandas?

IIUC, it is compounded at the end. Using pd.Series's shift and cumprod:
df['total'] = (df['deposit'] * df['rate'].shift().add(1).cumprod().fillna(1)).cumsum()
print(df)
Output:
deposit rate interest total
0 100 0.1 10.0 100.0
1 100 0.1 10.0 210.0
2 100 0.1 10.0 331.0
3 100 0.1 10.0 464.1

Related

Efficient way to do an incremental groupby in pandas

I would like to do an "incremental groupby". I have the following dataframe:
v1 increment
0.1 0
0.5 0
0.42 1
0.4 1
0.3 2
0.7 2
I would like to compute the average of column v1, by incrementally grouping by the column "increment". For instance when I do the first groupby for 0, I would get the average of the first two rows. The for the second groupby, I would get the average of the first 4 rows ( both increment= 0 and 1), then for the third groupby I would get the average of increment = 0,1 and 2)
Any idea how I could do that efficiently?
Expected output:
group average of v1
0 0.3
1 0.355
2 0.403
You can compute the cumulated sum and the cumulated size, then divide:
g = df.groupby('increment')['v1'] # set up a grouper for efficiency
out = (g.sum().cumsum() # cumulated sum
.div(g.size().cumsum()) # divide by cumulated size
.reset_index(name='average of v1')
)
output:
increment average of v1
0 0 0.300000
1 1 0.355000
2 2 0.403333
You can do a cumsum of v1 value then do a cumsum of each group size
cumsum = df.groupby('increment')['v1'].sum().cumsum()
cumsize = df.groupby('increment')['v1'].size().cumsum()
out = (cumsum.div(cumsize)
.to_frame('average of v1')
.reset_index())
print(out)
increment average of v1
0 0 0.300000
1 1 0.355000
2 2 0.403333

Faster way to operate columns with if conditions

I need to operate a column with an IF as shown in my code. It takes quite a time to compute, is there a faster, cleaner way to do this?
For reference, the column "coin" have pairs like "ETH_ARS", "DAI_USD" and so on, that´s why I split it.
for i in range(merged.shape[0]):
x = merged["coin"].iloc[i]
if x.split("_")[1] == "ARS":
merged["total"].iloc[i] = (
merged["price"].iloc[i]
* merged["amount"].iloc[i]
/ merged["valueUSD"].iloc[i]
)
else:
merged["total"].iloc[i] = merged["price"].iloc[i] * merged["amount"].iloc[i]
You can vectorize your code. The trick here is to set valueUSD=1 when coin column ends with USD. After that the operation is the same for all rows: total = price * amount / valueUSD.
Setup a MRE:
data = {'coin': ['ETH_ARS', 'DAI_USD'],
'price': [10, 12],
'amount': [3, 4],
'valueUSD': [2, 7]}
df = pd.DataFrame(data)
print(df)
# Output:
coin price amount valueUSD
0 ETH_ARS 10 3 2
1 DAI_USD 12 4 7 # <- should be set to 1 for division
valueUSD = df['valueUSD'].mask(df['coin'].str.split('_').str[1].eq('USD'), other=1)
df['total'] = df['price'] * df['amount'] / valueUSD
print(df)
# Output:
coin price amount valueUSD total
0 ETH_ARS 10 3 2 15.0 # = 10 * 3 / 2
1 DAI_USD 12 4 7 48.0 # = 10 * 3 / 1 (7 -> 1)
To do that, use mask and replace NaN by 1 instead of the valueUSD:
>>> valueUSD
0 2
1 1 # 7 -> 1
Name: valueUSD, dtype: int64

pandas-groupby: apply custom function which needs 2 columns as input to get one column as output

I have a dataframe with dates and a value per day. I want to see the gradient of the value, if it is growing, declining, .... The best way is to apply a linear regression with day as x and value as y:
import pandas as pd
df = pd.DataFrame({'customer':['a','a','a','b','b','b'],
'day':[1,2,4,2,3,4],
'value':[1.5,2.4,3.6,1.5,1.3,1.1]})
df:
customer day value
0 a 1 1.5
1 a 2 2.4
2 a 4 3.6
3 b 2 1.5
4 b 3 1.3
5 b 4 1.1
By hand I can do a linear regression:
from sklearn.linear_model import LinearRegression
def gradient(x,y):
return LinearRegression().fit(x,y).coef_[0]
xa = df[df.customer =='a'].day.values.reshape(-1, 1)
ya = df[df.customer =='a'].value.values.reshape(-1, 1)
xb = df[df.customer =='b'].day.values.reshape(-1, 1)
yb = df[df.customer =='b'].value.values.reshape(-1, 1)
print(gradient(xa,ya),gradient(xb,yb))
result: [0.68571429] [-0.2]
But I would like to use a groupby as in
df.groupby('customer').agg({'value':['mean','sum','gradient']})
with an output like:
value
mean sum gradient
customer
a 2.5 7.5 0.685
b 1.3 3.9 -0.2
the issue is that the gradient needs 2 columns as input.
You can do:
# calculate gradient
v = (df
.groupby('customer')
.apply(lambda x: gradient(x['day'].to_numpy().reshape(-1, 1),
x['value'].to_numpy().reshape(-1, 1)))
v.name = 'gradient'
# calculate mean, sum
d1 = df.groupby('customer').agg({'value': ['mean', 'sum']})
# join the results
d1 = d1.join(v)
# fix columns
d1.columns = d1.columns.str.join('')
print(d1)
valuemean valuesum gradient
customer
a 2.5 7.5 0.685714
b 1.3 3.9 -0.200000

Pandas Dataframe how to iterate over rows and perform calculations on their values

I've started to work with Pandas Dataframe and try to figure out how to deal with the below task.
I have an excel spreadsheet that needs to be imported to Pandas DataFrame and the below calculations need to be done to populate PercentageOnSale , Bonus and EmployeesIncome columns.
If the sum of all SalesValues for the EmployeeID is less than 5000 the PercentageOnSale should be 5% of SalesValue.
If the sum of all SalesValues for the EmployeeID is equal or more than 5000 the PercentageOnSale should be 7% of SalesValue.
If the sum of all SalesValues for the EmployeeID is more than 10.000 the PercentageOnSale should be 7% of SalesValue and additionaly a Bonus of 3% should be calculated.
EmployeesIncome is the sum of PercentageOnSale and Bonus columns.
sample excel view
You could try groupby-apply as follows:
# Data
df = pd.DataFrame({"EmployeeID":[1,1,2,3,1,3,5,1],
"ProductSold":["P1","P2","P3","P1","P2","P3","P1","P2"],
"SalesValue":[3000,3500,4000,3000,5000,3000,3000,4000]})
# Calculations
def calculate(x):
# Calcualte Bonus
x['Bonus'] = 0
if x['SalesValue'].sum() > 10000:
x['Bonus'] = 0.03*x['SalesValue']
# Calculate PercentageOnSale
if x['SalesValue'].sum() < 3000:
x['PercentageOnSale'] = 0.05*x['SalesValue']
else:
x['PercentageOnSale'] = 0.07*x['SalesValue']
# Total Income per sale
x['EmployeesIncome'] = x['PercentageOnSale'] + x['Bonus']
return x
df_final = df.groupby('EmployeeID').apply(calculate)
The output is as follows:
EmployeeID ProductSold SalesValue Bonus PercentageOnSale EmployeesIncome
0 1 P1 3000 90.0 210.0 300.0
1 1 P2 3500 105.0 245.0 350.0
2 2 P3 4000 0.0 280.0 280.0
3 3 P1 3000 0.0 210.0 210.0
4 1 P2 5000 150.0 350.0 500.0
5 3 P3 3000 0.0 210.0 210.0
6 5 P1 3000 0.0 210.0 210.0
7 1 P2 4000 120.0 280.0 400.0

Forcing dataframe recalculation after a change of a specific cell

I start with a simple
df = pd.DataFrame({'units':[30,20]})
And I get
units
0 30
1 20
I then add a row to total the column:
my_sum = df.sum()
df = df.append(my_sum, ignore_index=True)
Finally, I add a column to calculate percentages off of the 'units' column:
df['pct'] = df.units / df.units[:-1].sum()
ending with this:
units pct
0 30 0.6
1 20 0.4
2 50 1.0
So far so good - but now the question: I want to change the middle number of units from 20 to, for example, 30. I can use this:
df3.iloc[1, 0] = 40
or
df3.iat[1, 0] = 40
which change the cell, but the calculated values at both the last row and second column don't change to reflect it:
units pct
0 30 0.6
1 40 0.4
2 50 1.0
How do I force these calculated values to adjust following the change in that particular cell?
Make a function that calculates it
def f(df):
return df.append(df.sum(), ignore_index=True).assign(
pct=lambda d: d.units / d.units.iat[-1])
df.iat[1, 0] = 40
f(df)
units pct
0 30 0.428571
1 40 0.571429
2 70 1.000000