I have been trying to work out how to calculate the future value of a savings account where each month I must deposit $100.
import pandas as pd
# deposit an extra $100 per month
deposit = [100] * 4
# unbelievable rate of 10%!
rate = [0.1] * 4
df = pd.DataFrame({ 'deposit':deposit, 'rate':rate})
df['interest'] = df.deposit * df.rate
df['total'] = df.deposit.cumsum() + df.interest.cumsum()
This gives the incorrect total of $440 when it should be $464.10 due to compound interest.
total = 0
r = 0.1
d = 100
for i in range(0,4):
total = (total * r) + total + d
print (total)
100.0
210.0
331.0
464.1
What is the correct way to do this in Pandas?
IIUC, it is compounded at the end. Using pd.Series's shift and cumprod:
df['total'] = (df['deposit'] * df['rate'].shift().add(1).cumprod().fillna(1)).cumsum()
print(df)
Output:
deposit rate interest total
0 100 0.1 10.0 100.0
1 100 0.1 10.0 210.0
2 100 0.1 10.0 331.0
3 100 0.1 10.0 464.1
Related
I would like to do an "incremental groupby". I have the following dataframe:
v1 increment
0.1 0
0.5 0
0.42 1
0.4 1
0.3 2
0.7 2
I would like to compute the average of column v1, by incrementally grouping by the column "increment". For instance when I do the first groupby for 0, I would get the average of the first two rows. The for the second groupby, I would get the average of the first 4 rows ( both increment= 0 and 1), then for the third groupby I would get the average of increment = 0,1 and 2)
Any idea how I could do that efficiently?
Expected output:
group average of v1
0 0.3
1 0.355
2 0.403
You can compute the cumulated sum and the cumulated size, then divide:
g = df.groupby('increment')['v1'] # set up a grouper for efficiency
out = (g.sum().cumsum() # cumulated sum
.div(g.size().cumsum()) # divide by cumulated size
.reset_index(name='average of v1')
)
output:
increment average of v1
0 0 0.300000
1 1 0.355000
2 2 0.403333
You can do a cumsum of v1 value then do a cumsum of each group size
cumsum = df.groupby('increment')['v1'].sum().cumsum()
cumsize = df.groupby('increment')['v1'].size().cumsum()
out = (cumsum.div(cumsize)
.to_frame('average of v1')
.reset_index())
print(out)
increment average of v1
0 0 0.300000
1 1 0.355000
2 2 0.403333
I need to operate a column with an IF as shown in my code. It takes quite a time to compute, is there a faster, cleaner way to do this?
For reference, the column "coin" have pairs like "ETH_ARS", "DAI_USD" and so on, that´s why I split it.
for i in range(merged.shape[0]):
x = merged["coin"].iloc[i]
if x.split("_")[1] == "ARS":
merged["total"].iloc[i] = (
merged["price"].iloc[i]
* merged["amount"].iloc[i]
/ merged["valueUSD"].iloc[i]
)
else:
merged["total"].iloc[i] = merged["price"].iloc[i] * merged["amount"].iloc[i]
You can vectorize your code. The trick here is to set valueUSD=1 when coin column ends with USD. After that the operation is the same for all rows: total = price * amount / valueUSD.
Setup a MRE:
data = {'coin': ['ETH_ARS', 'DAI_USD'],
'price': [10, 12],
'amount': [3, 4],
'valueUSD': [2, 7]}
df = pd.DataFrame(data)
print(df)
# Output:
coin price amount valueUSD
0 ETH_ARS 10 3 2
1 DAI_USD 12 4 7 # <- should be set to 1 for division
valueUSD = df['valueUSD'].mask(df['coin'].str.split('_').str[1].eq('USD'), other=1)
df['total'] = df['price'] * df['amount'] / valueUSD
print(df)
# Output:
coin price amount valueUSD total
0 ETH_ARS 10 3 2 15.0 # = 10 * 3 / 2
1 DAI_USD 12 4 7 48.0 # = 10 * 3 / 1 (7 -> 1)
To do that, use mask and replace NaN by 1 instead of the valueUSD:
>>> valueUSD
0 2
1 1 # 7 -> 1
Name: valueUSD, dtype: int64
I have a dataframe with dates and a value per day. I want to see the gradient of the value, if it is growing, declining, .... The best way is to apply a linear regression with day as x and value as y:
import pandas as pd
df = pd.DataFrame({'customer':['a','a','a','b','b','b'],
'day':[1,2,4,2,3,4],
'value':[1.5,2.4,3.6,1.5,1.3,1.1]})
df:
customer day value
0 a 1 1.5
1 a 2 2.4
2 a 4 3.6
3 b 2 1.5
4 b 3 1.3
5 b 4 1.1
By hand I can do a linear regression:
from sklearn.linear_model import LinearRegression
def gradient(x,y):
return LinearRegression().fit(x,y).coef_[0]
xa = df[df.customer =='a'].day.values.reshape(-1, 1)
ya = df[df.customer =='a'].value.values.reshape(-1, 1)
xb = df[df.customer =='b'].day.values.reshape(-1, 1)
yb = df[df.customer =='b'].value.values.reshape(-1, 1)
print(gradient(xa,ya),gradient(xb,yb))
result: [0.68571429] [-0.2]
But I would like to use a groupby as in
df.groupby('customer').agg({'value':['mean','sum','gradient']})
with an output like:
value
mean sum gradient
customer
a 2.5 7.5 0.685
b 1.3 3.9 -0.2
the issue is that the gradient needs 2 columns as input.
You can do:
# calculate gradient
v = (df
.groupby('customer')
.apply(lambda x: gradient(x['day'].to_numpy().reshape(-1, 1),
x['value'].to_numpy().reshape(-1, 1)))
v.name = 'gradient'
# calculate mean, sum
d1 = df.groupby('customer').agg({'value': ['mean', 'sum']})
# join the results
d1 = d1.join(v)
# fix columns
d1.columns = d1.columns.str.join('')
print(d1)
valuemean valuesum gradient
customer
a 2.5 7.5 0.685714
b 1.3 3.9 -0.200000
I've started to work with Pandas Dataframe and try to figure out how to deal with the below task.
I have an excel spreadsheet that needs to be imported to Pandas DataFrame and the below calculations need to be done to populate PercentageOnSale , Bonus and EmployeesIncome columns.
If the sum of all SalesValues for the EmployeeID is less than 5000 the PercentageOnSale should be 5% of SalesValue.
If the sum of all SalesValues for the EmployeeID is equal or more than 5000 the PercentageOnSale should be 7% of SalesValue.
If the sum of all SalesValues for the EmployeeID is more than 10.000 the PercentageOnSale should be 7% of SalesValue and additionaly a Bonus of 3% should be calculated.
EmployeesIncome is the sum of PercentageOnSale and Bonus columns.
sample excel view
You could try groupby-apply as follows:
# Data
df = pd.DataFrame({"EmployeeID":[1,1,2,3,1,3,5,1],
"ProductSold":["P1","P2","P3","P1","P2","P3","P1","P2"],
"SalesValue":[3000,3500,4000,3000,5000,3000,3000,4000]})
# Calculations
def calculate(x):
# Calcualte Bonus
x['Bonus'] = 0
if x['SalesValue'].sum() > 10000:
x['Bonus'] = 0.03*x['SalesValue']
# Calculate PercentageOnSale
if x['SalesValue'].sum() < 3000:
x['PercentageOnSale'] = 0.05*x['SalesValue']
else:
x['PercentageOnSale'] = 0.07*x['SalesValue']
# Total Income per sale
x['EmployeesIncome'] = x['PercentageOnSale'] + x['Bonus']
return x
df_final = df.groupby('EmployeeID').apply(calculate)
The output is as follows:
EmployeeID ProductSold SalesValue Bonus PercentageOnSale EmployeesIncome
0 1 P1 3000 90.0 210.0 300.0
1 1 P2 3500 105.0 245.0 350.0
2 2 P3 4000 0.0 280.0 280.0
3 3 P1 3000 0.0 210.0 210.0
4 1 P2 5000 150.0 350.0 500.0
5 3 P3 3000 0.0 210.0 210.0
6 5 P1 3000 0.0 210.0 210.0
7 1 P2 4000 120.0 280.0 400.0
I start with a simple
df = pd.DataFrame({'units':[30,20]})
And I get
units
0 30
1 20
I then add a row to total the column:
my_sum = df.sum()
df = df.append(my_sum, ignore_index=True)
Finally, I add a column to calculate percentages off of the 'units' column:
df['pct'] = df.units / df.units[:-1].sum()
ending with this:
units pct
0 30 0.6
1 20 0.4
2 50 1.0
So far so good - but now the question: I want to change the middle number of units from 20 to, for example, 30. I can use this:
df3.iloc[1, 0] = 40
or
df3.iat[1, 0] = 40
which change the cell, but the calculated values at both the last row and second column don't change to reflect it:
units pct
0 30 0.6
1 40 0.4
2 50 1.0
How do I force these calculated values to adjust following the change in that particular cell?
Make a function that calculates it
def f(df):
return df.append(df.sum(), ignore_index=True).assign(
pct=lambda d: d.units / d.units.iat[-1])
df.iat[1, 0] = 40
f(df)
units pct
0 30 0.428571
1 40 0.571429
2 70 1.000000