"Indexing" a price series to a starting time point (index level = 100) with pandas data frame : P(i,t) / P(i) - pandas

I have a pandas data frame, where datetime is the index of the data frame (I use t=0 for simplification, in fact there is something like 20170101 09:30:00)
datetime Stock A Stock B
t=0 5 20
t=1 6 30
t=2 8 25
t=3 4 20
and I would like to return:
datetime Stock A Stock B
t=0 100 100
t=1 120 150
t=2 140 125
t=3 80 100
in mathematical terms: Index(i, t) = P(i, t) / P(i, 0).
I tried
df_norm = df[0:] / df[0:1]
print(df_norm)
which gives me an error.
edit1: I tried option 3 which works fine (couldn't try on NaN's yet, but at least it does not create an NaN for the first obs (which is caused by pctchange)). I wonder also that after performing, my datetime is not the set index anymore, which is easy to fix by just re-assigning it.
Now I am trying now to wrap it in a function, but I think the index is causing a problem (actually same error as with my "first" attempt):
def norming(x):
return x.assign(**x.drop('datetime', 1).pipe(
lambda d: d.div(d.shift().bfill()).cumprod()))
edit2: if my column datetime is an index, i.e.
df_norm.set_index(['datetime'], inplace = True)
I'll get an error though, what would I need to change?

Option 1
df.set_index('datetime').pct_change().fillna(0) \
.add(1).cumprod().mul(100).reset_index()
datetime Stock A Stock B
0 t=0 100.0 100.0
1 t=1 120.0 150.0
2 t=2 160.0 125.0
3 t=3 80.0 100.0
Option 2
def idx_me(a):
a = np.asarray(a)
r = np.append(1, a[1:] / a[:-1])
return r.cumprod() * 100
df.assign(**df.drop('datetime', 1).apply(idx_me))
datetime Stock A Stock B
0 t=0 100.0 100.0
1 t=1 120.0 150.0
2 t=2 160.0 125.0
3 t=3 80.0 100.0
Option 3
df.assign(**df.drop('datetime', 1).pipe(
lambda d: d.div(d.shift().bfill()).cumprod().mul(100)))
datetime Stock A Stock B
0 t=0 100.0 100.0
1 t=1 120.0 150.0
2 t=2 160.0 125.0
3 t=3 80.0 100.0

Seems like
p=100/df.iloc[0,1:]
df.iloc[:,1:]*=p
df
Out[1413]:
datetime StockA StockB
0 t=0 100 100
1 t=1 120 150
2 t=2 160 125
3 t=3 80 100

Related

I asked an earlier question on changing a dollar value to float and divide it, and it's worked, but it doesn't change the value in the data frame

Here was the original question:
With only being able to import numpy and pandas, I need to do the following: Scale the medianIncome to express the values in $10,000 of dollars (example: 150000 will become 15, 30000 will become 3, 15000 will become 1.5, etc)
Here's the code that works:
temp_housing['medianIncome'].replace( '[(]','-', regex=True ).astype(float)) / 10000
But when I call the df after, it still shows the original amount instead of the 15 of 1.5. I'm not sure what I'm missing on this.
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
The result is:
id medianIncome
0 1.7250
1 2.1806
2 2.4038
3 2.4597
4 1.8080
Name: medianIncome, Length: 20640, dtype: float64
But then when I call the df with
housing_cal
, it's back to:
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0

Pandas cumsum only if positive else zero

I am making a table, where i want to show that if there's no income, no expense can happen
it's a cumulative sum table
This is what I've
Incoming
Outgoing
Total
0
150
-150
10
20
-160
100
30
-90
50
70
-110
Required output
Incoming
Outgoing
Total
0
150
0
10
20
0
100
30
70
50
70
50
I've tried
df.clip(lower=0)
and
df['new_column'].apply(lambda x : df['outgoing']-df['incoming'] if df['incoming']>df['outgoing'])
That doesn't work as well
is there any other way?
Update:
A more straightforward approach inspired by your code using clip and without numpy:
diff = df['Incoming'].sub(df['Outgoing'])
df['Total'] = diff.mul(diff.ge(0).cumsum().clip(0, 1)).cumsum()
print(df)
# Output:
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
Old answer:
Find the row where the balance is positive for the first time then compute the cumulative sum from this point:
start = np.where(df['Incoming'] - df['Outgoing'] >= 0)[0][0]
df['Total'] = df.iloc[start:]['Incoming'].sub(df.iloc[start:]['Outgoing']) \
.cumsum().reindex(df.index, fill_value=0)
Output:
>>> df
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
IIUC, you can check when Incoming is greater than Outgoing using np.where and assign a helper column. Then you can check when this new column is not null, using notnull(), calculate the difference, and use cumsum() on the result:
df['t'] = np.where(df['Incoming'].ge(df['Outgoing']),0,np.nan)
df['t'].ffill(axis=0,inplace=True)
df['Total'] = np.where(df['t'].notnull(),(df['Incoming'].sub(df['Outgoing'])),df['t'])
df['Total'] = df['Total'].cumsum()
df.drop('t',axis=1,inplace=True)
This will give back:
Incoming Outgoing Total
0 0 150 NaN
1 10 20 NaN
2 100 30 70.0
3 50 70 50.0

Vectorize for loop and return x day high and low

Overview
For each row of a dataframe I want to calculate the x day high and low.
An x day high is higher than previous x days.
An x day low is lower than previous x days.
The for loop is explained in further detail in this post
Update:
Answer by #mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?
Example
2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.
2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.
See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.
Test data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15
df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)
def findHighLow (df):
df['x_day_high'] = 0
df['x_day_low'] = 0
for n in reversed(range(len(df['High']))):
for i in reversed(range(n)):
if df['High'][n] > df['High'][i]:
df['x_day_high'][n] = n - i
else: break
for n in reversed(range(len(df['Low']))):
for i in reversed(range(n)):
if df['Low'][n] < df['Low'][i]:
df['x_day_low'][n] = n - i
else: break
return df
df = findHighLow (df)
Desired output should match this:
df[["High","Low","x_day_high","x_day_low"]]
High Low x_day_high x_day_low
date
2020-03-19 1.1937 1.1832 0 0
2020-03-20 1.1879 1.1769 0 1
2020-03-21 1.1767 1.1662 0 2
2020-03-22 1.1721 1.1611 0 3
2020-03-23 1.1819 1.1690 2 0
2020-03-24 1.1928 1.1807 4 0
2020-03-25 1.1939 1.1864 6 0
2020-03-26 1.2141 1.1964 7 0
2020-03-27 1.2144 1.2039 8 0
2020-03-28 1.2099 1.2018 0 1
2020-03-29 1.2033 1.1853 0 4
2020-03-30 1.1887 1.1806 0 6
2020-03-31 1.1972 1.1873 1 0
2020-04-01 1.1997 1.1914 2 0
2020-04-02 1.1924 1.1781 0 9
Here are two so solutions. Both produce the desired output, as posted in the question.
The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.
Numba
#numba.njit
def count_smaller(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] > current:
break
count += 1
return count
#numba.njit
def count_greater(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] < current:
break
count += 1
return count
df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)
Pandas/Numpy
def count_consecutive_true(bool_arr):
return bool_arr[::-1].cumprod().sum()
def count_smaller(arr):
return count_consecutive_true(arr <= arr[-1]) - 1
def count_greater(arr):
return count_consecutive_true(arr >= arr[-1]) - 1
df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)
This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.
You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.
# number of days
N = 8
df.join(df.rolling(f'{N+1}d', min_periods=1)
.agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
})
.add_prefix(f'{N}_days_')
)
Output:
Open High Low Close 8_days_High 8_days_Low
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 0.0 0.0
2020-03-20 1.1843 1.1879 1.1769 1.1772 0.0 1.0
2020-03-21 1.1755 1.1767 1.1662 1.1672 0.0 2.0
2020-03-22 1.1686 1.1721 1.1611 1.1721 0.0 3.0
2020-03-23 1.1732 1.1819 1.1690 1.1819 2.0 0.0
2020-03-24 1.1836 1.1928 1.1807 1.1922 4.0 0.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 6.0 0.0
2020-03-26 1.1967 1.2141 1.1964 1.2114 7.0 0.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 7.0 0.0
2020-03-28 1.2080 1.2099 1.2018 1.2041 0.0 1.0
2020-03-29 1.2033 1.2033 1.1853 1.1880 0.0 4.0
2020-03-30 1.1876 1.1887 1.1806 1.1879 0.0 6.0
2020-03-31 1.1921 1.1972 1.1873 1.1939 1.0 0.0
2020-04-01 1.1932 1.1997 1.1914 1.1914 2.0 0.0
2020-04-02 1.1902 1.1924 1.1781 1.1862 0.0 7.0

Pandas Dataframe how to iterate over rows and perform calculations on their values

I've started to work with Pandas Dataframe and try to figure out how to deal with the below task.
I have an excel spreadsheet that needs to be imported to Pandas DataFrame and the below calculations need to be done to populate PercentageOnSale , Bonus and EmployeesIncome columns.
If the sum of all SalesValues for the EmployeeID is less than 5000 the PercentageOnSale should be 5% of SalesValue.
If the sum of all SalesValues for the EmployeeID is equal or more than 5000 the PercentageOnSale should be 7% of SalesValue.
If the sum of all SalesValues for the EmployeeID is more than 10.000 the PercentageOnSale should be 7% of SalesValue and additionaly a Bonus of 3% should be calculated.
EmployeesIncome is the sum of PercentageOnSale and Bonus columns.
sample excel view
You could try groupby-apply as follows:
# Data
df = pd.DataFrame({"EmployeeID":[1,1,2,3,1,3,5,1],
"ProductSold":["P1","P2","P3","P1","P2","P3","P1","P2"],
"SalesValue":[3000,3500,4000,3000,5000,3000,3000,4000]})
# Calculations
def calculate(x):
# Calcualte Bonus
x['Bonus'] = 0
if x['SalesValue'].sum() > 10000:
x['Bonus'] = 0.03*x['SalesValue']
# Calculate PercentageOnSale
if x['SalesValue'].sum() < 3000:
x['PercentageOnSale'] = 0.05*x['SalesValue']
else:
x['PercentageOnSale'] = 0.07*x['SalesValue']
# Total Income per sale
x['EmployeesIncome'] = x['PercentageOnSale'] + x['Bonus']
return x
df_final = df.groupby('EmployeeID').apply(calculate)
The output is as follows:
EmployeeID ProductSold SalesValue Bonus PercentageOnSale EmployeesIncome
0 1 P1 3000 90.0 210.0 300.0
1 1 P2 3500 105.0 245.0 350.0
2 2 P3 4000 0.0 280.0 280.0
3 3 P1 3000 0.0 210.0 210.0
4 1 P2 5000 150.0 350.0 500.0
5 3 P3 3000 0.0 210.0 210.0
6 5 P1 3000 0.0 210.0 210.0
7 1 P2 4000 120.0 280.0 400.0

Sorting Pandas data frame with groupby and conditions

I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*
You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0
The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)