pandas MultiIndex assign multiple columns - pandas

I have created a dataframe with a MultiIndex like below:
import numpy as np
import pandas as pd
column_index= [np.array(['OPEN','OPEN','CLOSE','CLOSE']),np.array(['IBM','AAPL','IBM','AAPL'])]
df = pd.DataFrame(np.transpose(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])),index=['20190101','20190102','20190103'],columns=column_index)
The result is like this:
OPEN CLOSE
IBM AAPL IBM AAPL
20190101 1 4 7 10
20190102 2 5 8 11
20190103 3 6 9 12
Now I'd like to create a new set of columns by doing something like:
df['RTN'] = df.CLOSE / df.OPEN
To get:
OPEN CLOSE RTN
IBM AAPL IBM AAPL IBM AAPL
20190101 1 4 7 10 7.0 2.5
20190102 2 5 8 11 4.0 2.2
20190103 3 6 9 12 3.0 2.0
That does not work. The nicest way I've been able to do this is like so:
rtn = df.CLOSE / df.OPEN
rtn = pd.concat([rtn],keys=['RTN'],axis=1)
df = pd.concat([df,rtn],axis=1)
Is there a way to do this as an assignment without the other steps?

One way is to rename the columns prior to the operations. Then it's a simple concat:
u = df.loc[:, ['CLOSE']].rename(columns={'CLOSE': 'RTN'}, level=0).divide(
df.loc[:, ['OPEN']].rename(columns={'OPEN': 'RTN'}, level=0))
# [] DataFrame selection keeps MultiIndex
pd.concat([df, u], axis=1)
Alternatively, you can stack + eval + unstack. It's concise, but perhaps not super performant for large datasets.
df.stack().eval('RTN = CLOSE/OPEN').unstack()
#df.stack().assign(RTN = lambda x: x.CLOSE/x.OPEN).unstack()
Without eval:
df.stack().assign(RTN = lambda x: x.CLOSE/x.OPEN).unstack()
#or
df = df.stack()
df['RTN'] = df.CLOSE/df.OPEN
df = df.unstack()
Output in all cases:
OPEN CLOSE RTN
IBM AAPL IBM AAPL IBM AAPL
20190101 1 4 7 10 7.0 2.5
20190102 2 5 8 11 4.0 2.2
20190103 3 6 9 12 3.0 2.0

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

Quickly replace values in a Pandas DataFrame

I have the following dataframe:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
# A B Sum
# 1 1 3 4
# 2 2 4 6
# Sum 3 7 10
I want to:
replace 1 by 3*4/10
replace 2 by 3*6/10
replace 3 by 4*7/10
replace 4 by 7*6/10
What is the easiest way to do this? I want the solution to be able to extend to n number of rows and columns. Been cracking my head over this. TIA!
If I understood you correctly:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
conditions = [(df==1), (df==2), (df==3), (df==4)]
values = [(3*4)/10, (3*6)/10, (4*7)/10, (7*6)/10]
df[df.columns] = np.select(conditions, values, df)
OutPut:
A B Sum
1 1.2 2.8 4.2
2 1.8 4.2 6.0
Sum 2.8 7.0 10.0
Let us try create it from original df before you do the sum and assign
import numpy as np
v = np.multiply.outer(df.sum(1).values,df.sum().values)/df.sum().sum()
out = pd.DataFrame(v,index=df.index,columns=df.columns)
out
Out[20]:
A B
1 1.2 2.8
2 1.8 4.2

Finding greatest fall and rise in a dynamic rolling window based on index

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:
Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

Pandas Creating Normal Dist series

I'm trying to convert an excel "normal distribution" formula into python.
(1-NORM.DIST(a+col,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE)))
For example: Here's my given df
Id a b c
ijk 4 3.5 12.53
xyz 12 3 10.74
My goal:
Id a b c 0 1 2 3
ijk 4 3.5 12.53 1 .93 .87 .81
xyz 12 3 10.74 1 .87 .76 .66
Here's the math behind it:
column 0: always 1
column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
This is what I have so far:
df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1
I'm not sure what to do after this.
This is how I use the normal distribution function:
https://support.office.com/en-us/article/normdist-function-126db625-c53e-4591-9a22-c9ff422d6d58
Many many thanks!
NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,
import scipy.stats as ss
ss.norm.cdf(4, 3.5, 12.53)
Out:
0.51591526057026538
For your case, you can first define a function:
def normalize(a, b, c, col):
return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)
and call that function with apply:
for col in range(4):
df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
df
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303
This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:
out = df.apply(
lambda x: pd.Series(
ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
), axis=1
)
Out:
0 1 2 3
0 1.0 0.934455 0.869533 0.805636
1 1.0 0.875050 0.760469 0.656303
And you can use join to add this to your original DataFrame:
df.join(out)
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303

pandas dataframe transformation partial sums

I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike
This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]