Mul() Broadcast levels from Multi Index - pandas

Attempting to use a multiply operation with a multi index.
import pandas as pd
import numpy as np
d = {'Alpha': [1,2,3,4,5,6,7,8,9]
,'Beta':tuple('ABCDEFGHI')
,'C': np.random.randint(1,10,9)
,'D': np.random.randint(100,200,9)
}
df = pd.DataFrame(d)
df.set_index(['Alpha','Beta'],inplace=True)
df = df.stack() #it's now a series
df.index.names = df.index.names[:-1] + ['Gamma']
ser = pd.Series(data = np.random.rand(9))
ser.index = pd.MultiIndex.from_tuples(zip(range(1,10),np.repeat('C',9)))
ser.index.names = ['Alpha','Gamma']
print df
print ser
foo = df.mul(ser,axis=0,level = ['Alpha','Gamma'])
So my dataframe which became a series looks like
Alpha Beta Gamma
1 A C 7
D 188
2 B C 7
D 110
3 C C 2
D 124
4 D C 4
D 153
5 E C 9
D 178
6 F C 6
D 196
7 G C 1
D 156
8 H C 1
D 184
9 I C 3
D 169
And my series looks like
Alpha Gamma
1 C 0.8731
2 C 0.6347
3 C 0.4688
4 C 0.5623
5 C 0.4944
6 C 0.5234
7 C 0.9946
8 C 0.7815
9 C 0.1219
In my multiply operation I want to broadcast on index levels 'Alpha' and 'Gamma'
but i get this error message:
TypeError: Join on level between two MultiIndex objects is ambiguous

How about this? Perhaps it's the extra 'Beta' column in df but not ser that causes the problem?
(Note: this is using df as updated in #Dickster's answer, not as in the original question)
df2 = df.reset_index().set_index(['Alpha','Gamma'])
df2[0].mul(ser)
Alpha Gamma
1 C 2.503829
D NaN
2 C 5.028208
D NaN
3 C 0.842322
D NaN
4 C 0.198101
D NaN
5 C 0.800745
D NaN
6 C 1.936523
D NaN
7 C 2.507393
D NaN
8 C 4.846258
D NaN
9 C NaN
D 147.233378

So imagine I have this, where I now have a 'D' in Gamma in the series "ser":
import pandas as pd
import numpy as np
np.random.seed(1)
d = {'Alpha': [1,2,3,4,5,6,7,8,9]
,'Beta':tuple('ABCDEFGHI')
,'C': np.random.randint(1,10,9)
,'D': np.random.randint(100,200,9)
}
df = pd.DataFrame(d)
df.set_index(['Alpha','Beta'],inplace=True)
df = df.stack() #it's now a series
df.index.names = df.index.names[:-1] + ['Gamma']
ser = pd.Series(data = np.random.rand(9))
idx = list(np.repeat('C',8))
idx.append('D')
ser.index = pd.MultiIndex.from_tuples(zip(range(1,10),idx))
ser.index.names = ['Alpha','Gamma']
print df
print ser
df_A = df.unstack('Alpha').mul(ser).stack('Alpha').reorder_levels(df.index.names)
print df_A
df_dickster77 = df.unstack('Alpha').mul(ser.unstack('Alpha')).stack('Alpha').reorder_levels(df.index.names)
print df_dickster77
Output is this:
Alpha Beta Gamma
1 A C 6
D 120
2 B C 9
D 118
3 C C 6
D 184
4 D C 1
D 111
5 E C 1
D 128
6 F C 2
D 129
7 G C 8
D 114
8 H C 7
D 150
9 I C 3
D 168
dtype: int32
Alpha Gamma
1 C 0.417305
2 C 0.558690
3 C 0.140387
4 C 0.198101
5 C 0.800745
6 C 0.968262
7 C 0.313424
8 C 0.692323
9 D 0.876389
dtype: float64
output A: inadvertent multiplication
Gamma C D
Alpha Beta Gamma
1 A C 2.503829 NaN
D 50.076576 NaN
2 B C 5.028208 NaN
D 65.925400 NaN
3 C C 0.842322 NaN
D 25.831197 NaN
4 D C 0.198101 NaN
D 21.989265 NaN
5 E C 0.800745 NaN
D 102.495305 NaN
6 F C 1.936523 NaN
D 124.905743 NaN
7 G C 2.507393 NaN
D 35.730356 NaN
8 H C 4.846258 NaN
D 103.848392 NaN
9 I C NaN 2.629167
D NaN 147.233378
output df_dickster77: Its correct multiplication lining up on C's and D.
However 8 x D NaNs lost and 1 x C NaN lost
Alpha Beta Gamma
1 A C 2.503829
2 B C 5.028208
3 C C 0.842322
4 D C 0.198101
5 E C 0.800745
6 F C 1.936523
7 G C 2.507393
8 H C 4.846258
9 I D 147.233378
dtype: float64

This is the way to do it ATM. At some point a more concise may be implemented.
In [21]: df.unstack('Alpha').mul(ser).stack('Alpha').reorder_levels(df.index.names)
Out[21]:
Gamma C
Alpha Beta Gamma
1 A C 6.761867
D 171.944612
2 B C 0.154139
D 6.371062
3 C C 2.311870
D 42.898041
4 D C 0.390920
D 9.479801
5 E C 3.484439
D 72.011743
6 F C 0.740913
D 50.382061
7 G C 3.459497
D 60.541203
8 H C 0.467012
D 19.030741
9 I C 0.071290
D 11.620286

Related

How can I append a Series to an existing DataFrame row with pandas?

I want to take a series and append it to an existing dataframe row. For example:
df
A B C
0 2 3 4
1 5 6 7
2 7 8 9
series
0 x
1 y
2 z
-->
A B C D E F
0 2 3 4 x y z
1 5 6 7 ...
2 7 8 9 ...
I want to do this using a for loop, appending a different series to each row of the dataframe. The series may have different lengths. Is there an easy way to accomplish this?
Use loc and the series's index as the column name
lst = [
[2,3,4],
[5,6,7],
[7,8,9]
]
df = pd.DataFrame(lst, columns=list("ABC"))
print(df)
###
A B C
0 2 3 4
1 5 6 7
2 7 8 9
s1 = pd.Series(list("xyz"))
s1.index = list("DEF")
print(s1)
###
D x
E y
F z
dtype: object
s2 = pd.Series(list("abcd"))
s2.index = list("GHIJ")
print(s2)
###
G a
H b
I c
J d
dtype: object
for idx, s in enumerate([s1, s2]):
df.loc[idx, s.index] = s.values
print(df)
###
A B C D E F G H I J
0 2 3 4 x y z NaN NaN NaN NaN
1 5 6 7 NaN NaN NaN a b c d
2 7 8 9 NaN NaN NaN NaN NaN NaN NaN
Try this:
df['D'], df['E'], df['F'] = s.tolist()
And now:
print(df)
Gives:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z
Edit:
If you are not sure how many extra values there are, try:
from string import ascii_uppercase as letters
df = df.assign(**dict(zip([letters[i + len(df.columns)] for i, v in enumerate(series)], series.tolist())))
print(df)
Output:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z

apply a function to each secuence of rows in a column

I have a df like this:
xx
A 3
B 4
C 1
D 5
E 7
F 6
G 3
H 5
I 8
J 5
I would like to apply the pct_change function to column XX to every 5 rows:
to generate the following output:
xx
A NaN
B 0.333333
C -0.750000
D 4.000000
E 0.400000
F NaN
G -0.500000
H 0.666667
I 0.600000
J -0.375000
How could I achieve this?
Create np.arange by length of df and use integer divison by 5 and pass it to groupby function:
df = df.groupby(np.arange(len(df)) // 5).pct_change()
print (df)
xx
A NaN
B 0.333333
C -0.750000
D 4.000000
E 0.400000
F NaN
G -0.500000
H 0.666667
I 0.600000
J -0.375000

How to append a DataFrame to a multiindex DataFrame?

Suppose that I have the DataFrames
In [1]: a=pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],
...: index=pd.MultiIndex.from_product([('A','B'),('d','e')]))
In [2]: a
Out[2]:
0 1
A d 1 2
e 3 4
B d 5 6
e 7 8
In [3]: b=pd.DataFrame([[9,10],[11,12]],index=('d','e'))
In [4]: b
Out[4]:
0 1
d 9 10
e 11 12
and I want to append b to a, with the subindex C, producing the
DataFrame
0 1
A d 1 2
e 3 4
B d 5 6
e 7 8
C d 9 10
e 11 12
I tried
In [5]: a.loc['C'] = b
but got
TypeError: 'int' object is not iterable
How do I do it?
Assign a new value to b , then set_index and swaplevel before append to a
a.append(b.assign(k='C').set_index('k',append=True).swaplevel(0,1))
Out[33]:
0 1
A d 1 2
e 3 4
B d 5 6
e 7 8
C d 9 10
e 11 12
First update b's index to match the same levels as a, then concat:
b.index = pd.MultiIndex.from_arrays([('C','C'), ('d','e')])
pd.concat([a,b]))])
If wanna step-by-step;
df2 = pd.concat([a,b], ignore_index=True)
df2['i0'] = a.index.get_level_values(0).tolist() + ['C']*len(b)
df2['i1'] = a.index.get_level_values(0).union(b.index)
df2.set_index(['i0', 'i1'])
Outputs
0 1
i0 i1
A A 1 2
A 3 4
B B 5 6
B 7 8
C d 9 10
e 11 12

How to remove duplicates the row wherefore two columns matches?

for my graduation project, I would like to remove duplicate rows and keep only a row where column b and c are equal for the value in column a. I tried a lot of things, groupby, Merge combinations and duplicates, but nothing worked out till now. Can you please help me? Many thanks!
input:
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
result:
a b c
1 1 A A
4 2 B B
I believe you need:
print (df)
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
5 3 C C
6 4 C NaN
7 4 C E
7 5 NaN E
Replace NaNs by forward and back filling:
df1 = df[['b','c']].bfill(axis=1).ffill(axis=1)
print (df1)
b c
0 A B
1 A A
2 A C
3 B A
4 B B
5 C C
6 C C
7 C E
7 E E
Check condition in df1 and because same index is possible filter df:
df = df[df1['b'] == df1['c']]
print (df)
a b c
1 1 A A
4 2 B B
5 3 C C
6 4 C NaN
7 5 NaN E

Finding the sum of each column and combined them to find the top 3 highest value

a = pd.DataFrame(df.groupby('actor_1_name')['gross'].sum())
b = pd.DataFrame(df.groupby('actor_2_name')['gross'].sum())
c = pd.DataFrame(df.groupby('actor_3_name')['gross'].sum())
x = [a,b,c]
y = pd.concat(x)
p =['actor_1_name','actor_2_name','actor_3_name','gross']
df.loc[y.nlargest(3).index,p]
I want to find the sum of each column then combine them together to find the top 3 highest values, but I'm getting an error and not sure what to do to fix it. I need some assistance.
I believe you need:
df = pd.DataFrame({'actor_1_name':['a','a','a','b','b','c','c','d','d','e'],
'actor_2_name':['d','d','a','c','b','c','c','d','e','e'],
'actor_3_name':['c','c','a','b','b','b','c','e','e','e'],
'gross':[1,2,3,4,5,6,7,8,9,10]})
print (df)
actor_1_name actor_2_name actor_3_name gross
0 a d c 1
1 a d c 2
2 a a a 3
3 b c b 4
4 b b b 5
5 c c b 6
6 c c c 7
7 d d e 8
8 d e e 9
9 e e e 10
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3)
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3)
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3)
x = [a,b,c]
print (x)
[actor_1_name
d 17
c 13
e 10
Name: gross, dtype: int64, actor_2_name
e 19
c 17
d 11
Name: gross, dtype: int64, actor_3_name
e 27
b 15
c 10
Name: gross, dtype: int64]
df1 = pd.concat(x, axis=1, keys=['actor_1_name','actor_2_name','actor_3_name'])
print (df1)
actor_1_name actor_2_name actor_3_name
b NaN NaN 15.0
c 13.0 17.0 10.0
d 17.0 11.0 NaN
e 10.0 19.0 27.0
EDIT1:
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3).reset_index()
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3).reset_index()
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3).reset_index()
x = [a,b,c]
print (x)
[ actor_1_name gross
0 d 17
1 c 13
2 e 10, actor_2_name gross
0 e 19
1 c 17
2 d 11, actor_3_name gross
0 e 27
1 b 15
2 c 10]
df1 = pd.concat(x, axis=1, keys=['a','b','c'])
df1.columns = df1.columns.map('-'.join)
print (df1)
a-actor_1_name a-gross b-actor_2_name b-gross c-actor_3_name c-gross
0 d 17 e 19 e 27
1 c 13 c 17 b 15
2 e 10 d 11 c 10
EDIT2:
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3).reset_index(drop=True)
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3).reset_index(drop=True)
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3).reset_index(drop=True)
x = [a,b,c]
print (x)
[0 17
1 13
2 10
Name: gross, dtype: int64, 0 19
1 17
2 11
Name: gross, dtype: int64, 0 27
1 15
2 10
Name: gross, dtype: int64]
df1 = pd.concat(x, axis=1, keys=['actor_1_name','actor_2_name','actor_3_name'])
print (df1)
actor_1_name actor_2_name actor_3_name
0 17 19 27
1 13 17 15
2 10 11 10