MultiIndex isn't kept when pd.concating multiple subtotal rows - pandas

I lose my multiIndex when I try to pd.concat a second subtotal. I'm able to add the first subtotal but not the second which is a sum of B0.
This is how my current df is:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
First Total 6 4 10 8
C1 D0 9 8 11 10
D1 13 12 15 14
First Total 22 20 26 24
C2 D0 17 16 19 18
After trying to add the second subtotal I get this:
lvl0 a b
lvl1 bar foo bah foo
(A0, B0, C2, First Total) 38 36 42 40
(A0, B0, C3, D0) 25 24 27 26
(A0, B0, C3, D1) 29 28 31 30
(A0, B0, C3, First Total) 54 52 58 56
(A0, B0, Second Total) 120 112 136 128
(A0, B1, C0, D0) 33 32 35 34
(A0, B1, C0, D1) 37 36 39 38
(A0, B1, C0, First Total) 70 68 74 72
(A0, B1, C1, D0) 41 40 43 42
You should be able to copy and paste the code below to test
import pandas as pd
import numpy as np
# creating multiIndex
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
miindex = pd.MultiIndex.from_product([mklbl('A', 4),
mklbl('B', 2),
mklbl('C', 4),
mklbl('D', 2)])
micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
('b', 'foo'), ('b', 'bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
.reshape((len(miindex), len(micolumns))),
index=miindex,
columns=micolumns).sort_index().sort_index(axis=1)
# My code STARTS HERE
# creating the first subtotal
print(dfmi.index)
df1 = dfmi.groupby(level=[0,1,2]).sum()
df2 = dfmi.groupby(level=[0, 1]).sum()
df1 = df1.set_index(np.array(['First Total'] * len(df1)), append=True)
dfmi = pd.concat([dfmi, df1]).sort_index(level=[0, 1])
print(dfmi)
# this is where the multiIndex is lost
df2 = df2.set_index(np.array(['Second Total'] * len(df2)), append=True)
dfmi = pd.concat([dfmi, df2]).sort_index(level=[1])
print(dfmi)
How I would want it to look:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
First Total 6 4 10 8
C1 D0 9 8 11 10
D1 13 12 15 14
First Total 22 20 26 24
C2 D0 17 16 19 18
D1 21 20 23 22
First Total 38 36 42 40
C3 D0 25 24 27 26
D1 29 28 31 30
First Total 54 52 58 56
Second Total 120 112 136 128
B1 C0 D0 33 32 35 34
D1 37 36 39 38
First Total 70 68 74 72
C1 D0 41 40 43 42
D1 45 44 47 46
First Total 86 84 90 88
C2 D0 49 48 51 50
D1 53 52 55 54
First Total 102 100 106 104
C3 D0 57 56 59 58
D1 61 60 63 62
First Total 118 116 122 120
Second Total 376 368 392 384
the first total is sum of level 2,
the second total is sum of level 1

dfmi has a 4-level MultiIndex:
In [208]: dfmi.index.nlevels
Out[208]: 4
df2 has a 3-level MultiIndex. Instead, if you use
df2 = df2.set_index([np.array(['Second Total'] * len(df2)), [''] * len(df2)], append=True)
then df2 ends up with a 4-level MultiIndex. When dfmi and df2 have the same number of levels,
then pd.concat([dfmi, df2]) produces the desired result.
One problem you may face when sorting by index labels is that it relies on the strings 'First' and 'Second'
appearing last in alphabetic order. An alterative to sorting by index would be assigning a numeric order column
and sorting by that instead:
dfmi['order'] = range(len(dfmi))
df1['order'] = dfmi.groupby(level=[0,1,2])['order'].last() + 0.1
df2['order'] = dfmi.groupby(level=[0,1])['order'].last() + 0.2
...
dfmi = pd.concat([dfmi, df1, df2])
dfmi = dfmi.sort_values(by='order')
Incorporating Scott Boston's improvement, the code would then look like this:
import pandas as pd
import numpy as np
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
miindex = pd.MultiIndex.from_product([mklbl('A', 4),
mklbl('B', 2),
mklbl('C', 4),
mklbl('Z', 2)])
micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
('b', 'foo'), ('b', 'bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
.reshape((len(miindex), len(micolumns))),
index=miindex,
columns=micolumns).sort_index().sort_index(axis=1)
df1 = dfmi.groupby(level=[0,1,2]).sum()
df2 = dfmi.groupby(level=[0, 1]).sum()
dfmi['order'] = range(len(dfmi))
df1['order'] = dfmi.groupby(level=[0,1,2])['order'].last() + 0.1
df2['order'] = dfmi.groupby(level=[0,1])['order'].last() + 0.2
df1 = df1.assign(lev4='First').set_index('lev4', append=True)
df2 = df2.assign(lev3='Second', lev4='').set_index(['lev3','lev4'], append=True)
dfmi = pd.concat([dfmi, df1, df2])
dfmi = dfmi.sort_values(by='order')
dfmi = dfmi.drop(['order'], axis=1)
print(dfmi)
which yields
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 Z0 1 0 3 2
Z1 5 4 7 6
First 6 4 10 8
C1 Z0 9 8 11 10
Z1 13 12 15 14
First 22 20 26 24
C2 Z0 17 16 19 18
Z1 21 20 23 22
First 38 36 42 40
C3 Z0 25 24 27 26
Z1 29 28 31 30
First 54 52 58 56
Second 120 112 136 128
...

#unutbu points out the nature of the problem. df2 has three levels of a multiindex and you need a 4th level.
I would use assign and set_index to create that fourth level:
df2 = df2.assign(lev3='Second Total', lev4='').set_index(['lev3','lev4'], append=True)
This avoids calculating the length of the dataframe.

Related

Reordering a DF by category in a preset order

df = pd.DataFrame(np.random.randint(0,100,size=(15, 3)), columns=list('NMO'))
df['Catgeory1'] = ['I','I','I','I','I','G','G','G','G','G','P','P','I','I','P']
df['Catgeory2'] = ['W','W','C','C','C','W','W','W','W','W','O','O','O','O','O']
Imagining this df is much larger with many more categories, how might I sort the list, retaining all the characteristics of any given row, by a determined order. Ex. Sorting the df only by 'category1', such that all the P's are first, the I's, then G's.
You can use categorical type:
cat_type = pd.CategoricalDtype(categories=["P", "I", "G"], ordered=True)
df['Category1'] = df['Category1'].astype(cat_type)
print(df.sort_values(by='Category1'))
Prints:
N M O Category1 Category2
10 49 37 44 P O
11 72 64 66 P O
14 39 98 32 P O
0 93 12 89 I W
1 20 74 21 I W
2 25 22 24 I C
3 47 11 33 I C
4 60 16 34 I C
12 0 90 6 I O
13 13 35 80 I O
5 84 64 67 G W
6 70 47 83 G W
7 61 57 76 G W
8 19 8 3 G W
9 7 8 5 G W
For PIG order (reverse alphabetical order):
df.sort_values('Catgeory1',ascending=False)
For custom sorting:
df['Catgeory1'] = pd.Categorical(df['Catgeory1'], ['P','G','I'])
df = df.sort_values('Catgeory1')

Move column level to top in multi column index pandas DataFrame

What's a pythonic way to move a certain column level to the top in a pandas multi column index?
Toy example:
import numpy as np
import pandas as pd
cols = pd.MultiIndex.from_arrays(
[
["a1", "a1", "a1", "a1"],
["b1", "b1", "b2", "b2"],
["x1", "x1", "x1", "x1"],
["c1", "c1", "c1", "c1"],
],
names=(
"unknown_level_name_0",
"unknown_level_name_1",
"known_level_name",
"unknown_level_name_last",
),
)
df = pd.DataFrame(np.random.randint(0, 100, [5, 4]), columns=cols)
print(df)
unknown_level_name_0 a1
unknown_level_name_1 b1 b2
known_level_name x1 x1
unknown_level_name_last c1 c1 c1 c1
0 37 34 97 19
1 54 47 53 46
2 63 94 14 85
3 16 51 27 96
4 32 64 76 26
I am looking for the following result:
known_level_name x1 x1
unknown_level_name_0 a1
unknown_level_name_1 b1 b2
unknown_level_name_last c1 c1 c1 c1
0 37 34 97 19
1 54 47 53 46
2 63 94 14 85
3 16 51 27 96
4 32 64 76 26
EDIT:
There can be a variable number of levels. Most level names are unknown. However, there will always be one familiar level name (here: "known_level_name").
Using reorder_levels or swaplevel might become tricky if I don't know the exact position of "known_level_name".
Here is a generic function to move a column level (by label or index) to the top:
def move_top(df, col, inplace=False):
if col in df.columns.names:
idx = df.columns.names.index(col)
elif isinstance(col, int) and 0 < col < len(df.columns.names):
idx = col
else:
raise IndexError(f'invalid index "{col}"')
order = list(range(len(df.columns.names)))
order.pop(idx)
order = [idx]+order
if inplace:
df.columns = df.columns.reorder_levels(order=order)
else:
return df.reorder_levels(order, axis=1)
move_top(df, 'known_level_name')
output:
known_level_name x1
unknown_level_name_0 a1
unknown_level_name_1 b1 b2
unknown_level_name_last c1 c1 c1 c1
0 33 30 23 77
1 10 73 80 33
2 7 54 52 9
3 71 99 22 22
4 83 15 86 40
You can just try reorder_levels
out = df.reorder_levels([2,0,1,3], axis=1)
Out[184]:
known_level_name x1
unknown_level_name_0 a1
unknown_level_name_1 b1 b2
unknown_level_name_last c1 c1 c1 c1
0 98 32 72 94
1 22 71 15 2
2 25 41 42 38
3 87 74 41 82
4 87 31 18 8
Use reorder_levels (also available as DataFrame method):
# by index
df.columns = df.columns.reorder_levels(order=[2,0,1,3])
or:
# by name
df.columns = df.columns.reorder_levels(order=['unknown_level_name_1',
'known_level_name',
'unknown_level_name_0',
'unknown_level_name_last'
])
Alternative, multiple swaplevel:
# move level 2 up # move again up
df.swaplevel(2,1,axis=1).swaplevel(1,0,axis=1)
output:
known_level_name x1
unknown_level_name_0 a1
unknown_level_name_1 b1 b2
unknown_level_name_last c1 c1 c1 c1
0 17 68 61 88
1 6 62 81 7
2 82 16 85 92
3 40 22 48 0
4 35 46 68 60

How do I solve this kind of problem through pandas.cut()?

I have my data as
data = pd.DataFrame({'A':[3,50,50,60],'B':[49,5,37,59],'C':[15,34,43,6],'D':[35,39,10,25]})
If I use cut this way
p = ['A','S','T','U','V','C','Z']
bins = [0,30,35,40,45,50,55,60]
data['A*'] = pd.cut(data.A,bins,labels=p)
print(data)
I get
A B C D A*
0 3 49 15 35 A
1 50 5 34 39 V
2 50 37 43 10 V
3 60 59 6 25 Z
How would I cut it to get
A B C D A*
0 3 49 15 35 3A
1 50 5 34 39 50V
2 50 37 43 10 50V
3 60 59 6 25 60Z
I tried this but doesn't work
for x in data.A:
p = [str(x)+'A',str(x)+'S',str(x)+'T',str(x)+'U',str(x)+'V',str(x)+'C',str(x)+'Z']
bins = [0,30,35,40,45,50,55,60]
It gives me this
A B C D A*
0 3 49 15 35 60A
1 50 5 34 39 60V
2 50 37 43 10 60V
3 60 59 6 25 60Z
Convert column A to strings and categoricals from pd.cut too and join together:
p = ['A','S','T','U','V','C','Z']
bins = [0,30,35,40,45,50,55,60]
data['A*'] = data.A.astype(str) + pd.cut(data.A,bins,labels=p).astype(str)
print(data)
A B C D A*
0 3 49 15 35 3A
1 50 5 34 39 50V
2 50 37 43 10 50V
3 60 59 6 25 60Z
EDIT:
For processing all columns is possible use DataFrame.apply:
data = data.apply(lambda x: x.astype(str) + pd.cut(x,bins,labels=p).astype(str))
print(data)
A B C D
0 3A 49V 15A 35S
1 50V 5A 34S 39T
2 50V 37T 43U 10A
3 60Z 59Z 6A 25A

Dropping list of rows from multi-level pandas dataframe where first two levels have duplicate indices

I would like to drop a list of row indices from a multi-level data frame, where the first two levels have duplicate entries. I imagine it is possible to do this without a loop, but thus far I have not found this.
I have attempted to use the pd.drop function by providing a list row index combinations, though this does not have the desired effect. As an example:
import numpy as np
import pandas as pd
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
def src_rec(n, mult):
src = [[no]*mult for no in range(1,n)]
src = [item for sublist in src for item in sublist]
rec = [no for no in range(1,n)]*mult
return src, rec
src, rec = src_rec(4,4)
miindex = pd.MultiIndex.from_arrays([src*2,
rec*2,
mklbl('C', 24)])
dfmi = pd.DataFrame(np.arange(len(miindex) * 2)\
.reshape((len(miindex), 2)),
index=miindex)
I would like to drop all rows with index values (1,2,:) and (2,3,:)
As = [1, 2]
Bs = [2, 3]
dfmi.drop(pd.MultiIndex.from_arrays([As,Bs]))
The result of this is:
0 1
1 1 C0 0 1
2 1 C18 36 37
2 C19 38 39
3 3 C20 40 41
1 C21 42 43
2 C22 44 45
3 C23 46 47
While my desired result is:
0 1
1 1 C0 0 1
3 C2 4 5
1 C3 6 7
2 2 C4 8 9
1 C6 12 13
2 C7 14 15
3 3 C8 16 17
1 C9 18 19
2 C10 20 21
3 C11 22 23
1 1 C12 24 25
3 C14 28 29
1 C15 30 31
2 2 C16 32 33
1 C18 36 37
2 C19 38 39
3 3 C20 40 41
1 C21 42 43
2 C22 44 45
3 C23 46 47
An example of this in a loop is;
for A, B in zip(As, Bs):
dfmi_drop_idx = CCdata.loc[(A, B, slice(None)), :].index
dfmi.drop(dfmi_drop_idx, inplace=True, errors='raise')
Use boolean indexing with test membership by Index.isin:
m = pd.MultiIndex.from_arrays([As,Bs])
df = dfmi[~dfmi.reset_index(level=2, drop=True).index.isin(m)]
print (df)
0 1
1 1 C0 0 1
3 C2 4 5
1 C3 6 7
2 2 C4 8 9
1 C6 12 13
2 C7 14 15
3 3 C8 16 17
1 C9 18 19
2 C10 20 21
3 C11 22 23
1 1 C12 24 25
3 C14 28 29
1 C15 30 31
2 2 C16 32 33
1 C18 36 37
2 C19 38 39
3 3 C20 40 41
1 C21 42 43
2 C22 44 45
3 C23 46 47

pandas: conditionally select a row cell for each column based on a mask

I want to be able to extract values from a pandas dataframe using a mask. However, after searching around, I cannot find a solution to my problem.
df = pd.DataFrame(np.random.randint(0,2, size=(2,10)))
mask = np.random.randint(0,2, size=(1,10))
I basically want the mask to serve as a index lookup for each column.
So if the mask was [0,1] for columns [a,b], I want to return:
df.iloc[0,a], df.iloc[1,b]
but in a pythonic way.
I have tried e.g.:
df.apply(lambda x: df.iloc[mask[x], x] for x in range(len(mask)))
which gives a Type error that I don't understand.
A for loop can work but is slow.
With NumPy, that's covered as advanced-indexing and should be pretty efficient -
df.values[mask, np.arange(mask.size)]
Sample run -
In [59]: df = pd.DataFrame(np.random.randint(11,99, size=(5,10)))
In [60]: mask = np.random.randint(0,5, size=(1,10))
In [61]: df
Out[61]:
0 1 2 3 4 5 6 7 8 9
0 17 87 73 98 32 37 61 58 35 87
1 52 64 17 79 20 19 89 88 19 24
2 50 33 41 75 19 77 15 59 84 86
3 69 13 88 78 46 76 33 79 27 22
4 80 64 17 95 49 16 87 82 60 19
In [62]: mask
Out[62]: array([[2, 3, 0, 4, 2, 2, 4, 0, 0, 0]])
In [63]: df.values[mask, np.arange(mask.size)]
Out[63]: array([[50, 13, 73, 95, 19, 77, 87, 58, 35, 87]])