Is there a way i could work with this multiindex? - pandas

I have a dataframe like this one, https://i.stack.imgur.com/2Sr29.png. RBD is a code that identifies each school, LET_CUR corresponds to a class and MRUN corresponds to the amount of students in each class, what i need is the following:
I would like to know how many of the schools have at least one class with more than 45 students, so far I haven't figured out yet a code to do that.
Thanks.

From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
RBD,LET_CUR,MRUN
1,A,65
1,B,23
1,C,21
2,A,22
2,B,20
2,C,34
3,A,54
4,A,23
4,B,11
5,A,15
5,C,16
6,A,76"""))
>>> df = df.set_index(['RBD', 'LET_CUR'])
>>> df
MRUN
RBD LET_CUR
1 A 65
B 23
C 21
2 A 22
B 20
C 34
3 A 54
4 A 23
B 11
5 A 15
C 16
6 A 76
As we want to know the number of school with at leat one class having more than 45 students, we can first filter the DataFrame on the column MRUN and then use the nunique() method to count the number of unique school :
>>> df_filtered = df[df['MRUN'] > 45].reset_index()
>>> df_filtered['RBD'].nunique()
3

Try with the following (here i build a similar dataframe structure as yours):
df = pd.DataFrame({'RBD': [1, 1, 2, 3],
'COD_GRADO': ['1', '2', '1', '3'],
'LET_CUR':['A', 'C', 'B', 'A'],
'MRUN':[65, 34, 64, 25]},
columns=['RBD', 'COD_GRADO', 'LET_CUR', 'MRUN'])
print(df)
n_schools = df.loc[df['MRUN'] >= 45].shape[0]
print(f"Number of shools with 45+ students is {n_schools}")
And output, for my example would (table formatted for easier reading):
(pd indices)
RBD
COD_GRADO
LET_CUR
MRUN
0
1
1
A
65
1
1
2
C
34
2
2
1
B
64
3
3
3
A
25
> Number of shools with 45+ students is 2

Related

Pandas Column Transformation with list of dict in column

I am getting the data from a nosql database own by third party. Post data fetch the dataframe look like below: I wish to explode perfomance column but can't figure out a way. Is it even possible?
import pandas as pd
cols = ['name', 'performance']
data = [
['bob', [{'dates': '15-12-2021', 'gdp': 19},
{'dates': '16-12-2021', 'gdp': 36},
{'dates': '12-12-2022', 'gdp': 39},
{'dates': '13-12-2022', 'gdp': 35},
{'dates': '14-12-2022', 'gdp': 35}]]]
df = pd.DataFrame(data, columns=cols)
Expected output:
cols = ['name', 'dates', 'gdp']
data = [
['bob', '15-12-2021', 19],
['bob', '16-12-2021', 36],
['bob', '12-12-2022', 39],
['bob', '13-12-2022', 35],
['bob', '14-12-2022', 35]]
df = pd.DataFrame(data, columns=cols)
Use DataFrame.explode with DataFrame.reset_index first and then flatten dictionaries by json_normalize, DataFrame.pop is used for remove column performance in ouput DataFrame:
df1 = df.explode('performance').reset_index(drop=True)
df1 = df1.join(pd.json_normalize(df1.pop('performance')))
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
Another solutions with list comprehension - if only 2 columns input DataFrame:
L = [{**{'name':a},**x} for a, b in zip(df['name'], df['performance']) for x in b]
df1 = pd.DataFrame(L)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
If multiple columns use DataFrame.join with original DataFrame:
L = [{**{'i':a},**x} for a, b in df.pop('performance').items() for x in b]
df1 = df.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

Why this inconsistency between a Dataframe and a column of it?

When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)

In Python Pandas using cumsum with groupby

I am trying to do a pandas cumsum(), where want to initialize the value to 0 every time group changes.
Say I have below dataframe where after group by I have col2(Group) and expect col3(cumsum) while using the function
Value Group Cumsum
a 1 0
a 1 1
a 1 2
b 2 0
b 2 1
b 2 2
b 2 3
c 3 0
c 3 1
d 4 0
This doesnt work
df['Cumsum'] = df['Group'].cumsum()
Please advise.
Thanks!
Hmm, this turned out more complicated than I imagined, due to getting the groups' keys back in. Perhaps someone else will find something shorter.
First, imports
import pandas as pd
import itertools
Now a DataFrame:
df = pd.DataFrame({
'a': ['a', 'b', 'a', 'b'],
'b': [0, 1, 2, 3]})
So now we separately do a groupby-cumsum, some itertools stuff for finding the keys, and combine both:
>>> pd.DataFrame({
'keys': list(itertools.chain.from_iterable([len(g) * [k] for k, g in df.b.groupby(df.a)])),
'cumsum': df.b.groupby(df.a).cumsum()})
cumsum keys
0 0 a
1 1 a
2 2 b
3 4 b