Pandas groupby nlargest slice - pandas

There were kind of similar named questions, but they do not reflect the use case I am facing. I have a dataframe with groups and values. I want to select values sliced by their order (confusing maybe, example will explain better).
This is my data:
group value
a 20
a 16
a 14
a 13
a 12
b 19
b 17
b 16
b 14
b 13
b 12
b 12
b 11
I want to group by group and slice [a:b] with nlargest logic, in other words, if a = 2 and b = 7 the biggest 3rd, 4th, 5ht, 6th and 7th variables per each group. I could not find any question here on this use case, or could I find something in pandas-dev github.
If there are less than b elements in any of the groups, then b = len(of that group) should be applied. If there are two or more elements with the same value, they should all be selected if they are within the [a:b] slice.
My desired result looks like this:
group value
a 14
a 13
a 12
b 16
b 14
b 13
b 12
b 12
Here, the group a has 5 elements which is less than b in the example and because of that, 3rd to the 5th biggest elements are returned. In group b 6th and 7th biggest values are the same, so they are both returned.
The closest question to mine is this question about slice but it does not use nlargest logic. It just slices the groups.
If you could guide me on that, I would appreciate!

You could try the following:
import pandas as pd
gbg = df.groupby('group')
a=2
b=7
res = gbg['value'].agg(lambda x: pd.Series.to_list(x)[a:b]).to_frame().explode('value').reset_index()
# .agg will "aggregate" the groups, here it will create the slices by group
# .to_frame will convert results from pd.Series to pd.DataFrame
# .explode() will write the list values in rows again
# .reset_index() will restore the column 'group'
The intermediate result after .agg():
group
a [14, 13, 12]
b [16, 14, 13, 12, 12]
Name: value, dtype: object
And the full result:
group value
0 a 14
1 a 13
2 a 12
3 b 16
4 b 14
5 b 13
6 b 12
7 b 12

By sorting the dataframe first and using the slice method which this approach gives me the result I expected.
df.sort_values(["group", "value"], ascending = False).groupby("group").slice(2, 7)
Output is
group value
a 14
a 13
a 12
b 16
b 14
b 13
b 12
b 12

Related

the 'combine' of a split-apply-combine in pd.groupby() works brilliantly, but I'm not sure why

I have a fragment of code similar to below. It works perfectly, but I'm not sure why I am so lucky.
The groupby() is a split-apply-combine operation. So I understand why the qf.groupby(qf.g).mean() returns a series with two rows, the mean() for each of a,b.
And what's brilliant is that -combine step of the qf.groupby(qf.g).cumsum() reassembles all the rows into their original order as found in the starting df.
My question is, "Why am I able to count on this behavior?" I'm glad I can, but I cannot articulate why it's possible.
#split-apply-combine
import pandas as pd
#DF with a value, and an arbitrary category
qf= pd.DataFrame(data=[x for x in "aaabbaaaab"], columns=['g'])
qf['val'] = [1,2,3,1,2,3,4,5,6,9]
print(f"applying mean() to members in each group of a,b ")
print ( qf.groupby(qf.g).mean() )
print(f"\n\napplying cumsum() to members in each group of a,b ")
print( qf.groupby(qf.g).cumsum() ) #this combines them in the original index order thankfully
qf['running_totals'] = qf.groupby(qf.g).cumsum()
print (f"\n{qf}")
yields:
applying mean() to members in each group of a,b
val
g
a 3.428571
b 4.000000
applying cumsum() to members in each group of a,b
val
0 1
1 3
2 6
3 1
4 3
5 9
6 13
7 18
8 24
9 12
g val running_totals
0 a 1 1
1 a 2 3
2 a 3 6
3 b 1 1
4 b 2 3
5 a 3 9
6 a 4 13
7 a 5 18
8 a 6 24
9 b 9 12

Why this inconsistency between a Dataframe and a column of it?

When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.

Generate list of values summing to 1 - within groupby?

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100

Select from MultiIndex by labels with repetitions

I have difficulties to make selection from MultiIndex in pandas 0.14.1 (I know that is old version, but my choice is limited).
I need to do selection based on index labels.
For one level index selection goes fine with repetitions
(pd.DataFrame
.from_records({'A' : [1,2,3], 'B' : [11,12,13]})
.set_index('A')
).loc[idx[1,1,1,2,1], :]
B
A
1 11
1 11
1 11
2 12
1 11
For multilevel index selection works in different way, taking only unique values.
(pd.DataFrame
.from_records({'A' : [1,2,3], 'B' : [11,12,13], 'C' : [21,22,23]})
.set_index(['A', 'B'])
).loc[idx[[1,1,1,2,1], :], :]
C
A B
1 11 21
2 12 22
QUESTION: Is there anyway to use multiindex but preserve selection behaviour from single level index? The expected output is like in single index, thus, 5 rows in return, not 2
Best I could come up with. Be warned, this will blow up for many reasons and I fully expect you to come back with "But this didn't work on my real data."
Option 1
This will always work, but may not always be what you expected.
pd.concat([df.xs(i, drop_level=False) for i in [1, 1, 1, 2, 1]])
C
A B
1 11 21
11 21
11 21
2 12 22
1 11 21
Option 2
This will break if your first level values aren't unique on their own.
df.iloc[df.index.get_level_values(0).searchsorted([1, 1, 1, 2, 1])]
C
A B
1 11 21
11 21
11 21
2 12 22
1 11 21