Pandas sort grouby groups by arbitrary condition on its contents - pandas

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).

I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100

Related

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

Pandas groupby sort each group values and order dataframe groups based on max of each group

I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2

Generate list of values summing to 1 - within groupby?

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871

Pandas find columns with wildcard names

I have a pandas dataframe with column names like this:
id ColNameOrig_x ColNameOrig_y
There are many such columns, the 'x' and 'y' came about because 2 datasets with similar column names were merged.
What I need to do:
df.ColName = df.ColNameOrig_x + df.ColNameOrig_y
I am now manually repeating this line for many cols(close to 50), is there a wildcard way of doing this?
You can use DataFrame.filter with DataFrame.groupby by lambda function and axis=1 for grouping per columns names with aggregate sum or use text functions like Series.str.split with indexing:
df1 = df.filter(like='_').groupby(lambda x: x.split('_')[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str.split('_').str[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str[:12], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
You can use the subscripting syntax to access column names dynamically:
col_groups = ['ColName1', 'ColName2']
for grp in col_groups:
df[grp] = df[f'{grp}Orig_x'] + df[f'{grp}Orig_y']
Or you can aggregate by column group. For example
df = pd.DataFrame([
[1,2,3,4],
[5,6,7,8]
], columns=['ColName1Orig_x', 'ColName1Orig_y', 'ColName2Orig_x', 'ColName2Orig_y'])
# Here's your opportunity to define the wildcard
col_groups = df.columns.str.extract('(.+)Orig_[x|y]')[0]
df.columns = [col_groups, df.columns]
df.groupby(level=0, axis=1).sum()
Input:
ColName1Orig_x ColName1Orig_y ColName2Orig_x ColName2Orig_y
1 2 3 4
5 6 7 8
Output:
ColName1 ColName2
3 7
11 15

pandas dataframe filter by sequence of values in a specific column

I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?
You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype
Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4