Pandas groupby custom nlargest - pandas

When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24

there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0

Related

How to mark or select rows in one dataframe, where value lies between any of the ranges in another dataframe featuring additional identifier

I am a bloody beginner and working with pandas and numpy for scientific dataframes.
I have 2 large dataframes, one with coordinates (geneX - position) and the other with ranges of coordinates (genes - start - end). I would like to select (or mark) all rows in the first dataset, were the coordinate falls into any of the ranges in the second dataframe.
For example:
import pandas as pd
import numpy as np
test = {'A': ['gene1','gene1','gene2','gene3','gene3','gene3','gene4'],
'B': [1,11,21,31,41,51,61],
'C': [10,20,30,40,50,60,70],
'D': [4,64,25,7,36,56,7]}
df1 = pd.DataFrame(test, columns = ['A', 'D'])
df2 = pd.DataFrame(test, columns = ['A', 'B', 'C'])
This gives two dataframes looking like this:
df1:
A D
0 gene1 4 <-- I want this row
1 gene1 64
2 gene2 25 <-- and this row
3 gene3 7
4 gene3 36 <-- and this row
5 gene3 56 <-- and this row
6 gene4 7
df2:
A B C
0 gene1 1 10
1 gene1 11 20
2 gene2 21 30
3 gene3 31 40
4 gene3 41 50
5 gene3 51 60
6 gene4 61 70
I managed to come this far:
for i,j in zip(df2["B"],df2["C"]):
x=df1[(df1["D"] >=i ) & (df1["D"] <= j)]
df_final.append(x)
df_final=pd.concat(df_final,axis=0).reset_index(drop=True)
df_final=df_final.drop_duplicates()
df_final
But this only checks if the value in df1'D' is in any of the ranges in df2 but I fail to incorporate the "gene" identifier. Basically, I need it for each row in df1 to loop through df2 and first check if the gene matches, and if thats the case, check if the coordinate is in the data range.
Can anyone help me to figure this out?
Additional question: Can I make it, so it leaves the df1 intakt and just makes a new column with a "true" behind the rows that match the conditions directly? If not I would create a new df from the selected rows, add a column with the "true" label and then merge it back to the first one.
Thank you for your help. I really appreciate it !
Let us do in steps
Reset the index of df1
Merge df1 with df2 on column A (basically merge rows with same genes)
Query the merged dataframe to filter the rows where column D falls between B and C
Flag the rows by testing the membership of index of df1 in the index column of filtered rows
m = df1.reset_index().merge(df2, on='A').query('B <= D <= C')
df1['flag'] = df1.index.isin(m['index'])
print(df1)
A D flag
0 gene1 4 True
1 gene1 64 False
2 gene2 25 True
3 gene3 7 False
4 gene3 36 True
5 gene3 56 True
6 gene4 7 False

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

how to do a proper pandas group by, where, sum

i am having a hard time using a group by + where to apply a sum to a broader range.
given this code:
from io import StringIO
import numpy as np
f = pd.read_csv(StringIO("""
fund_id,l_s,val
fund1,L,10
fund1,L,20
fund1,S,30
fund2,L,15
fund2,L,25
fund2,L,35
"""))
# fund total - works as expected
f['fund_total'] = f.groupby('fund_id')['val'].transform(np.sum)
# fund L total - applied only to L rows.
f['fund_total_l'] = f[f['l_s'] == "L"].groupby('fund_id')['val'].transform(np.sum)
f
this code gets me close:
numbers are correct, but i would like fund_total_l column to show 30 for all rows of fund1 (not just L). I want a fund level summary, but sum filtered by the l_s column
i know i can do this with multiple steps, but this needs to be a single operation. i can use a separate generic function if that helps.
playground: https://repl.it/repls/UnusualImpeccableDaemons
Use Series.where, to create NaN, these will be ignored in your sum:
f['val_temp'] = f['val'].where(f['l_s'] == "L")
f['fund_total_l'] = f.groupby('fund_id')['val_temp'].transform('sum')
f = f.drop(columns='val_temp')
Or in one line using assign:
df['fun_total_l'] = (
f.assign(val_temp=f['val'].where(f['l_s'] == "L"))
.groupby('fund_id')['val_temp'].transform('sum')
)
Another way would be to partly use your solution, but then use DataFrame.reindex to get the original index back and then use ffill and bfill to fill up our NaN:
f['fund_total_l'] = (
f[f['l_s'] == "L"]
.groupby('fund_id')['val']
.transform('sum')
.reindex(f.index)
.ffill()
.bfill()
)
fund_id l_s val fund_total_l
0 fund1 L 10 30.0
1 fund1 L 20 30.0
2 fund1 S 30 30.0
3 fund2 L 15 75.0
4 fund2 L 25 75.0
5 fund2 L 35 75.0
I think there is a more elegant solution, but I'm not able to broadcast the results back to the individual rows.
Essentially, with a boolean mask of all the "L" rows
f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L")))
you obtain
fund_id
fund1 30
fund2 75
dtype: int64
now we can just merge after using reset_index to obtain
pd.merge(f, f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L"))).reset_index(), on="fund_id")
to obtain
fund_id l_s val 0
0 fund1 L 10 30
1 fund1 L 20 30
2 fund1 S 30 30
3 fund2 L 15 75
4 fund2 L 25 75
5 fund2 L 35 75
However, I'd guess that the merging is not necessary and can be obtained directly in apply

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100

Calculate row wise percentage in pandas

I have a data frame as shown below
id val1 val2 val3
a 100 60 40
b 20 18 12
c 160 140 100
For each row I want to calculate the percentage.
The expected output as shown below
id val1 val2 val3
a 50 30 20
b 40 36 24
c 40 35 25
I tried following code
df['sum'] = df['val1]+df['val2]+df['val3]
df['val1] = df['val1]/df['sum']
df['val2] = df['val2]/df['sum']
df['val3] = df['val3]/df['sum']
I would like to know is there any easy and alternate way than this in pandas.
We can do the following:
We slice the correct columns with iloc
Use apply with axis=1 to apply each calculation row wise
We use div, sum and mul to divide each value to the rows sum and multiply it by 100 to get the percentages in whole numbers not decimals
We convert our floats back to int with astype
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.div(x.sum()).mul(100), axis=1).astype(int)
Output
id val1 val2 val3
0 a 50 30 20
1 b 40 36 24
2 c 40 35 25
Or a vectorized solution, accessing the numpy arrays underneath our dataframe.
note: this method should perform better in terms of speed
df.iloc[:, 1:] = (df.iloc[:, 1:] / df.sum(axis=1)[:, None]).mul(100).astype(int)
Or similar but using the pandas DataFrame.div method:
proposed by Jon Clements
df.iloc[:, 1:] = df.iloc[:, 1:].div(df.iloc[:, 1:].sum(1), axis=0).mul(100)