Winsorize within groups of dataframe - pandas

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))

You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45

This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

Related

Python - Sort column ascending - using groupby

The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.
Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12

Keep the second entry in a dataframe

I am showing you below an example dataset and the output desired.
ID number
1 50
1 49
1 48
2 47
2 40
2 31
3 60
3 51
3 42
Example output
1 49
2 40
3 51
I want to keep the second entry for every group in my dataset. I have already grouped them by ID but not I want for each Id to keep the second entry and remove all the duplicates afterwards from ID.
Use GroupBy.nth with 1 for second rows, because python counts from 0:
df1 = df.groupby('ID', as_index=False).nth(1)
print (df1)
ID number
1 1 49
4 2 40
7 3 51
Another solution with GroupBy.cumcount for counter and filtering by boolean indexing:
df1 = df[df.groupby('ID').cumcount() == 1]
Details:
print (df.groupby('ID').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
EDIT: Solution for second maximal value -s first sorting and then get second row - values has to be unique per groups:
df = (df.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df)
ID number
1 1 49
4 2 40
7 3 51
If want second maximal value if exist duplicates add DataFrame.drop_duplicates:
print (df)
ID number
0 1 50 <-first max
1 1 50 <-first max
2 1 48 <-second max
3 2 47
4 2 40
5 2 31
6 3 60
7 3 51
8 3 42
df3 = (df.drop_duplicates(['ID','number'])
.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df3)
ID number
2 1 48
4 2 40
7 3 51
If that is the case we can use duplicated + drop_duplicates
df=df[df.duplicated('ID')].drop_duplicates('ID')
ID number
1 1 49
4 2 40
7 3 51
Flexible solution cumcount
df[df.groupby('ID').cumcount()==1].copy()
ID number
1 1 49
4 2 40
7 3 51

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Pandas: Calculate percentage of column for each class

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64

Finding the maximum value for each column and the corspondace vlaue of a common column

I am trying to get the maximum value from each column in a dataframe with their time that they occur.
l = [[1,6,2,6,7],[2,66,2,6,8],[3,44,2,44,8],[4,5,35,6,8],[5,3,9,6,95]]
dft = pd.DataFrame(l, columns=['Time','25','50','75','100'])
max_t = pd.DataFrame()
max_t['Max_f'] = dft.loc[:, ['25','50','75','100']].max(axis=0)
max_t
I managed to get the maximum value in each column, however, I could not figure out how to get the time.
IIUC:
In [48]: dft
Out[48]:
Time 25 50 75 100
0 1 6 2 6 7
1 2 66 2 6 8
2 3 44 2 44 8
3 4 5 35 6 8
4 5 3 9 6 95
In [49]: dft.set_index('Time').agg(['max','idxmax']).T
Out[49]:
max idxmax
25 66 2
50 35 4
75 44 3
100 95 5