applying defined function over different ranges of rows in pandas - pandas

I have a df like this with tons of rows :
BB AA FF
2 5 0
3 7 A
6 5 A
9 6 A
8 3 0
And a function like this :
def test(a,b):
# a=array col AA
# b=array col BB
return (a*b)+a
I would like that for the rows in column FF where values are != 0 to apply the function test over that slice (array) of the df that involves column BB and AA to generate the following output in the new column ZZ:
BB AA FF ZZ
2 5 0 0
3 7 A 28
6 5 A 35
9 6 A 51
8 3 0 0
I was thinking in something like:
df['zz']= df.apply(lambda x: test(df.AA,df.BB) for the range of values among zero)
But my issue is that I am not sure on how to specify de arrays in column FF to apply the column

You can use DataFrame.apply + mask:
def test(x):
return (x[0]*x[1])+x[0]
df['ZZ']=df[['AA','BB']].apply(test,axis=1).mask(df['FF'].eq('0'),0)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0
or you can use lambda function:
df['ZZ']=df.apply(lambda x: x[['BB','AA']].prod()+ x['AA'] if x['FF'] != '0' else x['FF'],axis=1)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)
IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

Required data frame after explode or other option to fill a running difference b/w two columns pandas dataframe

Input data frame as given given below,
data = {
'labels': ["A","B","A","B","A","B","M","B","M","B","M"],
'start': [0,9,13,23,47,77,81,92,100,104,118],
'stop': [9,13,23,47,77,81,92,100,104,118,145],
}
df = pd.DataFrame.from_dict(data)
labels start stop
0 A 0 9
1 B 9 13
2 A 13 23
3 B 23 47
4 A 47 77
5 B 77 81
6 M 81 92
7 B 92 100
8 M 100 104
9 B 104 118
10 M 118 145
The output data frame required is as below,
Try this:
df['start'] = df.apply(lambda x: range(x['start'] + 1, x['stop'] + 1), axis=1)
df = df.explode('start')
Output:
>>> df
labels start stop
0 A 1 9
0 A 2 9
0 A 3 9
0 A 4 9
0 A 5 9
0 A 6 9
0 A 7 9
0 A 8 9
0 A 9 9
1 B 10 13
1 B 11 13
1 B 12 13
1 B 13 13
2 A 14 23
2 A 15 23
2 A 16 23
2 A 17 23
2 A 18 23
2 A 19 23
2 A 20 23
2 A 21 23
2 A 22 23
2 A 23 23
...

Python - Sort column ascending - using groupby

The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.
Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18