Using group by - create a new coulmn based on the condition on the other column in pandas

Using group by - create a new coulmn based on the condition on the other column in pandas - pandas

I have a data frame as shown below
B_ID Session no_show cumulative_no_show u_no_show
1 s1 0.4 0.4 0.4
2 s1 0.6 1.0 1.0
3 s1 0.2 1.2 0.2
4 s1 0.1 1.3 0.3
5 s1 0.4 1.7 0.7
6 s1 0.2 1.9 0.9
7 s1 0.3 2.2 0.2
10 s2 0.3 0.3 0.3
11 s2 0.4 0.7 0.7
12 s2 0.3 1.0 1.0
13 s2 0.6 1.6 0.6
14 s2 0.2 1.8 1.8
15 s2 0.5 2.3 0.3
From the above I woulk like to estimate new column slot_num depends on u_no_show as explained below. if u_no_show increases increase slot_num by one else keep it as same.
Expected Output
B_ID Session no_show cumulative_no_show u_no_show slot_num
1 s1 0.4 0.4 0.4 1
2 s1 0.6 1.0 1.0 2
3 s1 0.2 1.2 0.2 2
4 s1 0.1 1.3 0.3 3
5 s1 0.4 1.7 0.7 4
6 s1 0.2 1.9 0.9 5
7 s1 0.3 2.2 0.2 5
10 s2 0.3 0.3 0.3 1
11 s2 0.4 0.7 0.7 2
12 s2 0.3 1.0 1.0 3
13 s2 0.6 1.6 0.6 3
14 s2 0.2 1.8 0.8 4
15 s2 0.5 2.3 0.3 4

I would do with two groupby:
s = df.groupby('Session').u_no_show.diff().gt(0).astype(int)
df['slot_num'] = s.groupby(df.Session).cumsum().add(1)
Output:
B_ID Session no_show cumulative_no_show u_no_show slot_num
0 1 s1 0.4 0.4 0.4 1
1 2 s1 0.6 1.0 1.0 2
2 3 s1 0.2 1.2 0.2 2
3 4 s1 0.1 1.3 0.3 3
4 5 s1 0.4 1.7 0.7 4
5 6 s1 0.2 1.9 0.9 5
6 7 s1 0.3 2.2 0.2 5
7 10 s2 0.3 0.3 0.3 1
8 11 s2 0.4 0.7 0.7 2
9 12 s2 0.3 1.0 1.0 3
10 13 s2 0.6 1.6 0.6 3
11 14 s2 0.2 1.8 1.8 4
12 15 s2 0.5 2.3 0.3 4

Related

update the value of column based on groupby condition - Pandas

I have a data frame as shown below
Session slot_num ID prob
s1 1 A 0.2
s1 2 B 0.9
s1 2 B 0.4
s1 2 B 0.4
s1 3 C 0.7
s1 4 D 0.8
s1 4 D 0.3
s1 5 E 0.6
s1 6 F 0.5
s1 7 G 0.7
s2 1 A1 0.6
s2 2 B1 0.5
s2 3 C1 1.1
s2 3 C1 0.6
s2 4 D1 0.7
s2 5 E1 0.6
s2 6 F1 0.7
s2 7 G1 1.2
s2 7 G1 0.7
if Session and slot_num is same then change the ID of the rows except first row as TBF.
Expected Output:
Session slot_num ID prob
s1 1 A 0.2
s1 2 B 0.9
s1 2 TBF 0.4
s1 2 TBF 0.4
s1 3 C 0.7
s1 4 D 0.8
s1 4 TBF 0.3
s1 5 E 0.6
s1 6 F 0.5
s1 7 G 0.7
s2 1 A1 0.6
s2 2 B1 0.5
s2 3 C1 1.1
s2 3 TBF 0.6
s2 4 D1 0.7
s2 5 E1 0.6
s2 6 F1 0.7
s2 7 G1 1.2
s2 7 TBF 0.7

Use DataFrame.duplicated for mask with DataFrame.loc for set values by mask:
df.loc[df.duplicated(['Session','slot_num']), 'ID'] = 'TBF'
print (df)
Session slot_num ID prob
0 s1 1 A 0.2
1 s1 2 B 0.9
2 s1 2 TBF 0.4
3 s1 2 TBF 0.4
4 s1 3 C 0.7
5 s1 4 D 0.8
6 s1 4 TBF 0.3
7 s1 5 E 0.6
8 s1 6 F 0.5
9 s1 7 G 0.7
10 s2 1 A1 0.6
11 s2 2 B1 0.5
12 s2 3 C1 1.1
13 s2 3 TBF 0.6
14 s2 4 D1 0.7
15 s2 5 E1 0.6
16 s2 6 F1 0.7
17 s2 7 G1 1.2
18 s2 7 TBF 0.7

Create a new column to indicate the trend on a particular column using pandas groupby

I have a data frame as shown below
Session ID cumulative_prob
s1 1 0.4
s1 3 0.9
s1 4 -0.1
s1 5 0.3
s1 8 1.2
s1 9 0.2
s2 22 0.4
s2 29 0.7
s2 31 1.4
s2 32 0.4
s2 34 0.9
s3 36 0.9
s3 37 -0.1
s3 38 0.2
s3 40 1.0
From the I would like to create a new column which indicates the session wise trend (increase or decrease)
Expected Output:
Session ID cumulative_prob Decrease
s1 1 0.4 no
s1 3 0.9 no
s1 4 -0.1 yes
s1 5 0.3 no
s1 8 1.2 no
s1 9 0.2 yes
s2 22 0.4 no
s2 29 0.7 no
s2 31 1.4 no
s2 32 0.4 yes
s2 34 0.9 no
s3 36 0.9 no
s3 37 -0.1 yes
s3 38 0.2 no
s3 40 1.0 no
Note: Keep deafault 'no' for the first row for each Session

IIUC, GroupBy.diff and np.where:
#import numpy as np
df['Decrease'] = np.where(df.groupby('Session')['cumulative_prob']
.diff()
.lt(0),
'yes',
'no')
print(df)
Session ID cumulative_prob Decrease
0 s1 1 0.4 no
1 s1 3 0.9 no
2 s1 4 -0.1 yes
3 s1 5 0.3 no
4 s1 8 1.2 no
5 s1 9 0.2 yes
6 s2 22 0.4 no
7 s2 29 0.7 no
8 s2 31 1.4 no
9 s2 32 0.4 yes
10 s2 34 0.9 no
11 s3 36 0.9 no
12 s3 37 -0.1 yes
13 s3 38 0.2 no
14 s3 40 1.0 no
We can also use Series.map:
(df.groupby('Session')['cumulative_prob']
.diff()
.lt(0)
.map({True : 'yes' , False : 'no'}))

groupby cumulative in pandas then update using numpy based specific condition

I have a data frame as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.4 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
From the above I would like to find the cumulative No_show by Session
df['Cum_No_show'] = df.groupby(['Session'])['No_Show'].cumsum()
No we get
B_ID No_Show Session slot_num Patient_count Cumulative_No_show
1 0.4 S1 1 1 0.4
2 0.3 S1 2 1 0.7
3 0.8 S1 3 1 1.5
4 0.3 S1 3 2 1.8
5 0.6 S1 4 1 2.4
6 0.8 S1 5 1 3.2
7 0.9 S1 5 2 4.1
8 0.4 S1 5 3 4.5
9 0.6 S1 5 4 5.1
12 0.9 S2 1 1 0.9
13 0.5 S2 1 2 1.4
14 0.3 S2 2 1 1.7
15 0.7 S2 3 1 2.4
20 0.7 S2 4 1 3.1
16 0.6 S2 5 1 3.7
17 0.8 S2 5 2 4.5
19 0.3 S2 5 3 4.8
From the above I would like create a new column named as below
U_slot_num = Updated slot number
U_No_show = Updated cumulative no show
Whenever cumulative no show > 0.6 change the next slot_num as same as current one and increase patient count by one and update U_No_show as subtracting 1 as shown in expected output.
Expected output:
No_Show Session slot_num Patient_count Cum_No_show U_slot_num U_No_show
0.4 S1 1 1 0.4 1 0.4
0.3 S1 2 1 0.7 2 0.7
0.8 S1 3 1 1.5 2 0.5
0.3 S1 3 2 1.8 3 0.8
0.6 S1 4 1 2.4 3 0.4
0.8 S1 5 1 3.2 4 1.2
0.9 S1 5 2 4.1 4 0.2
0.4 S1 5 3 4.5 5 0.6
0.6 S1 5 4 5.1 6 1.2
0.9 S2 1 1 0.9 1 0.9
0.5 S2 1 2 1.4 1 0.4
0.3 S2 2 1 1.7 2 0.7
0.7 S2 3 1 2.4 2 0.4
0.7 S2 4 1 3.1 3 1.1
0.6 S2 5 1 3.7 3 0.7
0.8 S2 5 2 4.5 3 0.5
0.3 S2 5 3 4.8 4 0.8

calculate more than one column using for loop based multiple specific conditions in pandas

I have a dataframe as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
p = [0.2, 0.4] and Duration for each slot = 30 (minutes)
p = threshold probability
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1 Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2 multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3 multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off_0.2 Cut_off_0.4
1 0.2 S1 1 1 1 0.5
2 0.3 S1 2 1 1.5 0.75
3 0.8 S1 3 1 4 2
4 0.3 S1 3 2 1.2 0.6
5 0.6 S1 4 1 3 1.5
6 0.8 S1 5 1 4 2
7 0.9 S1 5 2 3.6 1.8
8 0.4 S1 5 3 1.44 0.72
9 0.6 S1 5 4 0.864 0.432
12 0.9 S2 1 1 4.5 2.25
13 0.5 S2 1 2 2.25 1.125
14 0.3 S2 2 1 1.5 0.75
15 0.7 S2 3 1 3.5 1.75
20 0.7 S2 4 1 3.5 1.75
16 0.6 S2 5 1 3 1.5
17 0.8 S2 5 2 2.4 1.2
19 0.3 S2 5 3 0.72 0.36
I tried below code
p = [0.2, 0.4]
for i in p:
df['Cut_off_'+'i'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)

Your solution is possible here with f-strings with {i} for new columns names:
p = [0.2, 0.4]
for i in p:
df[f'Cut_off_{i}'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Solution with numpy is also possible - output is converted to numpy array and divided by p, then converted to DataFrame and joined to original.
p = [0.2, 0.4]
arr = df.groupby(['Session','slot_num'])['No_Show'].cumprod().values[:, None] / np.array(p)
df = df.join(pd.DataFrame(arr, columns=p, index=df.index).add_prefix('Cut_off_'))

divide a column based on groupby or looping conditions in pandas

I have a data frame as shown below
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
threshold probability = 0.2
Duration for each slot = 30 (minutes)
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1
Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2
multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3
multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off
1 0.2 S1 1 1 1
2 0.3 S1 2 1 1.5
3 0.8 S1 3 1 4
4 0.3 S1 3 2 1.2
5 0.6 S1 4 1 3
6 0.8 S1 5 1 4
7 0.9 S1 5 2 3.6
8 0.4 S1 5 3 1.44
9 0.6 S1 5 4 0.864
12 0.9 S2 1 1 4.5
13 0.5 S2 1 2 2.25
14 0.3 S2 2 1 1.5
15 0.7 S2 3 1 3.5
20 0.7 S2 4 1 3.5
16 0.6 S2 5 1 3
17 0.8 S2 5 2 2.4
19 0.3 S2 5 3 0.72

Use GroupBy.cumprod and divide by probability by Series.div:
probability = 0.2
df['new'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(probability)
print (df)
B_ID No_Show Session slot_num Patient_count new
0 1 0.2 S1 1 1 1.000
1 2 0.3 S1 2 1 1.500
2 3 0.8 S1 3 1 4.000
3 4 0.3 S1 3 2 1.200
4 5 0.6 S1 4 1 3.000
5 6 0.8 S1 5 1 4.000
6 7 0.9 S1 5 2 3.600
7 8 0.4 S1 5 3 1.440
8 9 0.6 S1 5 4 0.864
9 12 0.9 S2 1 1 4.500
10 13 0.5 S2 1 2 2.250
11 14 0.3 S2 2 1 1.500
12 15 0.7 S2 3 1 3.500
13 20 0.7 S2 4 1 3.500
14 16 0.6 S2 5 1 3.000
15 17 0.8 S2 5 2 2.400
16 19 0.3 S2 5 3 0.720

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using group by - create a new coulmn based on the condition on the other column in pandas - pandas

Related

update the value of column based on groupby condition - Pandas

Create a new column to indicate the trend on a particular column using pandas groupby

groupby cumulative in pandas then update using numpy based specific condition

calculate more than one column using for loop based multiple specific conditions in pandas

divide a column based on groupby or looping conditions in pandas

Categories

Resources