Using group by - create a new coulmn based on the condition on the other column in pandas - pandas

I have a data frame as shown below
B_ID Session no_show cumulative_no_show u_no_show
1 s1 0.4 0.4 0.4
2 s1 0.6 1.0 1.0
3 s1 0.2 1.2 0.2
4 s1 0.1 1.3 0.3
5 s1 0.4 1.7 0.7
6 s1 0.2 1.9 0.9
7 s1 0.3 2.2 0.2
10 s2 0.3 0.3 0.3
11 s2 0.4 0.7 0.7
12 s2 0.3 1.0 1.0
13 s2 0.6 1.6 0.6
14 s2 0.2 1.8 1.8
15 s2 0.5 2.3 0.3
From the above I woulk like to estimate new column slot_num depends on u_no_show as explained below. if u_no_show increases increase slot_num by one else keep it as same.
Expected Output
B_ID Session no_show cumulative_no_show u_no_show slot_num
1 s1 0.4 0.4 0.4 1
2 s1 0.6 1.0 1.0 2
3 s1 0.2 1.2 0.2 2
4 s1 0.1 1.3 0.3 3
5 s1 0.4 1.7 0.7 4
6 s1 0.2 1.9 0.9 5
7 s1 0.3 2.2 0.2 5
10 s2 0.3 0.3 0.3 1
11 s2 0.4 0.7 0.7 2
12 s2 0.3 1.0 1.0 3
13 s2 0.6 1.6 0.6 3
14 s2 0.2 1.8 0.8 4
15 s2 0.5 2.3 0.3 4

I would do with two groupby:
s = df.groupby('Session').u_no_show.diff().gt(0).astype(int)
df['slot_num'] = s.groupby(df.Session).cumsum().add(1)
Output:
B_ID Session no_show cumulative_no_show u_no_show slot_num
0 1 s1 0.4 0.4 0.4 1
1 2 s1 0.6 1.0 1.0 2
2 3 s1 0.2 1.2 0.2 2
3 4 s1 0.1 1.3 0.3 3
4 5 s1 0.4 1.7 0.7 4
5 6 s1 0.2 1.9 0.9 5
6 7 s1 0.3 2.2 0.2 5
7 10 s2 0.3 0.3 0.3 1
8 11 s2 0.4 0.7 0.7 2
9 12 s2 0.3 1.0 1.0 3
10 13 s2 0.6 1.6 0.6 3
11 14 s2 0.2 1.8 1.8 4
12 15 s2 0.5 2.3 0.3 4

Related

update the value of column based on groupby condition - Pandas

I have a data frame as shown below
Session slot_num ID prob
s1 1 A 0.2
s1 2 B 0.9
s1 2 B 0.4
s1 2 B 0.4
s1 3 C 0.7
s1 4 D 0.8
s1 4 D 0.3
s1 5 E 0.6
s1 6 F 0.5
s1 7 G 0.7
s2 1 A1 0.6
s2 2 B1 0.5
s2 3 C1 1.1
s2 3 C1 0.6
s2 4 D1 0.7
s2 5 E1 0.6
s2 6 F1 0.7
s2 7 G1 1.2
s2 7 G1 0.7
if Session and slot_num is same then change the ID of the rows except first row as TBF.
Expected Output:
Session slot_num ID prob
s1 1 A 0.2
s1 2 B 0.9
s1 2 TBF 0.4
s1 2 TBF 0.4
s1 3 C 0.7
s1 4 D 0.8
s1 4 TBF 0.3
s1 5 E 0.6
s1 6 F 0.5
s1 7 G 0.7
s2 1 A1 0.6
s2 2 B1 0.5
s2 3 C1 1.1
s2 3 TBF 0.6
s2 4 D1 0.7
s2 5 E1 0.6
s2 6 F1 0.7
s2 7 G1 1.2
s2 7 TBF 0.7
Use DataFrame.duplicated for mask with DataFrame.loc for set values by mask:
df.loc[df.duplicated(['Session','slot_num']), 'ID'] = 'TBF'
print (df)
Session slot_num ID prob
0 s1 1 A 0.2
1 s1 2 B 0.9
2 s1 2 TBF 0.4
3 s1 2 TBF 0.4
4 s1 3 C 0.7
5 s1 4 D 0.8
6 s1 4 TBF 0.3
7 s1 5 E 0.6
8 s1 6 F 0.5
9 s1 7 G 0.7
10 s2 1 A1 0.6
11 s2 2 B1 0.5
12 s2 3 C1 1.1
13 s2 3 TBF 0.6
14 s2 4 D1 0.7
15 s2 5 E1 0.6
16 s2 6 F1 0.7
17 s2 7 G1 1.2
18 s2 7 TBF 0.7

Create a new column to indicate the trend on a particular column using pandas groupby

I have a data frame as shown below
Session ID cumulative_prob
s1 1 0.4
s1 3 0.9
s1 4 -0.1
s1 5 0.3
s1 8 1.2
s1 9 0.2
s2 22 0.4
s2 29 0.7
s2 31 1.4
s2 32 0.4
s2 34 0.9
s3 36 0.9
s3 37 -0.1
s3 38 0.2
s3 40 1.0
From the I would like to create a new column which indicates the session wise trend (increase or decrease)
Expected Output:
Session ID cumulative_prob Decrease
s1 1 0.4 no
s1 3 0.9 no
s1 4 -0.1 yes
s1 5 0.3 no
s1 8 1.2 no
s1 9 0.2 yes
s2 22 0.4 no
s2 29 0.7 no
s2 31 1.4 no
s2 32 0.4 yes
s2 34 0.9 no
s3 36 0.9 no
s3 37 -0.1 yes
s3 38 0.2 no
s3 40 1.0 no
Note: Keep deafault 'no' for the first row for each Session
IIUC, GroupBy.diff and np.where:
#import numpy as np
df['Decrease'] = np.where(df.groupby('Session')['cumulative_prob']
.diff()
.lt(0),
'yes',
'no')
print(df)
Session ID cumulative_prob Decrease
0 s1 1 0.4 no
1 s1 3 0.9 no
2 s1 4 -0.1 yes
3 s1 5 0.3 no
4 s1 8 1.2 no
5 s1 9 0.2 yes
6 s2 22 0.4 no
7 s2 29 0.7 no
8 s2 31 1.4 no
9 s2 32 0.4 yes
10 s2 34 0.9 no
11 s3 36 0.9 no
12 s3 37 -0.1 yes
13 s3 38 0.2 no
14 s3 40 1.0 no
We can also use Series.map:
(df.groupby('Session')['cumulative_prob']
.diff()
.lt(0)
.map({True : 'yes' , False : 'no'}))

groupby cumulative in pandas then update using numpy based specific condition

I have a data frame as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.4 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
From the above I would like to find the cumulative No_show by Session
df['Cum_No_show'] = df.groupby(['Session'])['No_Show'].cumsum()
No we get
B_ID No_Show Session slot_num Patient_count Cumulative_No_show
1 0.4 S1 1 1 0.4
2 0.3 S1 2 1 0.7
3 0.8 S1 3 1 1.5
4 0.3 S1 3 2 1.8
5 0.6 S1 4 1 2.4
6 0.8 S1 5 1 3.2
7 0.9 S1 5 2 4.1
8 0.4 S1 5 3 4.5
9 0.6 S1 5 4 5.1
12 0.9 S2 1 1 0.9
13 0.5 S2 1 2 1.4
14 0.3 S2 2 1 1.7
15 0.7 S2 3 1 2.4
20 0.7 S2 4 1 3.1
16 0.6 S2 5 1 3.7
17 0.8 S2 5 2 4.5
19 0.3 S2 5 3 4.8
From the above I would like create a new column named as below
U_slot_num = Updated slot number
U_No_show = Updated cumulative no show
Whenever cumulative no show > 0.6 change the next slot_num as same as current one and increase patient count by one and update U_No_show as subtracting 1 as shown in expected output.
Expected output:
No_Show Session slot_num Patient_count Cum_No_show U_slot_num U_No_show
0.4 S1 1 1 0.4 1 0.4
0.3 S1 2 1 0.7 2 0.7
0.8 S1 3 1 1.5 2 0.5
0.3 S1 3 2 1.8 3 0.8
0.6 S1 4 1 2.4 3 0.4
0.8 S1 5 1 3.2 4 1.2
0.9 S1 5 2 4.1 4 0.2
0.4 S1 5 3 4.5 5 0.6
0.6 S1 5 4 5.1 6 1.2
0.9 S2 1 1 0.9 1 0.9
0.5 S2 1 2 1.4 1 0.4
0.3 S2 2 1 1.7 2 0.7
0.7 S2 3 1 2.4 2 0.4
0.7 S2 4 1 3.1 3 1.1
0.6 S2 5 1 3.7 3 0.7
0.8 S2 5 2 4.5 3 0.5
0.3 S2 5 3 4.8 4 0.8

calculate more than one column using for loop based multiple specific conditions in pandas

I have a dataframe as shown below.
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
p = [0.2, 0.4] and Duration for each slot = 30 (minutes)
p = threshold probability
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1 Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2 multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3 multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off_0.2 Cut_off_0.4
1 0.2 S1 1 1 1 0.5
2 0.3 S1 2 1 1.5 0.75
3 0.8 S1 3 1 4 2
4 0.3 S1 3 2 1.2 0.6
5 0.6 S1 4 1 3 1.5
6 0.8 S1 5 1 4 2
7 0.9 S1 5 2 3.6 1.8
8 0.4 S1 5 3 1.44 0.72
9 0.6 S1 5 4 0.864 0.432
12 0.9 S2 1 1 4.5 2.25
13 0.5 S2 1 2 2.25 1.125
14 0.3 S2 2 1 1.5 0.75
15 0.7 S2 3 1 3.5 1.75
20 0.7 S2 4 1 3.5 1.75
16 0.6 S2 5 1 3 1.5
17 0.8 S2 5 2 2.4 1.2
19 0.3 S2 5 3 0.72 0.36
I tried below code
p = [0.2, 0.4]
for i in p:
df['Cut_off_'+'i'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Your solution is possible here with f-strings with {i} for new columns names:
p = [0.2, 0.4]
for i in p:
df[f'Cut_off_{i}'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(i)
Solution with numpy is also possible - output is converted to numpy array and divided by p, then converted to DataFrame and joined to original.
p = [0.2, 0.4]
arr = df.groupby(['Session','slot_num'])['No_Show'].cumprod().values[:, None] / np.array(p)
df = df.join(pd.DataFrame(arr, columns=p, index=df.index).add_prefix('Cut_off_'))

divide a column based on groupby or looping conditions in pandas

I have a data frame as shown below
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
threshold probability = 0.2
Duration for each slot = 30 (minutes)
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1
Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2
multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3
multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off
1 0.2 S1 1 1 1
2 0.3 S1 2 1 1.5
3 0.8 S1 3 1 4
4 0.3 S1 3 2 1.2
5 0.6 S1 4 1 3
6 0.8 S1 5 1 4
7 0.9 S1 5 2 3.6
8 0.4 S1 5 3 1.44
9 0.6 S1 5 4 0.864
12 0.9 S2 1 1 4.5
13 0.5 S2 1 2 2.25
14 0.3 S2 2 1 1.5
15 0.7 S2 3 1 3.5
20 0.7 S2 4 1 3.5
16 0.6 S2 5 1 3
17 0.8 S2 5 2 2.4
19 0.3 S2 5 3 0.72
Use GroupBy.cumprod and divide by probability by Series.div:
probability = 0.2
df['new'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(probability)
print (df)
B_ID No_Show Session slot_num Patient_count new
0 1 0.2 S1 1 1 1.000
1 2 0.3 S1 2 1 1.500
2 3 0.8 S1 3 1 4.000
3 4 0.3 S1 3 2 1.200
4 5 0.6 S1 4 1 3.000
5 6 0.8 S1 5 1 4.000
6 7 0.9 S1 5 2 3.600
7 8 0.4 S1 5 3 1.440
8 9 0.6 S1 5 4 0.864
9 12 0.9 S2 1 1 4.500
10 13 0.5 S2 1 2 2.250
11 14 0.3 S2 2 1 1.500
12 15 0.7 S2 3 1 3.500
13 20 0.7 S2 4 1 3.500
14 16 0.6 S2 5 1 3.000
15 17 0.8 S2 5 2 2.400
16 19 0.3 S2 5 3 0.720