groupby first as a dictionary in pandas - pandas

I have data frame as shown below.
session slot_num appt_time
s1 1 2020-01-06 09:00:00
s1 2 2020-01-06 09:20:00
s1 3 2020-01-06 09:40:00
s1 3 2020-01-06 09:40:00
s1 4 2020-01-06 10:00:00
s1 4 2020-01-06 10:00:00
s2 1 2020-01-06 08:20:00
s2 2 2020-01-06 08:40:00
s2 2 2020-01-06 08:40:00
s2 3 2020-01-06 09:00:00
s2 4 2020-01-06 09:20:00
s2 5 2020-01-06 09:40:00
s2 5 2020-01-06 09:40:00
s2 6 2020-01-06 10:00:00
s3 1 2020-01-09 13:00:00
s3 1 2020-01-09 13:00:00
s3 2 2020-01-09 13:20:00
s3 3 2020-01-09 13:40:00
from the above I want to create a dictionary with key as session and value as a starting time of each appt_time.
Expected Output:
d = {'S1':'2020-01-06 09:00:00',
'S2':'2020-01-06 08:20:00',
'S3':'2020-01-09 13:00:00'}

Use DataFrame.drop_duplicates with convert sesion to index, select column for Series and last use Series.to_dict:
d = df.drop_duplicates('session').set_index('session')['appt_time'].to_dict()
print (d)
{'s1': '2020-01-06 09:00:00', 's2': '2020-01-06 08:20:00', 's3': '2020-01-09 13:00:00'}


Add a column value with the other date time column at minutes level in pandas

I have a data frame as shown below
ID ideal_appt_time service_time
1 2020-01-06 09:00:00 22
2 2020-01-06 09:30:00 15
1 2020-01-08 14:00:00 42
2 2020-01-12 01:30:00 5
I would like to add service time in terms of minutes with ideal_appt_time and create new column called finish.
Expected Output:
ID ideal_appt_time service_time finish
1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
2 2020-01-12 01:30:00 35 2020-01-12 02:05:00
Use to_timedelta for convert column to timedeltas by minutes and add to datetimes:
df['ideal_appt_time'] = pd.to_datetime(df['ideal_appt_time'])
df['finish'] = df['ideal_appt_time'] + pd.to_timedelta(df['service_time'], unit='Min')
print (df)
ID ideal_appt_time service_time finish
0 1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2 2020-01-12 01:30:00 5 2020-01-12 01:35:00
df=pd.DataFrame({'ideal_appt_time':['2020-01-06 09:00:00','2020-01-06 09:30:00','2020-01-08 14:00:00','2020-01-12 01:30:00'],'service_time':[22,15,42,35]})
Another way out
df['finish'] = pd.to_datetime(df['ideal_appt_time']).add( df['service_time'].astype('timedelta64[m]'))
ideal_appt_time service_time finish
0 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2020-01-12 01:30:00 35 2020-01-12 02:05:00

generate a random number between 2 and 40 with mean 20 as a column in pandas

I have a data frame as shown below
session slot_num appt_time
s1 1 2020-01-06 09:00:00
s1 2 2020-01-06 09:20:00
s1 3 2020-01-06 09:40:00
s1 3 2020-01-06 09:40:00
s1 4 2020-01-06 10:00:00
s1 4 2020-01-06 10:00:00
s2 1 2020-01-06 08:20:00
s2 2 2020-01-06 08:40:00
s2 2 2020-01-06 08:40:00
s2 3 2020-01-06 09:00:00
s2 4 2020-01-06 09:20:00
s2 5 2020-01-06 09:40:00
s2 5 2020-01-06 09:40:00
s2 6 2020-01-06 10:00:00
s3 1 2020-01-09 13:00:00
s3 1 2020-01-09 13:00:00
s3 2 2020-01-09 13:20:00
s3 3 2020-01-09 13:40:00
In the above I would like to add a column called service_time.
service_time should contain any random digits between 2 to 40 with mean 20 for each session.
I prefer random numbers should follow random normal distribution with mean 20, standard deviation 10, minimum 2 and maximum 40
Expected output:
session slot_num appt_time service_time
s1 1 2020-01-06 09:00:00 30
s1 2 2020-01-06 09:20:00 10
s1 3 2020-01-06 09:40:00 15
s1 3 2020-01-06 09:40:00 35
s1 4 2020-01-06 10:00:00 20
s1 4 2020-01-06 10:00:00 10
s2 1 2020-01-06 08:20:00 15
s2 2 2020-01-06 08:40:00 20
s2 2 2020-01-06 08:40:00 25
s2 3 2020-01-06 09:00:00 30
s2 4 2020-01-06 09:20:00 20
s2 5 2020-01-06 09:40:00 8
s2 5 2020-01-06 09:40:00 40
s2 6 2020-01-06 10:00:00 2
s3 1 2020-01-09 13:00:00 4
s3 1 2020-01-09 13:00:00 32
s3 2 2020-01-09 13:20:00 26
s3 3 2020-01-09 13:40:00 18
Note : please note that this is the one of that random combination which follows the minimum, maximum and mean criteria mentioned above.
One possible solution with cutom function:
def gen_avg(n, expected_avg=20, a=2, b=40):
while True:
l = np.random.randint(a, b, size=n)
avg = np.mean(l)
if avg == expected_avg:
return l
df['service_time'] = df.groupby('session')['session'].transform(lambda x: gen_avg(len(x)))
print (df)
session slot_num appt_time service_time
0 s1 1 2020-01-06 09:00:00 31
1 s1 2 2020-01-06 09:20:00 9
2 s1 3 2020-01-06 09:40:00 23
3 s1 3 2020-01-06 09:40:00 37
4 s1 4 2020-01-06 10:00:00 6
5 s1 4 2020-01-06 10:00:00 14
6 s2 1 2020-01-06 08:20:00 33
7 s2 2 2020-01-06 08:40:00 29
8 s2 2 2020-01-06 08:40:00 18
9 s2 3 2020-01-06 09:00:00 32
10 s2 4 2020-01-06 09:20:00 9
11 s2 5 2020-01-06 09:40:00 26
12 s2 5 2020-01-06 09:40:00 10
13 s2 6 2020-01-06 10:00:00 3
14 s3 1 2020-01-09 13:00:00 19
15 s3 1 2020-01-09 13:00:00 22
16 s3 2 2020-01-09 13:20:00 5
17 s3 3 2020-01-09 13:40:00 34
Here's a solution with NumPy's new Generator infrastructure. See the documentation for a discussion of the differences between this and the older RandomState infrastructure.
import numpy as np
from numpy.random import default_rng
# assuming df is the name of your dataframe
n = len(df)
# set up random number generator
rng = default_rng()
# sample more than enough values
vals = rng.normal(loc=20., scale=10., size=2*n)
# filter values according to cut-off conditions
vals = vals[2 <= vals]
vals = vals[vals <= 40]
# add n random values to dataframe
df['service_time'] = vals[:n]
The normal distribution has an unbounded range, so if you're bounding between 2 and 40 the distribution isn't normal. An alternative which is bounded, and avoids acceptance/rejection schemes, is to use the triangular distribution (see Wikipedia for details). Since the mean of a triangular distribution is (left + mode + right) / 3, with left = 2 and right = 40 you would set mode = 18 to get the desired mean of 20.

Groupby - Generate date time as sequence

I have a dataframe as shown below
session slot_num
s1 1
s1 2
s1 3
s1 3
s1 4
s1 4
s2 1
s2 2
s2 2
s2 3
s2 4
s2 5
s2 5
s2 6
s3 1
s3 1
s3 2
s3 3
from the above I would like to create a column appt_time as shown below.
Expected output
session slot_num appt_time
s1 1 2020-01-06 09:00:00
s1 2 2020-01-06 09:20:00
s1 3 2020-01-06 09:40:00
s1 3 2020-01-06 09:40:00
s1 4 2020-01-06 10:00:00
s1 4 2020-01-06 10:00:00
s2 1 2020-01-06 08:20:00
s2 2 2020-01-06 08:40:00
s2 2 2020-01-06 08:40:00
s2 3 2020-01-06 09:00:00
s2 4 2020-01-06 09:20:00
s2 5 2020-01-06 09:40:00
s2 5 2020-01-06 09:40:00
s2 6 2020-01-06 10:00:00
s3 1 2020-01-09 13:00:00
s3 1 2020-01-09 13:00:00
s3 2 2020-01-09 13:20:00
s3 3 2020-01-09 13:40:00
for session = s1, appt_start time = 2020-01-06 09:00:00, then for each increase in slot_num for that session increment appt_time by 20 minutes.
for session = s2, appt_start time = 2020-01-06 08:20:00, then for each increase in slot_num for that session increment appt_time by 20 minutes.
for session = s3, appt_start time = 2020-01-09 13:00:00, then for each increase in slot_num for that session increment appt_time by 20 minutes.
First is necessary specified first datetimes for each session, here is used dictionary with, conver to datetimes and then add timedeltas by 20 minutes by to_timedelta with subtract 0 for added 0 Timedelta for first group and multiple for 20 minutes:
d = {'s1':'2020-01-06 09:00:00',
's2':'2020-01-06 08:20:00',
's3':'2020-01-09 13:00:00'}
df['appt_time'] = (pd.to_datetime(df['session'].map(d)) +
pd.to_timedelta(df['slot_num'].sub(1), unit='T').mul(20))
print (df)
session slot_num appt_time
0 s1 1 2020-01-06 09:00:00
1 s1 2 2020-01-06 09:20:00
2 s1 3 2020-01-06 09:40:00
3 s1 3 2020-01-06 09:40:00
4 s1 4 2020-01-06 10:00:00
5 s1 4 2020-01-06 10:00:00
6 s2 1 2020-01-06 08:20:00
7 s2 2 2020-01-06 08:40:00
8 s2 2 2020-01-06 08:40:00
9 s2 3 2020-01-06 09:00:00
10 s2 4 2020-01-06 09:20:00
11 s2 5 2020-01-06 09:40:00
12 s2 5 2020-01-06 09:40:00
13 s2 6 2020-01-06 10:00:00
14 s3 1 2020-01-09 13:00:00
15 s3 1 2020-01-09 13:00:00
16 s3 2 2020-01-09 13:20:00
17 s3 3 2020-01-09 13:40:00

create a new columns by adding minutes to date time column and another column by groupby row number - in Pandas

I have a data frame as shown below
session appt_time
s1 2020-01-06 09:00:00
s1 2020-01-06 09:20:00
s1 2020-01-06 09:40:00
s1 2020-01-06 09:40:00
s1 2020-01-06 10:00:00
s1 2020-01-06 10:00:00
s2 2020-01-06 08:20:00
s2 2020-01-06 08:40:00
s2 2020-01-06 08:40:00
s2 2020-01-06 09:00:00
s2 2020-01-06 09:20:00
s2 2020-01-06 09:40:00
s2 2020-01-06 09:40:00
s2 2020-01-06 10:00:00
s3 2020-01-09 13:00:00
s3 2020-01-09 13:00:00
s3 2020-01-09 13:20:00
s3 2020-01-09 13:40:00
From the above I would like to create a new columns called ideal_appt_time and slot_num as shown below.
session appt_time ideal_appt_time slot_num
s1 2020-01-06 09:00:00 2020-01-06 09:00:00 1
s1 2020-01-06 09:20:00 2020-01-06 09:20:00 2
s1 2020-01-06 09:40:00 2020-01-06 09:40:00 3
s1 2020-01-06 09:40:00 2020-01-06 10:00:00 4
s1 2020-01-06 10:00:00 2020-01-06 10:20:00 5
s1 2020-01-06 10:00:00 2020-01-06 10:40:00 6
s2 2020-01-06 08:20:00 2020-01-06 08:20:00 1
s2 2020-01-06 08:40:00 2020-01-06 08:40:00 2
s2 2020-01-06 08:40:00 2020-01-06 09:00:00 3
s2 2020-01-06 09:00:00 2020-01-06 09:20:00 4
s2 2020-01-06 09:20:00 2020-01-06 09:40:00 5
s2 2020-01-06 09:40:00 2020-01-06 10:00:00 6
s2 2020-01-06 09:40:00 2020-01-06 10:20:00 7
s2 2020-01-06 10:00:00 2020-01-06 10:40:00 8
s3 2020-01-09 13:00:00 2020-01-09 13:00:00 1
s3 2020-01-09 13:00:00 2020-01-09 13:20:00 2
s3 2020-01-09 13:20:00 2020-01-09 13:40:00 3
s3 2020-01-09 13:40:00 2020-01-09 14:00:00 4
where ideal_appt_time is calculated based on appt_time, start of ideal_appt_time is same as appt_time. then adding 20 minutes to that, where as in appt_time some appt_time are repeating.
slot_num just counted the slot of that session based on appoitment time.
Use GroupBy.cumcount for counter Series, converted to timedeltas by to_timedelta and multiple 20 for 20 Minutes.
Then get first timestamp per group by GroupBy.transform and GroupBy.first, add timedeltas and last for counter column add 1:
df['appt_time'] = pd.to_datetime(df['appt_time'])
counts = df.groupby('session').cumcount()
td = pd.to_timedelta(counts, unit='Min') * 20
df['ideal_appt_time'] = df.groupby('session')['appt_time'].transform('first') + td
df['slot_num'] = counts + 1
print (df)
session appt_time ideal_appt_time slot_num
0 s1 2020-01-06 09:00:00 2020-01-06 09:00:00 1
1 s1 2020-01-06 09:20:00 2020-01-06 09:20:00 2
2 s1 2020-01-06 09:40:00 2020-01-06 09:40:00 3
3 s1 2020-01-06 09:40:00 2020-01-06 10:00:00 4
4 s1 2020-01-06 10:00:00 2020-01-06 10:20:00 5
5 s1 2020-01-06 10:00:00 2020-01-06 10:40:00 6
6 s2 2020-01-06 08:20:00 2020-01-06 08:20:00 1
7 s2 2020-01-06 08:40:00 2020-01-06 08:40:00 2
8 s2 2020-01-06 08:40:00 2020-01-06 09:00:00 3
9 s2 2020-01-06 09:00:00 2020-01-06 09:20:00 4
10 s2 2020-01-06 09:20:00 2020-01-06 09:40:00 5
11 s2 2020-01-06 09:40:00 2020-01-06 10:00:00 6
12 s2 2020-01-06 09:40:00 2020-01-06 10:20:00 7
13 s2 2020-01-06 10:00:00 2020-01-06 10:40:00 8
14 s3 2020-01-09 13:00:00 2020-01-09 13:00:00 1
15 s3 2020-01-09 13:00:00 2020-01-09 13:20:00 2
16 s3 2020-01-09 13:20:00 2020-01-09 13:40:00 3
17 s3 2020-01-09 13:40:00 2020-01-09 14:00:00 4

create a new column based on groupby date time column at date level in pandas

I have data frame as shown below.
Doctor Appointment Booking_ID
A 2020-01-18 12:00:00 1
A 2020-01-18 12:30:00 2
A 2020-01-18 13:00:00 3
A 2020-01-18 13:00:00 4
A 2020-01-19 13:00:00 13
A 2020-01-19 13:30:00 14
B 2020-01-18 12:00:00 5
B 2020-01-18 12:30:00 6
B 2020-01-18 13:00:00 7
B 2020-01-25 12:30:00 6
B 2020-01-25 13:00:00 7
C 2020-01-19 12:00:00 19
C 2020-01-19 12:30:00 20
C 2020-01-19 13:00:00 21
C 2020-01-22 12:30:00 20
C 2020-01-22 13:00:00 21
From the above I would like to create a column called Session as shown below.
Expected Output:
Doctor Appointment Booking_ID Session
A 2020-01-18 12:00:00 1 S1
A 2020-01-18 12:30:00 2 S1
A 2020-01-18 13:00:00 3 S1
A 2020-01-18 13:00:00 4 S1
A 2020-01-29 13:00:00 13 S2
A 2020-01-29 13:30:00 14 S2
B 2020-01-18 12:00:00 5 S3
B 2020-01-18 12:30:00 6 S3
B 2020-01-18 13:00:00 17 S3
B 2020-01-25 12:30:00 16 S4
B 2020-01-25 13:00:00 7 S4
C 2020-01-19 12:00:00 19 S5
C 2020-01-19 12:30:00 20 S5
C 2020-01-19 13:00:00 21 S5
C 2020-01-22 12:30:00 29 S6
C 2020-01-22 13:00:00 26 S6
C 2020-01-22 13:30:00 24 S6
Session should be different for different doctor and different Appointment date(in day level)
I tried below
df = df.sort_values(['Doctor', 'Appointment'], ascending=True)
df['Appointment'] = pd.to_datetime(df['Appointment'])
dates = df['Appointment']
df['Session'] = 'S' + pd.Series(dates.factorize()[0] + 1, index=df.index).astype(str)
But it is considering session based on only dates. I would like to consider doctor as well.
IIUC, Groupby.ngroup with
df['Session'] = 'S' + (df.groupby(['Doctor',pd.to_datetime(df['Appointment'])])
Doctor Appointment Booking_ID Session
0 A 2020-01-18-12:00:00 1 S1
1 A 2020-01-18-12:30:00 2 S1
2 A 2020-01-18-13:00:00 3 S1
3 A 2020-01-18-13:00:00 4 S1
4 A 2020-01-19-13:00:00 13 S2
5 A 2020-01-19-13:30:00 14 S2
6 B 2020-01-18-12:00:00 5 S3
7 B 2020-01-18-12:30:00 6 S3
8 B 2020-01-18-13:00:00 7 S3
9 B 2020-01-25-12:30:00 6 S4
10 B 2020-01-25-13:00:00 7 S4
11 C 2020-01-19-12:00:00 19 S5
12 C 2020-01-19-12:30:00 20 S5
13 C 2020-01-19-13:00:00 21 S5
14 C 2020-01-22-12:30:00 20 S6
15 C 2020-01-22-13:00:00 21 S6
you can go with sort_values and check where either the diff in date is not 0 or the doctor not the same than previous row with shift like:
df = df.sort_values(['Doctor', 'Appointment'], ascending=True)
df['Session'] = 'S'+(df['Appointment']
print (df)
Doctor Appointment Booking_ID Session
0 A 2020-01-18 12:00:00 1 S1
1 A 2020-01-18 12:30:00 2 S1
2 A 2020-01-18 13:00:00 3 S1
3 A 2020-01-18 13:00:00 4 S1
4 A 2020-01-19 13:00:00 13 S2
5 A 2020-01-19 13:30:00 14 S2
6 B 2020-01-18 12:00:00 5 S3
7 B 2020-01-18 12:30:00 6 S3
8 B 2020-01-18 13:00:00 7 S3
9 B 2020-01-25 12:30:00 6 S4
10 B 2020-01-25 13:00:00 7 S4
11 C 2020-01-19 12:00:00 19 S5
12 C 2020-01-19 12:30:00 20 S5
13 C 2020-01-19 13:00:00 21 S5
14 C 2020-01-22 12:30:00 20 S6
15 C 2020-01-22 13:00:00 21 S6
Another approach using idxmin with a slightly different result:
df['Session'] = 'S' + (df.groupby(
This is groupby().numgroup():
# convert to datetime
df.Appointment = pd.to_datetime(df.Appointment)
df['Session'] = 'S' + (df.groupby(['Doctor',]).ngroup()+1).astype(str)
Doctor Appointment Booking_ID Session
0 A 2020-01-18 12:00:00 1 S1
1 A 2020-01-18 12:30:00 2 S1
2 A 2020-01-18 13:00:00 3 S1
3 A 2020-01-18 13:00:00 4 S1
4 A 2020-01-19 13:00:00 13 S2
5 A 2020-01-19 13:30:00 14 S2
6 B 2020-01-18 12:00:00 5 S3
7 B 2020-01-18 12:30:00 6 S3
8 B 2020-01-18 13:00:00 7 S3
9 B 2020-01-25 12:30:00 6 S4
10 B 2020-01-25 13:00:00 7 S4
11 C 2020-01-19 12:00:00 19 S5
12 C 2020-01-19 12:30:00 20 S5
13 C 2020-01-19 13:00:00 21 S5
14 C 2020-01-22 12:30:00 20 S6
15 C 2020-01-22 13:00:00 21 S6