Difference between first row and current row, by group - pandas

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?

Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

Related

Calculate the actual duration of successive or parallel task with Python Pandas

I have a pandas dataframe with many rows. In each row I have an object and the duration of the machining on a certain machine (with a start time and an end time). Each object can be processed in several machines in succession. I need to find the actual duration of all jobs.
For example:
Object
Machine
T start
T end
1
A
17:26
17:57
1
B
17:26
18:33
1
C
18:56
19:46
2
A
14:00
15:00
2
C
14:30
15:00
3
A
12:00
12:30
3
C
13:00
13:45
For object 1 the actual duration is 117 minutes,for object 2 is 60 minutes and for object 3 is 75 minutes.
I tried with a groupby where I calculated the sum of the durations of the processes for each object and the minimum and maximum values, i.e. the first start and the last end. Then I wrote a function that compares these values ​​but it doesn't work in case of object 1, and it works for object 2 and 3.
Here my solution:
Object
min
max
sumT
LT_ACTUAL
1
17:26
19:46
148
140 ERROR!
2
14:00
15:00
90
60 OK!
3
12:00
13:45
75
75 OK!
def calc_lead_time(min_t_start, max_t_end, t_sum):
t_max_min = (max_t_end - min_t_start) / pd.Timedelta(minutes=1)
if t_max_min <= t_sum:
return t_max_min
else:
return t_sum
df['LT_ACTUAL'] = df.apply(lambda x : calc_lead_time(x['min'], x['max'], x['sumT']), axis=1)
I posted an image to explane all the cases. I need to calc the actual duration between the tasks
Assuming the data is sorted by start time, and that one task duration is not fully within another one, you can use:
start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')
s = start.groupby(df['Object']).shift(-1)
(end.mask(end.gt(s), s).sub(start)
.groupby(df['Object']).sum()
)
Output:
Object
1 0 days 01:57:00
2 0 days 01:00:00
3 0 days 01:15:00
dtype: timedelta64[ns]
For minutes:
start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')
s = start.groupby(df['Object']).shift(-1)
(end.mask(end.gt(s), s).sub(start)
.groupby(df['Object']).sum()
.dt.total_seconds().div(60)
)
Output:
Object
1 117.0
2 60.0
3 75.0
dtype: float64
handling overlapping intervals
See here for the logic of the overlapping intervals grouping.
(df.assign(
start=pd.to_timedelta(df['T start']+':00'),
end=pd.to_timedelta(df['T end']+':00'),
max_end=lambda d: d.groupby('Object')['end'].cummax(),
group=lambda d: d['start'].ge(d.groupby('Object')['max_end'].shift()).cumsum()
)
.groupby(['Object', 'group'])
.apply(lambda g: g['end'].max()-g['start'].min())
.groupby(level='Object').sum()
.dt.total_seconds().div(60)
)
Output:
Object
1 117.0
2 60.0
3 75.0
4 35.0
dtype: float64
Used input:
Object Machine T start T end
0 1 A 17:26 17:57
1 1 B 17:26 18:33
2 1 C 18:56 19:46
3 2 A 14:00 15:00
4 2 C 14:30 15:00
5 3 A 12:00 12:30
6 3 C 13:00 13:45
7 4 A 12:00 12:30
8 4 C 12:00 12:15
9 4 D 12:20 12:35
def function1(dd:pd.DataFrame):
col1=dd.apply(lambda ss:pd.date_range(ss["T start"]+pd.to_timedelta("1 min"),ss["T end"],freq="min"),axis=1).explode()
min=col1.min()-pd.to_timedelta("1 min")
max=col1.max()
sumT=col1.size
LT_ACTUAL=col1.drop_duplicates().size
return pd.DataFrame({"min":min.strftime('%H:%M'),"max":max.strftime('%H:%M'),"sumT":sumT,"LT_ACTUAL":LT_ACTUAL,},index=[dd.name])
df1.groupby('Object').apply(function1).droplevel(0)
out:
min max sumT LT_ACTUAL
1 17:26 19:46 148 117
2 14:00 15:00 90 60
3 12:00 13:45 75 75

How to calculate slope of a dataframe, upto a specific row number?

I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858

Pandas - Filtering alternate Monday

I have a Dataframe that has sales data by day. I would like to be able to filter out sales data of every alternate Monday. For example, if I select June 27 the next date I would like to filter would be July 11 and the next date would be July 25 and so on.
I have my Dataframe as below
sale_date, count
2022-06-27, 100
2022-07-01, 150
2022-07-07, 100
2022-07-11, 150
2022-06-20, 100
2022-07-25, 150
I would expect the output to be
sale_date, count
2022-06-27, 100
2022-07-11, 150
2022-07-25, 150
You can use:
# convert to datetime
date = pd.to_datetime(df['sale_date'])
# is the day a Monday (0 = Monday)?
m1 = date.dt.weekday.eq(0)
# is the week an "even" week?
m2 = date.dt.isocalendar().week.mod(2).eq(0)
# if both conditions are True, keep the row
out = df[m1&m2]
output:
sale_date count
0 2022-06-27 100
3 2022-07-11 150
5 2022-07-25 150
intermediates:
sale_date count weekday weekday.eq(0) week week.mod(2) week.mod(2).eq(0)
0 2022-06-27 100 0 True 26 0 True
1 2022-07-01 150 4 False 26 0 True
2 2022-07-07 100 3 False 27 1 False
3 2022-07-11 150 0 True 28 0 True
4 2022-06-20 100 0 True 25 1 False
5 2022-07-25 150 0 True 30 0 True
df11=df1.resample("2w-mon",closed="left",on="sale_date")["count"].first().reset_index()
df11.assign(sale_date=df11.sale_date-pd.Timedelta(days=7))
out:
sale_date count
0 2022-06-27 100
1 2022-07-11 100
2 2022-07-25 150

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0