Calculate the actual duration of successive or parallel task with Python Pandas - pandas

I have a pandas dataframe with many rows. In each row I have an object and the duration of the machining on a certain machine (with a start time and an end time). Each object can be processed in several machines in succession. I need to find the actual duration of all jobs.
For example:
Object
Machine
T start
T end
1
A
17:26
17:57
1
B
17:26
18:33
1
C
18:56
19:46
2
A
14:00
15:00
2
C
14:30
15:00
3
A
12:00
12:30
3
C
13:00
13:45
For object 1 the actual duration is 117 minutes,for object 2 is 60 minutes and for object 3 is 75 minutes.
I tried with a groupby where I calculated the sum of the durations of the processes for each object and the minimum and maximum values, i.e. the first start and the last end. Then I wrote a function that compares these values ​​but it doesn't work in case of object 1, and it works for object 2 and 3.
Here my solution:
Object
min
max
sumT
LT_ACTUAL
1
17:26
19:46
148
140 ERROR!
2
14:00
15:00
90
60 OK!
3
12:00
13:45
75
75 OK!
def calc_lead_time(min_t_start, max_t_end, t_sum):
t_max_min = (max_t_end - min_t_start) / pd.Timedelta(minutes=1)
if t_max_min <= t_sum:
return t_max_min
else:
return t_sum
df['LT_ACTUAL'] = df.apply(lambda x : calc_lead_time(x['min'], x['max'], x['sumT']), axis=1)
I posted an image to explane all the cases. I need to calc the actual duration between the tasks

Assuming the data is sorted by start time, and that one task duration is not fully within another one, you can use:
start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')
s = start.groupby(df['Object']).shift(-1)
(end.mask(end.gt(s), s).sub(start)
.groupby(df['Object']).sum()
)
Output:
Object
1 0 days 01:57:00
2 0 days 01:00:00
3 0 days 01:15:00
dtype: timedelta64[ns]
For minutes:
start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')
s = start.groupby(df['Object']).shift(-1)
(end.mask(end.gt(s), s).sub(start)
.groupby(df['Object']).sum()
.dt.total_seconds().div(60)
)
Output:
Object
1 117.0
2 60.0
3 75.0
dtype: float64
handling overlapping intervals
See here for the logic of the overlapping intervals grouping.
(df.assign(
start=pd.to_timedelta(df['T start']+':00'),
end=pd.to_timedelta(df['T end']+':00'),
max_end=lambda d: d.groupby('Object')['end'].cummax(),
group=lambda d: d['start'].ge(d.groupby('Object')['max_end'].shift()).cumsum()
)
.groupby(['Object', 'group'])
.apply(lambda g: g['end'].max()-g['start'].min())
.groupby(level='Object').sum()
.dt.total_seconds().div(60)
)
Output:
Object
1 117.0
2 60.0
3 75.0
4 35.0
dtype: float64
Used input:
Object Machine T start T end
0 1 A 17:26 17:57
1 1 B 17:26 18:33
2 1 C 18:56 19:46
3 2 A 14:00 15:00
4 2 C 14:30 15:00
5 3 A 12:00 12:30
6 3 C 13:00 13:45
7 4 A 12:00 12:30
8 4 C 12:00 12:15
9 4 D 12:20 12:35

def function1(dd:pd.DataFrame):
col1=dd.apply(lambda ss:pd.date_range(ss["T start"]+pd.to_timedelta("1 min"),ss["T end"],freq="min"),axis=1).explode()
min=col1.min()-pd.to_timedelta("1 min")
max=col1.max()
sumT=col1.size
LT_ACTUAL=col1.drop_duplicates().size
return pd.DataFrame({"min":min.strftime('%H:%M'),"max":max.strftime('%H:%M'),"sumT":sumT,"LT_ACTUAL":LT_ACTUAL,},index=[dd.name])
df1.groupby('Object').apply(function1).droplevel(0)
out:
min max sumT LT_ACTUAL
1 17:26 19:46 148 117
2 14:00 15:00 90 60
3 12:00 13:45 75 75

Related

Postgres table transformation: transposing values of a column into new columns

Is there a way to transpose/flatten the following table -
userId
time window
propertyId
count
sum
avg
max
1
01:00 - 02:00
a
2
5
1.5
3
1
02:00 - 03:00
a
4
15
2.5
6
1
01:00 - 02:00
b
2
5
1.5
3
1
02:00 - 03:00
b
4
15
2.5
6
2
01:00 - 02:00
a
2
5
1.5
3
2
02:00 - 03:00
a
4
15
2.5
6
2
01:00 - 02:00
b
2
5
1.5
3
2
02:00 - 03:00
b
4
15
2.5
6
to something like this -
userId
time window
a_count
a_sum
a_avg
a_max
b_count
b_sum
b_avg
b_max
1
01:00 - 02:00
2
5
1.5
3
2
5
1.5
3
1
02:00 - 03:00
4
15
2.5
6
4
15
2.5
6
2
01:00 - 02:00
2
5
1.5
3
2
5
1.5
3
2
02:00 - 03:00
4
15
2.5
6
4
15
2.5
6
Basically, I want to flatten the table by having the aggregation columns (count, sum, avg, max) per propertyId, so the new columns are a_count, a_sum, a_avg, a_max, b_count, b_sum, ... All the rows have these values per userId per time window.
Important clarification: The values in propertyId column can change and hence, the number of columns can change as well. So, if there are n different values for propertyId, then there will be n*4 aggregation columns created.
SQL does not allow a dynamic number of result columns on principal. It demands to know number and data types of resulting columns at call time. The only way to make it "dynamic" is a two-step process:
Generate the query.
Execute it.
If you don't actually need separate columns, returning arrays or document-type columns (json, jsonb, xml, hstore, ...) containing a variable number of data sets would be a feasible alternative.
See:
Execute a dynamic crosstab query

How to calculate slope of a dataframe, upto a specific row number?

I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Difference between first row and current row, by group

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?
Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days
Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

Add a new dataframe column which counts the values in certain column less than the date prior to the time

EDITED
I want to add a new column called prev_message_left which counts the no. of messages_left per ID less than the date prior the given time. Basically I want to have a column which says how many times we had left message on call to that customer prior to the current time and date. This is how my data frame looks like
date ID call_time message_left
20191101 1 8:00 0
20191102 2 9:00 1
20191030 1 16:00 1
20191103 2 10:30 1
20191105 2 14:00 0
20191030 1 15:30 0
I want to add an additional column called prev_message_left_count
date ID call_time message_left prev_message_left_count
20191101 1 8:00 0 1
20191102 2 9:00 1 0
20191030 1 16:00 1 0
20191103 2 10:30 1 1
20191105 2 14:00 0 2
20191030 1 15:30 0 0
My dataframe has 15 columns and 90k rows.
I have various other columns in this dataframe and there are columns like 'No Message Left', 'Responded' for which I will have to compute additional columns called 'Previous_no_message_left' and 'prev_responded' similar to 'prev_message_left'
Use DataFrame.sort_values to get the cumulative sum in the correct order by groups. You can create groups using DataFrame.groupby:
df['prev_message_left_count']= (df.sort_values(['date','call_time'])
.groupby('ID')['message_left']
.apply(lambda x: x.shift(fill_value=0)
.cumsum()) )
print(df)
date ID call_time message_left prev_message_left_count
0 20191101 1 8:00 0 1
1 20191102 2 9:00 1 0
2 20191030 1 16:00 1 0
3 20191103 2 10:30 1 1
4 20191105 2 14:00 0 2
5 20191030 1 15:30 0 0
sometimes GroupBy.apply is slow so it may be advisable
df['prev_message_left_count']=( df.sort_values(['date','call_time'])
.groupby('ID')
.shift(fill_value=0)
.groupby(df['ID'])['message_left']
.cumsum()