Add a new dataframe column which counts the values in certain column less than the date prior to the time - pandas

EDITED
I want to add a new column called prev_message_left which counts the no. of messages_left per ID less than the date prior the given time. Basically I want to have a column which says how many times we had left message on call to that customer prior to the current time and date. This is how my data frame looks like
date ID call_time message_left
20191101 1 8:00 0
20191102 2 9:00 1
20191030 1 16:00 1
20191103 2 10:30 1
20191105 2 14:00 0
20191030 1 15:30 0
I want to add an additional column called prev_message_left_count
date ID call_time message_left prev_message_left_count
20191101 1 8:00 0 1
20191102 2 9:00 1 0
20191030 1 16:00 1 0
20191103 2 10:30 1 1
20191105 2 14:00 0 2
20191030 1 15:30 0 0
My dataframe has 15 columns and 90k rows.
I have various other columns in this dataframe and there are columns like 'No Message Left', 'Responded' for which I will have to compute additional columns called 'Previous_no_message_left' and 'prev_responded' similar to 'prev_message_left'

Use DataFrame.sort_values to get the cumulative sum in the correct order by groups. You can create groups using DataFrame.groupby:
df['prev_message_left_count']= (df.sort_values(['date','call_time'])
.groupby('ID')['message_left']
.apply(lambda x: x.shift(fill_value=0)
.cumsum()) )
print(df)
date ID call_time message_left prev_message_left_count
0 20191101 1 8:00 0 1
1 20191102 2 9:00 1 0
2 20191030 1 16:00 1 0
3 20191103 2 10:30 1 1
4 20191105 2 14:00 0 2
5 20191030 1 15:30 0 0
sometimes GroupBy.apply is slow so it may be advisable
df['prev_message_left_count']=( df.sort_values(['date','call_time'])
.groupby('ID')
.shift(fill_value=0)
.groupby(df['ID'])['message_left']
.cumsum()

Related

Calculate the actual duration of successive or parallel task with Python Pandas

I have a pandas dataframe with many rows. In each row I have an object and the duration of the machining on a certain machine (with a start time and an end time). Each object can be processed in several machines in succession. I need to find the actual duration of all jobs.
For example:
Object
Machine
T start
T end
1
A
17:26
17:57
1
B
17:26
18:33
1
C
18:56
19:46
2
A
14:00
15:00
2
C
14:30
15:00
3
A
12:00
12:30
3
C
13:00
13:45
For object 1 the actual duration is 117 minutes,for object 2 is 60 minutes and for object 3 is 75 minutes.
I tried with a groupby where I calculated the sum of the durations of the processes for each object and the minimum and maximum values, i.e. the first start and the last end. Then I wrote a function that compares these values ​​but it doesn't work in case of object 1, and it works for object 2 and 3.
Here my solution:
Object
min
max
sumT
LT_ACTUAL
1
17:26
19:46
148
140 ERROR!
2
14:00
15:00
90
60 OK!
3
12:00
13:45
75
75 OK!
def calc_lead_time(min_t_start, max_t_end, t_sum):
t_max_min = (max_t_end - min_t_start) / pd.Timedelta(minutes=1)
if t_max_min <= t_sum:
return t_max_min
else:
return t_sum
df['LT_ACTUAL'] = df.apply(lambda x : calc_lead_time(x['min'], x['max'], x['sumT']), axis=1)
I posted an image to explane all the cases. I need to calc the actual duration between the tasks
Assuming the data is sorted by start time, and that one task duration is not fully within another one, you can use:
start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')
s = start.groupby(df['Object']).shift(-1)
(end.mask(end.gt(s), s).sub(start)
.groupby(df['Object']).sum()
)
Output:
Object
1 0 days 01:57:00
2 0 days 01:00:00
3 0 days 01:15:00
dtype: timedelta64[ns]
For minutes:
start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')
s = start.groupby(df['Object']).shift(-1)
(end.mask(end.gt(s), s).sub(start)
.groupby(df['Object']).sum()
.dt.total_seconds().div(60)
)
Output:
Object
1 117.0
2 60.0
3 75.0
dtype: float64
handling overlapping intervals
See here for the logic of the overlapping intervals grouping.
(df.assign(
start=pd.to_timedelta(df['T start']+':00'),
end=pd.to_timedelta(df['T end']+':00'),
max_end=lambda d: d.groupby('Object')['end'].cummax(),
group=lambda d: d['start'].ge(d.groupby('Object')['max_end'].shift()).cumsum()
)
.groupby(['Object', 'group'])
.apply(lambda g: g['end'].max()-g['start'].min())
.groupby(level='Object').sum()
.dt.total_seconds().div(60)
)
Output:
Object
1 117.0
2 60.0
3 75.0
4 35.0
dtype: float64
Used input:
Object Machine T start T end
0 1 A 17:26 17:57
1 1 B 17:26 18:33
2 1 C 18:56 19:46
3 2 A 14:00 15:00
4 2 C 14:30 15:00
5 3 A 12:00 12:30
6 3 C 13:00 13:45
7 4 A 12:00 12:30
8 4 C 12:00 12:15
9 4 D 12:20 12:35
def function1(dd:pd.DataFrame):
col1=dd.apply(lambda ss:pd.date_range(ss["T start"]+pd.to_timedelta("1 min"),ss["T end"],freq="min"),axis=1).explode()
min=col1.min()-pd.to_timedelta("1 min")
max=col1.max()
sumT=col1.size
LT_ACTUAL=col1.drop_duplicates().size
return pd.DataFrame({"min":min.strftime('%H:%M'),"max":max.strftime('%H:%M'),"sumT":sumT,"LT_ACTUAL":LT_ACTUAL,},index=[dd.name])
df1.groupby('Object').apply(function1).droplevel(0)
out:
min max sumT LT_ACTUAL
1 17:26 19:46 148 117
2 14:00 15:00 90 60
3 12:00 13:45 75 75

Postgres table transformation: transposing values of a column into new columns

Is there a way to transpose/flatten the following table -
userId
time window
propertyId
count
sum
avg
max
1
01:00 - 02:00
a
2
5
1.5
3
1
02:00 - 03:00
a
4
15
2.5
6
1
01:00 - 02:00
b
2
5
1.5
3
1
02:00 - 03:00
b
4
15
2.5
6
2
01:00 - 02:00
a
2
5
1.5
3
2
02:00 - 03:00
a
4
15
2.5
6
2
01:00 - 02:00
b
2
5
1.5
3
2
02:00 - 03:00
b
4
15
2.5
6
to something like this -
userId
time window
a_count
a_sum
a_avg
a_max
b_count
b_sum
b_avg
b_max
1
01:00 - 02:00
2
5
1.5
3
2
5
1.5
3
1
02:00 - 03:00
4
15
2.5
6
4
15
2.5
6
2
01:00 - 02:00
2
5
1.5
3
2
5
1.5
3
2
02:00 - 03:00
4
15
2.5
6
4
15
2.5
6
Basically, I want to flatten the table by having the aggregation columns (count, sum, avg, max) per propertyId, so the new columns are a_count, a_sum, a_avg, a_max, b_count, b_sum, ... All the rows have these values per userId per time window.
Important clarification: The values in propertyId column can change and hence, the number of columns can change as well. So, if there are n different values for propertyId, then there will be n*4 aggregation columns created.
SQL does not allow a dynamic number of result columns on principal. It demands to know number and data types of resulting columns at call time. The only way to make it "dynamic" is a two-step process:
Generate the query.
Execute it.
If you don't actually need separate columns, returning arrays or document-type columns (json, jsonb, xml, hstore, ...) containing a variable number of data sets would be a feasible alternative.
See:
Execute a dynamic crosstab query

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Query repeating event in a given time range

I have a set of events (Time in MM/DD) that can be repeated by different users:
EventId Event Time User
1 Start 06/01/2012 10:05AM 1
1 End 06/05/2012 10:45AM 1
2 Start 07/07/2012 09:55AM 2
2 End 09/07/2012 11:05AM 2
3 Start 09/01/2012 11:05AM 2
3 End 09/03/2012 11:05AM 2
I want to get, using SQL, those events a user has done in a specified time range, for instance:
Given 06/06/2012 and 09/02/2012 I am expecting tho get:
EventId Event Time User
2 Start 07/07/2012 09:55AM 2
2 End 09/07/2012 11:05AM 2
3 Start 09/01/2012 11:05AM 1
Any idea on how to deal with this?
A basic range query should work here:
SELECT *
FROM yourTable
WHERE Time >= '2012-06-06'::date AND Time < '2012-09-03'::date;
This assumes you want records falling on June 6, 2012 to September 2 of the same year.

Construct a grouping column in SQL Server 2012

I have created a table that looks something like this:
ID TSPPLY_DT NEXT_DT DAYS_BTWN TIME_TO_EVENT CENSORED ENDPOINT
-----------------------------------------------------------------------------
1 2014-01-01 2014-01-10 10 10 0 0
1 2014-01-10 2014-01-21 11 21 0 0
1 2014-01-21 NULL NULL 21 1 0
2 2015-04-01 2015-04-30 30 30 0 0
2 2015-04-30 2015-05-03 1 31 0 1
2 2015-05-03 2015-05-06 3 34(should be 3)0 0
2 2015-05-06 2015-05-16 10 44(shouldbe 13)1 0
The TIME_TO_EVENT column however is not adding up correctly with my code - The idea is to add up the days between until either ID changes, CENSORED = 1 or ENDPOINT = 1.
I think what I need is an addition column where I can sum based on an aggregate of ID and GROUPING... With an output as follows:
ID TSPPLY_DT NEXT_DT DAYS_BTWN TIME_TO_EVENT CENSORED ENDPOINT GROUPING
----------------------------------------------------------------------------------------
1 2014-01-01 2014-01-10 10 10 0 0 A
1 2014-01-10 2014-01-21 11 21 0 0 A
1 2014-01-21 NULL NULL 21 1 0 A
2 2015-04-01 2015-04-30 30 30 0 0 A
2 2015-04-30 2015-05-03 1 31 0 1 A
2 2015-05-03 2015-05-06 3 3 0 0 B
2 2015-05-06 2015-05-16 10 13 1 0 B
So any ideas on how to create the GROUPING column? It would be something like IF next rows ID is the same as current row, check CENSORED AND ENDPOINT. If either = 1, for the next row, change the grouping to a new value. Once a new ID is reached, reset the grouping to A (or whatever arbitrary value) and run the test again.
You need to use the DATEDIFF function, like this:
DATEDIFF(d, TSPPLY_DT, NEXT_DT) AS DAYS_BTWN
Now you don't need GROUP BY.