How to calculate average time between different events from the same table - sql

I am currently trying to calculate the average processing time of a messages in PostGres, there are multiple stages in the processing lifecycle and I would like to identify the average processing time between each stage. I have successfully calculated the average processing time for the full lifecycle using the following:
select AVG(e2.timestamp - e.timestamp) avg_gap
from event e
join event e2 on (e.message_id = e2.message_id)
where e.event_stage= 'start' and e.timestamp > '2022-10-01T00:00:08.000001Z' and e.timestamp < '2022-10-31T23:59:59.999999Z'
and e2.event_stage= 'end' and e2.timestamp > '2022-10-01T00:00:08.000001Z' and e2.timestamp < '2022-10-31T23:59:59.999999Z'
However I would now like to add additional event stages to the query to calculate the average processing time between each stage of the lifecycle.
As I am an SQL noob I have tried to update my query to the below; but receive the error operator does not exist: interval & interval.
select AVG((e3.timestamp - e2.timestamp) & (e2.timestamp - e.timestamp)) avg_gap
from event e
join event e2 on (e.message_id = e2.message_id)
join event e3 on (e2.message_id = e3.message_id)
where e.event_stage= 'start' and e.timestamp > '2022-10-01T00:00:08.000001Z' and e.timestamp < '2022-10-31T23:59:59.999999Z'
and e2.event_stage= 'validation' and e2.timestamp > '2022-10-01T00:00:08.000001Z' and e2.timestamp < '2022-10-31T23:59:59.999999Z'
and e3.event_stage= 'end' and e3.timestamp > '2022-10-01T00:00:08.000001Z' and e3.timestamp < '2022-10-31T23:59:59.999999Z'
I was hoping that the above would provide me with an average processing time from start to validation, and validation to end.
NOTE - There are other stages that I would eventually like to include such as parsing and transforming.
Is it possible for someone to provide some input on how to add multiple stages to the query?
EDIT - table structure as per the below:

Can you clarify the table structure?
Along with stage_event, do you also have some kind of an ID associated to this stage?
That would help if you wanted to use lead/lag to get the timestamp for the "next stage"
Something like this -
SELECT message,
stage_name,
timestamp,
lag(timestamp, 1) over (partition by message_id order by event_stage_id) as next_stage_timestamp
FROM events
You could then use "next_stage_timestamp - timestamp" to get your difference and average it grouping by event_stage.
Like this -
select stageName,
avg(next_stage_timestamp - timestamp) as avg_time
from above_results
group by stageName
This is better than doing multiple self joins. However it would work only if you had some kind of an ID associated to the stage_event.
So your table would be like this -
Message StageID StageName Timestamp
----------------------------------------------
A 1 Start 00
A 2 Calculate 20
A 3 Intermediate 30
A 4 Validate 40
A 5 End 60

Related

Calculations based on condition in PostgreSQL

I am having trouble doing calculations in one table using conditional statements. I have a table 'df' with the following column names:
id - int
time - timestamp
correctness - boolean
subject - text
Every student (id) completes tasks on particular subject (subject). The system assigns "True" value in column "correctness" if the assignment is completed correctly and "False" if not. The time (time) when the student completes the task is also saved by the system.
I need to write an optimal sql query that counts all students who completed 20 tasks successfully within an hour during March 2020.
Thanks in advance!
You can do this with no subqueries using:
select distinct s.id
from t
where t.timestamp >= '2020-03-01' and t.timestamp < '2020-04-01'
group by s.id, date_trunc('hour', timestamp)
having count(*) >= 20;
Note: You may want that the tasks are completed successfully, but that is not actually mentioned in your question.
For performance, you want an index on (timestamp).
You need to look at each 'correct' task and see if there are 20 previous tasks, delivered within one hour, that are correct.
That means you have to inner join task unto itself and then count them.
select distinct on(tasks.id) tasks.id, tasks.time, sum(previous_tasks.id)
from tasks
inner join tasks previous_tasks
on tasks.id = previous_tasks.id
and (previous_tasks.time - tasks.time) < interval '1 hour'
and previous_tasks.correctness
and tasks.time >= '2020-03-01' and tasks.time < '2020-04-01'
and previous_tasks.time >= '2020-03-01' and previous_tasks.time < '2020-04-01'
group by 1, 2
having sum(previous_tasks.id) >= 20

How to flag consecutive shifts in SQL efficiently

I have a dataset that contains 250,000 rows and is expected to grow at around 100,000 rows a month.
I have data that contains the following columns:
ShiftDate (Day a shift occurred on),
Shift Start Time,
Shift End Time
and Employee Number.
I would like to flag consecutive shifts with a 1 when an Employees Shift End Time was within 4 hours of the start time of their next shift, otherwise flag it with a 0.
my data table
I have tried running a query that joins the table to itself but the run time is too long. I was planning to create the flag based on a case statement using 'NextStart':
select shiftdate,
shiftstarttime,
shiftendtime,
EmployeeID,
(select min(t2.shiftstarttime) from TABLE t2 where t1.EmployeeID=t2.EmployeeID and T2.shiftstarttime > t1.Shiftendtime) as NextStart
from
TABLE t1
I would love to know a more efficient way of trying to do this.
Thanks!
Select shiftdate, shiftstarttime, shiftendtime, employeeid,
(Case when
lead(shiftstarttime, 1) over (partition by employeeid order by shiftdate, shiftstarttime) - shiftendtime < 4 then 1 else 0 end) as consecutive_shift_flag
from table_name
In this query lead() window function is used to get the next shift start time for that employee
lead(shiftstarttime, 1) over (partition by employeeid order by shiftdate, shiftstarttime)
In case this is not what you are looking for then please share the Sample of correct output based on input data for couple of cases.

Calculate the sum of a field filtered by a window defined on another field

I have table event:
event_date,
num_events,
site_id
I can easily use aggregate SQL to do SELECT SUM(num_events) GROUP BY site_id.
But I also have another table site:
site_id,
target_date
I'd like to do a JOIN, showing the SUM of num_events within 60 days of the target_date, 90 days, 120 days, etc. I thought this could easily be done using a WHERE clause in the aggregate SQL. However, this is complicated by two challenges:
The target_date is not fixed, but varies for each site_id
I'd like multiple date ranges to be output in the same table; so I can't do a simple WHERE to exclude records falling outside the range from the event table.
One workaround I've thought of is to simply make several queries, one for each date range, and then use a view to paste them together. Is there a simpler, better, or more elegant way to achieve my goals?
You would do something like:
select sum(case when target_date - event_date < 30 then 1 else 0 end) as within_030,
sum(case when target_date - event_date < 60 then 1 else 0 end) as within_060,
sum(case when target_date - event_date < 90 then 1 else 0 end) as within_090
from event e join
site s
on e.site_id = s.site_id;
That is, you can use conditional aggregation. I am not sure what "within 60" days means. This gives days before the target date, but similar logic will work for what you need.
In Postgres 9.4 use the new aggregate FILTER clause:
Assuming actual date data type, so we can simply add / subtracts integer numbers for days.
Interpreting "within n days" as "+/- n days":
SELECT site_id, s.target_date
, sum(e.num_events) FILTER (WHERE e.event_date BETWEEN s.target_date - 30
AND s.target_date + 30) AS sum_30
, sum(e.num_events) FILTER (WHERE e.event_date BETWEEN s.target_date - 60
AND s.target_date + 60) AS sum_60
, sum(e.num_events) FILTER (WHERE e.event_date BETWEEN s.target_date - 90
AND s.target_date + 90) AS sum_90
FROM site s
JOIN event e USING (site_id)
WHERE e.event_date BETWEEN s.target_date - 90
AND s.target_date + 90
GROUP BY 1, 2;
Also add the condition as WHERE clause to exclude irrelevant rows early. This should be substantially faster with more than a trivial number of rows outside the scope of sum_90 in event.

Select statement to show next 'event' in the future

I am trying to retrieve the record of the next upcoming event, i have used a variety of different methods, but cannot seem to get the result. I need the event that is retrieved to be in the future,
For example if there was an event yesterday and there is one in three weeks time, i would like the record of the one in three weeks time, rather than yesterday.
The statement i have currently is:
SELECT TOP 1 *
FROM Events
WHERE StartDate <= DATEADD(day, DATEDIFF(day,0,getdate()), 0)
ORDER BY StartDate ASC
thanks
SELECT TOP 1 E.*
FROM Events E
WHERE E.StartDate > GetDate()
ORDER BY E.StartDate ASC
http://msdn.microsoft.com/en-us/library/ms188383.aspx

Sqlite3: Need to Cartesian On date

I have a table which is a list of games that have been played in a sqlite3 database. The field "datetime" is the a datetime of when game ended. The field "duration" is the number of seconds the game lasted. I want to know what percent of the past 24 hours had at least 5 games running simutaniously. I figured out to tell how many games running at a given time are:
select count(*)
from games
where strftime('%s',datetime)+0 >= 1257173442 and
strftime('%s',datetime)-duration <= 1257173442
If I had a table that was simply a list of every second (or every 30 seconds or something) I could do an intentional cartisian product like this:
select count(*)
from (
select count(*) as concurrent, d.second
from games g, date d
where strftime('%s',datetime)+0 >= d.second and
strftime('%s',datetime)-duration <= d.second and
d.second >= strftime('%s','now') - 24*60*60 and
d.second <= strftime('%s','now')
group by d.second) x
where concurrent >=5
Is there a way to create this date table on the fly? Or that I can get a similar effect to this without having to actually create a new table that is simply a list of all the seconds this week?
Thanks
First, I can't think of a way to approach your problem by creating a table on the fly or without the aid of an extra table. Sorry.
My suggestion is for you to rely on a static Numbers table.
Create a fixed table with the format:
CREATE TABLE Numbers (
number INTEGER PRIMARY KEY
);
Populate it with the number of seconds in 24h (24*60*60 = 84600). I would use any scripting language to do that using the insert statement:
insert into numbers default values;
Now the Numbers table has the numbers 1 through 84600. Your query will them be modified to be:
select count(*)
from (
select count(*) as concurrent, strftime('%s','now') - 84601 + n.number second
from games g, numbers n
where strftime('%s',datetime)+0 >= strftime('%s','now') - 84601 + n.number and
strftime('%s',datetime)-duration <= strftime('%s','now') - 84601 + n.number
group by second) x
where concurrent >=5
Without a procedural language in the mix, that is the best you'll be able to do, I think.
Great question!
Here's a query that I think gives you what you want without using a separate table. Note this is untested (so probably contains errors) and I've assumed datetime is an int column with # of seconds to avoid a ton of strftime's.
select sum(concurrent_period) from (
select min(end_table.datetime - begin_table.begin_time) as concurrent_period
from (
select g1.datetime, g1.num_end, count(*) as concurrent
from (
select datetime, count(*) as num_end
from games group by datetime
) g1, games g2
where g2.datetime >= g1.datetime and
g2.datetime-g2.duration < g1.datetime and
g1.datetime >= strftime('%s','now') - 24*60*60 and
g1.datetime <= strftime('%s','now')+0
) end_table, (
select g3.begin_time, g1.num_begin, count(*) as concurrent
from (
select datetime-duration as begin_time,
count(*) as num_begin
from games group by datetime-duration
) g3, games g4
where g4.datetime >= g3.begin_time and
g4.datetime-g4.duration < g3.begin_time and
g3.begin_time >= strftime('%s','now') - 24*60*60 and
g3.begin_time >= strftime('%s','now')+0
) begin_table
where end_table.datetime > begin_table.begin_time
and begin_table.concurrent < 5
and begin_table.concurrent+begin_table.num_begin >= 5
and end_table.concurrent >= 5
and end_table.concurrent-end_table.num_end < 5
group by begin_table.begin_time
) aah
The basic idea is to make two tables: one with the # of concurrent games at the begin time of each game, and one with the # of concurrent games at the end time. Then join the tables together and only take rows at "critical points" where # of concurrent games crosses 5. For each critical begin time, take the critical end time that happened soonest and that hopefully gives all the periods where at least 5 games were running concurrently.
Hope that's not too convoluted to be helpful!
Kevin rather beat me to the punchline there (+1), but I'll post this variation as it differs at least a little.
The key ideas are
Map the data in to a stream of events with attributes time and 'polarity' (=start or end of game)
Keep a running total of how many games are open at the time of each event
(this is done by forming a self-join on the event stream)
Find the event times where the number of games (as Kevin says) transitions up to 5, or down to 4
A little trick: add up all the down-to-4 times and take away the up-to-5s - the order is not important
The result is the number of seconds spent with 5 or more games open
I don't have sqllite, so I've been testing with MySQL, and I've not bothered to limit the time window to preserve some sanity. Shouldn't be difficult to revise.
Also, and more importantly, I've not considered what to do if games are open at the beginning or end of the period!
Something tells me there's a big simplification to be had here, but I've not spotted it yet.
SELECT SUM( event_time )
FROM (
SELECT -ga.event_type * ga.event_time AS event_time,
SUM( ga.event_type * gb.event_type ) event_type
FROM
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS ga,
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS gb
WHERE
ga.event_time >= gb.event_time
GROUP BY ga.event_time
HAVING SUM( ga.event_type * gb.event_type ) IN ( -4, 5 )
) AS gr
Why don't you trim the date and keep only the time, if you filter your data for any given date every time is unique. In this way you'll only need a table with numbers from 1 to 86400 (or less if you take bigger intervals), you may create two columns, "from" and "to" to define the intervals.
I'm not familiar with SQLite functions but according to the manual you have to use the strftime function with this format: HH:MM:SS.