Query to calculate average time between successive events - sql

My question is about how to write an SQL query to calculate the average time between successive events.
I have a small table:
event Name | Time
stage 1 | 10:01
stage 2 | 10:03
stage 3 | 10:06
stage 1 | 10:10
stage 2 | 10:15
stage 3 | 10:21
stage 1 | 10:22
stage 2 | 10:23
stage 3 | 10:29
I want to build a query that get as an answer the average of the times between stage(i) and stage(i+1).
For example,
the average time between stage 2 and stage 3 is 5:
(3+6+6)/3 = 5

Aaaaand with a sprinkle of black magic:
select a.eventName, b.eventName, AVG(DATEDIFF(MINUTE, a.[Time], b.[Time])) as Average from
(select *, row_number() over (order by [time]) rn from events) a
join (select *, row_number() over (order by [time]) rn from events) b on (a.rn=b.rn-1)
group by
a.eventName, b.eventName
This will give you rows like:
stage3 stage1 2
stage1 stage2 2
stage2 stage3 5
The first column is the starting event, the second column is the ending event. If there is Event 3 right after Event 1, that will be listed as well. Otherwise you should provide some criteria as to which stage follows which stage, so the times are calculated only between those.
Added: This should work OK on both Transact-SQL (MSSQL, Sybase) and PL/SQL (Oracle, PostgreSQL). However I haven't tested it and there could still be syntax errors. This will NOT work on any edition of MySQL.

Select Avg(differ) from (
Select s1.r, s2.r, s2.time - s1.time as differ from (
Select * From (Select rownum as r, inn.time from table inn order by time) s1
Join (Select rownum as r, inn.time from table inn order by time) s2
On mod(s2.r, 3) = 2 and s2.r = s1.r + 1
Where mod(s1.r, 3) = 1)
);
The parameters can be changed as the number of stages changes. This is currently set up to find the average between stages 1 and 2 from a 3 stage process.
EDIT a couple typos

Your table design is flawed. HOw can you tell which stage1 goes with which stage2? Without a way to do this, I do not think your query is possible.

The easiest way would be to order by time and use a cursor (tsql) for iterating over the data. Since cursors are evil it is advisable to fetch the data ordered by time into your application code and iterate there. There are probably other ways to do this in SQL but they will be very complicated and rely on non-standard language extensions.

You don't say which flavour of SQL you want the answer for. This probably means you want the code in SQL Server (as [sql] commonly = [sql-server] in SO tag usage).
But just in case you (or some future seeker) are using Oracle, this kind of query is quite straightforward with analytic functions, in this case LAG(). Check it out:
SQL> select stage_range
2 , avg(time_diff)/60 as average_time_diff_in_min
3 from
4 (
5 select event_name
6 , case when event_name = 'stage 2' then 'stage 1 to 2'
7 when event_name = 'stage 3' then 'stage 2 to 3'
8 else '!!!' end as stage_range
9 , stage_secs - lag(stage_secs)
10 over (order by ts, event_name) as time_diff
11 from
12 ( select event_name
13 , ts
14 , to_number(to_char(ts, 'sssss')) as stage_secs
15 from timings )
16 )
17 where event_name in ('stage 2','stage 3')
18 group by stage_range
19 /
STAGE_RANGE AVERAGE_TIME_DIFF_IN_MIN
------------ ------------------------
stage 1 to 2 2.66666667
stage 2 to 3 5
SQL>
The change of format in the inner query is necessary because I have stored the TIME column as a DATE datatype, so I convert it into seconds to make the mathematics clearer. An alternate solution would be to work with Day to Second Interval datatype instead. But this solution is really all about LAG().
edit
In my take on this query I have explicitly not calculated the difference between a prior Stage 3 and a subsequent Stage 1. This is a matter of requirement.

WITH q AS
(
SELECT 'stage 1' AS eventname, CAST('2009-01-01 10:01:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 2' AS eventname, CAST('2009-01-01 10:03:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 3' AS eventname, CAST('2009-01-01 10:06:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 1' AS eventname, CAST('2009-01-01 10:10:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 2' AS eventname, CAST('2009-01-01 10:15:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 3' AS eventname, CAST('2009-01-01 10:21:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 1' AS eventname, CAST('2009-01-01 10:22:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 2' AS eventname, CAST('2009-01-01 10:23:00' AS DATETIME) AS eventtime
UNION ALL
SELECT 'stage 3' AS eventname, CAST('2009-01-01 10:29:00' AS DATETIME) AS eventtime
)
SELECT (
SELECT AVG(DATEDIFF(minute, '2009-01-01', eventtime))
FROM q
WHERE eventname = 'stage 3'
) -
(
SELECT AVG(DATEDIFF(minute, '2009-01-01', eventtime))
FROM q
WHERE eventname = 'stage 2'
)
This relies on the fact that you always have complete groups of the stages and they always go in the same order (that is, stage 1 then stage 2 then stage 3)

I can't comment, but I have to agree with HLGEM. While you can tell with the provided data set, the OP should be made aware that relying on only a single set of stages existing at one time may be too optimistic.
event Name | Time
stage 1 | 10:01
stage 2 | 10:03
stage 3 | 10:06
stage 1 | 10:10
stage 2 | 10:15
stage 3 | 10:21
stage 1 | 10:22
stage 2 | 10:23
stage 1 | 10:25 --- new stage 1
stage 2 | 10:28 --- new stage 2
stage 3 | 10:29
stage 3 | 10:34 --- new stage 3
We don't know the environment or what is creating the data. It is up to the OP to decide if the table is built correctly.
Oracle would handle this with Analytics. like Vilx's answer.

try this
Select Avg(e.Time - s.Time)
From Table s
Join Table e
On e.Time =
(Select Min(Time)
From Table
Where eventname = s.eventname
And time > s.Time)
And Not Exists
(Select * From Table
Where eventname = s.eventname
And time < s.Time)
For each record representing a Start of a stage, this sql joins it to the record which represents the end, takes the difference between the end time and the start time, and averages those differences. The Not Exists ensures that he intermediate resultset of start records joined to end records only includes the start records as s... and the first join condition ensures that only the one end record ( the one with the same name and the next time value after the start time) is joined to it...
To see the intermediate resultset after the join, but before the average is taken, run the following:
Select s.EventName,
s.Time Startime, e.Time EndTime,
(e.Time - s.Time) Elapsed
From Table s
Join Table e
On e.Time =
(Select Min(Time)
From Table
Where eventname = s.eventname
And time > s.Time)
And Not Exists
(Select * From Table
Where eventname = s.eventname
And time < s.Time)

Related

Running "distinct on" across all unique thresholds in a postgres table

I have a Postgres 11 table called sample_a that looks like this:
time | cat | val
------+-----+-----
1 | 1 | 5
1 | 2 | 4
2 | 1 | 6
3 | 1 | 9
4 | 3 | 2
I would like to create a query that for each unique timestep, gets the most recent values across each category at or before that timestep, and aggregates these values by taking the sum of these values and dividing by the count of these values.
I believe I have the query to do this for a given timestep. For example, for time 3 I can run the following query:
select sum(val)::numeric / count(val) as result from (
select distinct on (cat) * from sample_a where time <= 3 order by cat, time desc
) x;
and get 6.5. (This is because at time 3, the latest from category 1 is 9 and the latest from category 2 is 4. The count of the values are 2, and they sum up to 13, and 13 / 2 is 6.5.)
However, I would ideally like to run a query that will give me all the results for each unique time in the table. The output of this new query would look as follows:
time | result
------+----------
1 | 4.5
2 | 5
3 | 6.5
4 | 5
This new query ideally would avoid adding another subselect clause if possible; an efficient query would be preferred. I could get these prior results by running the prior query inside my application for each timestep, but this doesn't seem efficient for a large sample_a.
What would this new query look like?
See if performance is acceptable this way. Syntax might need minor tweaks:
select t.time, avg(mr.val) as result
from (select distinct time from sample_a) t,
lateral (
select distinct on (cat) val
from sample_a a
where a.time <= t.time
order by a.cat, a.time desc
) mr
group by t.time
I think you just want cumulative functions:
select time,
sum(sum(val)) over (order by time) / sum(sum(num_val)) over (order by time) as result
from (select time, sum(val) as sum_val, count(*) as num_val
from sample_a a
group by time
) a;
Note if val is an integer, you might need to convert to a numeric to get fractional values.
This can be expressed without a subquery as well:
select time,
sum(sum(val)) over (order by time) / sum(count(*)) over (order by time) as result
from sample_a
group by time

How does order_by behave in sql when order_by column is having duplicate values?

I'm using a query to fill null values using 'first_value' function of sql
The entire query with example is as below:
WITH example (date,close) AS
(VALUES
('12:00:00',3),
('12:00:01',4),
('12:00:01',5),
('12:00:03',NULL),
('12:00:04',NULL),
('12:00:05',3)
)
SELECT * INTO temporary table market_summary FROM example;
select
date,
cccc,
first_value(cccc) over (partition by grp_close) as corrected_close
from (
select date, close as cccc,
sum(case when close is not null then 1 end) over (order by date) as grp_close
from market_summary
) t
The result is :
date cccc corrected_close
1 12:00:00 3 3
2 12:00:01 4 4
3 12:00:01 5 4
4 12:00:03 NULL 4
5 12:00:04 NULL 4
6 12:00:05 3 3
Here in this example, 'date' is used as order_by column in the query but it has a duplicate of '12:00:01'. The null values are being filled with '4' which is ideally not correct as i want null values to be filled with the previous non-null values which in this case is '5' not '4' so that result should have been as below:
date cccc corrected_close
1 12:00:00 3 3
2 12:00:01 4 4
3 12:00:01 5 5
4 12:00:03 NULL 5
5 12:00:04 NULL 5
6 12:00:05 3 3
How do I modify the query to meet my requirement?
You should change the window function so that you get the correct value:
last_value(cccc) IGNORE NULLS OVER (PARTITION BY grp_close ORDER BY date)
This is the way defined by the SQL standard, but a lot of databases don't implement the standard in this respect. Since you tagged a lot of databases, it is difficult to give a generic answer that works for all of them.
What you want is lag( . . . ignore nulls). But, Postgres doesn't support that.
Here is one work-around:
select e.*, coalesce(close, max(close) over (partition by grp))
from (select e.*, count(close) over (order by date) as grp
from example e
) e;
You can even do this without a subquery:
select e.*,
coalesce(close,
(array_remove(array_agg(close) over (order by date), null))[array_upper(array_remove(array_agg(close) over (order by date), null), 1)]
)
from example e;

SQL Query: how to return only the first and last instance?

I have a table that shows the status of each case with multiple jobs being performed simultaneously, I would like to have the results displayed so that it only shows the first and last instance. (Mainly I want to know when the job was first started and what's its last known status).
I've managed to get the results with 2 similar min, max, and group by queries joined by an UNION function. But is there a simpler way?
However, would it be possible to display the 2 instances on one line instead of 2 separate lines? because the date from the first instance will be the start date and the last instance will be the end date, and i don't really care about the first status because it's always pending, i just want to know what's the last known status
1st table shows unfiltered results and 2nd table is desired results (but if we can combine the first and last instance on one line that'd be even better)
ID Status Date Job Note
1 pending 1-Jul A abc
1 pending 2-Jul A xyz
1 pending 2-Jul A abc
1 done 3-Jul B xyz
1 done 4-Jul A abc
2 pending 1-Jul A abc
2 done 2-Jul A xyz
2 done 2-Jul A abc
2 pending 3-Jul C xyz
2 pending 4-Jul C xyz
2 pending 5-Jul C xyz
2 pending 6-Jul C xyz
3 pending 2-Jul D xyz
3 done 3-Jul D abc
3 pending 4-Jul D abc
3 pending 1-Jul E xyz
3 done 3-Jul E xyz
ID Status Date Job Note
1 pending 1-Jul A abc
1 done 3-Jul B xyz
1 done 4-Jul A abc
2 pending 1-Jul A abc
2 done 2-Jul A abc
2 pending 3-Jul C xyz
2 pending 6-Jul C xyz
3 pending 2-Jul D xyz
3 pending 4-Jul D abc
3 pending 1-Jul E xyz
3 done 3-Jul E xyz
Thank you very much in advance
One way to do it is to use ROW_NUMBER function twice in ascending and descending order to get first and last rows of each group. See SQL Fiddle
WITH
CTE
AS
(
SELECT
ID
,Status
,dt
,Job
,Note
,ROW_NUMBER() OVER (PARTITION BY ID, Job ORDER BY dt ASC) AS rnASC
,ROW_NUMBER() OVER (PARTITION BY ID, Job ORDER BY dt DESC) AS rnDESC
FROM T
)
SELECT
ID
,Status
,dt
,Job
,Note
FROM CTE
WHERE rnAsc=1 OR rnDesc=1
ORDER BY ID, Job, dt
This variant would scan through the whole table, calculate row numbers and discard those rows that don't satisfy the filter.
The second variant is to use CROSS APPLY, which may be more efficient, if (a) your main table has millions of rows, (b) you have a small table with the list of all IDs and Jobs, (c) the main table has appropriate index. In this case instead of reading all rows of the main table you can do index seek for each (ID, Job) (two seeks, one for first row plus one for the last row).
Try this:
SELECT A.ID, A.JOB, A.STATUS, B.START_DATE, CASE WHEN A.STATUS = 'done' THEN C.END_DATE ELSE NULL AS END_DATE
FROM <JOBS_TABLE> A
JOIN (SELECT ID, JOB, MIN(DATE) AS START_DATE FROM <JOBS_TABLE> GROUP BY ID, JOB) B
ON A.ID = B.ID
AND A.JOB = B.JOB
JOIN (SELECT ID, JOB, MAX(DATE) AS END_DATE FROM <JOBS_TABLE GROUP BY ID, JOB) C
ON A.ID= C.ID
AND A.JOB = C.JOB
AND A.DATE = C.END_DATE
You'll need to replace < JOBS_TABLE > with whatever your table name is. Ideally, this should combine the data from the first and last rows for each distinct set of ID and JOB values. If the job is not finished, it will not show an END_DATE.
I don't think there's much wrong with your UNION idea. Is this what you have?
select id, job, status, max(date), job, note, 'max' as type from test1 group by job UNION
select id, job, status, min(date), job, note, 'min' as type from test1 group by job;

Complex Query Involving Search for Contiguous Dates (by Month)

I have a table that contains a list of accounts by month along with a field that indicates activity. I want to search through to find when an account has "died", based on the following criteria:
the account had consistent activity for a contiguous period of months
the account had a spike of activity on a final month (spike = 200% or more of average of all previous contiguous months of activity)
the month immediately following the spike of activity and the next 12 months all had 0 activity
So the table might look something like this:
ID | Date | Activity
1 | 1/1/2010 | 2
2 | 1/1/2010 | 3.2
1 | 2/3/2010 | 3
2 | 2/3/2010 | 2.7
1 | 3/2/2010 | 8
2 | 3/2/2010 | 9
1 | 4/6/2010 | 0
2 | 4/6/2010 | 0
1 | 5/2/2010 | 0
2 | 5/2/2010 | 2
So in this case both accounts 1 and 2 have activity in months Jan - Mar. Both accounts exhibit a spike of activity in March. Both accounts have 0 activity in April. Account 2 has activity again in May, but account 1 does not. Therefore, my query should return Account 1, but not Account 2. I would want to see this as my query result:
ID | Last Date
1 | 3/2/2010
I realize this is a complicated question and I'm not expecting anyone to write the whole query for me. The current best approach I can think of is to create a series of sub-queries and join them, but I don't even know what the subqueries would look like. For example: how do I look for a contiguous series of rows for a single ID where activity is all 0 (or all non-zero?).
My fall-back if the SQL is simply too involved is to use a brute-force search using Java where I would first find all unique IDs, and then for each unique ID iterate across the months to determine if and when the ID "died".
Once again: any help to move in the right direction is very much appreciated.
Processing in Java, or partially processing in SQL, and finishing the processing in Java is a good approach.
I'm not going to tackle how to define a spike.
I will suggest that you start with condition 3. It's easy to find the last non-zero value. Then that's the one you want to test for a spike, and consistant data before the spike.
SELECT out.*
FROM monthly_activity out
LEFT OUTER JOIN monthly_activity comp
ON out.ID = comp.ID AND out.Date < comp.Date AND comp.Activity <> 0
WHERE comp.Date IS NULL
Not bad, but you don't want the result if this is because the record is the last for the month, so instead,
SELECT out.*
FROM monthly_activity out
INNER JOIN monthly_activity comp
ON out.ID = comp.ID AND out.Date < comp.Date AND comp.Activity == 0
GROUP BY out.ID
Probably not the world's most efficient code, but I think this does what you're after:
declare #t table (AccountId int, ActivityDate date, Activity float)
insert #t
select 1, '2010-01-01', 2
union select 2, '2010-01-01', 3.2
union select 1, '2010-02-03', 3
union select 2, '2010-02-03', 2.7
union select 1, '2010-03-02', 8
union select 2, '2010-03-02', 9
union select 1, '2010-04-06', 0
union select 2, '2010-04-06', 0
union select 1, '2010-05-02', 0
union select 2, '2010-05-02', 2
select AccountId, ActivityDate LastActivityDate --, Activity
from #t a
where
--Part 2 --select only where the activity is a peak
Activity >= isnull
(
(
select 2 * avg(c.Activity)
from #t c
where c.AccountId = 1
and c.ActivityDate >= isnull
(
(
select max(d.ActivityDate)
from #t d
where d.AccountId = c.AccountId
and d.ActivityDate < c.ActivityDate
and d.Activity = 0
)
,
(
select min(e.ActivityDate)
from #t e
where e.AccountId = c.AccountId
)
)
and c.ActivityDate < a.ActivityDate
)
, Activity + 1 --Part 1 (i.e. if no activity before today don't include the result)
)
--Part 3
and not exists --select only dates which have had no activity for the following 12 months on the same account (assumption: count no record as no activity / also ignore current date in this assumption)
(
select 1
from #t b
where a.AccountId = b.AccountId
and b.Activity > 0
and b.ActivityDate between dateadd(DAY, 1, a.ActivityDate) and dateadd(YEAR, 1, a.ActivityDate)
)

MySQL Combine multiple rows

I have a table similar to the following (of course with more rows and fields):
category_id | client_id | date | step
1 1 2009-12-15 first_step
1 1 2010-02-03 last_step
1 2 2009-04-05 first_step
1 2 2009-08-07 last_step
2 3 2009-11-22 first_step
3 4 2009-11-14 first_step
3 4 2010-05-09 last_step
I would like to transform this so that I can calculate the time between the first and last steps and eventually find the average time between first and last steps, etc. Basically, I'm stumped at how to transform the above table into something like:
category_id | first_date | last_date
1 2009-12-15 2010-02-03
1 2009-04-05 2009-08-07
2 2009-11-22 NULL
3 2009-11-14 2010-05-09
Any help would be appreciated.
Updated based on question update/clarification:
SELECT t.category_id,
MIN(t.date) AS first_date,
CASE
WHEN MAX(t.date) = MIN(t.date) THEN NULL
ELSE MAX(t.date)
END AS last_date
FROM TABLE t
GROUP BY t.category_id, t.client_id
a simple GROUP BY should do the trick
SELECT category_id
, MIN(first_date)
, MAX(last_date)
FROM TABLE
GROUP BY category_ID
Try:
select
category_id
, min(date) as first_date
, max(date) as last_date
from
table_name
group by
category_id
, client_id
You need to do two sub queries and then join them together - something like the following
select
*
from
(select
*,
date as first_date
from
table
where step = "first_step") a
left join ( select
*
date as last_date
from
table
where step = "lastt_step") b
on (a.category_id = b.category_id)
Enjoy!
simple answer:
SELECT fist.category_id, first.date, last.date
FROM tableName first
LEFT JOIN tableName last
ON first.category_id = last.category_id
AND first.step = 'first_step'
AND last.step ='last_step'
You could also do the calculations in the query instead of just returning the two date values.