How to merge intervals based on a minimum gap of dates? - sql

My current table is like the below, each patient has their visit start date and end date to a hospital, and they are administered a drug between admin_startdate and admin_enddate. For example, the first two rows mean, patient PT1 has two drug administrations, one between 01/08 and 01/10 & the other between 01/12 and 01/23, during her visit from 01/01 to 01/31.
ptid visit_start_date visit_end_date admin_startdate admin_enddate
PT1 2018-01-01 2018-01-31 2018-01-08 2018-01-10
PT1 2018-01-01 2018-01-31 2018-01-12 2018-01-23
PT2 2018-01-02 2018-01-18 2018-01-06 2018-01-11
PT2 2018-01-02 2018-01-18 2018-01-14 2018-01-17
What I would like to achieve is to lump together the drug administration that are too close together, say, the end date of the previous one is <= 2 days of the start date of new one, and call that a whole episode, like below:
ptid visit_start_date visit_end_date admin_startdate admin_enddate episode_startdate episode_enddate
PT1 2018-01-01 2018-01-31 2018-01-08 2018-01-10 2018-01-08 2018-01-23
PT1 2018-01-01 2018-01-31 2018-01-12 2018-01-23 2018-01-08 2018-01-23
PT2 2018-01-02 2018-01-18 2018-01-06 2018-01-11 2018-01-06 2018-01-11
PT2 2018-01-02 2018-01-18 2018-01-14 2018-01-17 2018-01-14 2018-01-17
You can see that PT1's two administrations are lumped together with the same episode_startdate and episode_enddate, whereas PT2's two administrations are considered two separate episode.
I have a hard time to figure out how to do it in PostgreSQL (Redshift).

This work in Postgres 14. Not tested for Redshift.
SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
, min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
, max(admin_enddate) OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM (
SELECT *, count(*) FILTER (WHERE gap) OVER (PARTITION BY visit_id ORDER BY admin_startdate) AS admin
FROM (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
FROM (
SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id -- optional, to simplify
FROM tbl
) sub1
) sub2
) sub3
db<>fiddle here
The innermost subquery sub1 is only to compute a unique visit_id - which should really be in your table instead of repeating (ptid, visit_start_date, visit_end_date ) over and over. Consider normalizing your design at least that much.
The next subquery sub2 checks for a gap that's greater than two days to the previous row in the same partition.
Subquery sub3 then counts those gaps to identify distinct administration periods (admin)
In the outer SELECT, min(admin_startdate) and max(admin_enddate) per administration period produce the desired episode dates.
See (with assorted links to more):
How to group timestamps into islands (based on arbitrary gap)?

CREATE TABLE tb1 AS (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
FROM (
SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id -- optional, to simplify
FROM tbl
) sub1
) ;
CREATE TABLE tb2 AS (
SELECT *, count(*) OVER (PARTITION BY visit_id ORDER BY admin_startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS admin
FROM tb1
WHERE gap is True
)
;
CREATE TABLE tb3 AS (
SELECT tb1.ptid, tb1.visit_start_date, tb1.visit_end_date, tb1.admin_startdate, tb1.admin_enddate, tb1.visit_id, tb1.gap,
CASE WHEN tb2.admin is NULL THEN 0 else tb2.admin END AS admin
FROM tb1
LEFT JOIN tb2
ON tb1.ptid = tb2.ptid AND tb1.visit_start_date = tb2.visit_start_date AND tb1.visit_end_date = tb2.visit_end_date AND tb1.admin_startdate = tb2.admin_startdate AND tb1.admin_enddate = tb2.admin_enddate AND tb1.visit_id = tb2.visit_id
)
;
CREATE TABLE tb4 AS (
SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
, min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
, max(admin_enddate) OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM tb3
)
This is an uglier version adapted from Erwin's answer for Redshift, which does not support FILTER operation. Tested correctly at least on db fiddle

Related

Find out per day the first trip duration and last trip duration of a bike

Find out per day first trip duration and last trip duration of a bike.
Table
trip_id bike-id trip_date trip_starttime trip_duration
1 1 2018-12-01 12:00:00.0000000 10
2 2 2018-12-01 14:00:00.0000000 25
3 1 2018-12-01 14:30:00.0000000 5
4 3 2018-12-02 05:00:00.0000000 12
5 3 2018-12-02 19:00:00.0000000 37
6 1 2018-12-02 20:30:00.0000000 20
Expected Result
trip_date bike-id first_trip_duration last_trip_duration
2018-12-01 1 10 5
2018-12-01 2 25 25
2018-12-02 1 20 20
2018-12-02 3 12 37
I tried it with below code,
select A.trip_date,A.[bike-id],A.trip_duration AS Minduration,B.trip_duration AS MaxDUrtaion from
(SELECT T1.trip_date,T1.[bike-id],T1.trip_duration FROM TRIP T1
INNER JOIN (
select trip_date,[bike-id] , min(trip_starttime) AS Mindate
from Trip group by trip_date,[bike-id] ) T2
oN T1.[bike-id]=T2.[bike-id] AND T1.trip_date=T2.trip_date AND t1.trip_starttime=t2.Mindate ) as A
inner join
(SELECT T1.trip_date,T1.[bike-id],T1.trip_duration FROM TRIP T1
INNER JOIN (
select trip_date,[bike-id] , MAX(trip_starttime) AS Maxdate
from Trip group by trip_date,[bike-id] ) T2
oN T1.[bike-id]=T2.[bike-id] AND T1.trip_date=T2.trip_date AND t1.trip_starttime=t2.Maxdate ) as B
ON A.[bike-id]=B.[bike-id] AND A.trip_date=B.trip_date
order by A.trip_date,A.[bike-id]
I want to know some other logic too, please help out.
First, determine for each date/bike the first and last trip.
Then, determine the duration of these trips.
Something like this might do it (I didn't test it though):
SELECT minmax.trip_date,
minmax.bike_id,
first.trip_duration AS first_trip_duration,
last.trip_duration AS last_trip_duration
FROM (SELECT trip_date,
bike_id,
MIN(trip_starttime) AS first_trip,
MAX(trip_starttime) AS last_trip
FROM trip_table
GROUP BY trip_date,
bike_id
) minmax
JOIN trip_table first
ON minmax.trip_date = first.trip_date
AND minmax.bike_id = first.bike_id
AND minmax.first_trip = first.trip_starttime
JOIN trip_table last
ON minmax.trip_date = last.trip_date
AND minmax.bike_id = last.bike_id
AND minmax.last_trip = last.trip_starttime
Supposing you have the necessary indexes on the table.
Preferably a unique index on (bike_id, trip_date, starttime).
select trip_date,bike_id
,first_value(trip_duration) over(partition by trip_date,bike_id order by trip_starttime) as first_trip_duration
,first_value(trip_duration) over(partition by trip_date,bike_id order by trip_starttime desc) as last_trip_duration
from trip;
Assuming window functions are supported, this can be done with first_value.
select distinct
trip_date
,bike_id
,first_value(trip_duration) over(partition by trip_date,bike_id order by trip_starttime) as first_trip_duration
,first_value(trip_duration) over(partition by trip_date,bike_id order by trip_starttime desc) as last_trip_duration
from trip

How to get ' COUNT DISTINCT' over moving window

I'm working on a query to compute the distinct users of particular features of an app within a moving window. So, if there's a range from 15-20th October, I want a query to go from 8-15 Oct, 9-16 Oct etc and get the count of distinct users per feature. So for each date, it should have x rows where x is the number of features.
I have a query the following query so far:
WITH V1(edate, code, total) AS
(
SELECT date, featurecode,
DENSE_RANK() OVER ( PARTITION BY (featurecode ORDER BY accountid ASC) + DENSE_RANK() OVER ( PARTITION BY featurecode ORDER By accountid DESC) - 1
FROM....
GROUP BY edate, featurecode, appcode, accountid
HAVING appcode='sample' AND eventdate BETWEEN '15-10-2018' And '20-10-2018'
)
Select distinct date, code, total
from V1
WHERE date between '2018-10-15' AND '2018-10-20'
This returns the same set of values for all the dates. Is there any way to do this efficiently?? It's a DB2 database by the way but I'm looking for insight from postgresql users too.
Present result- All the totals are being repeated.
date code total
10/15/2018 appname-feature1 123
10/15/2018 appname-feature2 234
10/15/2018 appname-feature3 321
10/16/2018 appname-feature1 123
10/16/2018 appname-feature2 234
10/16/2018 appname-feature3 321
Desired result.
date code total
10/15/2018 appname-feature1 123
10/15/2018 appname-feature2 234
10/15/2018 appname-feature3 321
10/16/2018 appname-feature1 212
10/16/2018 appname-feature2 577
10/16/2018 appname-feature3 2345
This is not easy to do efficiently. DISTINCT counts are't incrementally maintainable (unless you go down the route of in-exact DISTINCT counts such as HyperLogLog).
It is easy to code in SQL, and try the usual indexing etc to help.
It is (possibly) not possible, however, to code with OLAP functions.. not least because you can only use RANGE BETWEEN for SUM(), COUNT(), MAX() etc, but not RANK() or DENSE_RANK() ... so just use a traditional co-related sub-select
First some data
CREATE TABLE T(D DATE,F CHAR(1),A CHAR(1));
INSERT INTO T (VALUES
('2018-10-10','X','A')
, ('2018-10-11','X','A')
, ('2018-10-15','X','A')
, ('2018-10-15','X','A')
, ('2018-10-15','X','B')
, ('2018-10-15','Y','A')
, ('2018-10-16','X','C')
, ('2018-10-18','X','A')
, ('2018-10-21','X','B')
)
;
Now a simple select
WITH B AS (
SELECT DISTINCT D, F FROM T
)
SELECT D,F
, (SELECT COUNT(DISTINCT A)
FROM T
WHERE T.F = B.F
AND T.D BETWEEN B.D - 3 DAYS AND B.D + 4 DAYS
) AS DISTINCT_A_MOVING_WEEK
FROM
B
ORDER BY F,D
;
giving, e.g.
D F DISTINCT_A_MOVING_WEEK
---------- - ----------------------
2018-10-10 X 1
2018-10-11 X 2
2018-10-15 X 3
2018-10-16 X 3
2018-10-18 X 3
2018-10-21 X 2
2018-10-15 Y 1

SQL Query - Design struggle

I am fairly new to SQL Server (2012) but I was assigned the project where I have to use it.
The database consists of one table (counted in millions of rows) which looks mainly like this:
Number (float) Date (datetime) Status (nvarchar(255))
999 2016-01-01 14:00:00.000 Error
999 2016-01-02 14:00:00.000 Error
999 2016-01-03 14:00:00.000 Ok
999 2016-01-04 14:00:00.000 Error
888 2016-01-01 14:00:00.000 Error
888 2016-01-02 14:00:00.000 Ok
888 2016-01-03 14:00:00.000 Error
888 2016-01-04 14:00:00.000 Error
777 2016-01-01 14:00:00.000 Error
777 2016-01-02 14:00:00.000 Error
I have to create a query which will show me only the phone numbers (one number per row so probably Group by number?) that meet the conditions:
Number reappears at least 3 times
Last two times (that has to be based on date; originally records are not sorted by date) has to be an Error
For example, in the table above the phone number that meets the criteria is only 888, beacuse for 999 2nd newest status is Ok and number 777 reoccurs only 2 times.
I will appreciate any kind of help!
Thanks in advance!
You can use row_number() and conditional aggregation:
select number
from (select t.*,
row_number() over (partition by number order by date desc) as seqnum
from t
) t
group by number
having count(*) >= 3 and
max(case when seqnum = 1 then status end) = 'Error' and
max(case when seqnum = 2 then status end) = 'Error';
Note: float is a really, really bad type to use for the "number" column. In particular, two numbers can look the same but differ in low-order bits. They will produce different rows in the group by.
You should probably use varchar() for telephone numbers. That gives you the most flexibility. If you need to store the number as a number, then decimal/numeric is a much, much better choice than float.
select *, ROW_NUMBER() OVER(partition by Number, order by date desc) as times
FROM
(
select Number, Date
From table
where Number in
(
select Number
from table
group by Number
having count (*) >3
) as ABC
WHERE ABC.times in (1,2) and ABC.Status = 'Error'
with CTE as
(
select t1.*, row_number() over(partition by t1.Number order by t1.date desc) as r_ord
from MyTable t1
)
select C1.*
from CTE C1
inner join
(
select Number
from CTE
group by Number
having max(r_ord) >=3
) C2
on C1.Number = C2.Number
where C1.r_ord in (1,2)
and C1.Status = 'Error'

SQL to capture time periods that a certain condition exists

I have a table that captures daily data on users.
I want to pull the start and end dates for users when IS_AWESOME = 'Y'
I do not know how to do this using SQL
USER_ID DATE IS_AWESOME
123 2017-01-01 Y
123 2017-01-02 Y
123 2017-01-03 Y
123 2017-01-04 N
123 2017-01-05 Y
123 2017-01-06 Y
123 2017-01-07 Y
123 2017-01-08 N
123 2017-01-09 Y
123 2017-01-10 Y
123 2017-01-11 N
If I use MIN(DATE) and MAX(DATE) I will not get the intervals between those two dates.
A typical way to do this uses a difference row_number()s (an ANSI-standard function supported by most databases):
select user_id, min(date), max(date)
from (select t.*,
row_number() over (partition by user_id order by date) as seqnum_u,
row_number() over (partition by user_id, is_awesome order by date) as seqnum_uia
from t
) t
where is_awesome = 'Y'
group by user_id, is_awesome, (seqnum_u - seqnum_uia) ;
Explaining how this works is a bit tricky. If you run the subquery, you will see how the difference of the row numbers defines each group of sequential values.

Get MAX count but keep the repeated calculated value if highest

I have the following table, I am using SQL Server 2008
BayNo FixDateTime FixType
1 04/05/2015 16:15:00 tyre change
1 12/05/2015 00:15:00 oil change
1 12/05/2015 08:15:00 engine tuning
1 04/05/2016 08:11:00 car tuning
2 13/05/2015 19:30:00 puncture
2 14/05/2015 08:00:00 light repair
2 15/05/2015 10:30:00 super op
2 20/05/2015 12:30:00 wiper change
2 12/05/2016 09:30:00 denting
2 12/05/2016 10:30:00 wiper repair
2 12/06/2016 10:30:00 exhaust repair
4 12/05/2016 05:30:00 stereo unlock
4 17/05/2016 15:05:00 door handle repair
on any given day need do find the highest number of fixes made on a given bay number, and if that calculated number is repeated then it should also appear in the resultset
so would like to see the result set as follows
BayNo FixDateTime noOfFixes
1 12/05/2015 00:15:00 2
2 12/05/2016 09:30:00 2
4 12/05/2016 05:30:00 1
4 17/05/2016 15:05:00 1
I manage to get the counts of each but struggling to get the max and keep the highest calculated repeated value. can someone help please
Use window functions.
Get the count for each day by bayno and also find the min fixdatetime for each day per bayno.
Then use dense_rank to compute the highest ranked row for each bayno based on the number of fixes.
Finally get the highest ranked rows.
select distinct bayno,minfixdatetime,no_of_fixes
from (
select bayno,minfixdatetime,no_of_fixes
,dense_rank() over(partition by bayno order by no_of_fixes desc) rnk
from (
select t.*,
count(*) over(partition by bayno,cast(fixdatetime as date)) no_of_fixes,
min(fixdatetime) over(partition by bayno,cast(fixdatetime as date)) minfixdatetime
from tablename t
) x
) y
where rnk = 1
Sample Demo
You are looking for rank() or dense_rank(). I would right the query like this:
select bayno, thedate, numFixes
from (select bayno, cast(fixdatetime) as date) as thedate,
count(*) as numFixes,
rank() over (partition by cast(fixdatetime as date) order by count(*) desc) as seqnum
from t
group by bayno, cast(fixdatetime as date)
) b
where seqnum = 1;
Note that this returns the date in question. The date does not have a time component.