I need to get the data that generates count of total ID by date between date_active and date_end using date ranges for each. If the dates are crossing each other the ID will adding up. here is the data I have right now,
TABLE CONTRACT:
ID DATE_ACTIVE DATE_END
1 05-FEB-13 08-NOV-13
1 21-DEC-18 06-OCT-19
2 05-FEB-13 27-JAN-14
3 05-FEB-13 07-NOV-13
4 06-FEB-13 02-NOV-13
4 25-OCT-14 13-APR-16
TABLE CALENDAR:
DT
05-FEB-13
06-FEB-13
07-FEB-13
08-FEB-13
09-FEB-13
..-DEC-19
what I want out is basically like this:
DT COUNT(ID)
05-FEB-13 3
06-FEB-13 4
07-FEB-13 4
08-FEB-13 4
09-FEB-13 4
10-FEB-13 4
....
03-NOV-13 3
....
08-NOV-13 2
09-NOV-13 1
....
28-JAN-14 0
....
25-OCT-14 1
....
13-APR-16 1
14-APR-16 0
....
21-DEC-18 1
....
06-OCT-19 1
07-OCT-19 0
....
....
And here is my query to get that result
with contract as (
select * from contract
where id in ('1','2','3','4')
)
,
cal as
(
select TRUNC (SYSDATE - ROWNUM) dt
from dual
connect by rownum < sysdate - to_date('05-FEB-13')
)
select aa.dt,count(distinct bb.id)id from cal aa
left join contract bb on aa.dt >= bb.date_active and aa.dt<= bb.date_end
group by aa.dt
order by 1
but the problem is I have 6 mio of ID and if I use this kind of query, the result maybe will take forever, and I'm having a hard times to figured out how to get the result with different query. It will be my pleasure if somebody can help me out of this. Thank you so much.
If you group your events by date_active and date_end, you will get the numbers of events which have started and ended on each separate day.
Not a lot of days have passed between 2013 and 2019 (about 2 000), so the grouped resultsets will be relatively short.
Now that you have the two groups, you can notice that the number of events on each given date is the number of events which have started on or before this date, minus the number of events which have finished on or before this date (I'm assuming the end dates are non-inclusive).
In other words, the number of events on every given day is:
The number of events on the previous date,
plus the number of events started on this date,
minus the number of events ended on this date.
This can be easily done using a window function.
This will require a join between the calendar table and the two groups, but fortunately all of them are relatively short (thousands of records) and the join would be fast.
Here's the query: http://sqlfiddle.com/#!4/b21ce/5
WITH cal AS
(
SELECT TRUNC (to_date('01-NOV-13') - ROWNUM) dt
FROM dual
CONNECT BY
rownum < to_date('01-NOV-13')- to_date('01-FEB-13')
),
started_on AS
(
SELECT date_active AS dt, COUNT(*) AS cnt_start
FROM contract
GROUP BY
date_active
),
ended_on AS
(
SELECT date_end AS dt, COUNT(*) AS cnt_end
FROM contract
GROUP BY
date_end
)
SELECT dt,
SUM(COALESCE(cnt_start, 0) - COALESCE(cnt_end, 0)) OVER (ORDER BY dt) cnt
FROM cal c
LEFT JOIN
started_on s
USING (dt)
LEFT JOIN
ended_on e
USING (dt)
(I used a fixed date instead of SYSDATE to keep the resultset short, but the idea is the same)
This query requires that the calendar starts before the earliest event, otherwise every result will be off by a fixed amount, the number of events before the beginning of the calendar.
You can replace the fixed date in the calendar condition with (SELECT MIN(date_active) FROM contract) which is instant if date_active is indexed.
Update:
If your contract dates can overlap and you want to collapse multiple overlapping contracts into a one continuous contract, you can use window functions to do so.
WITH cal AS
(
SELECT TRUNC (to_date('01-NOV-13') - ROWNUM) dt
FROM dual
CONNECT BY
rownum <= to_date('01-NOV-13')- to_date('01-FEB-13')
),
collapsed_contract AS
(
SELECT *
FROM (
SELECT c.*,
COALESCE(LAG(date_end_effective) OVER (PARTITION BY id ORDER BY date_active), date_active) AS date_start_effective
FROM (
SELECT c.*,
MAX(date_end) OVER (PARTITION BY id ORDER BY date_active) AS date_end_effective
FROM contract c
) c
) c
WHERE date_start_effective < date_end_effective
),
started_on AS
(
SELECT date_start_effective AS dt, COUNT(*) AS cnt_start
FROM collapsed_contract
GROUP BY
date_start_effective
),
ended_on AS
(
SELECT date_end_effective AS dt, COUNT(*) AS cnt_end
FROM collapsed_contract
GROUP BY
date_end_effective
)
SELECT dt,
SUM(COALESCE(cnt_start, 0) - COALESCE(cnt_end, 0)) OVER (ORDER BY dt) cnt
FROM cal c
LEFT JOIN
started_on s
USING (dt)
LEFT JOIN
ended_on e
USING (dt)
http://sqlfiddle.com/#!4/adeba/1
The query might seem bulky, but that's to make it more efficient, as all these window functions can be calculated in a single pass over the table.
Note however that this single pass relies on the table being sorted on (id, date_active) so an index on these two fields is crucial.
Firstly, row_number() over (order by id,date_active) analytic function is used in order to generate unique ID values those will be substituted in
connect by level <= ... and prior id = id syntax to get unpivoted hierarchical data :
with t0 as
(
select row_number() over (order by id,date_active) as id, date_active, date_end
from contract
), t1 as
(
select date_active + level - 1 as dt
from t0
connect by level <= date_end - date_active + 1
and prior id = id
and prior sys_guid() is not null
)
select dt, count(*)
from t1
group by dt
order by dt
Demo
Related
Let's say I have hospital visits in the table TestData
I would like to know which patients have had a second hospital visit within 7 days of their first hospital visit.
How would I code this in SQL?
I have patient_id as a TEXT
the date is date_visit is also TEXT and takes the format MM/DD/YYYY
patient_id
date_visit
A123B29133
07/12/2011
A123B29133
07/14/2011
A123B29133
07/20/2011
A123B29134
12/05/2016
In the above table patient A123B29133 fulfills the condition as they were seen on 07/14/2011 which is less that 7 days from 07/12/2011
You can use a subquery with exists:
with to_d(id, v_date) as (
select patient_id, substr(date_visit, 7, 4)||"-"||substr(date_visit, 1, 2)||"-"||substr(date_visit, 4, 2) from visits
)
select t2.id from (select t1.id, min(t1.v_date) d1 from to_d t1 group by t1.id) t2
where exists (select 1 from to_d t3 where t3.id = t2.id and t3.v_date != t2.d1 and t3.v_date <= date(t2.d1, '+7 days'))
id
A123B29133
Since your date column is not in YYYY-MM-DD which is the default value used by several sqlite date functions, the substr function was used to transform your date in this format. JulianDay was then used to convert your dates to an integer value which would ease the comparison of 7 days. The MIN window function was used to identify the first hospital visit date for that patient. The demo fiddle and samples show the query that was used to transform the data and the results before the final query which filters based on your requirements i.e. < 7 days. With this approach using window functions, you may also retrieve the visit_date and the number of days since the first visit date if desired.
You may read more about sqlite date functions here.
Query #1
SELECT
patient_id,
visit_date,
JulianDay(visit_date) -
MIN(JulianDay(visit_date)) OVER (PARTITION BY patient_id)
as num_of_days_since_first_visit
FROM
(
SELECT
*,
(
substr(date_visit,7) || '-' ||
substr(date_visit,0,3) || '-' ||
substr(date_visit,4,2)
) as visit_date
FROM
visits
) v;
patient_id
visit_date
num_of_days_since_first_visit
A123B29133
2011-07-12
0
A123B29133
2011-07-14
2
A123B29133
2011-07-20
8
A123B29134
2016-12-05
0
Query #2
The below is your desired query, which uses the previous query as a CTE and applies the filter for visits less than 7 days. num_of_days <> 0 is applied to remove entries where the first date is also the date of the record.
WITH num_of_days_since_first_visit AS (
SELECT
patient_id,
visit_date,
JulianDay(visit_date) - MIN(JulianDay(visit_date)) OVER (PARTITION BY patient_id) num_of_days
FROM
(
SELECT
*,
(
substr(date_visit,7) || '-' ||
substr(date_visit,0,3) || '-' ||
substr(date_visit,4,2)
) as visit_date
FROM
visits
) v
)
SELECT DISTINCT
patient_id
FROM
num_of_days_since_first_visit
WHERE
num_of_days <> 0 AND num_of_days < 7;
patient_id
A123B29133
View on DB Fiddle
Let me know if this works for you.
I would like to know which patients have had a second hospital visit within 7 days of their first hospital visit.
You can use lag(). The following gets all rows where this is true:
select t.*
from (select t.*,
lag(date_visit) over (partition by patient_id order by date_visit) as prev_date_visit
from t
) t
where prev_date_visit >= date(date_visit, '-7 day');
If you just want the patient_ids, you can use select distinct patient_id.
We are trying to port a code to run on Amazon Redshift, but Refshift won't run the recursive CTE function. Any good soul that knows how to port this?
with tt as (
select t.*, row_number() over (partition by id order by time) as seqnum
from t
),
recursive cte as (
select t.*, time as grp_start
from tt
where seqnum = 1
union all
select tt.*,
(case when tt.time < cte.grp_start + interval '3 second'
then tt.time
else tt.grp_start
end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*,
(case when grp_start = lag(grp_start) over (partition by id order by time)
then 0 else 1
end) as isValid
from cte;
Or, a different code to reproduce the logic below.
It is a binary result that:
it is 1 if it is the first known value of an ID
it is 1 if it is 3 seconds or later than the previous "1" of that ID
It is 0 if it is less than 3 seconds than the previous "1" of that ID
Note 1: this is not the difference in seconds from the previous record
Note 2: there are many IDs in the data set
Note 3: original dataset has ID and Date
Desired output:
https://i.stack.imgur.com/k4KUQ.png
Dataset poc:
http://www.sqlfiddle.com/#!15/41d4b
As of this writing, Redshift does support recursive CTE's: see documentation here
To note when creating a recursive CTE in Redshift:
start the query: with recursive
column names must be declared for all recursive cte's
Consider the following example for creating a list of dates using recursive CTE's:
with recursive
start_dt as (select current_date s_dt)
, end_dt as (select dateadd(day, 1000, current_date) e_dt)
-- the recusive cte, note declaration of the column `dt`
, dates (dt) as (
-- start at the start date
select s_dt dt from start_dt
union all
-- recursive lines
select dateadd(day, 1, dt)::date dt -- converted to date to avoid type mismatch
from dates
where dt <= (select e_dt from end_dt) -- stop at the end date
)
select *
from dates
The below code could help you.
SELECT id, time, CASE WHEN sec_diff is null or prev_sec_diff - sec_diff > 3
then 1
else 0
end FROM (
select id, time, sec_diff, lag(sec_diff) over(
partition by id order by time asc
)
as prev_sec_diff
from (
select id, time, date_part('s', time - lag(time) over(
partition by id order by time asc
)
)
as sec_diff from hon
) x
) y
consider the following data with 4 persons:
ID Date (DMY)
1 2014-12-30
2 2014-12-30
3 2014-12-30
4 2014-12-30
1 2014-12-31
2 2014-12-31
3 2015-01-01
1 2015-01-01
3 2015-01-02
1 2015-01-02
3 2015-01-03
1 2015-01-03
4 2015-01-03
Now what I would like to do is detecting changes in the group of ID's per day. Initially when I thought about it, it was a relatively easy problem, but it is extremely difficult, because:
At 2014-12-30, we see that there are 4 persons.
At 2014-12-31 it should also be 4 persons, because the person with ID=3 and ID=4 don't do a transaction, but we can detect their activity later in the data, meaning that they are still in the sample.
At 2015-01-01 there are only 3 people, ID=1, ID=3, ID=4. ID=2 doesn't do anything anymore in the rest of the data.
At 2015-01-02 there are 3 people.
At 2015-01-03 there are still 3 people.
So I want the SQL to return the dates: 2014-12-30 to 2014-12-31, 2015-01-01 to 2015-01-03.
This is extremely difficult in my humble opinion and I have no idea how to solve this. Can TSQL even deal with these kind of issues?
Thanks!
This work in SQL 2008 SQL Fiddle
I can't tell you about efficiency with your data size, but shouldn't have any problem.
WITH dateGroup(gDate)
AS (
-- SEE HOW MANY DIFFERENT DATES ARE THERE
SELECT DISTINCT DATE
FROM [dbo].[testData]
), userActivity (id, dBegin, dEnd)
AS (
-- SEE THE ACTIVITY WINDOW FOR EACH USER
SELECT ID, MIN(DATE), MAX(DATE)
FROM [dbo].[testData]
GROUP BY ID
), rangeDate ( gDate, users)
AS (
-- SEE WHICH USERS ARE ACTIVE ON EACH DATE
SELECT *
FROM dateGroup as p OUTER APPLY
(SELECT STUFF(( SELECT ';' + CAST(a.id AS VARCHAR(10) )
FROM userActivity AS a
WHERE p.gDate BETWEEN a.dBegin AND a.dEnd
ORDER BY a.id
FOR XML PATH('') ), 1,1,'') AS users ) AS f
), activityWindow (users)
AS (
-- DETECT WHEN THE ACTIVE GROUP CHANGE
SELECT distinct users
FROM rangeDate
)
-- SEE THE RANGE FOR EACH GROUP.
SELECT *
FROM activityWindow as p OUTER APPLY
(SELECT STUFF(( SELECT ' ; ' + CAST(a.gDate AS VARCHAR(10) )
FROM rangeDate AS a
WHERE p.users = a.users
FOR XML PATH('') ), 1,1,'') AS activity_window ) AS f
Not only you have the date range.
You have which user are active in that range. You can split by ;
Also you see all days, so if no data during a SUNDAY you can see it.
If only want begin end, you do split by ; and take first and last date.
So, someone is in the data from their first appearance to the last. Here is one method with cumulative sums: SQL Fiddle
with persondates as (
select id, min(date) as dte, 1 as inc
from data
group by id
union all
select id, dateadd(day, 1, max(date)) as dte, -1 as inc
from data
group by id
)
select dte, min(cume) as actives
from (select dte, sum(inc) over (order by dte) as cume
from persondates
) d
group by dte
order by dte;
Try this:
with c as(
select min(d) as d from t group by id
union
select max(d) as d from t group by id),
u as(
select * from c
union all
select dateadd(dd, 1, d) from c
where d <> (select max(d) from c) and d <> (select min(d) from c)),
r as(select d, row_number() over(order by d) rn from u)
select r1.d, r2.d from r r1
join r r2 on r1.rn + 1 = r2.rn
where r2.rn % 2 = 0
If I am correct, the idea is to select peak dates, i.e. when someone is added or when it is last day of someone. It is done in first cte. The second cte fills peak dates with next dates of those peak dates. Third cte is just numbering the row for following joins to get intervals.
I am not completely sure if this is correct logic, but it works on provided test data http://sqlfiddle.com/#!3/2d7a6/6
I have a column of a mostly continous unique dates in ascending order. Although the dates are mostly continuos, there are some gaps in the dates of less than 3 days, others have more than 3 days.
I need to create a table where each record has a start date and an end date of the range that includes a gap of 3 days or less. But a new record has to be generated if the gap is longer than 3 days.
so if dates are:
1/2/2012
1/3/2012
1/4/2012
1/15/2012
1/16/2012
1/18/2012
1/19/2012
I need:
1/2/2012 1/4/2012
1/15/2012 1/19/2012
You can do something like this:
WITH CTE_Source AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY DT) RN
FROM dbo.Table1
)
,CTE_Recursion AS
(
SELECT *, 1 AS Grp
FROM CTE_Source
WHERE RN = 1
UNION ALL
SELECT src.*, CASE WHEN DATEADD(DD,3,rec.DT) < src.DT THEN rec.Grp + 1 ELSE Grp END AS Grp
FROM CTE_Source src
INNER JOIN CTE_Recursion rec ON src.RN = rec.RN +1
)
SELECT
MIN(DT) AS StartDT, MAX(DT) AS EndDT
FROM CTE_Recursion
GROUP BY Grp
First CTE is just to assign continuous numbers for all rows in order to join them later. Then using recursive CTE you can join on each next row assigning groups if date difference is larger than 3 days. In the end just group by grouping column and select desired results.
SQLFiddle DEMO
I have this query
--Retention by DOC,Users created >= Jan 1,2012--
Select
One.Date_Of_Concern,
Two.Users,
One.Retained,
Round(One.Retained/Two.Users,4) as Perc_Retained
From
(
Select
To_Date('2012-sep-09','yyyy-mon-dd')As Date_Of_Concern,
Count(P.Player_Id) As Retained
From Player P
Where
Trunc(P.Create_Dtime) >= To_Date('2012-Jan-01','yyyy-mon-dd')
And
(To_Date('2012-sep-09','yyyy-mon-dd')-Trunc(P.Init_Dtime))<=7
) One
Inner Join
(
Select
To_Date('2012-sep-09','yyyy-mon-dd')As Date_Of_Concern,
Count(P.Player_Id) As Users
From Player P
Where
Trunc(P.Create_Dtime) >= To_Date('2012-Jan-01','yyyy-mon-dd')
) Two On One.Date_Of_Concern = Two.Date_Of_Concern
Which Gives Me a Result of 1 Row:
Date_Of_Concern USERS RETAINED PERC_RETAINED
09-Sep-12 449773 78983 0.1756
I would like to improve this query by adding in some sort of date changing methodology. That way, I won't have to run the query each time for 09-sep-12, 10-sep-12, 11-sep-12, and so on. Instead, it will all show up in the same query, like this:
Date_Of_Concern USERS RETAINED PERC_RETAINED
09-Sep-12 449773 48783 0.1756
10-Sep-12 449773 46777 0.1600
11-Sep-12 440773 44852 0.1500
12-Sep-12 349773 42584 0.1400
Well, with the given information, i don't know if you have any table you can join and bring those dates. However, if you do not, you could try this:
We have to generate rows and reproduce the dates in a sequential form. But first, lets give a look at how to generate rows:
Generate 5 rows:
SELECT rownum
FROM dual
CONNECT BY LEVEL <= 5;
ROWNUM
----------
1
2
3
4
5
Now, applying this to reproduce a data source for your dates:
SELECT to_date('2012-sep-09','yyyy-mon-dd') + (rownum -1) as Date_Of_Concern
FROM dual
CONNECT BY LEVEL <= 5;
Date_Of_Concern
----------
2012-sep-09
2012-sep-10
2012-sep-11
2012-sep-12
2012-sep-13
Obviously you will need a start date. Additionally, the number 5 has to be replaced by the number of dates you need, it could be a date range like
SELECT to_date('2012-sep-09','yyyy-mon-dd') + (rownum -1) date
FROM dual
CONNECT BY LEVEL <= (to_date('2012-sep-20','yyyy-mon-dd') - to_date('2012-sep-09','yyyy-mon-dd'));
OK, now the final result would look like this:
SELECT both.Date_Of_Concern,
both.Retained,
both.Users,
Round(both.Retained/both.Users,4) as Perc_Retained
FROM (select Date_Of_Concern,
(Select Count(P.Player_Id) As Retained
From Player P
Where Trunc(P.Create_Dtime) >= To_Date('2012-Jan-01','yyyy-mon-dd')
And (Date_Of_Concern-Trunc(P.Init_Dtime))<=7) Retained,
(Select Count(P.Player_Id) As Users
From Player P
Where Trunc(P.Create_Dtime) >= To_Date('2012-Jan-01','yyyy-mon-dd')
) Users
from (SELECT to_date('2012-sep-09','yyyy-mon-dd') + (rownum -1) Date_Of_Concern,
FROM dual
CONNECT BY LEVEL <= 5)) both
I have a feeling your query can be simplified a lot. Here is an attempt to list day by day from the beginning of 2012. Depends what kind of range you are looking for.
SELECT date_of_concern
,Running_Total_Users AS Users
,Running_Total_Retained As Retained
,ROUND(Running_Total_Retained / Running_Total_Users, 4) AS Perc_Retained
FROM ( SELECT date_of_concern
,SUM(Users) OVER( ORDER BY date_of_concern
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW ) AS Running_Total_Users
,SUM(Retained) OVER( ORDER BY date_of_concern
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW ) AS Running_Total_Retained
FROM ( SELECT TRUNC(Create_Dtime) date_of_concern
,COUNT(Player_Id) Users
,SUM( CASE WHEN (TRUNC(Create_Dtime) - TRUNC(Init_Dtime)) <= 7 THEN 1 ELSE 0 END ) AS Retained
FROM player ON ( TRUNC(.Create_Dtime) >= TO_DATE('2012', 'YYYY') )
)
)
The inner most query is an attempt to re-write the posted query counting from day 1 (Jan 1 2012). Then the next wrapper is supposed to do running totals for each subsequent day. The final wrapper is to enable the Perc_Retained. Completed untested of course :)