Related
I need to get the data that generates count of total ID by date between date_active and date_end using date ranges for each. If the dates are crossing each other the ID will adding up. here is the data I have right now,
TABLE CONTRACT:
ID DATE_ACTIVE DATE_END
1 05-FEB-13 08-NOV-13
1 21-DEC-18 06-OCT-19
2 05-FEB-13 27-JAN-14
3 05-FEB-13 07-NOV-13
4 06-FEB-13 02-NOV-13
4 25-OCT-14 13-APR-16
TABLE CALENDAR:
DT
05-FEB-13
06-FEB-13
07-FEB-13
08-FEB-13
09-FEB-13
..-DEC-19
what I want out is basically like this:
DT COUNT(ID)
05-FEB-13 3
06-FEB-13 4
07-FEB-13 4
08-FEB-13 4
09-FEB-13 4
10-FEB-13 4
....
03-NOV-13 3
....
08-NOV-13 2
09-NOV-13 1
....
28-JAN-14 0
....
25-OCT-14 1
....
13-APR-16 1
14-APR-16 0
....
21-DEC-18 1
....
06-OCT-19 1
07-OCT-19 0
....
....
And here is my query to get that result
with contract as (
select * from contract
where id in ('1','2','3','4')
)
,
cal as
(
select TRUNC (SYSDATE - ROWNUM) dt
from dual
connect by rownum < sysdate - to_date('05-FEB-13')
)
select aa.dt,count(distinct bb.id)id from cal aa
left join contract bb on aa.dt >= bb.date_active and aa.dt<= bb.date_end
group by aa.dt
order by 1
but the problem is I have 6 mio of ID and if I use this kind of query, the result maybe will take forever, and I'm having a hard times to figured out how to get the result with different query. It will be my pleasure if somebody can help me out of this. Thank you so much.
If you group your events by date_active and date_end, you will get the numbers of events which have started and ended on each separate day.
Not a lot of days have passed between 2013 and 2019 (about 2 000), so the grouped resultsets will be relatively short.
Now that you have the two groups, you can notice that the number of events on each given date is the number of events which have started on or before this date, minus the number of events which have finished on or before this date (I'm assuming the end dates are non-inclusive).
In other words, the number of events on every given day is:
The number of events on the previous date,
plus the number of events started on this date,
minus the number of events ended on this date.
This can be easily done using a window function.
This will require a join between the calendar table and the two groups, but fortunately all of them are relatively short (thousands of records) and the join would be fast.
Here's the query: http://sqlfiddle.com/#!4/b21ce/5
WITH cal AS
(
SELECT TRUNC (to_date('01-NOV-13') - ROWNUM) dt
FROM dual
CONNECT BY
rownum < to_date('01-NOV-13')- to_date('01-FEB-13')
),
started_on AS
(
SELECT date_active AS dt, COUNT(*) AS cnt_start
FROM contract
GROUP BY
date_active
),
ended_on AS
(
SELECT date_end AS dt, COUNT(*) AS cnt_end
FROM contract
GROUP BY
date_end
)
SELECT dt,
SUM(COALESCE(cnt_start, 0) - COALESCE(cnt_end, 0)) OVER (ORDER BY dt) cnt
FROM cal c
LEFT JOIN
started_on s
USING (dt)
LEFT JOIN
ended_on e
USING (dt)
(I used a fixed date instead of SYSDATE to keep the resultset short, but the idea is the same)
This query requires that the calendar starts before the earliest event, otherwise every result will be off by a fixed amount, the number of events before the beginning of the calendar.
You can replace the fixed date in the calendar condition with (SELECT MIN(date_active) FROM contract) which is instant if date_active is indexed.
Update:
If your contract dates can overlap and you want to collapse multiple overlapping contracts into a one continuous contract, you can use window functions to do so.
WITH cal AS
(
SELECT TRUNC (to_date('01-NOV-13') - ROWNUM) dt
FROM dual
CONNECT BY
rownum <= to_date('01-NOV-13')- to_date('01-FEB-13')
),
collapsed_contract AS
(
SELECT *
FROM (
SELECT c.*,
COALESCE(LAG(date_end_effective) OVER (PARTITION BY id ORDER BY date_active), date_active) AS date_start_effective
FROM (
SELECT c.*,
MAX(date_end) OVER (PARTITION BY id ORDER BY date_active) AS date_end_effective
FROM contract c
) c
) c
WHERE date_start_effective < date_end_effective
),
started_on AS
(
SELECT date_start_effective AS dt, COUNT(*) AS cnt_start
FROM collapsed_contract
GROUP BY
date_start_effective
),
ended_on AS
(
SELECT date_end_effective AS dt, COUNT(*) AS cnt_end
FROM collapsed_contract
GROUP BY
date_end_effective
)
SELECT dt,
SUM(COALESCE(cnt_start, 0) - COALESCE(cnt_end, 0)) OVER (ORDER BY dt) cnt
FROM cal c
LEFT JOIN
started_on s
USING (dt)
LEFT JOIN
ended_on e
USING (dt)
http://sqlfiddle.com/#!4/adeba/1
The query might seem bulky, but that's to make it more efficient, as all these window functions can be calculated in a single pass over the table.
Note however that this single pass relies on the table being sorted on (id, date_active) so an index on these two fields is crucial.
Firstly, row_number() over (order by id,date_active) analytic function is used in order to generate unique ID values those will be substituted in
connect by level <= ... and prior id = id syntax to get unpivoted hierarchical data :
with t0 as
(
select row_number() over (order by id,date_active) as id, date_active, date_end
from contract
), t1 as
(
select date_active + level - 1 as dt
from t0
connect by level <= date_end - date_active + 1
and prior id = id
and prior sys_guid() is not null
)
select dt, count(*)
from t1
group by dt
order by dt
Demo
I have user logins by date. My requirement is to track the number of users that have been logged in during the past 90 days window.
I am new to both SQL in general and Teradata specifically and I can't get the window functionality to work as I need.
I need the following result, where ACTIVE is a count of the unique USER_IDs that appear in the previous 90 day window the DATE.
DATES ACTIVE_IN_WINDOW
12/06/2018 20
13/06/2018 45
14/06/2018 65
15/06/2018 73
17/06/2018 24
18/06/2018 87
19/06/2018 34
20/06/2018 51
Currently my script is as follows.
It is this line here that I cant get right
COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING)
I suspect I need a different set of functions to make this work.
SELECT b.DATES , a.ACTIVE_IN_WINDOW
FROM
(
SELECT
CAST(CALENDAR_DATE AS DATE) AS DATES FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) b
LEFT JOIN
(
SELECT USER_ID , EVT_DT
, COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING) AS ACTIVE_IN_WINDOW
FROM ENV0.R_ONBOARDING
) a
ON a.EVT_DT = b.DATES
ORDER BY b.DATES
Thank you for any assistance.
The logic is similar to Gordon', but a non-equi-Join instead of a Correlated Scalar Subquery is usually more efficient on Teradata:
SELECT b.DATES , Count(DISTINCT USER_ID)
FROM
(
SELECT CALENDAR_DATE AS DATES
FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN Add_Months(Current_Date, - 10) AND Current_Date
) b
LEFT JOIN
( -- apply DISTINCT before aggregation to reduce intermediate spool
SELECT DISTINCT USER_ID, EVT_DT
FROM ENV0.R_ONBOARDING
) AS a
ON a.EVT_DT BETWEEN Add_Months(b.DATES,-3) AND b.DATES
GROUP BY 1
ORDER BY 1
Of course this will require a large spool and much CPU.
Edit:
Switching to weeks reduces the overhead, I'm using dates instead of week numbers (it's easier to modify for other ranges):
SELECT b.Week , Count(DISTINCT USER_ID)
FROM
( -- Return only Mondays instead of DISTINCT over all days
SELECT calendar_date AS Week
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN Add_Months(Current_Date, -9) AND Current_Date
AND day_of_week = 2 -- 2 = Monday
) b
LEFT JOIN
(
SELECT DISTINCT USER_ID,
-- td_monday returns the previous Monday, but we need the following monday
-- covers the previous Tuesday up to the current Monday
Td_Monday(EVT_DT+6) AS PERIOD_WEEK
FROM ENV0.R_ONBOARDING
-- You should add another condition to limit the actually covered date range, e.g.
-- where EVT_DT BETWEEN Add_Months(b.DATES,-13) AND b.DATES
) AS a
ON a.PERIOD_WEEK BETWEEN b.Week-(12*7) AND b.Week
GROUP BY 1
ORDER BY 1
Explain should duplicate the calendar as preparation for the product join, if not you might need to materialize the dates in a Volatile Table. Better don't use sys_calendar, there are no statistics, e.g. optimizer doesn't know about how many days per week/month/year, etc. Check your system, there should be a calendar table designed for you company needs (with stats on all columns)
If your data is not too big, a subquery might be the simplest method:
SELECT c.dte,
(SELECT COUNT(DISTINCT o.USER_ID)
FROM ENV0.R_ONBOARDING o
WHERE o.EVT_DT > ADD_MONTHS(dte, -3) AND
o.EVT_DT <= dte
) as three_month_count
FROM (SELECT CAST(CALENDAR_DATE AS DATE) AS dte
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) c;
You might want to start on a shorter timeframe then 3 months to see how the query performs.
I can get the desired output by using while loop but since original table has thousands of record, performance is very slow.
How can I get the desired results using Common Table Expression?
Thank You.
This will produce the desired results. Not as elegant as Gordon's, but it does allow for gaps in dates and dupicate dates.
If you have a Calendar/Tally Table, the cte logic can be removed.
Example
Declare #YourTable Table ([AsOfDate] Date,[SecurityID] varchar(50),[IsHeld] bit)
Insert Into #YourTable Values
('2017-05-19','S1',1)
,('2017-05-20','S1',1)
,('2017-05-21','S1',1)
,('2017-05-22','S1',1)
,('2017-05-23','S1',0)
,('2017-05-24','S1',0)
,('2017-05-25','S1',0)
,('2017-05-26','S1',1)
,('2017-05-27','S1',1)
,('2017-05-28','S1',1)
,('2017-05-29','S1',0)
,('2017-05-30','S1',0)
,('2017-05-31','S1',1)
;with cte1 as ( Select D1=min(AsOfDate),D2=max(AsOfDate) From #YourTable )
,cte2 as (
Select Top (DateDiff(DAY,(Select D1 from cte1),(Select D2 from cte1))+1)
D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),(Select D1 from cte1))
,R=Row_Number() over (Order By (Select Null))
From master..spt_values n1,master..spt_values n2
)
Select [SecurityID]
,[StartDate] = min(D)
,[EndDate] = max(D)
From (
Select *,Grp = dense_rank() over (partition by securityId order by asofdate )-R
From #YourTable A
Join cte2 B on AsOfDate=B.D
Where IsHeld=1
) A
Group By [SecurityID],Grp
Order By min(D)
Returns
SecurityID StartDate EndDate
S1 2017-05-19 2017-05-22
S1 2017-05-26 2017-05-28
S1 2017-05-31 2017-05-31
This is a variant of the gaps-and-islands problem. In this case, you can use date arithmetic to calculate the rows with adjacent dates:
select securityId, isheld, min(asofdate), max(asofdate)
from (select t.*,
datediff(day,
- row_number() over (partition by securityId, isheld
order by asofdate
),
asofdate) as grp
from t
) t
group by grp, securityId, isheld;
Note: This assumes that the dates are contiguous and have no duplicates. The query can be modified to take those factors into account.
The basic idea is that if you have a sequence of days that are increasing one at a time, then you can subtract a sequence of values and get a constant. That is what grp is. The rest is just aggregation.
consider the following data with 4 persons:
ID Date (DMY)
1 2014-12-30
2 2014-12-30
3 2014-12-30
4 2014-12-30
1 2014-12-31
2 2014-12-31
3 2015-01-01
1 2015-01-01
3 2015-01-02
1 2015-01-02
3 2015-01-03
1 2015-01-03
4 2015-01-03
Now what I would like to do is detecting changes in the group of ID's per day. Initially when I thought about it, it was a relatively easy problem, but it is extremely difficult, because:
At 2014-12-30, we see that there are 4 persons.
At 2014-12-31 it should also be 4 persons, because the person with ID=3 and ID=4 don't do a transaction, but we can detect their activity later in the data, meaning that they are still in the sample.
At 2015-01-01 there are only 3 people, ID=1, ID=3, ID=4. ID=2 doesn't do anything anymore in the rest of the data.
At 2015-01-02 there are 3 people.
At 2015-01-03 there are still 3 people.
So I want the SQL to return the dates: 2014-12-30 to 2014-12-31, 2015-01-01 to 2015-01-03.
This is extremely difficult in my humble opinion and I have no idea how to solve this. Can TSQL even deal with these kind of issues?
Thanks!
This work in SQL 2008 SQL Fiddle
I can't tell you about efficiency with your data size, but shouldn't have any problem.
WITH dateGroup(gDate)
AS (
-- SEE HOW MANY DIFFERENT DATES ARE THERE
SELECT DISTINCT DATE
FROM [dbo].[testData]
), userActivity (id, dBegin, dEnd)
AS (
-- SEE THE ACTIVITY WINDOW FOR EACH USER
SELECT ID, MIN(DATE), MAX(DATE)
FROM [dbo].[testData]
GROUP BY ID
), rangeDate ( gDate, users)
AS (
-- SEE WHICH USERS ARE ACTIVE ON EACH DATE
SELECT *
FROM dateGroup as p OUTER APPLY
(SELECT STUFF(( SELECT ';' + CAST(a.id AS VARCHAR(10) )
FROM userActivity AS a
WHERE p.gDate BETWEEN a.dBegin AND a.dEnd
ORDER BY a.id
FOR XML PATH('') ), 1,1,'') AS users ) AS f
), activityWindow (users)
AS (
-- DETECT WHEN THE ACTIVE GROUP CHANGE
SELECT distinct users
FROM rangeDate
)
-- SEE THE RANGE FOR EACH GROUP.
SELECT *
FROM activityWindow as p OUTER APPLY
(SELECT STUFF(( SELECT ' ; ' + CAST(a.gDate AS VARCHAR(10) )
FROM rangeDate AS a
WHERE p.users = a.users
FOR XML PATH('') ), 1,1,'') AS activity_window ) AS f
Not only you have the date range.
You have which user are active in that range. You can split by ;
Also you see all days, so if no data during a SUNDAY you can see it.
If only want begin end, you do split by ; and take first and last date.
So, someone is in the data from their first appearance to the last. Here is one method with cumulative sums: SQL Fiddle
with persondates as (
select id, min(date) as dte, 1 as inc
from data
group by id
union all
select id, dateadd(day, 1, max(date)) as dte, -1 as inc
from data
group by id
)
select dte, min(cume) as actives
from (select dte, sum(inc) over (order by dte) as cume
from persondates
) d
group by dte
order by dte;
Try this:
with c as(
select min(d) as d from t group by id
union
select max(d) as d from t group by id),
u as(
select * from c
union all
select dateadd(dd, 1, d) from c
where d <> (select max(d) from c) and d <> (select min(d) from c)),
r as(select d, row_number() over(order by d) rn from u)
select r1.d, r2.d from r r1
join r r2 on r1.rn + 1 = r2.rn
where r2.rn % 2 = 0
If I am correct, the idea is to select peak dates, i.e. when someone is added or when it is last day of someone. It is done in first cte. The second cte fills peak dates with next dates of those peak dates. Third cte is just numbering the row for following joins to get intervals.
I am not completely sure if this is correct logic, but it works on provided test data http://sqlfiddle.com/#!3/2d7a6/6
I am using iReport 3.0.0, PostgreSQL 9.1. For a report I need to compare date ranges from invoices with date ranges in filters and print for every invoice code if a filter range is covered, partially covered, etc. To complicate things, there can be multiple date ranges per invoice code.
Table Invoices
ID Code StartDate EndDate
1 111 1.5.2012 31.5.2012
2 111 1.7.2012 20.7.2012
3 111 25.7.2012 31.7.2012
4 222 1.4.2012 15.4.2012
5 222 18.4.2012 30.4.2012
Examples
Filter: 1.5.2012. - 5.6.2012.
Result that I need to get is:
code 111 - partialy covered
code 222 - invoice missing
Filter: 1.5.2012. - 31.5.2012.
code 111 - fully covered
code 222 - invoice missing
Filter: 1.6.2012. - 30.6.2012.
code 111 - invoice missing
code 222 - invoice missing
After clarification in comment.
Your task as I understand it:
Check for all supplied individual date ranges (filter) whether they are are covered by the combined date ranges of sets of codes in your table (invoice).
It can be done with plain SQL, but it is not a trivial task. The steps could be:
Supply date ranges as filters.
Combine date ranges in invoice table per code.
Can result in one or more ranges per code.
Look for overlaps between filters and combined invoices
Classify: fully covered / partially covered.
Can result in one full coverage, one or two partial coverages or no coverage.
Reduce to maximum level of coverage.
Display one row for every combination of (filter, code) with the resulting coverage, in a sensible sort order
Ad hoc filter ranges
WITH filter(filter_id, startdate, enddate) AS (
VALUES
(1, '2012-05-01'::date, '2012-06-05'::date) -- list filters here.
,(2, '2012-05-01', '2012-05-31')
,(3, '2012-06-01', '2012-06-30')
)
SELECT * FROM filter;
Or put them in a (temporary) table and use the table instead.
Combine overlapping / adjacent date ranges per code
WITH a AS (
SELECT code, startdate, enddate
,max(enddate) OVER (PARTITION BY code ORDER BY startdate) AS max_end
-- Calculate the cumulative maximum end of the ranges sorted by start
FROM invoice
), b AS (
SELECT *
,CASE WHEN lag(max_end) OVER (PARTITION BY code
ORDER BY startdate) + 2 > startdate
-- Compare to the cumulative maximum end of the last row.
-- Only if there is a gap, start a new group. Therefore the + 2.
THEN 0 ELSE 1 END AS step
FROM a
), c AS (
SELECT code, startdate, enddate, max_end
,sum(step) OVER (PARTITION BY code ORDER BY startdate) AS grp
-- Members of the same date range end up in the same grp
-- If there is a gap, the grp number is incremented one step
FROM b
)
SELECT code, grp
,min(startdate) AS startdate
,max(enddate) AS enddate
FROM c
GROUP BY 1, 2
ORDER BY 1, 2
Alternative final SELECT (may be faster or not, you'll have to test):
SELECT DISTINCT code, grp
,first_value(startdate) OVER w AS startdate
,last_value(enddate) OVER w AS enddate
FROM c
WINDOW W AS (PARTITION BY code, grp ORDER BY startdate
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY 1, 2;
Combine to one query
WITH
-- supply one or more filter values
filter(filter_id, startdate, enddate) AS (
VALUES
(1, '2012-05-01'::date, '2012-06-05'::date) -- cast values in first row
,(2, '2012-05-01', '2012-05-31')
,(3, '2012-06-01', '2012-06-30')
)
-- combine date ranges per code
,a AS (
SELECT code, startdate, enddate
,max(enddate) OVER (PARTITION BY code ORDER BY startdate) AS max_end
FROM invoice
), b AS (
SELECT *
,CASE WHEN (lag(max_end) OVER (PARTITION BY code ORDER BY startdate)
+ 2) > startdate THEN 0 ELSE 1 END AS step
FROM a
), c AS (
SELECT code, startdate, enddate, max_end
,sum(step) OVER (PARTITION BY code ORDER BY startdate) AS grp
FROM b
), i AS ( -- substitutes original invoice table
SELECT code, grp
,min(startdate) AS startdate
,max(enddate) AS enddate
FROM c
GROUP BY 1, 2
)
-- match filters
, x AS (
SELECT f.filter_id, i.code
,bool_or(f.startdate >= i.startdate
AND f.enddate <= i.enddate) AS full_cover
FROM filter f
JOIN i ON i.enddate >= f.startdate
AND i.startdate <= f.enddate -- only overlapping
GROUP BY 1,2
)
SELECT f.*, i.code
,CASE x.full_cover
WHEN TRUE THEN 'fully covered'
WHEN FALSE THEN 'partially covered'
ELSE 'invoice missing'
END AS covered
FROM (SELECT DISTINCT code FROM i) i
CROSS JOIN filter f -- all combinations of filter and code
LEFT JOIN x USING (filter_id, code) -- join in overlapping
ORDER BY filter_id, code;
Tested and works for me on PostgreSQL 9.1.