Full History Join - sql

currently I am trying to figure out a join between to historized tables, where I want to synchronize both timeline.
As an example, I have the following two tables:
A
ID Value FROM TO
1 5 01.01.2018 31.03.2018
1 6 31.03.2018 08.04.2018
B A_FK Value FROM TO
1 1 50 01.02.2018 01.04.2018
2 1 51 04.04.2018 10.04.2018
As a baseline, I want to take the timeline of table A and join table B, including NULL values so that I know, for which times there is no fitting value.
The desired result should look like this:
C
Value_A Value_B FROM TO
5 NULL 01.01.2018 01.02.2018
5 50 01.02.2018 31.03.2018
6 50 31.03.2018 01.04.2018
6 NULL 01.04.2018 04.04.2018
6 51 04.04.2018 08.04.2018
Can you help me with this? I started, but can fail to align the wrong history - here my try:
with a as (SELECT *
FROM (VALUES (1,5,'01.01.2018','31.03.2018')
, (1,6,'31.03.2018','08.04.2018')
) A (ID, VALUE, FROM, TO)),
b as (
SELECT *
FROM (VALUES (1,1,50,'01.02.2018','01.04.2018')
, (2,1,51,'04.04.2018','10.04.2018')
) A (ID,A_FK, VALUE, FROM, TO)
)
select
a.value as value_a,
b.value as value_b,
max(a.from,b.from) as from,
min(a.to,b.to) as to
from a
left outer join b on
a.id = b.a_fk and
a.from < b.to and
a.to > b.from;
As you can see, it aligns, but not the way I expected it to.
Thank you for your help.

So as I suggested in the comments with the technique in my own answer from another question you can solve your problem.
Here is one solution.
The test data:
create table a (
id integer,
value integer,
dtfrom date,
dtto date
);
create table b(
id integer,
a_fk integer,
value integer,
dtfrom date,
dtto date
);
insert into a values
(1, 5, '2018-01-01', '2018-03-31'),
(1, 6, '2018-03-31', '2018-04-08');
insert into b values
(1, 1, 50, '2018-02-01', '2018-04-01'),
(2, 1, 51, '2018-04-04', '2018-04-10');
The trick part of this solution is to generate the date intervals that isn't in any of your tables such as 01.01.2018-01.02.2018 and 01.02.2018-31.03.2018 so in order to do that you must have all available dates as one table so I created a VIEW called timmings to make it easier:
create or replace view timmings as
select a.dtfrom dt from a inner join b on a.id=b.a_fk
union
select a.dtto from a inner join b on a.id=b.a_fk
union
select b.dtfrom from a inner join b on a.id=b.a_fk
union
select b.dtto from a inner join b on a.id=b.a_fk;
After that you need a query to generate all available periods (starts and ends) so it will be:
select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1
order by start;
This will result in (with your sample data):
start dend
01/01/2018 01/02/2018
01/02/2018 31/03/2018
31/03/2018 01/04/2018
01/04/2018 04/04/2018
04/04/2018 08/04/2018
08/04/2018 10/04/2018
10/04/2018 null
With that you can use it to get all available values from table a that intersects with the periods:
select a.id, a.value, tm.start, tm.dend
from (select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1) tm
left join a on tm.start >= a.dtfrom and tm.dend <= a.dtto
where a.id is not null
order by tm.start;
That results in:
id value start end
1 5 01/01/2018 01/02/2018
1 5 01/02/2018 31/03/2018
1 6 31/03/2018 01/04/2018
1 6 01/04/2018 04/04/2018
1 6 04/04/2018 08/04/2018
And finally you LEFT JOIN it with b table:
select x.value as valueA,
b.value as valueB,
x.start as "from",
x.dend as "to"
from (select a.id, a.value, tm.start, tm.dend
from (select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1) tm
left join a on tm.start >= a.dtfrom and tm.dend <= a.dtto
where a.id is not null
) x
left join b on b.a_fk = x.id
and b.dtfrom <= x.start
and b.dtto >= x.dend
order by x.start;
Which will give you the result you want:
valueA valueB start end
5 null 01/01/2018 01/02/2018
5 50 01/02/2018 31/03/2018
6 50 31/03/2018 01/04/2018
6 null 01/04/2018 04/04/2018
6 51 04/04/2018 08/04/2018
See the final solution working here: http://sqlfiddle.com/#!9/36418e/1 It is MySQL but since it is all SQL ANSI it will work just fine in DB2

There is an excellent Blog article about that
"Fun with Date Ranges" by John Maenpaa
And secondly if you have a chance to influence the DDL I would recommend to have a closer look at Db2 Temporal Tables - they come with full SQL support (Time Travel SQL) - find details here

This is actually really simple if you have what's known as a Calendar table - a table with every date in it - although you can construct one on-the-fly if necessary. You can use it to turn this more obviously into a gaps-and-islands problem.
(You want one anyways, since they're one of the most useful analysis dimension tables):
SELECT valueA, valueB,
MIN(calendarDate) AS startDate,
MAX(calendarDate) + 1 DAY AS endDate
FROM (SELECT A.val AS valueA, B.val AS valueB, Calendar.calendarDate,
ROW_NUMBER() OVER(ORDER BY Calendar.calendarDate) -
ROW_NUMBER() OVER(PARTITION BY A.val, B.val ORDER BY Calendar.calendarDate) AS grouping
FROM Calendar
LEFT JOIN A
ON A.startDate <= Calendar.calendarDate
AND A.endDate > Calendar.calendarDate
LEFT JOIN B
ON B.startDate <= Calendar.calendarDate
AND B.endDate > Calendar.calendarDate
WHERE A.val IS NOT NULL
OR B.val IS NOT NULL) Groups
GROUP BY valueA, valueB, grouping
ORDER BY grouping
SQL Fiddle Example (Minor tweaks for SQL Server usage in example)
...which yields the following results. Note that there's a few extra days from the date range in table B that aren't present in table A!
valueA valueB startDate endDate
5 (null) 2018-01-01 2018-02-01
5 50 2018-02-01 2018-03-31
6 50 2018-03-31 2018-04-01
6 (null) 2018-04-01 2018-04-04
6 51 2018-04-04 2018-04-08
(null) 51 2018-04-08 2018-04-10
(This of course is trivially changeable by switching the join to A to a regular INNER JOIN, but I figured this and other cases would be important.)

Related

SQL - Find if column dates include at least partially a date range

I need to create a report and I am struggling with the SQL script.
The table I want to query is a company_status_history table which has entries like the following (the ones that I can't figure out)
Table company_status_history
Columns:
| id | company_id | status_id | effective_date |
Data:
| 1 | 10 | 1 | 2016-12-30 00:00:00.000 |
| 2 | 10 | 5 | 2017-02-04 00:00:00.000 |
| 3 | 11 | 5 | 2017-06-05 00:00:00.000 |
| 4 | 11 | 1 | 2018-04-30 00:00:00.000 |
I want to answer to the question "Get all companies that have been at least for some point in status 1 inside the time period 01/01/2017 - 31/12/2017"
Above are the cases that I don't know how to handle since I need to add some logic of type :
"If this row is status 1 and it's date is before the date range check the next row if it has a date inside the date range."
"If this row is status 1 and it's date is after the date range check the row before if it has a date inside the date range."
I think this can be handled as a gaps and islands problem. Consider the following input data: (same as sample data of OP plus two additional rows)
id company_id status_id effective_date
-------------------------------------------
1 10 1 2016-12-15
2 10 1 2016-12-30
3 10 5 2017-02-04
4 10 4 2017-02-08
5 11 5 2017-06-05
6 11 1 2018-04-30
You can use the following query:
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
ORDER BY company_id, effective_date
to get:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 0
2 10 1 2016-12-30 1
3 10 5 2017-02-04 2
4 10 4 2017-02-08 2
5 11 5 2017-06-05 0
6 11 1 2018-04-30 0
Now you can identify status = 1 islands using:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
)
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
Output:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 1
2 10 1 2016-12-30 1
3 10 5 2017-02-04 1
4 10 4 2017-02-08 2
5 11 5 2017-06-05 1
6 11 1 2018-04-30 2
Calculated field grp will help us identify those islands:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
), CTE2 AS
(
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
)
SELECT company_id,
MIN(effective_date) AS start_date,
CASE
WHEN COUNT(*) > 1 THEN DATEADD(DAY, -1, MAX(effective_date))
ELSE MIN(effective_date)
END AS end_date
FROM CTE2
GROUP BY company_id, grp
HAVING COUNT(CASE WHEN status_id = 1 THEN 1 END) > 0
Output:
company_id start_date end_date
-----------------------------------
10 2016-12-15 2017-02-03
11 2018-04-30 2018-04-30
All you want know is those records from above that overlap with the specified interval.
Demo here with somewhat more complicated use case.
Maybe this is what you are looking for? For these kind of questions, you need to join two instance of your table, in this case I am just joining with next record by Id, which probably is not totally correct. To do it better, you can create a new Id using a windowed function like row_number, ordering the table by your requirement criteria
If this row is status 1 and it's date is before the date range check
the next row if it has a date inside the date range
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
else NULL
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
Implementing second criteria:
"If this row is status 1 and it's date is after the date range check
the row before if it has a date inside the date range."
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
when csh1.status_id=1 and csh1.effective_date>#range_en
then
case
when csh3.effective_date between #range_st and #range_en then true
else false
end
else null -- ¿?
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
left join company_status_history csh3
on csh1.id=csh3.id-1
I would suggest the use of a cte and the window functions ROW_NUMBER. With this you can find the desired records. An example:
DECLARE #t TABLE(
id INT
,company_id INT
,status_id INT
,effective_date DATETIME
)
INSERT INTO #t VALUES
(1, 10, 1, '2016-12-30 00:00:00.000')
,(2, 10, 5, '2017-02-04 00:00:00.000')
,(3, 11, 5, '2017-06-05 00:00:00.000')
,(4, 11, 1, '2018-04-30 00:00:00.000')
DECLARE #StartDate DATETIME = '2017-01-01';
DECLARE #EndDate DATETIME = '2017-12-31';
WITH cte AS(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) AS rn
FROM #t
),
cteLeadLag AS(
SELECT c.*, ISNULL(c2.effective_date, c.effective_date) LagEffective, ISNULL(c3.effective_date, c.effective_date)LeadEffective
FROM cte c
LEFT JOIN cte c2 ON c2.company_id = c.company_id AND c2.rn = c.rn-1
LEFT JOIN cte c3 ON c3.company_id = c.company_id AND c3.rn = c.rn+1
)
SELECT 'Included' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Following' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date > #EndDate
AND LagEffective BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Trailing' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date < #EndDate
AND LeadEffective BETWEEN #StartDate AND #EndDate
I first select all records with their leading and lagging Dates and then I perform your checks on the inclusion in the desired timespan.
Try with this, self-explanatory. Responds to this part of your question:
I want to answer to the question "Get all companies that have been at
least for some point in status 1 inside the time period 01/01/2017 -
31/12/2017"
Case that you want to find those id's that have been in any moment in status 1 and have records in the period requested:
SELECT *
FROM company_status_history
WHERE id IN
( SELECT Id
FROM company_status_history
WHERE status_id=1 )
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'
Case that you want to find id's in status 1 and inside the period:
SELECT *
FROM company_status_history
WHERE status_id=1
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'

Overlapping date spans

I have following table. How I can find out overlapping spans only? In example, below memberid 3 should not be in our scope since date spans do not overlap with each other
Any help is highly appreciated
MemberID fromdate todate
1 1/1/2018 12/31/2018
1 1/1/2018 12/31/2018
2 12/1/2017 1/1/2019
2 1/2/2018 2/2/2019
3 1/1/2015 12/31/2015
3 1/1/2016 12/31/2016
3 1/1/2017 12/31/2017
4 1/1/2018 1/1/2018
4 1/1/2018 1/1/2018
5 1/1/2015 1/31/2016
5 1/1/2016 7/31/2016
5 07/01/2016 12/31/2016
Expected results should be data associated with Member Ids 1,2,4 and 5 Member ID 3 should not be in the results set because date spans are not overlapping.
Hmmm. You can get the overlapping spans by doing:
select m.*
from members m
where exists (select 1
from members m2
where m2.memberid = m.memberid and
m2.todate > m.fromdate and m2.fromdate < m.todate
);
If you want members that don't overlap, let's use except:
select m.memberid
from members m
except
select m.*
from members m
where exists (select 1
from members m2
where m2.memberid = m.memberid and
m2.todate >= m.fromdate and m2.fromdate <= m.todate
);
Except removes duplicates. But if you wanted to be extra sure and redundant, you could write select distinct for each query.
Try this:
;with cte as
(select memberid, convert(Varchar,fromdate,101)fromdate,convert(Varchar,todate,101)todate from #tb),
cte2 as
(select Num,memberid,todate,fromdate,Num + 1 as num2 from
(select ROW_NUMBER() over(partition by memberid order by fromdate) as Num,memberid,fromdate,todate from cte) as a),
cte3 as
(select memberid,fromdate,todate, DATEDIFF(day,fromdate,todate) as date_diff from
(select ISNULL(memberid,bnum)memberid , isnull(fromdate1,fromdate2)fromdate,isnull(fromdate2,fromdate1)todate,bnum from
(select a.num,a.fromdate,a.todate,a.num2 as num1,a.memberid,case when a.Num=b.num2 then b.todate else a.fromdate end as fromdate1,
case when a.Num=b.num2 then a.fromdate else b.todate end as fromdate2,b.num2,b.todate as todate2,b.Num as bnum from cte2 as a
full join cte2 as b
on a.num = b.num2 and a.memberid = b.memberid) as a) as a)
select distinct memberid from cte3 where date_diff<0

Joining two tables on the nearest single date

I was hoping someone might help me on this one. I have two tables that need to be joined on the nearest date (nearest before date). I have found with some searching a way to do this using the DATEDIFF and Row_Number functions, but the output is not quite what I want. Here is what i am trying to do:
CREATE TABLE #OPS ([Date] Date, [Runtime] FLOAT, [INTERVAL] INT)
INSERT INTO #OPS Values
( '2015-02-09',29540.3,12),
('2015-02-16',29661.7, 10),
('2015-03-02',29993.7,10),
('2015-03-09',30161.7,12),
('2015-03-16',30333.4,12),
('2015-03-23',30337.9,5),
('2015-03-30',30506.9,12),
('2015-04-06',30628.1,6),
('2015-04-13',30795,4),
('2015-04-20',30961.2,6)
SELECT * FROM #OPS
CREATE TABLE #APPS ([Date] DATE, [Value] INT)
INSERT INTO #APPS Values
('2015-03-05', 1000),('2015-03-27', 1040), ('2015-04-17', 1070)
;WITH Nearest_date AS
(
SELECT
t1.*, t2.Date as date2, t2.Value,
ROW_NUMBER() OVER
(
PARTITION BY t1.[Date]
ORDER BY t2.[Date] DESC
) AS RowNum
FROM #OPS t1
LEFT JOIN #APPS t2
ON t2.[Date] <= t1.[Date]
)
SELECT *
FROM Nearest_date
WHERE RowNum = 1
ORDER BY Date ASC
--This is what I get
Date Runtime INTERVAL date2 Value
2/9/2015 29540.3 12 NULL NULL
2/16/2015 29661.7 10 NULL NULL
3/2/2015 29993.7 10 NULL NULL
3/9/2015 30161.7 12 3/5/2015 1000
3/16/2015 30333.4 12 3/5/2015 1000
3/23/2015 30337.9 5 3/5/2015 1000
3/30/2015 30506.9 12 3/27/2015 1040
4/6/2015 30628.1 6 3/27/2015 1040
4/13/2015 30795 4 3/27/2015 1040
4/20/2015 30961.2 6 4/17/2015 1070
-- This is what I want
Date Runtime INTERVAL date2 Value
2/9/2015 29540.3 12 NULL NULL
2/16/2015 29661.7 10 NULL NULL
3/2/2015 29993.7 10 NULL NULL
3/9/2015 30161.7 12 3/5/2015 1000
3/16/2015 30333.4 12 NULL NULL
3/23/2015 30337.9 5 NULL NULL
3/30/2015 30506.9 12 3/27/2015 1040
4/6/2015 30628.1 6 NULL NULL
4/13/2015 30795 4 NULL NULL
4/20/2015 30961.2 6 4/17/2015 1070
You can see that I want to select the nearest date that date compared against all dates in the second table. The query I created shows the same date for multiple values - when only one of those dates is truly the closest. Any help would be, as always, massively appreciated. -- running MSSQL 2014
Using OUTER APPLY and LEFT JOIN:
SQL Fiddle
SELECT
o.*,
Date2 = t.Date,
t.Value
FROM #OPS o
LEFT JOIN(
SELECT
a.*, Date2 = x.Date
FROM #APPS a
OUTER APPLY(
SELECT TOP 1 *
FROM #OPS
WHERE
[Date] <= a.Date
ORDER BY [Date] DESC
)x
)t
ON t.Date2 = o.Date

SQL Query Find x rows forward the highest value without having a lower value in between

I have a table with the left 2 columns.
I am trying to achieve the 3th column based on some logic.
Logic: If we take date 1/1 and go further the highest score that wil be reached with going further in dates before the score goes down will be on 3/1. With a score of 12. So as HighestAchievedScore we will retrieve 12 for 1/1. And so forth.
If we are on a date where the next score goes down my highestAchieveScore will be my next score. Like you can see at 3/01/2014
date score HighestAchieveScore
1/01/2014 10 12
2/01/2014 11 12
3/01/2014 12 10
4/01/2014 10 11
5/01/2014 11 9
6/01/2014 9 8
7/01/2014 8 9
8/01/2014 9 9
I hope I explained it clear enough.
Thanks already for every input resolving the problem.
Lets make some test data:
DECLARE #Score TABLE
(
ScoreDate DATETIME,
Score INT
)
INSERT INTO #Score
VALUES
('01-01-2014', 10),
('01-02-2014', 11),
('01-03-2014', 12),
('01-04-2014', 10),
('01-05-2014', 11),
('01-06-2014', 9),
('01-07-2014', 8),
('01-08-2014', 9);
Now we are going to number our rows and then link to the next row to see if we are still going up
WITH ScoreRows AS
(
SELECT
s.ScoreDate,
s.Score,
ROW_NUMBER() OVER (ORDER BY ScoreDate) RN
FROM #Score s
),
ScoreUpDown AS
(
SELECT p.ScoreDate,
p.Score,
p.RN,
CASE WHEN p.Score < n.Score THEN 1 ELSE 0 END GoingUp,
ISNULL(n.Score, p.Score) NextScore
FROM ScoreRows p
LEFT JOIN ScoreRows n
ON n.RN = p.RN + 1
)
We take our data recursively look for the next row that is right before a fall, and take that value as our max for any row that is still going up. otherwise, we use the score for the next falling row.
SELECT
s.ScoreDate,
s.Score,
CASE WHEN s.GoingUp = 1 THEN d.Score ELSE s.NextScore END Test
FROM ScoreUpDown s
OUTER APPLY
(
SELECT TOP 1 * FROM ScoreUpDown d
WHERE d.ScoreDate > s.ScoreDate
AND GoingUp = 0
) d;
Output:
ScoreDate Score Test
2014-01-01 00:00:00.000 10 12
2014-01-02 00:00:00.000 11 12
2014-01-03 00:00:00.000 12 10
2014-01-04 00:00:00.000 10 11
2014-01-05 00:00:00.000 11 9
2014-01-06 00:00:00.000 9 8
2014-01-07 00:00:00.000 8 9
2014-01-08 00:00:00.000 9 9
Assuming you are wanting the third column to be computed, you can create the table like this (or add the column to an existing table), using a function to determine the value of the third column:
Create Function dbo.fnGetMaxScore(#Date Date)
Returns Int
As Begin
Declare #Ret Int
Select #Ret = Max(Score)
From YourTable
Where Date > #Date
Return #Ret
End
Create Table YourTable
(
Date Date,
Score Int,
HighestAchieveScore As dbo.fnGetMaxScore(Date)
)
I'm not sure this will work.... but this is the general concept.
Self join on A.Date < B.Date to get max score, but use coalesce and a 3rd self join on a rowID assigned in a CTE to determine if the score dropped on the next record, and if it did coalesce that score in, otherwise use the max score.
NEED TO TEST but have to setup a fiddle to do so..
WITH CTE as
(SELECT Date, Score, ROW_NUMBER() OVER(ORDER BY A.Date ASC) AS Row FROM tableName)
SELECT A.Date, A.Score, coalesce(c.score, Max(A.Score)) as HighestArchievedScore
FROM CTE A
LEFT JOIN CTE B
on A.Date < B.Date
LEFT JOIN CTE C
on A.Row+1=B.Row
and A.Score > C.Score
GROUP BY A.DATE,
A.SCORE
This should work on SQL Server 2012 but not earlier versions:
WITH cte AS (
SELECT date,
LEAD(score) OVER (ORDER BY date) nextScore
FROM yourTable
)
SELECT t.date, score,
CASE
WHEN nextScore < score THEN nextScore
ELSE (
SELECT ISNULL(MAX(t1.score), t.score)
FROM yourTable t1
JOIN cte ON t1.date = cte.date
WHERE t1.date > t.date
AND ISNULL(nextScore, 0) < score
)
END AS HighestAchieveScore
FROM yourTable t
JOIN cte ON t.date = cte.date

SQL issue - calculate max days sequence

There is a table with visits data:
uid (INT) | created_at (DATETIME)
I want to find how many days in a row a user has visited our app. So for instance:
SELECT DISTINCT DATE(created_at) AS d FROM visits WHERE uid = 123
will return:
d
------------
2012-04-28
2012-04-29
2012-04-30
2012-05-03
2012-05-04
There are 5 records and two intervals - 3 days (28 - 30 Apr) and 2 days (3 - 4 May).
My question is how to find the maximum number of days that a user has visited the app in a row (3 days in the example). Tried to find a suitable function in the SQL docs, but with no success. Am I missing something?
UPD:
Thank you guys for your answers! Actually, I'm working with vertica analytics database (http://vertica.com/), however this is a very rare solution and only a few people have experience with it. Although it supports SQL-99 standard.
Well, most of solutions work with slight modifications. Finally I created my own version of query:
-- returns starts of the vitit series
SELECT t1.d as s FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
s
---------------------
2012-04-28 01:00:00
2012-05-03 01:00:00
-- returns end of the vitit series
SELECT t1.d as f FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
f
---------------------
2012-04-30 01:00:00
2012-05-04 01:00:00
So now only what we need to do is to join them somehow, for instance by row index.
SELECT s, f, DATEDIFF(day, s, f) + 1 as seq FROM (
SELECT t1.d as s, ROW_NUMBER() OVER () as o1 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl1 LEFT JOIN (
SELECT t1.d as f, ROW_NUMBER() OVER () as o2 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl2 ON o1 = o2
Sample output:
s | f | seq
---------------------+---------------------+-----
2012-04-28 01:00:00 | 2012-04-30 01:00:00 | 3
2012-05-03 01:00:00 | 2012-05-04 01:00:00 | 2
Another approach, the shortest, do a self-join:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select d, group_number, count(d) over m as consecutive_days
from grouped_result
window m as (partition by group_number)
Output:
d | group_number | consecutive_days
---------------------+--------------+------------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
Live test: http://www.sqlfiddle.com/#!1/93789/1
sr = second row, fr = first row ( or perhaps previous row? ツ ). Basically we are doing a back tracking, it's a simulated lag on database that doesn't support LAG (Postgres supports LAG, but the solution is very long, as windowing doesn't support nested windowing). So in this query, we uses a hybrid approach, simulate LAG via join, then use SUM windowing against it, this produces group number
UPDATE
Forgot to put the final query, the query above illustrate the underpinnings of group numbering, need to morph that into this:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select min(d) as starting_date, max(d) as end_date, count(d) as consecutive_days
from grouped_result
group by group_number
-- order by consecutive_days desc limit 1
STARTING_DATE END_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
UPDATE
I know why my other solution that uses window function became long, it became long on my attempt to illustrate the logic of group numbering and counting over the group. If I'd cut to the chase like in my MySql approach, that windowing function could be shorter. Having said that, here's my old windowing function approach, albeit better now:
with headers as
(
select
d,lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over (order by d) as group_number
from headers
)
select min(d) as starting_date,max(d) as ending_date,count(d) as consecutive_days
from sequence_group
group by group_number
-- order by consecutive_days desc limit 1
Live test: http://www.sqlfiddle.com/#!1/93789/21
In MySQL you could do this:
SET #nextDate = CURRENT_DATE;
SET #RowNum = 1;
SELECT MAX(RowNumber) AS ConecutiveVisits
FROM ( SELECT #RowNum := IF(#NextDate = Created_At, #RowNum + 1, 1) AS RowNumber,
Created_At,
#NextDate := DATE_ADD(Created_At, INTERVAL 1 DAY) AS NextDate
FROM Visits
ORDER BY Created_At
) Visits
Example here:
http://sqlfiddle.com/#!2/6e035/8
However I am not 100% certain this is the best way to do it.
In Postgresql:
;WITH RECURSIVE VisitsCTE AS
( SELECT Created_At, 1 AS ConsecutiveDays
FROM Visits
UNION ALL
SELECT v.Created_At, ConsecutiveDays + 1
FROM Visits v
INNER JOIN VisitsCTE cte
ON 1 + cte.Created_At = v.Created_At
)
SELECT MAX(ConsecutiveDays) AS ConsecutiveDays
FROM VisitsCTE
Example here:
http://sqlfiddle.com/#!1/16c90/9
I know Postgresql has something similar to common table expressions as available in MSSQL. I'm not that familiar with Postgresql, but the code below works for MSSQL and does what you want.
create table #tempdates (
mydate date
)
insert into #tempdates(mydate) values('2012-04-28')
insert into #tempdates(mydate) values('2012-04-29')
insert into #tempdates(mydate) values('2012-04-30')
insert into #tempdates(mydate) values('2012-05-03')
insert into #tempdates(mydate) values('2012-05-04');
with maxdays (s, e, c)
as
(
select mydate, mydate, 1
from #tempdates
union all
select m.s, mydate, m.c + 1
from #tempdates t
inner join maxdays m on DATEADD(day, -1, t.mydate)=m.e
)
select MIN(o.s),o.e,max(o.c)
from (
select m1.s,max(m1.e) e,max(m1.c) c
from maxdays m1
group by m1.s
) o
group by o.e
drop table #tempdates
And here's the SQL fiddle: http://sqlfiddle.com/#!3/42b38/2
All are very good answers, but I think I should contribute by showing another approach utilizing an analytical capability specific to Vertica (after all it is part of what you paid for). And I promise the final query is short.
First, query using conditional_true_event(). From Vertica's documentation:
Assigns an event window number to each row, starting from 0, and
increments the number by 1 when the result of the boolean argument
expression evaluates true.
The example query looks like this:
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits;
And output:
uid created_at seq_id
--- ------------------- ------
123 2012-04-28 00:00:00 0
123 2012-04-29 00:00:00 0
123 2012-04-30 00:00:00 0
123 2012-05-03 00:00:00 1
123 2012-05-04 00:00:00 1
123 2012-06-04 00:00:00 2
123 2012-06-04 00:00:00 2
Now the final query becomes easy:
select uid, seq_id, count(1) num_days, min(created_at) s, max(created_at) f
from
(
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits
) as seq
group by uid, seq_id;
Final Output:
uid seq_id num_days s f
--- ------ -------- ------------------- -------------------
123 0 3 2012-04-28 00:00:00 2012-04-30 00:00:00
123 1 2 2012-05-03 00:00:00 2012-05-04 00:00:00
123 2 2 2012-06-04 00:00:00 2012-06-04 00:00:00
One final note:
num_days is actually number of rows of the inner query. If there are two '2012-04-28' visits in the original table (i.e. duplicates), you might want to work around that.
The following should be Oracle friendly, and not require recursive logic.
;WITH
visit_dates (
visit_id,
date_id,
group_id
)
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY TRUNC(created_at)),
TRUNC(SYSDATE) - TRUNC(created_at),
TRUNC(SYSDATE) - TRUNC(created_at) - ROW_NUMBER() OVER (ORDER BY TRUNC(created_at))
FROM
visits
GROUP BY
TRUNC(created_at)
)
,
group_duration (
group_id,
duration
)
AS
(
SELECT
group_id,
MAX(date_id) - MIN(date_id) + 1 AS duration
FROM
visit_dates
GROUP BY
group_id
)
SELECT
MAX(duration) AS max_duration
FROM
group_duration
Postgresql:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
Divide-and-conquer approach: 3 steps
1st step, find headers:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
select * from headers
Output:
d | header
---------------------+--------
2012-04-28 08:00:00 | t
2012-04-29 08:00:00 | f
2012-04-30 08:00:00 | f
2012-05-03 08:00:00 | t
2012-05-04 08:00:00 | f
(5 rows)
2nd step, designate grouping:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
select * from sequence_group
Output:
d | group_number
---------------------+--------------
2012-04-28 08:00:00 | 1
2012-04-29 08:00:00 | 1
2012-04-30 08:00:00 | 1
2012-05-03 08:00:00 | 2
2012-05-04 08:00:00 | 2
(5 rows)
3rd step, count max days:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
Output:
d | group_number | consecutive_count
---------------------+--------------+-----------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
This is for MySQL, the shortest, and uses minimal variable (one variable only):
select
min(d) as starting_date, max(d) as ending_date,
count(d) as consecutive_days
from
(
select
sr.d,
IF(fr.d is null,#group_number := #group_number + 1,#group_number)
as group_number
from tbl sr
left join tbl fr on sr.d = adddate(fr.d,interval 1 day)
cross join (select #group_number := 0) as grp
) as x
group by group_number
Output:
STARTING_DATE ENDING_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
Live test: http://www.sqlfiddle.com/#!2/65169/1
For PostgreSQL 8.4 or later, there is a short and clean way with window functions and no JOIN.
I'd expect this to be the fastest solution posted so far:
WITH x AS (
SELECT created_at AS d
, lag(created_at) OVER (ORDER BY created_at) = (created_at - 1) AS nu
FROM visits
WHERE uid = 1
)
, y AS (
SELECT d, count(NULLIF(nu, TRUE)) OVER (ORDER BY d) AS seq
FROM x
)
SELECT count(*) AS max_days, min(d) AS seq_from, max(d) AS seq_to
FROM y
GROUP BY seq
ORDER BY 1 DESC
LIMIT 1;
Returns:
max_days | seq_from | seq_to
---------+------------+-----------
3 | 2012-04-28 | 2012-04-30
Assuming that created_at is a date and unique.
In CTE x: for every day our user visits, check if he was here yesterday, too.
To calculate "yesterday" just use created_at - 1 The first row is a special case and will produce NULL here.
In CTE y: calculate a running count of "days without yesterday so far" (seq) for every day. NULL values don't count, so count(NULLIF(nu, TRUE)) is the fastes and shortest way, also covering the special case.
Finally, group days per seq and count the days. While being at it I added first and last day of the sequence.
ORDER BY length of the sequence, and pick the longest one.
Upon seeing OP's query approach for their Vertica database, I tried making the two joins run at the same time:
These Postgresql and Sql Server query versions shall both work in Vertica
Postgresql version:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
date_part('day', max(gr.d) - min(gr.d))+1 as consecutive_days
from
(
select
cr.d, (row_number() over() - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = cr.d - interval '1 day'
left join tbl nr on nr.d = cr.d + interval '1 day'
where pr.d is null <> nr.d is null
) as gr
group by pair_number
order by start_date
Regarding pr.d is null <> nr.d is null. It means, it's either the previous row is null or next row is null, but they can never both be null, so this basically removes the non-consecutive dates, as non-consecutive dates' previous & next row are nulls (and this basically gives us all dates that are just headers and footers only). This is also called an XOR operation
If we are left with consecutive dates only, we can now pair them via row_number:
(row_number() over() - 1) / 2 as pair_number
row_number() starts with 1, we need to subtract it with 1 (we can also add with 1 instead), then we divide it by two; this makes the paired date adjacent to each other
Live test: http://www.sqlfiddle.com/#!1/fc440/7
This is the Sql Server version:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
datediff(day, min(gr.d),max(gr.d)) +1 as consecutive_days
from
(
select
cr.d, (row_number() over(order by cr.d) - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = dateadd(day,-1,cr.d)
left join tbl nr on nr.d = dateadd(day,+1,cr.d)
where
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
) as gr
group by pair_number
order by start_date
Same logic as above, except for artificial differences on date functions. And sql Server requires an ORDER BY clause on its OVER, while Postgresql's OVER can be left empty.
Sql Server has no first class boolean, that's why we cannot compare booleans directly:
pr.d is null <> nr.d is null
We must do this in Sql Server:
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
Live test: http://www.sqlfiddle.com/#!3/65df2/17
There have already been several answers to this question. However the SQL statements all seem too complex. This can be accomplished with basic SQL, a way to enumerate rows, and some date arithmetic.
The key observation is that if you have a bunch of days and have a parallel sequence of integers, then the difference is a constant date when the days are in a sequence.
The following query uses this observation to answer the original question:
select uid, min(d) as startdate, count(*) as numdaysinseq
from
(
select uid, d, adddate(d, interval -offset day) as groupstart
from
(
select uid, d, row_number() over (partition by uid order by date) as offset
from
(
SELECT DISTINCT uid, DATE(created_at) AS d
FROM visits
) t
) t
) t
Alas, mysql does not have the row_number() function. However, there is a work-around with variables (and most other databases do have this function).