Merging the two tables using sql / Spark

Merging the two tables using sql / Spark - sql

I have two data set as below and need to merge two data set based on the date range logic. Please suggest any idea? and the driver table is A
Table A
UID Start Date End Date A_Val
1 1980-01-01 00:00:00 1980-02-01 00:00:00 A
1 1980-02-02 00:00:00 1980-03-10 00:00:00 B
1 1980-03-11 00:00:00 1980-03-24 00:00:00 C
Table B
UID Start Date End Date B_Val
1 1980-01-10 00:00:00 1980-02-01 00:00:00 G
1 1980-02-02 00:00:00 1980-03-01 00:00:00 H
1 1980-03-02 00:00:00 1980-03-24 00:00:00 I
Result / out put needed as below
UID Start Date End Date A_Val B_Val
1 1980-01-01 00:00:00 1980-01-09 00:00:00 A NULL
1 1980-01-10 00:00:00 1980-02-01 00:00:00 A G
1 1980-02-02 00:00:00 1980-03-01 00:00:00 B H
1 1980-03-02 00:00:00 1980-03-10 00:00:00 B I
1 1980-03-11 00:00:00 1980-03-24 00:00:00 C I
Table Detail
Need the out put as below based on date range calculations
out put of Merged Table

You can do it in several ways, here is one:
find minimum and maximum date from whole set (subquery T),
create each day entry with hierarchical query (subquery D),
left join data from A and B,
assign groups to continuous periods, having same A_VAL and B_VAL (subquery G),
group data using assigned group numbers.
SQLFiddle demo
with
T as (select min(start_date) sd, max(end_date) ed
from (select start_date, end_date from a union all
select start_date, end_date from b)),
D as (select sd + level - 1 dt from t connect by sd + level - 1 <= ed),
G as (select dt, a_val, b_val,
row_number() over (order by dt) -
row_number() over (partition by a_val, b_val order by dt) grp
from d
left join a on dt between a.start_date and a.end_date
left join b on dt between b.start_date and b.end_date)
select min(dt) sd, max(dt) ed, min(a_val) a_val, min(b_val) b_val
from g group by grp order by sd
Result:
SD ED A_VAL B_VAL
----------- ----------- ----- -----
1980-01-01 1980-01-09 A
1980-01-10 1980-02-01 A G
1980-02-02 1980-03-01 B H
1980-03-02 1980-03-10 B I
1980-03-11 1980-03-24 C I
If you are doing this for one U_ID filter data at first. If for many U_ID's then you have to consider this value in partitioning and grouping.

Related

How to find the continous range of dates in Sql Server?

How do we find the continous range of dates from the following scenario?
Id modifiedDate StartDate EndDate
1 2019-01-01 2019-01-01 2019-12-31
1 2019-02-02 2019-02-01 2019-02-28
1 2019-02-27 2019-01-15 2019-03-15
1 2019-03-01 2019-03-01 2019-03-12
2 2019-01-01 2019-01-01 2019-03-01
2 2019-05-01 2019-05-01 2019-08-01
The Output i want to show is :
Id StartDate EndDate
1 2019-01-01 2019-01-15
1 2019-01-15 2019-02-01
1 2019-02-01 2019-02-28
1 2019-02-28 2019-03-01
1 2019-03-01 2019-03-12
2 2019-01-01 2019-03-01
2 2019-05-01 2019-08-01
What I have tried so far is :
With X As(
Select a.StartDate,a.EndDate,b.StartDate,b.EndDate
From table a Full Join table b ON a.endDate>b.StartDate
Where a.StartDate<>b.StartDate and b.endDate<>a.Enddate
)
Select StartDate,Enddate,Min(StartDtae)
From X
Group By StartDate,EndDate
But I couldn't get fill the gaps in between the dates. How can I fix this?

You can try this following script I have created with the Help of CTE and Row_Number(). I am getting 2 additional row considering your sample output from the the given input data. If you sample output is correct, you can ignore this solution.
CTE Only worked for MSSQL and Oracle. But you can convert the logic given, for any other databases.
WITH CTE
AS
(
SELECT DISTINCT id,Date, ROW_NUMBER() OVER(PARTITION BY id ORDER BY Date) RN
FROM
(
SELECT Id,StartDate Date FROM your_table
UNION ALL
SELECT Id,EndDate FROM your_table
) A
)
SELECT A.Id, A.Date StartDate,B.Date EndDate
FROM CTE A
INNER JOIN CTE B ON A.Id = B.Id AND A.RN = B.RN - 1
Output is-
Id StartDate EndDate
1 2019-01-01 2019-01-15
1 2019-01-15 2019-02-01
1 2019-02-01 2019-02-28
1 2019-02-28 2019-03-01
1 2019-03-01 2019-03-12
1 2019-03-12 2019-03-15 -- Not exist in your expected output
1 2019-03-15 2019-12-31 -- Not exist in your expected output
Note: Adding an additional Filtering at the as below will give you the exact output you have posted. But take your own decision which one best suits your requirement.
SELECT....
....
FROM CTE A
INNER JOIN CTE B ON A.Id = B.Id AND A.RN = B.RN - 1
WHERE B.DATE <= '2019-03-12'

The following query should give you the desired result:
WITH dates AS (SELECT StartDate
FROM TABLE
UNION
SELECT EndDate + 1
FROM TABLE)
SELECT StartDate
, (SELECT MIN(StartDate) - 1
FROM dates b
WHERE StartDate - 1 > a.StartDate) EndDate
FROM dates a

Just use lead() with union:
select t.id, t.dte as startdate,
lead(t.dte) over (partition by t.id order by t.dte) as enddate
from (select distinct t.id, v.dte
from t cross apply
(values (startdate), (enddate)) v(dte)
) t;
In addition to being concise, this probably has the best performance.

Oracle SQL - Select users between two date by month

I am learning SQL and I was wondering how to select active users by month, depending on their starting and ending date (both timestamp(6)). My table looks like this:
Cust_Num | Start_Date | End_Date
1 | 2018-01-01 | 2019-01-01
2 | 2018-01-01 | NULL
3 | 2019-01-01 | 2019-06-01
4 | 2017-01-01 | 2019-03-01
So, counting the active users by month, I should have an output like:
As of. | Count
2018-06-01 | 3
...
2019-02-01 | 3
2019-07-01 | 1
So far, I do a manual operation by entering each month:
Select
201906,
count(distinct a.cust_num)
From
active_users a
Where
to_date(‘20190630’,’yyyymmdd) between a.start_date and nvl (a.end_date, ‘31-dec-9999)
union all
Select
201905,
count(distinct a.cust_num)
From
active_users a
Where
to_date(‘20190531’,’yyyymmdd) between a.start_date and nvl (a.end_date, ‘31-dec-9999)
union all
...
Not very optimized and sustainable if I want to enter 10 years ao 120 months lol.
Any help is welcome. Thanks a lot!

This query shows the active-user-count effective as-of the end of the month.
How it works:
Convert each input row (with StartDate and EndDate value) into two rows that represent a point-in-time when the active-user-count incremented (on StartDate) and decremented (on EndDate). We need to convert NULL to a far-off date value because NULL values are sorted before instead of after non-NULL values:
This makes your data look like this:
OnThisDate Change
2018-01-01 1
2019-01-01 -1
2018-01-01 1
9999-12-31 -1
2019-01-01 1
2019-06-01 -1
2017-01-01 1
2019-03-01 -1
Then we simply SUM OVER the Change values (after sorting) to get the active-user-count as of that specific date:
So first, sort by OnThisDate:
OnThisDate Change
2017-01-01 1
2018-01-01 1
2018-01-01 1
2019-01-01 1
2019-01-01 -1
2019-03-01 -1
2019-06-01 -1
9999-12-31 -1
Then SUM OVER:
OnThisDate ActiveCount
2017-01-01 1
2018-01-01 2
2018-01-01 3
2019-01-01 4
2019-01-01 3
2019-03-01 2
2019-06-01 1
9999-12-31 0
Then we PARTITION (not group!) the rows by month and sort them by their date so we can identify the last ActiveCount row for that month (this actually happens in the WHERE of the outermost query, using ROW_NUMBER() and COUNT() for each month PARTITION):
OnThisDate ActiveCount IsLastInMonth
2017-01-01 1 1
2018-01-01 2 0
2018-01-01 3 1
2019-01-01 4 0
2019-01-01 3 1
2019-03-01 2 1
2019-06-01 1 1
9999-12-31 0 1
Then filter on that where IsLastInMonth = 1 (actually, where ROW_COUNT() = COUNT(*) inside each PARTITION) to give us the final output data:
At-end-of-month Active-count
2017-01 1
2018-01 3
2019-01 3
2019-03 2
2019-06 1
9999-12 0
This does result in "gaps" in the result-set because the At-end-of-month column only shows rows where the Active-count value actually changed rather than including all possible calendar months - but that's ideal (as far as I'm concerned) because it excludes redundant data. Filling in the gaps can be done inside your application code by simply repeating output rows for each additional month until it reaches the next At-end-of-month value.
Here's the query using T-SQL on SQL Server (I don't have access to Oracle right now). And here's the SQLFiddle I used to come to a solution: http://sqlfiddle.com/#!18/ad68b7/24
SELECT
OtdYear,
OtdMonth,
ActiveCount
FROM
(
-- This query adds columns to indicate which row is the last-row-in-month ( where RowInMonth == RowsInMonth )
SELECT
OnThisDate,
OtdYear,
OtdMonth,
ROW_NUMBER() OVER ( PARTITION BY OtdYear, OtdMonth ORDER BY OnThisDate ) AS RowInMonth,
COUNT(*) OVER ( PARTITION BY OtdYear, OtdMonth ) AS RowsInMonth,
ActiveCount
FROM
(
SELECT
OnThisDate,
YEAR( OnThisDate ) AS OtdYear,
MONTH( OnThisDate ) AS OtdMonth,
SUM( [Change] ) OVER ( ORDER BY OnThisDate ASC ) AS ActiveCount
FROM
(
SELECT
StartDate AS [OnThisDate],
1 AS [Change]
FROM
tbl
UNION ALL
SELECT
ISNULL( EndDate, DATEFROMPARTS( 9999, 12, 31 ) ) AS [OnThisDate],
-1 AS [Change]
FROM
tbl
) AS sq1
) AS sq2
) AS sq3
WHERE
RowInMonth = RowsInMonth
ORDER BY
OtdYear,
OtdMonth
This query can be flattened into fewer nested queries by using aggregate and window functions directly instead of using aliases (like OtdYear, ActiveCount, etc) but that would make the query much harder to understand.

I have created the query which will give the result of all the months starting from the minimum start date in the table till maximum end date.
You can change it using adding one condition in WHERE clause.
-- table creation
CREATE TABLE ACTIVE_USERS (CUST_NUM NUMBER, START_DATE DATE, END_DATE DATE)
-- data creation
INSERT INTO ACTIVE_USERS
SELECT * FROM
(
SELECT 1, DATE '2018-01-01', DATE '2019-01-01' FROM DUAL UNION ALL
SELECT 2, DATE '2018-01-01', NULL FROM DUAL UNION ALL
SELECT 3, DATE '2019-01-01', DATE '2019-06-01' FROM DUAL UNION ALL
SELECT 4, DATE '2017-01-01', DATE '2019-03-01' FROM DUAL
)
-- data in the actual table
SELECT * FROM ACTIVE_USERS ORDER BY CUST_NUM;
CUST_NUM START_DATE END_DATE
---------- ---------- ----------
1 2018-01-01 2019-01-01
2 2018-01-01
3 2019-01-01 2019-06-01
4 2017-01-01 2019-03-01
Query to fetch desired result
WITH CTE ( START_DATE, END_DATE ) AS
(
SELECT
ADD_MONTHS( START_DATE, LEVEL - 1 ),
ADD_MONTHS( START_DATE, LEVEL ) - 1
FROM
(
SELECT
MIN( START_DATE ) AS START_DATE,
MAX( END_DATE ) AS END_DATE
FROM
ACTIVE_USERS
)
CONNECT BY LEVEL <= CEIL( MONTHS_BETWEEN( END_DATE, START_DATE ) ) + 1
)
--
--
SELECT
C.START_DATE,
COUNT(1) AS CNT
FROM
CTE C
JOIN ACTIVE_USERS D ON
(
C.END_DATE BETWEEN
D.START_DATE
AND
CASE
WHEN D.END_DATE IS NOT NULL THEN D.END_DATE
ELSE C.END_DATE
END
)
GROUP BY
C.START_DATE
ORDER BY
C.START_DATE;
-- output --
START_DATE CNT
---------- ----------
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-05-01 1
2017-06-01 1
2017-07-01 1
2017-08-01 1
2017-09-01 1
2017-10-01 1
2017-11-01 1
START_DATE CNT
---------- ----------
2017-12-01 1
2018-01-01 3
2018-02-01 3
2018-03-01 3
2018-04-01 3
2018-05-01 3
2018-06-01 3
2018-07-01 3
2018-08-01 3
2018-09-01 3
2018-10-01 3
START_DATE CNT
---------- ----------
2018-11-01 3
2018-12-01 3
2019-01-01 3
2019-02-01 3
2019-03-01 2
2019-04-01 2
2019-05-01 2
2019-06-01 1
30 rows selected.
Cheers!!

Full History Join

currently I am trying to figure out a join between to historized tables, where I want to synchronize both timeline.
As an example, I have the following two tables:
A
ID Value FROM TO
1 5 01.01.2018 31.03.2018
1 6 31.03.2018 08.04.2018
B A_FK Value FROM TO
1 1 50 01.02.2018 01.04.2018
2 1 51 04.04.2018 10.04.2018
As a baseline, I want to take the timeline of table A and join table B, including NULL values so that I know, for which times there is no fitting value.
The desired result should look like this:
C
Value_A Value_B FROM TO
5 NULL 01.01.2018 01.02.2018
5 50 01.02.2018 31.03.2018
6 50 31.03.2018 01.04.2018
6 NULL 01.04.2018 04.04.2018
6 51 04.04.2018 08.04.2018
Can you help me with this? I started, but can fail to align the wrong history - here my try:
with a as (SELECT *
FROM (VALUES (1,5,'01.01.2018','31.03.2018')
, (1,6,'31.03.2018','08.04.2018')
) A (ID, VALUE, FROM, TO)),
b as (
SELECT *
FROM (VALUES (1,1,50,'01.02.2018','01.04.2018')
, (2,1,51,'04.04.2018','10.04.2018')
) A (ID,A_FK, VALUE, FROM, TO)
)
select
a.value as value_a,
b.value as value_b,
max(a.from,b.from) as from,
min(a.to,b.to) as to
from a
left outer join b on
a.id = b.a_fk and
a.from < b.to and
a.to > b.from;
As you can see, it aligns, but not the way I expected it to.
Thank you for your help.

So as I suggested in the comments with the technique in my own answer from another question you can solve your problem.
Here is one solution.
The test data:
create table a (
id integer,
value integer,
dtfrom date,
dtto date
);
create table b(
id integer,
a_fk integer,
value integer,
dtfrom date,
dtto date
);
insert into a values
(1, 5, '2018-01-01', '2018-03-31'),
(1, 6, '2018-03-31', '2018-04-08');
insert into b values
(1, 1, 50, '2018-02-01', '2018-04-01'),
(2, 1, 51, '2018-04-04', '2018-04-10');
The trick part of this solution is to generate the date intervals that isn't in any of your tables such as 01.01.2018-01.02.2018 and 01.02.2018-31.03.2018 so in order to do that you must have all available dates as one table so I created a VIEW called timmings to make it easier:
create or replace view timmings as
select a.dtfrom dt from a inner join b on a.id=b.a_fk
union
select a.dtto from a inner join b on a.id=b.a_fk
union
select b.dtfrom from a inner join b on a.id=b.a_fk
union
select b.dtto from a inner join b on a.id=b.a_fk;
After that you need a query to generate all available periods (starts and ends) so it will be:
select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1
order by start;
This will result in (with your sample data):
start dend
01/01/2018 01/02/2018
01/02/2018 31/03/2018
31/03/2018 01/04/2018
01/04/2018 04/04/2018
04/04/2018 08/04/2018
08/04/2018 10/04/2018
10/04/2018 null
With that you can use it to get all available values from table a that intersects with the periods:
select a.id, a.value, tm.start, tm.dend
from (select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1) tm
left join a on tm.start >= a.dtfrom and tm.dend <= a.dtto
where a.id is not null
order by tm.start;
That results in:
id value start end
1 5 01/01/2018 01/02/2018
1 5 01/02/2018 31/03/2018
1 6 31/03/2018 01/04/2018
1 6 01/04/2018 04/04/2018
1 6 04/04/2018 08/04/2018
And finally you LEFT JOIN it with b table:
select x.value as valueA,
b.value as valueB,
x.start as "from",
x.dend as "to"
from (select a.id, a.value, tm.start, tm.dend
from (select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1) tm
left join a on tm.start >= a.dtfrom and tm.dend <= a.dtto
where a.id is not null
) x
left join b on b.a_fk = x.id
and b.dtfrom <= x.start
and b.dtto >= x.dend
order by x.start;
Which will give you the result you want:
valueA valueB start end
5 null 01/01/2018 01/02/2018
5 50 01/02/2018 31/03/2018
6 50 31/03/2018 01/04/2018
6 null 01/04/2018 04/04/2018
6 51 04/04/2018 08/04/2018
See the final solution working here: http://sqlfiddle.com/#!9/36418e/1 It is MySQL but since it is all SQL ANSI it will work just fine in DB2

There is an excellent Blog article about that
"Fun with Date Ranges" by John Maenpaa
And secondly if you have a chance to influence the DDL I would recommend to have a closer look at Db2 Temporal Tables - they come with full SQL support (Time Travel SQL) - find details here

This is actually really simple if you have what's known as a Calendar table - a table with every date in it - although you can construct one on-the-fly if necessary. You can use it to turn this more obviously into a gaps-and-islands problem.
(You want one anyways, since they're one of the most useful analysis dimension tables):
SELECT valueA, valueB,
MIN(calendarDate) AS startDate,
MAX(calendarDate) + 1 DAY AS endDate
FROM (SELECT A.val AS valueA, B.val AS valueB, Calendar.calendarDate,
ROW_NUMBER() OVER(ORDER BY Calendar.calendarDate) -
ROW_NUMBER() OVER(PARTITION BY A.val, B.val ORDER BY Calendar.calendarDate) AS grouping
FROM Calendar
LEFT JOIN A
ON A.startDate <= Calendar.calendarDate
AND A.endDate > Calendar.calendarDate
LEFT JOIN B
ON B.startDate <= Calendar.calendarDate
AND B.endDate > Calendar.calendarDate
WHERE A.val IS NOT NULL
OR B.val IS NOT NULL) Groups
GROUP BY valueA, valueB, grouping
ORDER BY grouping
SQL Fiddle Example (Minor tweaks for SQL Server usage in example)
...which yields the following results. Note that there's a few extra days from the date range in table B that aren't present in table A!
valueA valueB startDate endDate
5 (null) 2018-01-01 2018-02-01
5 50 2018-02-01 2018-03-31
6 50 2018-03-31 2018-04-01
6 (null) 2018-04-01 2018-04-04
6 51 2018-04-04 2018-04-08
(null) 51 2018-04-08 2018-04-10
(This of course is trivially changeable by switching the join to A to a regular INNER JOIN, but I figured this and other cases would be important.)

How do I select start_date and corresponding end_date

The data input in the table in Oracle is as below. That is start_date is in one row but the end_date is in the next row for a account number. Want to align the start date and end date in the same row. I tried using lead function and it doesn't seem to work. I am using Oracle 11g. Can you please help me with this.
ACCT_NUM ACTV_TMST START_DATE END_DATE
1234 11/22/2006 2:12:13.928230 PM 11/22/2006 00:00:00 NULL
1234 11/28/2006 7:35:05.659595 AM NULL 11/28/2006
1234 12/22/2008 3:00:47.864811 PM 12/22/2008 00:00:00 NULL
1234 12/26/2008 3:34:28.776394 PM NULL 12/26/2008 00:00:00
1234 02/18/2016 9:22:35.746829 AM 02/18/2016 00:00:00 NULL
1234 02/23/2016 9:03:35.295622 AM NULL 02/23/2016 00:00:00
I need an output like
ACCT_NUM START_DATE END_DATE
1234 11/22/2006 00:00:00 11/28/2006 00:00:00
1234 12/22/2008 00:00:00 12/26/2008 00:00:00
1234 02/18/2016 00:00:00 02/23/2016 00:00:00
Thanks.

You can use ORACLE's row_number window function:
SELECT s.acct_num,
max(s.start_date) as start_date,
max(s.end_date) as end_date
FROM(
SELECT t.acct_num,
t.start_date,
row_number() OVER(PARTITION BY t.acct_num ORDER BY t.start_date) as sd_rnk,
t.end_date,
row_number() OVER(PARTITION BY t.acct_num ORDER BY t.end_date) as ed_rnk
FROM YourTable t) s
GROUP BY acct_num,
CASE WHEN t.start_date is null then ed_rnk else sd_rnk end
This will basically rank each row, first start_date 1 , second will get 2. Same goes for end_date, first will get 1 second 2...
Then, you will group by this results (acct_num , end_date_rank / start_date_rank) and use an aggregation function to unite them into 1 row.

SQL issue - calculate max days sequence

There is a table with visits data:
uid (INT) | created_at (DATETIME)
I want to find how many days in a row a user has visited our app. So for instance:
SELECT DISTINCT DATE(created_at) AS d FROM visits WHERE uid = 123
will return:
d
------------
2012-04-28
2012-04-29
2012-04-30
2012-05-03
2012-05-04
There are 5 records and two intervals - 3 days (28 - 30 Apr) and 2 days (3 - 4 May).
My question is how to find the maximum number of days that a user has visited the app in a row (3 days in the example). Tried to find a suitable function in the SQL docs, but with no success. Am I missing something?
UPD:
Thank you guys for your answers! Actually, I'm working with vertica analytics database (http://vertica.com/), however this is a very rare solution and only a few people have experience with it. Although it supports SQL-99 standard.
Well, most of solutions work with slight modifications. Finally I created my own version of query:
-- returns starts of the vitit series
SELECT t1.d as s FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
s
---------------------
2012-04-28 01:00:00
2012-05-03 01:00:00
-- returns end of the vitit series
SELECT t1.d as f FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
f
---------------------
2012-04-30 01:00:00
2012-05-04 01:00:00
So now only what we need to do is to join them somehow, for instance by row index.
SELECT s, f, DATEDIFF(day, s, f) + 1 as seq FROM (
SELECT t1.d as s, ROW_NUMBER() OVER () as o1 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl1 LEFT JOIN (
SELECT t1.d as f, ROW_NUMBER() OVER () as o2 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl2 ON o1 = o2
Sample output:
s | f | seq
---------------------+---------------------+-----
2012-04-28 01:00:00 | 2012-04-30 01:00:00 | 3
2012-05-03 01:00:00 | 2012-05-04 01:00:00 | 2

Another approach, the shortest, do a self-join:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select d, group_number, count(d) over m as consecutive_days
from grouped_result
window m as (partition by group_number)
Output:
d | group_number | consecutive_days
---------------------+--------------+------------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
Live test: http://www.sqlfiddle.com/#!1/93789/1
sr = second row, fr = first row ( or perhaps previous row? ツ ). Basically we are doing a back tracking, it's a simulated lag on database that doesn't support LAG (Postgres supports LAG, but the solution is very long, as windowing doesn't support nested windowing). So in this query, we uses a hybrid approach, simulate LAG via join, then use SUM windowing against it, this produces group number
UPDATE
Forgot to put the final query, the query above illustrate the underpinnings of group numbering, need to morph that into this:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select min(d) as starting_date, max(d) as end_date, count(d) as consecutive_days
from grouped_result
group by group_number
-- order by consecutive_days desc limit 1
STARTING_DATE END_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
UPDATE
I know why my other solution that uses window function became long, it became long on my attempt to illustrate the logic of group numbering and counting over the group. If I'd cut to the chase like in my MySql approach, that windowing function could be shorter. Having said that, here's my old windowing function approach, albeit better now:
with headers as
(
select
d,lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over (order by d) as group_number
from headers
)
select min(d) as starting_date,max(d) as ending_date,count(d) as consecutive_days
from sequence_group
group by group_number
-- order by consecutive_days desc limit 1
Live test: http://www.sqlfiddle.com/#!1/93789/21

In MySQL you could do this:
SET #nextDate = CURRENT_DATE;
SET #RowNum = 1;
SELECT MAX(RowNumber) AS ConecutiveVisits
FROM ( SELECT #RowNum := IF(#NextDate = Created_At, #RowNum + 1, 1) AS RowNumber,
Created_At,
#NextDate := DATE_ADD(Created_At, INTERVAL 1 DAY) AS NextDate
FROM Visits
ORDER BY Created_At
) Visits
Example here:
http://sqlfiddle.com/#!2/6e035/8
However I am not 100% certain this is the best way to do it.
In Postgresql:
;WITH RECURSIVE VisitsCTE AS
( SELECT Created_At, 1 AS ConsecutiveDays
FROM Visits
UNION ALL
SELECT v.Created_At, ConsecutiveDays + 1
FROM Visits v
INNER JOIN VisitsCTE cte
ON 1 + cte.Created_At = v.Created_At
)
SELECT MAX(ConsecutiveDays) AS ConsecutiveDays
FROM VisitsCTE
Example here:
http://sqlfiddle.com/#!1/16c90/9

I know Postgresql has something similar to common table expressions as available in MSSQL. I'm not that familiar with Postgresql, but the code below works for MSSQL and does what you want.
create table #tempdates (
mydate date
)
insert into #tempdates(mydate) values('2012-04-28')
insert into #tempdates(mydate) values('2012-04-29')
insert into #tempdates(mydate) values('2012-04-30')
insert into #tempdates(mydate) values('2012-05-03')
insert into #tempdates(mydate) values('2012-05-04');
with maxdays (s, e, c)
as
(
select mydate, mydate, 1
from #tempdates
union all
select m.s, mydate, m.c + 1
from #tempdates t
inner join maxdays m on DATEADD(day, -1, t.mydate)=m.e
)
select MIN(o.s),o.e,max(o.c)
from (
select m1.s,max(m1.e) e,max(m1.c) c
from maxdays m1
group by m1.s
) o
group by o.e
drop table #tempdates
And here's the SQL fiddle: http://sqlfiddle.com/#!3/42b38/2

All are very good answers, but I think I should contribute by showing another approach utilizing an analytical capability specific to Vertica (after all it is part of what you paid for). And I promise the final query is short.
First, query using conditional_true_event(). From Vertica's documentation:
Assigns an event window number to each row, starting from 0, and
increments the number by 1 when the result of the boolean argument
expression evaluates true.
The example query looks like this:
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits;
And output:
uid created_at seq_id
--- ------------------- ------
123 2012-04-28 00:00:00 0
123 2012-04-29 00:00:00 0
123 2012-04-30 00:00:00 0
123 2012-05-03 00:00:00 1
123 2012-05-04 00:00:00 1
123 2012-06-04 00:00:00 2
123 2012-06-04 00:00:00 2
Now the final query becomes easy:
select uid, seq_id, count(1) num_days, min(created_at) s, max(created_at) f
from
(
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits
) as seq
group by uid, seq_id;
Final Output:
uid seq_id num_days s f
--- ------ -------- ------------------- -------------------
123 0 3 2012-04-28 00:00:00 2012-04-30 00:00:00
123 1 2 2012-05-03 00:00:00 2012-05-04 00:00:00
123 2 2 2012-06-04 00:00:00 2012-06-04 00:00:00
One final note:
num_days is actually number of rows of the inner query. If there are two '2012-04-28' visits in the original table (i.e. duplicates), you might want to work around that.

The following should be Oracle friendly, and not require recursive logic.
;WITH
visit_dates (
visit_id,
date_id,
group_id
)
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY TRUNC(created_at)),
TRUNC(SYSDATE) - TRUNC(created_at),
TRUNC(SYSDATE) - TRUNC(created_at) - ROW_NUMBER() OVER (ORDER BY TRUNC(created_at))
FROM
visits
GROUP BY
TRUNC(created_at)
)
,
group_duration (
group_id,
duration
)
AS
(
SELECT
group_id,
MAX(date_id) - MIN(date_id) + 1 AS duration
FROM
visit_dates
GROUP BY
group_id
)
SELECT
MAX(duration) AS max_duration
FROM
group_duration

Postgresql:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
Divide-and-conquer approach: 3 steps
1st step, find headers:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
select * from headers
Output:
d | header
---------------------+--------
2012-04-28 08:00:00 | t
2012-04-29 08:00:00 | f
2012-04-30 08:00:00 | f
2012-05-03 08:00:00 | t
2012-05-04 08:00:00 | f
(5 rows)
2nd step, designate grouping:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
select * from sequence_group
Output:
d | group_number
---------------------+--------------
2012-04-28 08:00:00 | 1
2012-04-29 08:00:00 | 1
2012-04-30 08:00:00 | 1
2012-05-03 08:00:00 | 2
2012-05-04 08:00:00 | 2
(5 rows)
3rd step, count max days:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
Output:
d | group_number | consecutive_count
---------------------+--------------+-----------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)

This is for MySQL, the shortest, and uses minimal variable (one variable only):
select
min(d) as starting_date, max(d) as ending_date,
count(d) as consecutive_days
from
(
select
sr.d,
IF(fr.d is null,#group_number := #group_number + 1,#group_number)
as group_number
from tbl sr
left join tbl fr on sr.d = adddate(fr.d,interval 1 day)
cross join (select #group_number := 0) as grp
) as x
group by group_number
Output:
STARTING_DATE ENDING_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
Live test: http://www.sqlfiddle.com/#!2/65169/1

For PostgreSQL 8.4 or later, there is a short and clean way with window functions and no JOIN.
I'd expect this to be the fastest solution posted so far:
WITH x AS (
SELECT created_at AS d
, lag(created_at) OVER (ORDER BY created_at) = (created_at - 1) AS nu
FROM visits
WHERE uid = 1
)
, y AS (
SELECT d, count(NULLIF(nu, TRUE)) OVER (ORDER BY d) AS seq
FROM x
)
SELECT count(*) AS max_days, min(d) AS seq_from, max(d) AS seq_to
FROM y
GROUP BY seq
ORDER BY 1 DESC
LIMIT 1;
Returns:
max_days | seq_from | seq_to
---------+------------+-----------
3 | 2012-04-28 | 2012-04-30
Assuming that created_at is a date and unique.
In CTE x: for every day our user visits, check if he was here yesterday, too.
To calculate "yesterday" just use created_at - 1 The first row is a special case and will produce NULL here.
In CTE y: calculate a running count of "days without yesterday so far" (seq) for every day. NULL values don't count, so count(NULLIF(nu, TRUE)) is the fastes and shortest way, also covering the special case.
Finally, group days per seq and count the days. While being at it I added first and last day of the sequence.
ORDER BY length of the sequence, and pick the longest one.

Upon seeing OP's query approach for their Vertica database, I tried making the two joins run at the same time:
These Postgresql and Sql Server query versions shall both work in Vertica
Postgresql version:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
date_part('day', max(gr.d) - min(gr.d))+1 as consecutive_days
from
(
select
cr.d, (row_number() over() - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = cr.d - interval '1 day'
left join tbl nr on nr.d = cr.d + interval '1 day'
where pr.d is null <> nr.d is null
) as gr
group by pair_number
order by start_date
Regarding pr.d is null <> nr.d is null. It means, it's either the previous row is null or next row is null, but they can never both be null, so this basically removes the non-consecutive dates, as non-consecutive dates' previous & next row are nulls (and this basically gives us all dates that are just headers and footers only). This is also called an XOR operation
If we are left with consecutive dates only, we can now pair them via row_number:
(row_number() over() - 1) / 2 as pair_number
row_number() starts with 1, we need to subtract it with 1 (we can also add with 1 instead), then we divide it by two; this makes the paired date adjacent to each other
Live test: http://www.sqlfiddle.com/#!1/fc440/7
This is the Sql Server version:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
datediff(day, min(gr.d),max(gr.d)) +1 as consecutive_days
from
(
select
cr.d, (row_number() over(order by cr.d) - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = dateadd(day,-1,cr.d)
left join tbl nr on nr.d = dateadd(day,+1,cr.d)
where
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
) as gr
group by pair_number
order by start_date
Same logic as above, except for artificial differences on date functions. And sql Server requires an ORDER BY clause on its OVER, while Postgresql's OVER can be left empty.
Sql Server has no first class boolean, that's why we cannot compare booleans directly:
pr.d is null <> nr.d is null
We must do this in Sql Server:
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
Live test: http://www.sqlfiddle.com/#!3/65df2/17

There have already been several answers to this question. However the SQL statements all seem too complex. This can be accomplished with basic SQL, a way to enumerate rows, and some date arithmetic.
The key observation is that if you have a bunch of days and have a parallel sequence of integers, then the difference is a constant date when the days are in a sequence.
The following query uses this observation to answer the original question:
select uid, min(d) as startdate, count(*) as numdaysinseq
from
(
select uid, d, adddate(d, interval -offset day) as groupstart
from
(
select uid, d, row_number() over (partition by uid order by date) as offset
from
(
SELECT DISTINCT uid, DATE(created_at) AS d
FROM visits
) t
) t
) t
Alas, mysql does not have the row_number() function. However, there is a work-around with variables (and most other databases do have this function).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Merging the two tables using sql / Spark - sql

Related

How to find the continous range of dates in Sql Server?

Oracle SQL - Select users between two date by month

Full History Join

How do I select start_date and corresponding end_date

SQL issue - calculate max days sequence

Categories

Resources