How do we group continuous date into single date span in SQL? - sql

Here is my data:
id
customercode
startdate
enddate
1
122
20200812
20200814
2
122
20200816
20200817
3
122
20200817
20200819
4
122
20200821
20200822
5
122
20200823
20200824
I tried the following code:
select Customercode, min(startdate) as startdate, max(enddate) as enddate
from (
select Customercode, startdate, enddate
sum(rst) over (order by Customercode, DOS) as grp
from (
select Customercode, startdate, enddate
case when coalesce(lag(enddate) over (partition by Customercode order by Customercode, startdate), startdate) + 1 <> startdate then 1 end rst
from tbl
) t1
) t2
group by grp, Customercode
order by startdate
My result
id
customercode
startdate
enddate
1
122
20200812
20200814
2
122
20200816
20200817
3
122
20200817
20200819
4
122
20200821
20200824
The desired output should be like this. Please share your thoughts.
id
customercode
startdate
enddate
1
122
20200812
20200814
2
122
20200816
20200819
3
122
20200821
20200824

It is unclear if you want to group records whose start date is the same as the previous end date, or one day afterwards.
If you want group on the same date, you would phrase the query as:
select customercode, min(startdate), max(enddate)
from (
select t.*,
sum(case when startdate = lag_enddate then 0 else 1 end)
over(partition by customercode order by startdate) as grp
from (
select t.*,
lag(enddate) over(partition by customercode order by startdate) as lag_enddate
from tbl t
) t
) t
group by customercode, grp
order by min(startdate)
You can also allow both cases at once, by modifying the conditional window sum(). This requires a little date artithmetics, whose syntax varies across databases. In standard SQL:
sum(case when startdate <= lag_enddate + interval '1' day then 0 else 1 end)
over(partition by customercode order by startdate) as grp

Related

Finding correct date pair and eliminate the overlapped one in T-SQL

I have a Dates like startdate as one column and Enddate as another column. I need to find eliminate Continuous date ranges in data in sQL.I need to find the overlapped items and i need to delete.I already using one code to find Overlap items.And i am giving startdate and enddate as parameter.
Code i am using to find overlap
Select * from #t
where
((cast(#StartDate as datetime2)>=StartDate and cast(#EndDate as datetime2)<=EndDate)
OR (StartDate>= cast(#StartDate as datetime2) and EndDate<= cast(#EndDate as datetime2))
OR (cast(#StartDate as datetime2)>=StartDate AND cast(#StartDate as datetime2)<=EndDate)
OR (cast(#EndDate as datetime2)>=StartDate AND cast(#EndDate as datetime2)<=EndDate))
Above query is ok to find normal overlap like
Id
Startdate
Enddate
1
01/01/2020
01/11/2020
2
01/01/2020
01/03/2021
In above condition i will delete one data and i will keep other one
But it fails in below type of data example.When run for below type of query 1 id is overlapped with 2 and 2 is overlapped with both 1 and 3.So it show both 1 and 2 to delete.but in my case is not to delete 1 and 3.only 2 need to be deleted.since 2 is overlapped between both data and 1& 3 is already in good date periods
For example
Id
Startdate
Enddate
1
01/01/2020
01/11/2020
2
01/01/2020
01/03/2021
3
02/11/2020
05/04/2022
In above example we have three pair of dates and id 1 and 3 are in correct interval and 2 is overlapped between both id. I need to find overlapped one or non overlapped items. Any case is ok for me to find the result.
My Expected Result is
Id
Startdate
Enddate
2
01/01/2020
01/03/2021
Another example is
Id
Startdate
Enddate
1
01/01/2020
01/11/2020
2
02/11/2020
06/05/2022
3
02/11/2020
05/04/2022
Above if you see 1 and 2 is in correct date periods but id 3 is overlapped with 2 id.Now i want to find only that overlapped result and i don't need other data.
Another example is
Id
Startdate
Enddate
3
02/11/2020
05/04/2022
I used second set of data, But this should work for first set of data as well. But I have a doubt on your first record set expected output. If you can clear it up, i can check again,
Create table OverlapData_1
(
id int
, Startdate date
, EndDate date
)
insert into OverlapData_1 values(1, '01/01/2020','01/11/2020')
insert into OverlapData_1 values(2, '01/01/2020','01/03/2021')
insert into OverlapData_1 values(3, '02/11/2020','05/04/2022')
SELECT A.[id]
,A.Startdate
,A.EndDate FROM
(
SELECT
CASE WHEN Startdate between LAG(StartDate) OVER ( order by id) and LAG(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_1]
, CASE WHEN EndDate between LAG(StartDate) OVER ( order by id) and LAG(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_2]
, CASE WHEN StartDate between LEAD(StartDate) OVER ( order by id) and LEAD(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_3]
, CASE WHEN EndDate between LEAD(StartDate) OVER ( order by id) and LEAD(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_4]
,*
FROM OverlapData_1
) AS A
WHERE (A.status_1 = 1 AND A.status_2 = 1)
OR (A.status_1 = 1 AND A.status_4 = 1)
Create table OverlapData_2
(
id int
, Startdate date
, EndDate date
)
insert into OverlapData_2 values(1, '01/01/2020','01/11/2020')
insert into OverlapData_2 values(2, '01/01/2020','06/05/2022')
insert into OverlapData_2 values(3, '02/11/2020','05/04/2022')
SELECT A.[id]
,A.Startdate
,A.EndDate FROM
(
SELECT
CASE WHEN Startdate between LAG(StartDate) OVER ( order by id) and LAG(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_1]
, CASE WHEN EndDate between LAG(StartDate) OVER ( order by id) and LAG(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_2]
, CASE WHEN StartDate between LEAD(StartDate) OVER ( order by id) and LEAD(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_3]
, CASE WHEN EndDate between LEAD(StartDate) OVER ( order by id) and LEAD(EndDate) OVER ( order by id) THEN 1 else 0 end as [status_4]
,*
FROM OverlapData_2
) AS A
WHERE (A.status_1 = 1 AND A.status_2 = 1)
OR (A.status_1 = 1 AND A.status_4 = 1)

Minimum and maximum dates within continuous date range grouped by name

I have a data ranges with start and end date for a persons, I want to get the continuous date ranges only per persons:
Input:
NAME | STARTDATE | END DATE
--------------------------------------
MIKE | **2019-05-15** | 2019-05-16
MIKE | 2019-05-17 | **2019-05-18**
MIKE | 2020-05-18 | 2020-05-19
Expected output like:
MIKE | **2019-05-15** | **2019-05-18**
MIKE | 2020-05-18 | 2020-05-19
So basically output is MIN and MAX for each continuous period for the person.
Appreciate any help.
I have tried the below query:
With N AS ( SELECT Name, StartDate, EndDate
, LastStop = MAX(EndDate)
OVER (PARTITION BY Name ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) FROM Table ), B AS ( SELECT Name, StartDate, EndDate
, Block = SUM(CASE WHEN LastStop Is Null Then 1
WHEN LastStop < StartDate Then 1
ELSE 0
END)
OVER (PARTITION BY Name ORDER BY StartDate, LastStop) FROM N ) SELECT Name
, MIN(StartDate) DateFrom
, MAX(EndDate) DateTo FROM B GROUP BY Name, Block ORDER BY Name, Block
But its not considering the continuous period. It's showing the same input.
This is a type of gap-and-islands problem. There is no need to expand the data out by day! That seems very inefficient.
Instead, determine the "islands". This is where there is no overlap -- in your case lag() is sufficient. Then a cumulative sum and aggregation:
select name, min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -1, startdate) then 0 else 1 end) over
(partition by name order by startdate) as grp
from (select t.*,
lag(enddate) over (partition by name order by startdate) as prev_enddate
from t
) t
) t
group by name, grp;
Here is a db<>fiddle.
Here is an example using an ad-hoc tally table
Example or dbFiddle
;with cte as (
Select A.[Name]
,B.D
,Grp = datediff(day,'1900-01-01',D) - dense_rank() over (partition by [Name] Order by D)
From YourTable A
Cross Apply (
Select Top (DateDiff(DAY,StartDate,EndDate)+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),StartDate)
From master..spt_values n1,master..spt_values n2
) B
)
Select [Name]
,StartDate= min(D)
,EndDate = max(D)
From cte
Group By [Name],Grp
Returns
Name StartDate EndDate
MIKE 2019-05-15 2019-05-18
MIKE 2020-05-18 2020-05-19
Just to help with the Visualization, the CTE generates the following
This will give you the same result
SELECT subquery.name,min(subquery.startdate),max(subquery.enddate1)
FROM (SELECT NAME,startdate,
CASE WHEN EXISTS(SELECT yt1.startdate
FROM t yt1
WHERE yt1.startdate = DATEADD(day, 1, yt2.enddate)
) THEN null else yt2.enddate END as enddate1
FROM t yt2) as subquery
GROUP by NAME, CAST(MONTH(subquery.startdate) AS VARCHAR(2)) + '-' + CAST(YEAR(subquery.startdate) AS VARCHAR(4))
For the CASE WHEN EXISTS I refered to SQL CASE
For the group by month and year you can see this GROUP BY MONTH AND YEAR
DB_FIDDLE

SQL - unique users who are visiting for the first time

Given following table visitorLog, write a SQL to find the following by date.
Total_Visitors
VisitorGain - compare to previous day
VisitorLoss - compare to previous day
Total_New_Visitors - unique users who are visiting for the first time
visitorLog :
*----------------------*
| Date Visitor |
*----------------------*
| 01-Jan-2011 V1 |
| 01-Jan-2011 V2 |
| 01-Jan-2011 V3 |
| 02-Jan-2011 V2 |
| 03-Jan-2011 V2 |
| 03-Jan-2011 V4 |
| 03-Jan-2011 V5 |
*----------------------*
Expected output:
*---------------------------------------------------------------------*
| Date Total_Visitors VisitorGain VisitorLoss Total_New_Visitors |
*---------------------------------------------------------------------*
| 01-Jan-2011 3 3 0 3 |
| 02-Jan-2011 1 0 2 0 |
| 03-Jan-2011 3 2 0 2 |
*---------------------------------------------------------------------*
Here is my SQL and SLQ fiddle.
with cte as
(
select
date,
total_visitors,
lag(total_visitors) over (order by date) as prev_visitors,
row_number() over (order by date ) as rnk
from
(
select
*,
count(visitor) over (partition by date) as total_visitors
from visitorLog
) val
group by
date,
total_visitors
),
cte2 as
(
select
date,
sum(case when rnk = 1 then 1 else 0 end) as total_new_visitors
from
(
select
date,
visitor,
row_number() over (partition BY visitor order by date) as rnk
from visitorLog
) t
group by
date
)
select
c.date,
sum(total_visitors) as total_visitors,
sum(
case
when rnk = 1 then total_visitors
when (rnk > 1 and prev_visitors < total_visitors) then (total_visitors - prev_visitors)
else
0
end
)visitorGain,
sum(
case
when rnk = 1 then 0
when prev_visitors > total_visitors then (prev_visitors - total_visitors)
else
0
end
) as visitorLoss,
sum(total_new_visitors) as total_new_visitors
from cte c
join cte2 c2
on c.date = c2.date
group by
c.date
order by
c.date
My solution is working as expected but I am wondering if I am missing any any edge cases here which may break my logic. any help would be great.
This logic does what you want:
select date, count(*) as num_visitor,
greatest(count(*) - lag(count(*)::int, 1, 0) over (order by date), 0) as visitor_gain,
greatest(lag(count(*)::int, 1, 0) over (order by date) - count(*), 0) as visitor_loss,
count(*) filter (where seqnum = 1) as num_new_visitors
from (select vl.*,
row_number() over (partition by visitor order by date) as seqnum
from visitorLog vl
) vl
group by date
order by date
Here is a db<>fiddle.
I would use window functions and aggregation:
select
date,
count(*) no_visitor,
count(*) - lag(count(*), 1, 0) over(partition by date) no_visitor_diff,
count(*) filter(where rn = 1) no_new_visitors
from (
select t.*, row_number() over(partition by visitor order by date) rn
from visitorLog
) t
group by date
order by date
The subquery ranks the visits of each customer using row_number() (the first visit of each customer gets row number 1). Then, the outer query aggregates by date, and uses lag() to get the visitor count of the "previous" day.
I don't really see the point to have two distinct columns for the difference of visitors compared to the last day, so this gives you a single column, with a value that's either positive or negative depending whether customers were gained or lost.
If you really want two columns, then:
greatest(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_gain,
- least(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_loss

if the records are identical, display other information into new columns

I want to see second row data like start date, end date and associate into new columns in single row if they matching.
Name Id Start date End date Association
XYZ 100 1/1/2017 1/1/2022 Marketing
XYZ 100 5/1/2018 1/1/2028 Business
Result:
Name Id Start date End date Association Start date1 End date1 Association1
XYZ 100 1/1/2017 1/1/2022 Marketing 5/1/2018 1/1/2028
Business
Your Problem is Solved:-
select Id,
name,
max(case when rn = 1 then StartDate end) StartDate,
max(case when rn = 1 then EndDate end) EndDate,
max(case when rn = 1 then Association end) Association,
max(case when rn = 2 then StartDate end) StartDate1,
max(case when rn = 2 then EndDate end) EndDate1,
max(case when rn = 2 then Association end) Association1
from
(
select id, name, StartDate, EndDate, Association,
row_number() over(partition by Id order by name) rn
from Business
) src
group by id, name;

grouping by date range in t-sql

I'm trying to do a query on this table:
Id startdate enddate amount
1 2013-01-01 2013-01-31 0.00
2 2013-02-01 2013-02-28 0.00
3 2013-03-01 2013-03-31 245
4 2013-04-01 2013-04-30 529
5 2013-05-01 2013-05-31 0.00
6 2013-06-01 2013-06-30 383
7 2013-07-01 2013-07-31 0.00
8 2013-08-01 2013-08-31 0.00
I want to get the output:
2013-01-01 2013-02-28 0
2013-03-01 2013-06-30 1157
2013-07-01 2013-08-31 0
I wanted to get that result so I would know when money started to come in and when it stopped. I am also interested in the number of months before money started coming in (which explains the first row), and the number of months where money has stopped (which explains why I'm also interested in the 3rd row for July 2013 to Aug 2013).
I know I can use min and max on the dates and sum on amount but I can't figure out how to get the records divided that way.
Thanks!
with CT as
(
select t1.*,
( select max(endDate)
from t
where startDate<t1.StartDate and SIGN(amount)<>SIGN(t1.Amount)
) as GroupDate
from t as t1
)
select min(StartDate) as StartDate,
max(EndDate) as EndDate,
sum(Amount) as Amount
from CT
group by GroupDate
order by StartDate
SQLFiddle demo
Here's one idea (and a fiddle to go with it):
;WITH MoneyComingIn AS
(
SELECT MIN(startdate) AS startdate, MAX(enddate) AS enddate,
SUM(amount) AS amount
FROM myTable
WHERE amount > 0
)
SELECT MIN(startdate) AS startdate, MAX(enddate) AS enddate,
SUM(amount) AS amount
FROM myTable
WHERE enddate < (SELECT startdate FROM MoneyComingIn)
UNION ALL
SELECT startdate, enddate, amount
FROM MoneyComingIn
UNION ALL
SELECT MIN(startdate) AS startdate, MAX(enddate) AS enddate,
SUM(amount) AS amount
FROM myTable
WHERE startdate > (SELECT enddate FROM MoneyComingIn)
And a second, without using UNION (fiddle):
SELECT MIN(startdate), MAX(enddate), SUM(amount)
FROM
(
SELECT startdate, enddate, amount,
CASE
WHEN EXISTS(SELECT 1
FROM myTable b
WHERE b.id>=a.id AND b.amount > 0) THEN
CASE WHEN EXISTS(SELECT 1
FROM myTable b
WHERE b.id<=a.id AND b.amount > 0)
THEN 2
ELSE 1
END
ELSE 3
END AS partition_no
FROM myTable a
) x
GROUP BY partition_no
although I suppose as written it assumes Id are in order. You could substitute this with a ROW_NUMBER() OVER(ORDER BY startdate).
Something like that should do it :
select min(startdate), max(enddate), sum(amount) from paiements
where enddate < (select min(startdate) from paiements where amount >0)
union
select min(startdate), max(enddate), sum(amount) from paiements
where startdate >= (select min(startdate) from paiements where amount >0)
and enddate <= (select max(enddate) from paiements where amount >0)
union
select min(startdate), max(enddate), sum(amount) from paiements
where startdate > (select max(enddate) from paiements where amount >0)
But for this kind of reporting, It's probably more explicit using multiple queries.
This does what you want:
-- determine the three periods
DECLARE #StartMoneyIn INT
DECLARE #EndMoneyIn INT
SELECT #StartMoneyIn = MIN(Id)
FROM [Amounts]
WHERE amount > 0
SELECT #EndMoneyIn = MAX(Id)
FROM [Amounts]
WHERE amount > 0
-- retrieve the amounts
SELECT MIN(startdate) AS startdate, MAX(enddate) AS enddate, SUM(amount) AS amount
FROM [Amounts]
WHERE Id < #StartMoneyIn
UNION
SELECT MIN(startdate), MAX(enddate), SUM(amount)
FROM [Amounts]
WHERE Id >= #StartMoneyIn AND Id <= #EndMoneyIn
UNION
SELECT MIN(startdate), MAX(enddate), SUM(amount)
FROM [Amounts]
WHERE Id > #EndMoneyIn
If all you want to do is to see when money started coming in and when it stopped, this might work for you:
select
min(startdate),
max(enddate),
sum(amount)
where
amount > 0
This would not include the periods where there was no money coming in though.
If you don't care about the total in the period, but only want the records where you go from 0 to something and vica versa, you could do something crazy like this:
select *
from MoneyTable mt
where exists ( select *
from MoneyTable mtTemp
where mtTemp.enddate = dateadd(day, -1, mt.startDate)
and mtTemp.amount <> mt.amount
and mtTemp.amount * mt.amount = 0)
Or if you must include the first record:
select *
from MoneyTable mt
where exists ( select *
from MoneyTable mtTemp
where mtTemp.enddate = dateadd(day, -1, mt.startDate)
and mtTemp.amount <> mt.amount
and mtTemp.amount * mt.amount = 0 )
or not exists ( select *
from MoneyTable mtTemp
where mtTemp.enddate = dateadd(day, -1, mt.startDate))
Sql Fiddle