SQL - Group rows by contiguous date - sql

I have a table:
Value Date
100 01/01/2000
110 01/05/2002
100 01/10/2003
100 01/12/2004
I want to group the data in this way
Value StartDate EndDate
100 01/01/2000 30/04/2002
110 01/05/2002 30/09/2003
100 01/10/2003 NULL --> or value like '01/01/2099'
How can I accomplish this?
Can a CTE be useful and how?

For RDBMS supported window functions (example on MS SQL database):
with Test(value, dt) as(
select 100, cast('2000-01-01' as date) union all
select 110, cast('2002-05-01' as date) union all
select 100, cast('2003-10-01' as date) union all
select 100, cast('2004-12-01' as date)
)
select max(value) value, min(dt) startDate, max(end_dt) endDate
from (
select a.*, sum(brk) over(order by dt) grp
from (
select t.*,
case when value!=lag(value) over(order by dt) then 1 else 0 end brk,
DATEADD(DAY,-1,lead(dt,1,cast('2099-01-02' as date)) over(order by dt)) end_dt
from Test t
) a
) b
group by grp
order by startDate

I think the difference of row numbers is simpler in this case:
select value, min(date) as endDate,
dateadd(day, -1, lead(min(date)) over (order by min(date))) as endDate
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by value order by date) as seqnum_v
from t
) t
group by value, (seqnum - seqnum_v);
The difference of the row numbers defines the groups you want. This is a bit hard to see at first . . . if you stare at the results of the subquery, you'll see how it works.

Related

Gap and Island problem - query not working for all periods

I have to create a query to find the gaps and islands between dates. This seems to be a standard gaps and island problem. To show my issue I will use sample of data. The queries are executed in Snowflake.
CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017' UNION ALL
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
SELECT '8/24/2017', '8/26/2017' UNION ALL
SELECT '8/28/2017', '9/19/2017' UNION ALL
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL
SELECT '10/25/2017','11/3/2017' UNION ALL
SELECT '11/3/2017', '11/15/2017';
This code gives me a sample of table.
Then I have the code to find gaps and islands:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
TEST
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
The results are:
As you see the problem is for period 8/28/2017 - 9/19/2017.
This period should not be a separate island, because it should be included in the period: 8/23/2017 - 9/23/2017.
Do you have any idea how I can modify my query to get the correct results (so instead 6 I should have 5 islands as 8/28/2017 - 9/19/2017 should not be island). This just example of data, so I am looking for unversal solution, but so far I have not figure out the correct approach.
You can express the gaps-and-islands logic like this:
select min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
from (select t.*,
max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
from test t
) t
) t
group by grp
order by min(startdate);
Here is a db<>fiddle.
The idea is to look for the maximum enddate on all the "earlier" rows. This value is used to check if there is an overlap.
So, the innermost subquery calculates the previous enddate. The middle subquery does a cumulative sum of the beginnings of groups to assign a group identifier.
The outer query just aggregates by the group identifier.
You could remove the overlapping records from the original set:
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
StartDate
EndDate
2017-08-20
2017-08-21
2017-08-22
2017-09-23
2017-08-23
2017-10-10
2017-10-17
2017-10-18
2017-10-25
2017-11-03
2017-11-03
2017-11-15
In this current result set, no additional duplications have occurred, but in a larger recordset there would be more potential for a much larger range of contiguous records. Meaning you may need to recursively execute this lookup.
Putting that together you get:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
(
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
) Normalized
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
This results in 4 islands, not 5 as you were originally expecting, because of your 2nd and 3rd input lines AND the 6th and 7th lines, they create an Island that spans 8/22 - 10/10 !
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
...
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
IslandStartDate
IslandEndDate
2017-08-20
2017-08-21
2017-08-22
2017-10-10
2017-10-17
2017-10-18
2017-10-25
2017-11-15

Top N items in every month - BIGQUERY

I have a big query program below;
WITH cte AS(
SELECT *
FROM (
SELECT project_name,
SUM(reward_value) AS total_reward_value,
DATE_TRUNC(date_signing, MONTH) as month,
date_signing,
Row_number() over (partition by DATE_TRUNC(date_signing, MONTH)
order by SUM(reward_value) desc) AS rank
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month, date_signing
)
)
SELECT * FROM cte WHERE rank <= 5
that returns the following result:
While I expect to have each unique project to be SUM within each month and then I filter only the top 5.
Something like this:
I got the following error if the date_signing grouping is removed
PARTITION BY expression references column date_signing which is neither grouped nor aggregated at [16:48]
Any hints what should be corrected will be appreciated!
One more subquery maybe then?
WITH cte AS(
SELECT project_name,
SUM(reward_value) as reward_sum,
DATE_TRUNC(date_signing, MONTH) as month
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month
),
ranks AS (
SELECT
project_name,
reward_sum,
month,
ROW_NUMBER() over (PARTITION BY month ORDER BY reward_sum DESC) AS rank
)
SELECT *
FROM ranks
WHERE rank <= 5
yeah you can't do that , yo can show the last signing date instead:
WITH cte AS(
SELECT project_name,
SUM(reward_value),
DATE_TRUNC(date_signing, MONTH) as month,
MAX(date_signing) as last_signing_date,
Row_number() over (partition by DATE_TRUNC(date_signing, MONTH)
order by SUM(reward_value) desc) AS rank
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month
)
SELECT * FROM cte WHERE rank <= 5

SQL - get counts based on rolling window per unique id

I'm working with a table that has an id and date column. For each id, there's a 90-day window where multiple transactions can be made. The 90-day window starts when the first transaction is made and the clock resets once the 90 days are over. When the new 90-day window begins triggered by a new transaction I want to start the count from the beginning at one. I would like to generate something like this with the two additional columns (window and count) in SQL:
id date window count
name1 7/7/2019 first 1
name1 12/31/2019 second 1
name1 1/23/2020 second 2
name1 1/23/2020 second 3
name1 2/12/2020 second 4
name1 4/1/2020 third 1
name2 6/30/2019 first 1
name2 8/14/2019 first 2
I think getting the rank of the window can be done with a CASE statement and MIN(date) OVER (PARTITION BY id). This is what I have in mind for that:
CASE WHEN MIN(date) OVER (PARTITION BY id) THEN 'first'
WHEN DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) <= 90 THEN 'first'
WHEN DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) > 90 AND DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) <= 180 THEN 'third'
WHEN DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) > 180 AND DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) <= 270 THEN 'fourth'
ELSE NULL END
And incrementing the counts within the windows would be ROW_NUMBER() OVER (PARTITION BY id, window)?
You cannot solve this problem with window functions only. You need to iterate through the dataset, which can be done with a recursive query:
with
tab as (
select t.*, row_number() over(partition by id order by date) rn
from mytable t
)
cte as (
select id, date, rn, date date0 from tab where rn = 1
union all
select t.id, t.date, t.rn, greatest(t.date, c.date + interval '90' day)
from cte c
inner join tab t on t.id = c.id and t.rn = c.rn + 1
)
select
id,
date,
dense_rank() over(partition by id order by date0) grp,
count(*) over(partition by id order by date0, date) cnt
from cte
The first query in the with clause ranks records having the same id by increasing date; then, the recursive query traverses the data set and computes the starting date of each group. The last step is numbering the groups and computing the window count.
GMB is totally correct that a recursive CTE is needed. I offer this as an alternative form for two reasons. First, because it uses SQL Server syntax, which appears to be the database being used in the question. Second, because it directly calculates window and count without window functions:
with t as (
select t.*, row_number() over (partition by id order by date) as seqnum
from tbl t
),
cte as (
select t.id, t.date, dateadd(day, 90, t.date) as window_end, 1 as window, 1 as count, seqnum
from t
where seqnum = 1
union all
select t.id, t.date,
(case when t.date > cte.window_end then dateadd(day, 90, t.date)
else cte.window_end
end) as window_end,
(case when t.date > cte.window_end then window + 1 else window end) as window,
(case when t.date > cte.window_end then 1 else cte.count + 1 end) as count,
t.seqnum
from cte join
t
on t.id = cte.id and
t.seqnum = cte.seqnum + 1
)
select id, date, window, count
from cte
order by 1, 2;
Here is a db<>fiddle.

Segregate the row based on the date time column per each month

I have the following table in sql server database environment.
the format of start date MM/DD/YYYY.
I need the result to be like the following table.
based on start date column the record should segregated to each month in the period between start date and end date
You can use a recursive CTE:
with cte as (
select id, startdate as dte, enddate
from t
union all
select id,
dateadd(day, 1, eomonth(dte)),
enddate
from t
where eomonth(dte) < enddate
)
select id, dte,
lead(dte, 1, enddate) over (partition by id order by dte)
from cte;
Thank you Gordon Linoff
Using CTE I have got the following result
My code
WITH cte
AS (SELECT 1 AS id,
Cast('2010-01-20' AS DATE) AS trg,
Cast('2010-01-20' AS DATE) AS strt_dte,
Cast('2010-03-15' AS DATE) AS end_dte
UNION ALL
SELECT id,
Dateadd(day, 1, Eomonth (trg)),
strt_dte,
end_dte
FROM cte
WHERE Eomonth(trg) < end_dte)
SELECT id,
trg,
strt_dte,
end_dte,
Lead (trg, 1, end_dte)
OVER (
partition BY id
ORDER BY trg) AS lead_result
FROM cte

How to select the user with max count by day

I have a table with three columns
UserID, Count, Date
I'd like to be able to select the userid with the highest count for each date.
I've tried a few different variations of queries with inline select statements but none have worked 100%, and I'm not too fond of having a select with three inline selects.
Is doing inline selects the only way to go without using temp tables? Whats the best way to tackle this?
This solution will give you multiple records if there is a tie in Count but should work.
SELECT a.Date, a.UserId, a.[Count]
FROM yourTable a INNER JOIN (
SELECT MAX([Count]) as [Count], Date
FROM yourTable
GROUP BY Date
) b ON a.[Count] = b.[Count] AND a.Date = b.Date
ORDER BY a.Date
If [Date] is in fact a [Date] column with no time component:
;WITH x AS
(
SELECT [Date], [Count], UserID, rn = ROW_NUMBER() OVER
(PARTITION BY [Date] ORDER BY [Count] DESC)
FROM dbo.table
)
SELECT [Date], [Count], UserID
FROM x
WHERE rn = 1
ORDER BY [Date];
If [Date] is a DATETIME column with a time component, then:
;WITH x AS
(
SELECT [Date] = DATEADD(DAY, DATEDIFF(DAY, '19000101', [Date]), '19000101'),
[Count], UserID, rn = ROW_NUMBER() OVER
(PARTITION BY DATEADD(DAY, DATEDIFF(DAY, '19000101', [Date]), '19000101')
ORDER BY [Count] DESC)
FROM dbo.table
)
SELECT [Date], [Count], UserID
FROM x
WHERE rn = 1
ORDER BY [Date];
If you want to pick a specific row in the event of a tie, you can add a tie-breaker to the ORDER BY within the over. If you want to include multiple rows in the case of ties, you can try changing ROW_NUMBER() to DENSE_RANK().
SELECT x.*
FROM (
SELECT Date
FROM atable
GROUP BY Date
) t
CROSS APPLY (
SELECT TOP 1 WITH TIES
UserID, Count, Date
FROM atable
WHERE Date = t.Date
ORDER BY Count DESC
) x
If Date is datetime type and can have a non-zero time component, change the t table like this:
…
FROM (
SELECT Date = DATEADD(DAY, DATEDIFF(DAY, 0, Date), 0)
FROM atable
GROUP BY DATEADD(DAY, DATEDIFF(DAY, 0, Date), 0)
) t
…
References:
TOP (Transact-SQL)
Using APPLY
for SQL 2k5
select UserID, Count, Date
from tb
where Rank() over (partition by Date order by Count DESC, UserID DESC) = 1