Please help. How to find users who have refueled 50 or more liters every month for 3 months in a row? Customers are counted on the last month of the three-month period.
I'm new to SQL and only guessed to filter out customers who refuel more than 50 liters per month.
SELECT [Client]
,SUM([Litr]) AS Volume
,DATEPART(mm, [Date_transaction]) AS Month_num
,DATEPART(yyyy, [Date_transaction]) AS Year_num
FROM [dbo].[Транзакции]
GROUP BY [Client], DATEPART(mm, [Date_transaction]), DATEPART(yyyy, [Date_transaction])
HAVING SUM([Litr]) >= 50
I received this table:
|Client|Volume |Month_num|Year_num|
|:-----|:---------------|:--------|:-------|
|33 |52,7497194163861|8 |2019 |
|33 |58,1573308846036|9 |2019 |
|33 |148,852157943067|10 |2019 |
|33 |61,2182430364249|12 |2019 |
|55 |73,0741761044791|1 |2019 |
|55 |136,802367105397|3 |2019 |
|58 |88,0522395673911|7 |2019 |
|58 |140,965207631874|8 |2019 |
|58 |130,20099989797 |9 |2019 |
|181 |507,009488827671|6 |2019 |
Desired result:
|Month_num|COUNT(Client)|
|:--------|-------------|
|9 | 1 | (Client 58)
|10 | 1 | (Client 33)
We can group by Client and the start of each month. Then filter out all rows with SUM(tr.Litr) < 50.
Now we can use the LAG window function to check if the previous two rows were the previous two months.
The final result is simply the count of clients who satisfy that condition
SELECT
MonthStart,
NumOfClients = COUNT(*)
FROM (
SELECT
tr.Client,
v.MonthStart,
PrevMonth1 = LAG(v.MonthStart, 1) OVER (PARTITION BY tr.Client ORDER v.MonthStart),
PrevMonth2 = LAG(v.MonthStart, 2) OVER (PARTITION BY tr.Client ORDER v.MonthStart)
FROM dbo.Транзакции tr
CROSS APPLY (VALUES (
DATEFROMPARTS(YEAR(tr.Date_transaction), MONTH(tr.Date_transaction), 1)
) ) v(MonthStart)
GROUP BY tr.Client, v.MonthStart
HAVING SUM(tr.Litr) >= 50
) trGrouped
WHERE DATEDIFF(month, PrevMonth1, MonthStart) = 1
AND DATEDIFF(month, PrevMonth2, PrevMonth1) = 1
GROUP BY MonthStart;
Try this
scripts for table and insert data. Although It was expected from #Andrey
create table #test(client int, litre decimal(10, 5), Date_transaction date)
insert into #test values (33 ,52.7497194163861,'8/1/2019')
insert into #test values (33 ,58.1573308846036,'9/1/2019')
insert into #test values (33 ,148.852157943067,'10/1/2019')
insert into #test values (33 ,61.2182430364249,'12/1/2019')
insert into #test values (55 ,73.0741761044791,'1/1/2019')
insert into #test values (55 ,136.802367105397,'3/1/2019')
insert into #test values (58 ,88.0522395673911,'7/1/2019')
insert into #test values (58 ,140.965207631874,'8/1/2019')
insert into #test values (58 ,130.20099989797 ,'9/1/2019')
insert into #test values (181 ,507.009488827671,'6/1/2019')
Solution Script
SELECT COUNT(distinct client)client, max(month)month
FROM
(
select t1.client, t1.litre, t1.Date_transaction,(DATEPART(mm, t1.Date_transaction)+2)month
from
#test t1
inner join #test t2 ON (t1.client = t2.client and (DATEPART(mm, t1.Date_transaction)+1) = DATEPART(mm, t2.Date_transaction))
inner join #test t3 ON (t1.client = t3.client and (DATEPART(mm, t1.Date_transaction)+2) = DATEPART(mm, t3.Date_transaction))
where t1.litre>50
)T
GROUP BY client
How to find users who have refueled 50 or more liters every month for 3 months in a row?
I would use aggregation and lag(). To get the last month where this occurs:
SELECT t.*
FROM (SELECT t.Client, v.yyyymm,
SUM(t.Litr) AS Volume,
LAG(v.yyyymm, 2) OVER (PARTITION BY t.Client ORDER BY v.yyyymm) as prev2_yyyymm
FROM [dbo].[Транзакции] t CROSS APPLY
(VALUES (DATEFROMPARTS(YEAR(t.Date_transaction), MONTH(t.Date_transaction), 1))
) v(yyyymm)
GROUP BY t.Client, v.yyyymm
HAVING SUM(Litr) >= 50
) t
WHERE prev2_yyyymm = DATEADD(MONTH, -2, yyyymm);
If you want to aggregate by month:
SELECT yyyymm, COUNT(*)
FROM (SELECT t.Client, v.yyyymm,
SUM(t.Litr) AS Volume,
LAG(v.yyyymm, 2) OVER (PARTITION BY t.Client ORDER BY v.yyyymm) as prev2_yyyymm
FROM [dbo].[Транзакции] t CROSS APPLY
(VALUES (DATEFROMPARTS(YEAR(t.Date_transaction), MONTH(t.Date_transaction), 1))
) v(yyyymm)
GROUP BY t.Client, v.yyyymm
HAVING SUM(Litr) >= 50
) t
WHERE prev2_yyyymm = DATEADD(MONTH, -2, yyyymm)
GROUP BY yyyymm;
What is this doing? The inner query is basically your query with some refinements:
yyyymm is a date, more convenient than storing the year and month in the same column.
The CROSS APPLY is just a convenient way of defining the yyyymm alias.
prev2_yyyymm gets the second time previously when the client met the condition.
The outer query implements the logic: If prev2_yyyymm is exactly two months ago, then we have three months in a row that meet the condition.
Here is a db<>fiddle.
Related
I have the following data set:
I want to create a new column that sums the last 7 days of sales. So the query result should look be the following:
Pls help
Thanks!
In standard SQL, you would use a window function -- assuming you have data for each day:
select t.*,
sum(sales) over (partition by itemid order by date rows between 6 preceding and current row) as sales_7
from t;
use sum() aggregate function and group by
select country,itemid,year,monthnumber,week sum(sales) as sales_last_7days from your_table
where date>=DATEADD(day, -7, getdate()) and date< getdate()
group by country,itemid,year,monthnumber,week
with window:
select (list other columns here), sum(sum(sales)) over
(partition by week
order by day
rows between 6 preceding and current row)
from table
group by date, week;
note that week doesen't change group by beacause a date is reffered to one week only, but it is needed in window.
Seems you are working with SQL Server if so, then you can use apply :
select t.*, t1.[last7day]
from table t outer apply
(select sum(t1.sales) as [last7day]
from table t1
where t.itemid = t1.itemid and
t1.date <= dateadd(day, -6, t.dt)
) t1;
If you don't have exactly one day for each row, for example if you have a list of transactions...
The below example completely confused me the first time I saw it, so I've tried to comment as much as I can to explain what's happening.
Suppose we have a table tbl with date column dt and amount column amt, and for each date in tbl we want to return a rolling sum of the amount from the current day and the past 6 days.
select distinct -- see note after code on what this distinct is doing.
dt
, ( -- Has to be in brackets to denote we're returning 1 value per row.
-- for each row of T1:
select sum(b.amt) -- the sum of amounts in T2. The where clause will restrict which rows in T2 will be summed.
from tbl T2
where T2.dt between T1.dt - 6 and T1.dt -- for each row in T1, give me all rows in T2 where the date is between 6 days before this T1 row's date and T1 row's date, giving us our rolling sum
-- WARNING: CHECK YOUR VERSION OF SQL FOR HOW TO SUBTRACT DAYS FROM A DATE, I'VE MADE IT (T1.dt - 6) FOR SIMPLICITY
-- we don't need a group by, because we're returning one value for each row in T1
)
from tbl T1
We have a main version of tbl, aliased T1. We then have a secondary table, aliased T2. For each row in T1, we're going to ask for a set of rows in T2 that we're going to sum before giving it to our main query.
To understand what's happening, run the code without the distinct. You'll notice that we have the same number of rows as in tbl, because the T2 statement is happening for every row in T1.
Notes:
If you have any days for which no rows exist in your table you will not get a calculation for this day. To be certain this doesn't happen, join your table to a table containing a distinct list of consecutive dates, and use this as your date column.
If you have nulls in your amount column the calculation will still work, but if the rolling average contains only nulls you will have null instead of 0 as your result. If that troubles you convert all your nulls to zero's before (or after) you use the query.
The beginning of the period will have a 'ramp up'. But this would be the same whatever method you use to do a rolling sum. If it bothers you, don't return the first 6 days.
Finally a worked example if you're playing along at home using SQL Server:
with tbl as (
-- a list of transactions from 1.10.2019 to 14.10.2019
select cast('2019-10-01' as date) dt, 1 amt
union select cast('2019-10-02' as date), 4
union select cast('2019-10-01' as date), 10
union select cast('2019-10-03' as date), 3
union select cast('2019-10-04' as date), 20
union select cast('2019-10-04' as date), 2
union select cast('2019-10-04' as date), 12
union select cast('2019-10-04' as date), 17
union select cast('2019-10-05' as date), null -- a whole week of null values because we all had the week off... I hope this data wasn't important
union select cast('2019-10-06' as date), null
union select cast('2019-10-07' as date), null
union select cast('2019-10-08' as date), null
union select cast('2019-10-09' as date), null
union select cast('2019-10-10' as date), null
union select cast('2019-10-10' as date), null
union select cast('2019-10-10' as date), null
union select cast('2019-10-11' as date), null
union select cast('2019-10-12' as date), 1
union select cast('2019-10-12' as date), 1
union select cast('2019-10-12' as date), 1
union select cast('2019-10-12' as date), 1
union select cast('2019-10-12' as date), 1
union select cast('2019-10-12' as date), 1
union select cast('2019-10-13' as date), 2
union select cast('2019-10-14' as date), 1000
)
select distinct
a.dt
, (
select sum(b.amt)
from tbl b
where b.dt between dateadd(dd, -6, a.dt) and a.dt
) past_7_days_amt
from tbl a
Returns:
+------------+-----------------+
| dt | past_7_days_amt |
+------------+-----------------+
| 2019-10-01 | 11 |
| 2019-10-02 | 15 |
| 2019-10-03 | 18 |
| 2019-10-04 | 69 |
| 2019-10-05 | 69 |
| 2019-10-06 | 69 |
| 2019-10-07 | 69 |
| 2019-10-08 | 58 |
| 2019-10-09 | 54 |
| 2019-10-10 | 51 |
| 2019-10-11 | NULL |
| 2019-10-12 | 1 |
| 2019-10-13 | 3 |
| 2019-10-14 | 1003 |
+------------+-----------------+
I am currently using this query (in SQL Server) to count the number of unique item each day:
SELECT Date, COUNT(DISTINCT item)
FROM myTable
GROUP BY Date
ORDER BY Date
How can I transform this to get for each date the number of unique item over the past 3 days (including the current day)?
The output should be a table with 2 columns:
one columns with all dates in the original table. On the second column, we have the number of unique item per date.
for instance if original table is:
Date Item
01/01/2018 A
01/01/2018 B
02/01/2018 C
03/01/2018 C
04/01/2018 C
With my query above I currently get the unique count for each day:
Date count
01/01/2018 2
02/01/2018 1
03/01/2018 1
04/01/2018 1
and I am looking to get as result the unique count over 3 days rolling window:
Date count
01/01/2018 2
02/01/2018 3 (because items ABC on 1st and 2nd Jan)
03/01/2018 3 (because items ABC on 1st,2nd,3rd Jan)
04/01/2018 1 (because only item C on 2nd,3rd,4th Jan)
Using an apply provides a convenient way to form sliding windows
CREATE TABLE myTable
([DateCol] datetime, [Item] varchar(1))
;
INSERT INTO myTable
([DateCol], [Item])
VALUES
('2018-01-01 00:00:00', 'A'),
('2018-01-01 00:00:00', 'B'),
('2018-01-02 00:00:00', 'C'),
('2018-01-03 00:00:00', 'C'),
('2018-01-04 00:00:00', 'C')
;
CREATE NONCLUSTERED INDEX IX_DateCol
ON MyTable([Date])
;
Query:
select distinct
t1.dateCol
, oa.ItemCount
from myTable t1
outer apply (
select count(distinct t2.item) as ItemCount
from myTable t2
where t2.DateCol between dateadd(day,-2,t1.DateCol) and t1.DateCol
) oa
order by t1.dateCol ASC
Results:
| dateCol | ItemCount |
|----------------------|-----------|
| 2018-01-01T00:00:00Z | 2 |
| 2018-01-02T00:00:00Z | 3 |
| 2018-01-03T00:00:00Z | 3 |
| 2018-01-04T00:00:00Z | 1 |
There may be some performance gains by reducing the date column prior to using the apply, like so:
select
d.date
, oa.ItemCount
from (
select distinct t1.date
from myTable t1
) d
outer apply (
select count(distinct t2.item) as ItemCount
from myTable t2
where t2.Date between dateadd(day,-2,d.Date) and d.Date
) oa
order by d.date ASC
;
Instead of using select distinct in that subquery you could use group by instead but the execution plan will remain the same.
Demo at SQL Fiddle
The most straight forward solution is to join the table with itself based on dates:
SELECT t1.DateCol, COUNT(DISTINCT t2.Item) AS C
FROM testdata AS t1
LEFT JOIN testdata AS t2 ON t2.DateCol BETWEEN DATEADD(dd, -2, t1.DateCol) AND t1.DateCol
GROUP BY t1.DateCol
ORDER BY t1.DateCol
Output:
| DateCol | C |
|-------------------------|---|
| 2018-01-01 00:00:00.000 | 2 |
| 2018-01-02 00:00:00.000 | 3 |
| 2018-01-03 00:00:00.000 | 3 |
| 2018-01-04 00:00:00.000 | 1 |
GROUP BY should be faster then DISTINCT (make sure to have an index on your Date column)
DECLARE #tbl TABLE([Date] DATE, [Item] VARCHAR(100))
;
INSERT INTO #tbl VALUES
('2018-01-01 00:00:00', 'A'),
('2018-01-01 00:00:00', 'B'),
('2018-01-02 00:00:00', 'C'),
('2018-01-03 00:00:00', 'C'),
('2018-01-04 00:00:00', 'C');
SELECT t.[Date]
--Just for control. You can take this part away
,(SELECT DISTINCT t2.[Item] AS [*]
FROM #tbl AS t2
WHERE t2.[Date]<=t.[Date]
AND t2.[Date]>=DATEADD(DAY,-2,t.[Date]) FOR XML PATH('')) AS CountedItems
--This sub-select comes back with your counts
,(SELECT COUNT(DISTINCT t2.[Item])
FROM #tbl AS t2
WHERE t2.[Date]<=t.[Date]
AND t2.[Date]>=DATEADD(DAY,-2,t.[Date])) AS ItemCount
FROM #tbl AS t
GROUP BY t.[Date];
The result
Date CountedItems ItemCount
2018-01-01 AB 2
2018-01-02 ABC 3
2018-01-03 ABC 3
2018-01-04 C 1
This solution is different from other solutions. Can you check performance of this query on real data with comparison to other answers?
The basic idea is that each row can participate in the window for its own date, the day after, or the day after that. So this first expands the row out into three rows with those different dates attached and then it can just use a regular COUNT(DISTINCT) aggregating on the computed date. The HAVING clause is just to avoid returning results for dates that were solely computed and not present in the base data.
with cte(Date, Item) as (
select cast(a as datetime), b
from (values
('01/01/2018','A')
,('01/01/2018','B')
,('02/01/2018','C')
,('03/01/2018','C')
,('04/01/2018','C')) t(a,b)
)
select
[Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from
cte
cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1
option (force order)
Output:
| Date | Count |
|-------------------------|-------|
| 2018-01-01 00:00:00.000 | 2 |
| 2018-01-02 00:00:00.000 | 3 |
| 2018-01-03 00:00:00.000 | 3 |
| 2018-01-04 00:00:00.000 | 1 |
It might be faster if you have many duplicate rows:
select
[Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from
(select distinct Date, Item from cte) c
cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1
option (force order)
Use GETDATE() function to get current date, and DATEADD() to get the last 3 days
SELECT Date, count(DISTINCT item)
FROM myTable
WHERE [Date] >= DATEADD(day,-3, GETDATE())
GROUP BY Date
ORDER BY Date
SQL
SELECT DISTINCT Date,
(SELECT COUNT(DISTINCT item)
FROM myTable t2
WHERE t2.Date BETWEEN DATEADD(day, -2, t1.Date) AND t1.Date) AS count
FROM myTable t1
ORDER BY Date;
Demo
Rextester demo: http://rextester.com/ZRDQ22190
Since COUNT(DISTINCT item) OVER (PARTITION BY [Date]) is not supported you can use dense_rank to emulate that:
SELECT Date, dense_rank() over (partition by [Date] order by [item])
+ dense_rank() over (partition by [Date] order by [item] desc)
- 1 as count_distinct_item
FROM myTable
One thing to note is that dense_rank will count null as whereas COUNT will not.
Refer this post for more details.
Here is a simple solution that uses myTable itself as the source of grouping dates (edited for SQLServer dateadd). Note that this query assumes there will be at least one record in myTable for every date; if any date is absent, it will not appear in the query results, even if there are records for the 2 days prior:
select
date,
(select
count(distinct item)
from (select distinct date, item from myTable) as d2
where
d2.date between dateadd(day,-2,d.date) and d.date
) as count
from (select distinct date from myTable) as d
I solve this question with Math.
z (any day) = 3x + y (y is mode 3 value)
I need from 3 * (x - 1) + y + 1 to 3 * (x - 1) + y + 3
3 * (x- 1) + y + 1 = 3* (z / 3 - 1) + z % 3 + 1
In that case; I can use group by (between 3* (z / 3 - 1) + z % 3 + 1 and z)
SELECT iif(OrderDate between 3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between 3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1
and orderdate, Orderdate, 0)
order by iif(OrderDate between 3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1
and orderdate, Orderdate, 0)
If you need else day group, you can use;
declare #n int = 4 (another day count)
SELECT iif(OrderDate between #n * (cast(OrderDate as int) / #n - 1) + (cast(OrderDate as int) % #n) + 1
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between #n * (cast(OrderDate as int) / #n - 1) + (cast(OrderDate as int) % #n) + 1
and orderdate, Orderdate, 0)
order by iif(OrderDate between #n * (cast(OrderDate as int) / #n - 1) + (cast(OrderDate as int) % #n) + 1
and orderdate, Orderdate, 0)
Assume that I have a table (MyTable) as follows:
item_id | date
----------------
1 | 2016-06-08
1 | 2016-06-07
1 | 2016-06-05
1 | 2016-06-04
1 | 2016-05-31
...
2 | 2016-06-08
2 | 2016-06-06
2 | 2016-06-04
2 | 2016-05-31
...
3 | 2016-05-31
...
I would like to build a weekly summary table that reports on a running 7 day window. The window would basically say "How many unique item_ids were reported in the preceding 7 days"?
So, in this case, the output table would look something like:
date | weekly_ids
----------------------
2016-05-31| 3 # All 3 were present on the 31st
2016-06-01| 3 # All 3 were present on the 31st which is < 7 days before the 1st
2016-06-02| 3 # Same
2016-06-03| 3 # Same
2016-06-04| 3 # Same
2016-06-05| 3 # Same
2016-06-06| 3 # Same
2016-06-07| 3 # Same
2016-06-08| 2 # item 3 was not present for the entire last week so it does not add to the count.
I've tried:
SELECT
item_id,
date,
MAX(present) OVER (
PARTITION BY item_id
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS is_present
FROM (
# Inner query
SELECT
item_id,
date,
1 AS present,
FROM MyTable
)
GROUP BY date
ORDER BY date DESC
This feels like it is going in the right direction. But as it is, the window runs over the wrong time-frame when dates aren't present (too many dates) and it also doesn't output records for dates when the item_id wasn't present (even if it was present on the previous date). Is there a simple resolution to this problem?
If it's helpful and necessary
I can hard-code an oldest date
I also can get a table of all of the item_ids in existence.
This query will only be run on BigQuery, so BQ specific functions/syntax are fair game and SQL functions/syntax that doesn't run on BigQuery unfortunately doesn't help me ...
I have created a temp table to hold dates, however, you probably would benefit from adding a permanent table to your database for these joins. Trust me it will cause less headaches.
DECLARE #my_table TABLE
(
item_id int,
date DATETIME
)
INSERT #my_table SELECT 1,'2016-06-08'
INSERT #my_table SELECT 1,'2016-06-07'
INSERT #my_table SELECT 1,'2016-06-05'
INSERT #my_table SELECT 1,'2016-06-04'
INSERT #my_table SELECT 1,'2016-05-31'
INSERT #my_table SELECT 2,'2016-06-08'
INSERT #my_table SELECT 2,'2016-06-06'
INSERT #my_table SELECT 2,'2016-06-04'
INSERT #my_table SELECT 2,'2016-05-31'
INSERT #my_table SELECT 3,'2016-05-31'
DECLARE #TrailingDays INT=7
DECLARE #LowDate DATETIME='01/01/2016'
DECLARE #HighDate DATETIME='12/31/2016'
DECLARE #Calendar TABLE(CalendarDate DATETIME)
DECLARE #LoopDate DATETIME=#LowDate
WHILE(#LoopDate<=#HighDate) BEGIN
INSERT #Calendar SELECT #LoopDate
SET #LoopDate=DATEADD(DAY,1,#LoopDate)
END
SELECT
date=HighDate,
weekly_ids=COUNT(DISTINCT item_id)
FROM
(
SELECT
HighDate=C.CalendarDate,
LowDate=LAG(C.CalendarDate, #TrailingDays,0) OVER (ORDER BY C.CalendarDate)
FROM
#Calendar C
WHERE
CalendarDate BETWEEN #LowDate AND #HighDate
)AS X
LEFT OUTER JOIN #my_table MT ON MT.date BETWEEN LowDate AND HighDate
GROUP BY
LowDate,
HighDate
Try below example. It can give you direction to explore
Purely GBQ - Legacy SQL
SELECT date, items FROM (
SELECT
date, COUNT(DISTINCT item_id) OVER(ORDER BY sec RANGE BETWEEN 60*60*24*2 PRECEDING AND CURRENT ROW) AS items
FROM (
SELECT
item_id, date, timestamp_to_sec(timestamp(date)) AS sec
FROM (
SELECT calendar.day AS date, MyTable.item_id AS item_id
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP('2016-05-28'), pos - 1, "DAY")) AS day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP('2016-05-28')), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS calendar
LEFT JOIN (
SELECT date, item_id
FROM
(SELECT 1 AS item_id, '2016-06-08' AS date),
(SELECT 1 AS item_id, '2016-06-07' AS date),
(SELECT 1 AS item_id, '2016-06-05' AS date),
(SELECT 1 AS item_id, '2016-06-04' AS date),
(SELECT 1 AS item_id, '2016-05-28' AS date),
(SELECT 2 AS item_id, '2016-06-08' AS date),
(SELECT 2 AS item_id, '2016-06-06' AS date),
(SELECT 2 AS item_id, '2016-06-04' AS date),
(SELECT 2 AS item_id, '2016-05-31' AS date),
(SELECT 3 AS item_id, '2016-05-31' AS date),
(SELECT 3 AS item_id, '2016-06-05' AS date)
) AS MyTable
ON calendar.day = MyTable.date
)
)
)
GROUP BY date, items
ORDER BY date
Please note
oldest date - 2016-05-28 - is hardcoded in calendar subquery
window size is controled in RANGE BETWEEN 60*60*24*2 PRECEDING AND CURRENT ROW; if you need 7 days - the expression should be 60*60*24*6
have in mind specifics of COUNT(DISTINCT) in BigQuery Legacy SQL
I have a table which contains datetime rows like below.
ID | DateTime
1 | 12:00
2 | 12:02
3 | 12:03
4 | 12:04
5 | 12:05
6 | 12:10
I want to identify those rows where there is a 'gap' of 5 minutes between rows (for example, row 5 and 6).
I know that we need to use DATEDIFF, but how can I only get those rows which are consecutive with each other?
You can use LAG, LEAD window functions for this:
SELECT ID
FROM (
SELECT ID, [DateTime],
DATEDIFF(mi, LAG([DateTime]) OVER (ORDER BY ID), [DateTime]) AS prev_diff,
DATEDIFF(mi, [DateTime], LEAD([DateTime]) OVER (ORDER BY ID)) AS next_diff
FROM mytable) AS t
WHERE prev_diff >= 5 OR next_diff >= 5
Output:
ID
==
5
6
Note: The above query assumes that order is defined by ID field. You can easily substitute this field with any other field that specifies order in your table.
You might try this (I'm not sure if it's really fast)
SELECT current.datetime AS current_datetime,
previous.datetime AS previous_datetime,
DATEDIFF(minute, previous.datetime, current.datetime) AS gap
FROM my_table current
JOIN my_table previous
ON previous.datetime < current.datetime
AND NOT EXISTS (SELECT *
FROM my_table others
WHERE others.datetime < current.datetime
AND others.datetime > previous.datetime);
update SS2012: Use LAG
DECLARE #tbl TABLE(ID INT, T TIME)
INSERT INTO #tbl VALUES
(1,'12:00')
,(2,'12:02')
,(3,'12:03')
,(4,'12:04')
,(5,'12:05')
,(6,'12:10');
WITH TimesWithDifferenceToPrevious AS
(
SELECT ID
,T
,LAG(T) OVER(ORDER BY T) AS prev
,DATEDIFF(MI,LAG(T) OVER(ORDER BY T),T) AS MinuteDiff
FROM #tbl
)
SELECT *
FROM TimesWithDifferenceToPrevious
WHERE ABS(MinuteDiff) >=5
The result
6 12:10:00.0000000 12:05:00.0000000 5
I have a large data set which for the purpose of this question has 3 fields:
Group Identifier
From Date
To Date
On any given row the From Date will always be less than the To Date but within each group the time periods (which are in no particular order) represented by the date pairs could overlap, be contained one within another, or even be identical.
What I'd like to end up with is a query that condenses the results for each group down to just the continuous periods. For example a group that looks like this:
| Group ID | From Date | To Date |
--------------------------------------
| A | 01/01/2012 | 12/31/2012 |
| A | 12/01/2013 | 11/30/2014 |
| A | 01/01/2015 | 12/31/2015 |
| A | 01/01/2015 | 12/31/2015 |
| A | 02/01/2015 | 03/31/2015 |
| A | 01/01/2013 | 12/31/2013 |
Would result in this:
| Group ID | From Date | To Date |
--------------------------------------
| A | 01/01/2012 | 11/30/2014 |
| A | 01/01/2015 | 12/31/2015 |
I've read a number of articles on date packing but I can't quite figure out how to apply that to my data set.
How can construct a query that would give me those results?
The solution from book "Microsoft® SQL Server ® 2012 High-Performance T-SQL Using Window Functions"
;with C1 as(
select GroupID, FromDate as ts, +1 as type, 1 as sub
from dbo.table_name
union all
select GroupID, dateadd(day, +1, ToDate) as ts, -1 as type, 0 as sub
from dbo.table_name),
C2 as(
select C1.*
, sum(type) over(partition by GroupID order by ts, type desc
rows between unbounded preceding and current row) - sub as cnt
from C1),
C3 as(
select GroupID, ts, floor((row_number() over(partition by GroupID order by ts) - 1) / 2 + 1) as grpnum
from C2
where cnt = 0)
select GroupID, min(ts) as FromDate, dateadd(day, -1, max(ts)) as ToDate
from C3
group by GroupID, grpnum;
Create table:
if object_id('table_name') is not null
drop table table_name
create table table_name(GroupID varchar(100), FromDate datetime,ToDate datetime)
insert into table_name
select 'A', '01/01/2012', '12/31/2012' union all
select 'A', '12/01/2013', '11/30/2014' union all
select 'A', '01/01/2015', '12/31/2015' union all
select 'A', '01/01/2015', '12/31/2015' union all
select 'A', '02/01/2015', '03/31/2015' union all
select 'A', '01/01/2013', '12/31/2013'
I'd use a Calendar table. This table simply has a list of dates for several decades.
CREATE TABLE [dbo].[Calendar](
[dt] [date] NOT NULL,
CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED
(
[dt] ASC
))
There are many ways to populate such table.
For example, 100K rows (~270 years) from 1900-01-01:
INSERT INTO dbo.Calendar (dt)
SELECT TOP (100000)
DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '19000101') AS dt
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);
Once you have a Calendar table, here is how to use it.
Each original row is joined with the Calendar table to return as many rows as there are dates between From and To.
Then possible duplicates are removed.
Then classic gaps-and-islands by numbering the rows in two sequences.
Then grouping found islands together to get the new From and To.
Sample data
I added a second group.
DECLARE #T TABLE (GroupID int, FromDate date, ToDate date);
INSERT INTO #T (GroupID, FromDate, ToDate) VALUES
(1, '2012-01-01', '2012-12-31'),
(1, '2013-12-01', '2014-11-30'),
(1, '2015-01-01', '2015-12-31'),
(1, '2015-01-01', '2015-12-31'),
(1, '2015-02-01', '2015-03-31'),
(1, '2013-01-01', '2013-12-31'),
(2, '2012-01-01', '2012-12-31'),
(2, '2013-01-01', '2013-12-31');
Query
WITH
CTE_AllDates
AS
(
SELECT DISTINCT
T.GroupID
,CA.dt
FROM
#T AS T
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >= T.FromDate
AND dbo.Calendar.dt <= T.ToDate
) AS CA
)
,CTE_Sequences
AS
(
SELECT
GroupID
,dt
,ROW_NUMBER() OVER(PARTITION BY GroupID ORDER BY dt) AS Seq1
,DATEDIFF(day, '2001-01-01', dt) AS Seq2
,DATEDIFF(day, '2001-01-01', dt) -
ROW_NUMBER() OVER(PARTITION BY GroupID ORDER BY dt) AS IslandNumber
FROM CTE_AllDates
)
SELECT
GroupID
,MIN(dt) AS NewFromDate
,MAX(dt) AS NewToDate
FROM CTE_Sequences
GROUP BY GroupID, IslandNumber
ORDER BY GroupID, NewFromDate;
Result
+---------+-------------+------------+
| GroupID | NewFromDate | NewToDate |
+---------+-------------+------------+
| 1 | 2012-01-01 | 2014-11-30 |
| 1 | 2015-01-01 | 2015-12-31 |
| 2 | 2012-01-01 | 2013-12-31 |
+---------+-------------+------------+
; with
cte as
(
select *, rn = row_number() over (partition by [Group ID] order by [From Date])
from tbl
),
rcte as
(
select rn, [Group ID], [From Date], [To Date], GrpNo = 1, GrpFrom = [From Date], GrpTo = [To Date]
from cte
where rn = 1
union all
select c.rn, c.[Group ID], c.[From Date], c.[To Date],
GrpNo = case when c.[From Date] between r.GrpFrom and dateadd(day, 1, r.GrpTo)
or c.[To Date] between r.GrpFrom and r.GrpTo
then r.GrpNo
else r.GrpNo + 1
end,
GrpFrom= case when c.[From Date] between r.GrpFrom and dateadd(day, 1, r.GrpTo)
or c.[To Date] between r.GrpFrom and r.GrpTo
then case when c.[From Date] > r.GrpFrom then c.[From Date] else r.GrpFrom end
else c.[From Date]
end,
GrpTo = case when c.[From Date] between r.GrpFrom and dateadd(day, 1, r.GrpTo)
or c.[To Date] between r.GrpFrom and dateadd(day, 1, r.GrpTo)
then case when c.[To Date] > r.GrpTo then c.[To Date] else r.GrpTo end
else c.[To Date]
end
from rcte r
inner join cte c on r.[Group ID] = c.[Group ID]
and r.rn = c.rn - 1
)
select [Group ID], min(GrpFrom), max(GrpTo)
from rcte
group by [Group ID], GrpNo
A Geometric Approach
Here and elsewhere I've noticed that date packing questions
don't provide a geometric approach to this problem. After all,
any range, date-ranges included, can be interpreted as a line.
So why not convert them to a sql geometry type and utilize
geometry::UnionAggregate to merge the ranges. So I gave a stab
at it with your post.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
In this problem, we're working in
'integer space,' meaning there are no time considerations, and so
an begin date in one range that is one day apart from an end date
in another should be merged with that other. In order to make
that merge happen, we need to convert to 'real space.', so we
add 1 to the tail of all ranges (we undo this later).
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In the outer query:
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations, ensuring to put them back into 'integer space'.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #spans -- Where I put your data
),
mergeLines as (
select groupId,
lines = geometry::UnionAggregate(line)
from #spans
cross apply (select
startP = geometry::Point(convert(float,fromDate), 0, 0),
stopP = geometry::Point(convert(float,toDate) + 1, 0, 0)
) pointify
cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
group by groupId
)
select groupId, fromDate, toDate
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
fromDate = convert(datetime, l.line.STPointN(1).STX),
toDate = convert(datetime, l.line.STPointN(3).STX) - 1
) unprepare
order by groupId, fromDate;