Histogram of orders by range of dates - sql

I'm trying to create a histogram based on interval of dates and total number of orders but im having a hard time binning it through SQL.
A simplified table can be seen below
customer_id
Date
count_orders
1
01-01-2020
5
1
01-13-2020
26
1
02-06-2020
11
2
01-17-2020
9
3
02-04-2020
13
3
03-29-2020
24
4
04-05-2020
1
5
02-23-2020
10
6
03-15-2020
7
6
04-18-2020
32
...
...
...
and im thinking of binning it into 20 day intervals but the only thing I can think about is do a
SUM(CASE WHEN Date BETWEEN <interval1_startdate> AND <interval1_enddate> ...)
method per interval which if used into the actual data (which contains millions of row) is quite exhausting. So i need help in automating the binning part.
Desired output would either be
1)
interval
total_count
01-01-2020 - 01-20-2020
31
01-21-2020 - 02-10-2020
24
02-10-2020 - 03-01-2020
10
...
...
or 2)
start
end
total_count
01-01-2020
01-20-2020
31
01-21-2020
02-10-2020
24
02-10-2020
03-01-2020
10
...
...
...
Do you have any ideas?

You can group by the (current date - minimum date)/20. For preso something like this:
WITH dataset (customer_id, Date, count_orders) AS (
VALUES (1, date_parse('01-01-2020', '%m-%d-%Y'), 5),
(1, date_parse('01-13-2020', '%m-%d-%Y'), 26),
(1, date_parse('02-06-2020', '%m-%d-%Y'), 11),
(2, date_parse('01-17-2020', '%m-%d-%Y'), 9),
(3, date_parse('02-04-2020', '%m-%d-%Y'), 13),
(3, date_parse('03-29-2020', '%m-%d-%Y'), 24),
(4, date_parse('04-05-2020', '%m-%d-%Y'), 1),
(5, date_parse('02-23-2020', '%m-%d-%Y'), 10),
(6, date_parse('03-15-2020', '%m-%d-%Y'), 7),
(6, date_parse('04-18-2020', '%m-%d-%Y'), 32)
)
SELECT date_add('day', 20 * grp, min(min_date)) interval_end,
date_add('day', 20 * (grp + 1) - 1, min(min_date)) interval_end,
sum(count_orders) total_count
FROM (
SELECT *,
date_diff('day', min(date) over (), date) / 20 as grp,
min(date) over () min_date
FROM dataset
)
group by grp
order by 1
Output:
interval_end
interval_end
total_count
2020-01-01 00:00:00.000
2020-01-20 00:00:00.000
40
2020-01-21 00:00:00.000
2020-02-09 00:00:00.000
24
2020-02-10 00:00:00.000
2020-02-29 00:00:00.000
10
2020-03-01 00:00:00.000
2020-03-20 00:00:00.000
7
2020-03-21 00:00:00.000
2020-04-09 00:00:00.000
25
2020-04-10 00:00:00.000
2020-04-29 00:00:00.000
32

You can get the intervals using CTE and then get the total using cross apply.
Drop table Tbl
Create Table Tbl (customer_id Int, [date] Date, count_orders Int)
Insert Into Tbl (customer_id, [date], count_orders)
Values (1,'2020-01-01', 5),
(1,'2020-01-13',26),
(1,'2020-02-06',11),
(2,'2020-01-17',9),
(3,'2020-02-04',13),
(3,'2020-03-29',24),
(4,'2020-04-05',1),
(5,'2020-02-23',10),
(6,'2020-03-15',7),
(6,'2020-04-18',32)
;With A As (
Select Min([date]) As start, DateAdd(dd,19,Min([date])) As [end], Max([date]) As [max]
From Tbl
Union All
Select DateAdd(dd,1,[end]) As start, DateAdd(dd,20,[end]) As [end], [max]
From A
Where [end]<[max])
Select A.[start], A.[end], T.total_count
From A Cross Apply (Select SUM(count_orders) As total_count
From Tbl Where [date] between A.[start] And A.[end]) As T
Result:
start end total_count
---------- ---------- -----------
2020-01-01 2020-01-20 40
2020-01-21 2020-02-09 24
2020-02-10 2020-02-29 10
2020-03-01 2020-03-20 7
2020-03-21 2020-04-09 25
2020-04-10 2020-04-29 32

Related

How to transform data into daily snapshot given the two date columns?

I have product data in my table which looks similar to this
product_id
user_id
sales_start
sales_end
quantity
1
12
2022-01-01
2022-02-01
15
2
234
2022-11-01
2022-12-31
123
I want to transform the table into a daily snapshot so that it would look something like this:
product_id
user_id
quantity
date
1
12
15
2022-01-01
1
12
15
2022-01-02
1
12
15
2022-01-03
...
...
...
...
2
234
123
2022-12-31
I know how to do a similar thing in Pandas, but I need to do it within AWS Athena.
I thought of getting the date interval and unnest it, but I am struggling with mapping them properly.
Any ideas on how to transform data?
This will help you sequence
SELECT product_id, user_id, quantity, date(date) as date FROM(
VALUES
(1, 12, DATE '2022-01-01', DATE '2022-02-01', 15),
(2, 234, DATE '2022-11-01', DATE '2022-12-31', 123)
) AS t (product_id, user_id, sales_start, sales_end, quantity),
UNNEST(sequence(sales_start, sales_end, interval '1' day)) t(date)
You can use sequnece to generate dates range and then unnest it:
-- sample data
with dataset(product_id, user_id, sales_start, sales_end, quantity) as (
values (1, 12 , date '2022-01-01', date '2022-01-05', 15), -- short date ranges
(2, 234, date '2022-11-01', date '2022-11-03', 123) -- short date ranges
)
-- query
select product_id, user_id, quantity, date
from dataset,
unnest(sequence(sales_start, sales_end, interval '1' day)) as t(date);
Output:
product_id
user_id
quantity
date
1
12
15
2022-01-01
1
12
15
2022-01-02
1
12
15
2022-01-03
1
12
15
2022-01-04
1
12
15
2022-01-05
2
234
123
2022-11-01
2
234
123
2022-11-02
2
234
123
2022-11-03

Function that returns MAX OR MIN dates based on ID count

I have a task in SQL Server where I need to return the RESULT_DATE column using ID, PRODUCT_ID and DATE columns. Task criteria:
If DATE column is filled once for each PRODUCT_ID then I need to return the only date (like for PRODUCT_ID 1 and 3). Let`s say its MIN date.
If DATE column is filled more than one time (like for PRODUCT_ID 2) then I need to return the next filled DATE row.
Data:
CREATE TABLE #temp (
ID INT,
PRODUCT_ID INT,
[DATE] DATETIME
)
INSERT #temp (ID, PRODUCT_ID, DATE) VALUES
(1, 1, '2008-04-24 00:00:00.000'),
(2, 1, NULL),
(3, 2, '2015-12-09 00:00:00.000'),
(4, 2, NULL),
(5, 2, NULL),
(6, 2, '2022-01-01 13:06:45.253'),
(7, 2, NULL),
(8, 2, '2022-01-19 13:06:45.253'),
(9, 3, '2018-04-25 00:00:00.000'),
(10,3, NULL),
(11,3, NULL)
ID
PRODUCT_ID
DATE
RESULT_DATE
1
1
2008-04-24 00:00:00.000
2008-04-24 00:00:00.000
2
1
NULL
2008-04-24 00:00:00.000
3
2
2015-12-09 00:00:00.000
2022-01-01 13:06:45.253
4
2
NULL
2022-01-01 13:06:45.253
5
2
NULL
2022-01-01 13:06:45.253
6
2
2022-01-01 13:06:45.253
2022-01-19 13:06:45.253
7
2
NULL
2022-01-19 13:06:45.253
8
2
2022-01-19 13:06:45.253
2022-01-19 13:06:45.253
9
3
2018-04-25 00:00:00.000
2018-04-25 00:00:00.000
10
3
NULL
2018-04-25 00:00:00.000
11
3
NULL
2018-04-25 00:00:00.000
I have tried different techniques, for example using LEAD and LAG SQL function combinations. The latest script: (However, still not working)
SELECT
COALESCE(DATE,
CAST(
SUBSTRING(
MAX(CAST(DATE AS BINARY(4)) + CAST(DATE AS BINARY(4))) OVER ( PARTITION BY PRODUCT_ID ORDER BY DATE ROWS UNBOUNDED PRECEDING)
,5,4)
AS INT)
) AS RESULT_DATE,
*
FROM TABLE
You can use a CTE, Select all rows with a non-NULL Date giving each a row_number, then use a second CTE to fetch all rows from the first CTE equivalent to the date with the largest row number per product_id that is less than 3. Finally join this CTE to the original table to supply the 2nd Date to each row:
Set Up
CREATE TABLE #temp (
ID INT,
PRODUCT_ID INT,
MyDATE DATETIME
)
INSERT #temp (ID, PRODUCT_ID, MyDate)
VALUES
(1, 1, '2008-04-24 00:00:00.000'),
(2, 1, NULL),
(3, 2, '2015-12-09 00:00:00.000'),
(4, 2, NULL),
(5, 2, NULL),
(6, 2, '2022-01-01 13:06:45.253'),
(7, 2, NULL),
(8, 2, '2022-01-19 13:06:45.253'),
(9, 3, '2018-04-25 00:00:00.000'),
(10,3, NULL),
(11,3, NULL);
Query:
;WITH CTE
AS
(
SELECT ID, Product_ID, MyDate,
ROW_NUMBER() OVER (PARTITION BY Product_ID ORDER BY Id) AS rn
from #temp
WHERE MyDate IS NOT NULL
),
CTE2
AS
(
SELECT *
FROM CTE C1
WHERE C1.rn < 3
AND
C1.rn =
(SELECT MAX(rn) FROM CTE WHERE Product_Id = C1.Product_Id AND rn<3)
)
SELECT T.Id, T.Product_Id, T.MyDate, C.MyDate As Result_date
FROM #temp T
INNER JOIN CTE2 C
ON T.Product_Id = C.Product_Id
ORDER BY T.Id;
Results:
Id Product_Id MyDate Result_Date
1 1 2008-04-24 00:00:00.000 2008-04-24 00:00:00.000
2 1 NULL 2008-04-24 00:00:00.000
3 2 2015-12-09 00:00:00.000 2022-01-01 13:06:45.253
4 2 NULL 2022-01-01 13:06:45.253
5 2 NULL 2022-01-01 13:06:45.253
6 2 2022-01-01 13:06:45.253 2022-01-01 13:06:45.253
7 2 NULL 2022-01-01 13:06:45.253
8 2 2022-01-19 13:06:45.253 2022-01-01 13:06:45.253
9 3 2018-04-25 00:00:00.000 2018-04-25 00:00:00.000
10 3 NULL 2018-04-25 00:00:00.000
11 3 NULL 2018-04-25 00:00:00.000

How to "unstack" data in SQL in Microsoft SQL Server

My table has this shape:
UTC_DT ID value
-----------------------------
2021-09-29 12:30:00 1 10
2021-09-29 12:30:00 2 20
2021-09-29 12:30:00 3 30
2021-09-29 12:45:00 1 11
2021-09-29 12:45:00 2 21
2021-09-29 12:45:00 3 31
I need this shape:
UTC_DT 1 2 3
------------------------------
2021-09-29 12:30:00 10 20 30
2021-09-29 12:45:00 11 21 31
I can't figure out how to do that. I thought maybe using PIVOT, but I can't figure out the correct syntax. Please help.
You can do this with a simple conditional aggregation like so:
select utc_dt,
max(case id when 1 then value end) [1],
max(case id when 2 then value end) [2],
max(case id when 3 then value end) [3]
from t
group by utc_dt
Since you asked for a pivot example, here you go:
WITH cte AS (
SELECT * FROM (VALUES
('2021-09-29 12:30:00', 1, 10),
('2021-09-29 12:30:00', 2, 20),
('2021-09-29 12:30:00', 3, 30),
('2021-09-29 12:45:00', 1, 11),
('2021-09-29 12:45:00', 2, 21),
('2021-09-29 12:45:00', 3, 31)
) AS x(UTC_DT, ID, value)
)
SELECT pvt.*
FROM (
SELECT *
FROM cte
) AS src
PIVOT (
MAX(value)
FOR ID IN ([1], [2], [3])
) AS pvt;
It becomes rough when you don't know the set of (in this case) IDs that you want to pivot.

How do I check missing values from a table which represents multiple hourly time series (DB2)?

Imagine I have a DB2 table called FOO with a time series id, a timestamp value every hour and an integer number. This is the definition:
CREATE TABLE FOO(
Id_timeseries INTEGER NOT NULL,
number DECIMAL(10, 3) NOT NULL,
timestamp TIMESTAMP NOT NULL,
);
I would like to know, for each time series (imagine that there are several of them) if there are empty values between two given dates and what are those empty dates(I suppose that giving the range of those values would be a lot more difficult)
Example:
Id_timeseries number timestamp
1 28 2017-01-01 01:00:00
1 28 2017-01-01 02:00:00
1 28 2017-01-01 03:00:00
2 28 2017-01-01 01:00:00
2 28 2017-01-01 02:00:00
2 28 2017-01-01 03:00:00
1 28 2017-01-01 07:00:00
1 28 2017-01-01 06:00:00
And I want to know the missing houly values from 2017:01-01 00:00:00 to 2017:01-02 00:00:00
Output:
Id_timeseries from to
1 2017:01-01 00:00:00 2017:01-01 00:00:00
1 2017:01-01 04:00:00 2017:01-01 05:00:00
1 2017:01-01 08:00:00 2017:01-01 23:00:00
2 2017:01-01 00:00:00 2017:01-01 00:00:00
2 2017:01-01 04:00:00 2017:01-01 23:00:00
Try this:
WITH FOO (Id_timeseries, number, timestamp) AS
(
VALUES
(1, 28, timestamp('2017-01-01 01:00:00'))
, (1, 28, timestamp('2017-01-01 02:00:00'))
, (1, 28, timestamp('2017-01-01 03:00:00'))
, (1, 28, timestamp('2017-01-01 06:00:00'))
, (1, 28, timestamp('2017-01-01 07:00:00'))
--, (1, 28, timestamp('2017-01-01 00:00:00'))
--, (1, 28, timestamp('2017-01-01 23:00:00'))
--, (1, 28, timestamp('2017-01-02 00:00:00'))
, (2, 28, timestamp('2017-01-01 01:00:00'))
, (2, 28, timestamp('2017-01-01 02:00:00'))
, (2, 28, timestamp('2017-01-01 03:00:00'))
)
-- Internal gaps
SELECT Id_timeseries, timestamp_prev + 1 hour as from, timestamp - 1 hour as to
FROM
(
SELECT Id_timeseries, timestamp, lag(timestamp) over (partition by Id_timeseries order by timestamp) timestamp_prev
FROM FOO
)
WHERE timestamp_prev <> timestamp - 1 hour
-- Start gap
UNION ALL
SELECT Id_timeseries, timestamp('2017-01-01 00:00:00') as from, min(timestamp) - 1 hour as to
FROM FOO
GROUP BY Id_timeseries
HAVING timestamp('2017-01-01 00:00:00') <> min(timestamp)
-- End gap
UNION ALL
SELECT Id_timeseries, max(timestamp) + 1 hour as from, timestamp('2017-01-02 00:00:00') as to
FROM FOO
GROUP BY Id_timeseries
HAVING timestamp('2017-01-02 00:00:00') <> max(timestamp)
ORDER BY Id_timeseries, from;
The result is:
|ID_TIMESERIES|FROM |TO |
|-------------|--------------------------|--------------------------|
|1 |2017-01-01-00.00.00.000000|2017-01-01-00.00.00.000000|
|1 |2017-01-01-04.00.00.000000|2017-01-01-05.00.00.000000|
|1 |2017-01-01-08.00.00.000000|2017-01-02-00.00.00.000000|
|2 |2017-01-01-00.00.00.000000|2017-01-01-00.00.00.000000|
|2 |2017-01-01-04.00.00.000000|2017-01-02-00.00.00.000000|

Determine contiguous date intervals

I have the following table structure:
id int -- more like a group id, not unique in the table
AddedOn datetime -- when the record was added
For a specific id there is at most one record each day. I have to write a query that returns contiguous (at day level) date intervals for each id.
The expected result structure is:
id int
StartDate datetime
EndDate datetime
Note that the time part of AddedOn is available but it is not important here.
To make it clearer, here is some input data:
with data as
(
select * from
(
values
(0, getdate()), --dummy record used to infer column types
(1, '20150101'),
(1, '20150102'),
(1, '20150104'),
(1, '20150105'),
(1, '20150106'),
(2, '20150101'),
(2, '20150102'),
(2, '20150103'),
(2, '20150104'),
(2, '20150106'),
(2, '20150107'),
(3, '20150101'),
(3, '20150103'),
(3, '20150105'),
(3, '20150106'),
(3, '20150108'),
(3, '20150109'),
(3, '20150110')
) as d(id, AddedOn)
where id > 0 -- exclude dummy record
)
select * from data
And the expected result:
id StartDate EndDate
1 2015-01-01 2015-01-02
1 2015-01-04 2015-01-06
2 2015-01-01 2015-01-04
2 2015-01-06 2015-01-07
3 2015-01-01 2015-01-01
3 2015-01-03 2015-01-03
3 2015-01-05 2015-01-06
3 2015-01-08 2015-01-10
Although it looks like a common problem I couldn't find a similar enough question. Also I'm getting closer to a solution and I will post it when (and if) it works but I feel that there should be a more elegant one.
Here's answer without any fancy joining, but simply using group by and row_number, which is not only simple but also more efficient.
WITH CTE_dayOfYear
AS
(
SELECT id,
AddedOn,
DATEDIFF(DAY,'20000101',AddedOn) dyID,
ROW_NUMBER() OVER (ORDER BY ID,AddedOn) row_num
FROM data
)
SELECT ID,
MIN(AddedOn) StartDate,
MAX(AddedOn) EndDate,
dyID-row_num AS groupID
FROM CTE_dayOfYear
GROUP BY ID,dyID - row_num
ORDER BY ID,2,3
The logic is that the dyID is based on the date so there are gaps while row_num has no gaps. So every time there is a gap in dyID, then it changes the difference between row_num and dyID. Then I simply use that difference as my groupID.
In Sql Server 2008 it is a little bit pain without LEAD and LAG functions:
WITH data
AS ( SELECT * ,
ROW_NUMBER() OVER ( ORDER BY id, AddedOn ) AS rn
FROM ( VALUES ( 0, GETDATE()), --dummy record used to infer column types
( 1, '20150101'), ( 1, '20150102'), ( 1, '20150104'),
( 1, '20150105'), ( 1, '20150106'), ( 2, '20150101'),
( 2, '20150102'), ( 2, '20150103'), ( 2, '20150104'),
( 2, '20150106'), ( 2, '20150107'), ( 3, '20150101'),
( 3, '20150103'), ( 3, '20150105'), ( 3, '20150106'),
( 3, '20150108'), ( 3, '20150109'), ( 3, '20150110') )
AS d ( id, AddedOn )
WHERE id > 0 -- exclude dummy record
),
diff
AS ( SELECT d1.* ,
CASE WHEN ISNULL(DATEDIFF(dd, d2.AddedOn, d1.AddedOn),
1) = 1 THEN 0
ELSE 1
END AS diff
FROM data d1
LEFT JOIN data d2 ON d1.id = d2.id
AND d1.rn = d2.rn + 1
),
parts
AS ( SELECT * ,
( SELECT SUM(diff)
FROM diff d2
WHERE d2.rn <= d1.rn
) AS p
FROM diff d1
)
SELECT id ,
MIN(AddedOn) AS StartDate ,
MAX(AddedOn) AS EndDate
FROM parts
GROUP BY id ,
p
Output:
id StartDate EndDate
1 2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1 2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2 2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2 2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3 2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3 2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3 2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3 2015-01-08 00:00:00.000 2015-01-10 00:00:00.000
Walkthrough:
diff
This CTE returns data:
1 2015-01-01 00:00:00.000 1 0
1 2015-01-02 00:00:00.000 2 0
1 2015-01-04 00:00:00.000 3 1
1 2015-01-05 00:00:00.000 4 0
1 2015-01-06 00:00:00.000 5 0
You are joining same table on itself to get the previous row. Then you calculate difference in days between current row and previous row and if the result is 1 day then pick 0 else pick 1.
parts
This CTE selects result from previous step and sums up the new column(it is a cumulative sum. sum of all values of new column from starting till current row), so you are getting partitions to group by:
1 2015-01-01 00:00:00.000 1 0 0
1 2015-01-02 00:00:00.000 2 0 0
1 2015-01-04 00:00:00.000 3 1 1
1 2015-01-05 00:00:00.000 4 0 1
1 2015-01-06 00:00:00.000 5 0 1
2 2015-01-01 00:00:00.000 6 0 1
2 2015-01-02 00:00:00.000 7 0 1
2 2015-01-03 00:00:00.000 8 0 1
2 2015-01-04 00:00:00.000 9 0 1
2 2015-01-06 00:00:00.000 10 1 2
2 2015-01-07 00:00:00.000 11 0 2
3 2015-01-01 00:00:00.000 12 0 2
3 2015-01-03 00:00:00.000 13 1 3
The last step is just a grouping by ID and new column and picking min and max values for dates.
I took the "Islands Solution #3 from SQL MVP Deep Dives" solution from https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/ and applied to your test data:
with
data as
(
select * from
(
values
(0, getdate()), --dummy record used to infer column types
(1, '20150101'),
(1, '20150102'),
(1, '20150104'),
(1, '20150105'),
(1, '20150106'),
(2, '20150101'),
(2, '20150102'),
(2, '20150103'),
(2, '20150104'),
(2, '20150106'),
(2, '20150107'),
(3, '20150101'),
(3, '20150103'),
(3, '20150105'),
(3, '20150106'),
(3, '20150108'),
(3, '20150109'),
(3, '20150110')
) as d(id, AddedOn)
where id > 0 -- exclude dummy record
)
,CTE_Seq
AS
(
SELECT
ID
,SeqNo
,SeqNo - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) AS rn
FROM
data
CROSS APPLY
(
SELECT DATEDIFF(day, '20150101', AddedOn) AS SeqNo
) AS CA
)
SELECT
ID
,DATEADD(day, MIN(SeqNo), '20150101') AS StartDate
,DATEADD(day, MAX(SeqNo), '20150101') AS EndDate
FROM CTE_Seq
GROUP BY ID, rn
ORDER BY ID, StartDate;
Result set
ID StartDate EndDate
1 2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1 2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2 2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2 2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3 2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3 2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3 2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3 2015-01-08 00:00:00.000 2015-01-10 00:00:00.000
I'd recommend you to examine the intermediate results of CTE_Seq to understand how it actually works. Just put
select * from CTE_Seq
instead of the final SELECT ... GROUP BY .... You'll get this result set:
ID SeqNo rn
1 0 -1
1 1 -1
1 3 0
1 4 0
1 5 0
2 0 -1
2 1 -1
2 2 -1
2 3 -1
2 5 0
2 6 0
3 0 -1
3 2 0
3 4 1
3 5 1
3 7 2
3 8 2
3 9 2
Each date is converted into a sequence number by DATEDIFF(day, '20150101', AddedOn). ROW_NUMBER() generates a set of sequential numbers without gaps, so when these numbers are subtracted from a sequence with gaps the difference jumps/changes. The difference stays the same until the next gap, so in the final SELECT GROUP BY ID, rn brings all rows from the same island together.
Here is a simple solution that does not use analytics. I tend not to use analytics because I work with many different DBMSs and many don't (yet) have them emplemented and even those who do have different syntaxes. I just have the habit of writing generic code whenever possible.
with
Data( ID, AddedOn )as(
select 1, convert( date, '20150101' ) union all
select 1, '20150102' union all
select 1, '20150104' union all
select 1, '20150105' union all
select 1, '20150106' union all
select 2, '20150101' union all
select 2, '20150102' union all
select 2, '20150103' union all
select 2, '20150104' union all
select 2, '20150106' union all
select 2, '20150107' union all
select 3, '20150101' union all
select 3, '20150103' union all
select 3, '20150105' union all
select 3, '20150106' union all
select 3, '20150108' union all
select 3, '20150109' union all
select 3, '20150110'
)
select d.ID, d.AddedOn StartDate, IsNull( d1.AddedOn, '99991231' ) EndDate
from Data d
left join Data d1
on d1.ID = d.ID
and d1.AddedOn =(
select Min( AddedOn )
from data
where ID = d.ID
and AddedOn > d.AddedOn );
In your situation I assume that ID and AddedOn form a composite PK and so are indexed. Thus, the query will run impressively fast even on very large tables.
Also, I used the outer join because it seemed like the last AddedOn date of each ID should be seen in the StartDate column. Instead of NULL I used a common MaxDate value. The NULL could work just as well as a "this is the latest StartDate row" flag.
Here is the output for ID=1:
ID StartDate EndDate
----------- ---------- ----------
1 2015-01-01 2015-01-02
1 2015-01-02 2015-01-04
1 2015-01-04 2015-01-05
1 2015-01-05 2015-01-06
1 2015-01-06 9999-12-31
I'd like to post my own solution too because it's yet another approach:
with data as
(
...
),
temp as
(
select d.id
,d.AddedOn
,dprev.AddedOn as PrevAddedOn
,dnext.AddedOn as NextAddedOn
FROM data d
left JOIN
data dprev on dprev.id = d.id
and dprev.AddedOn = dateadd(d, -1, d.AddedOn)
left JOIN
data dnext on dnext.id = d.id
and dnext.AddedOn = dateadd(d, 1, d.AddedOn)
),
starts AS
(
select id
,AddedOn
from temp
where PrevAddedOn is NULL
),
ends as
(
select id
,AddedOn
from temp
where NextAddedon is NULL
)
SELECT s.id as id
,s.AddedOn as StartDate
,(select min(e.AddedOn) from ends e where e.id = s.id and e.AddedOn >= s.AddedOn) as EndDate
from starts s