Aggregate log table records to avoid redundancy - sql

I have a table for product change tracking that looks like this:
CREATE TABLE ProductHistory (
ProductId INT NOT NULL,
Name NVARCHAR(50) NOT NULL,
Price MONEY NOT NULL,
StartDate DATETIME NOT NULL,
EndDate DATETIME NOT NULL
)
INSERT INTO ProductHistory VALUES
(1, 'Phone', 100, '2020-11-20 00:00', '2020-11-20 01:00'), /* initial */
(1, 'Phone', 100, '2020-11-20 01:01', '2020-11-20 02:00'), /* no change */
(1, 'Phone', 200, '2020-11-20 02:01', '2020-11-20 03:00'), /* no change, current */
(2, 'Apple', 5, '2020-11-20 00:00', '2020-11-20 01:00'), /* initial */
(2, 'Apple', 10, '2020-11-20 01:01', '2020-11-20 02:00'), /* changed */
(2, 'Pineapple', 10, '2020-11-20 02:01', '2020-11-20 03:00'), /* no change, current */
(3, 'Orange juice', 100, '2020-11-21 00:00', '2020-11-21 01:00'), /* initial */
(3, 'Orange juice', 100, '2020-11-21 01:01', '2020-11-21 02:00'), /* no change */
(3, 'Orange juice', 100, '2020-11-21 02:01', '2020-11-21 03:00') /* no change, current */
I was hoping to come up with a query to get the results below. Notice that the records without actual changes are supposed to be merged together so that there is no redudancy.
ProductId Name Price StartDate EndDate
----------- -------------- ------- -------------------------------------- --------------------------------------
1 Phone 100 2020-11-20 00:00:00.000 (first row) 2020-11-20 02:00:00.000 (second row)
1 Phone 200 2020-11-20 02:01:00.000 (third row) 2020-11-20 03:00:00.000 (third row)
2 Apple 5 2020-11-20 00:00:00.000 (first row) 2020-11-20 01:00:00.000 (first row)
2 Apple 10 2020-11-20 01:01:00.000 (second row) 2020-11-20 02:00:00.000 (second row)
2 Pineapple 10 2020-11-20 02:01:00.000 (third row) 2020-11-20 03:00:00.000 (third row)
3 Orange juice 100 2020-11-21 00:00:00.000 (first row) 2020-11-20 03:00:00.000 (third row)
The closest I got to was the following:
; WITH history AS (
SELECT
ProductId,
Name,
Price,
StartDate,
EndDate
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY StartDate DESC) 'RowNumber',
*
FROM ProductHistory
) history
WHERE history.RowNumber = 1 -- select newest row per ProductId
UNION ALL
SELECT
previous.ProductId,
previous.Name,
previous.Price,
previous.StartDate,
EndDate
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY previous.ProductId ORDER BY previous.StartDate DESC) 'RowNumber',
previous.*
FROM history [current]
INNER JOIN ProductHistory previous
ON previous.ProductId = [current].ProductId
AND previous.StartDate < [current].StartDate
AND (
previous.Name <> [current].Name
OR previous.Price <> [current].Price
)
) previous
WHERE previous.RowNumber = 1 -- select previous row of each ProductId, recursively
)
SELECT *
FROM history
ORDER BY
ProductId,
StartDate
ProductId Name Price StartDate EndDate
----------- -------------- -------- ------------------------- -------------------------
1 Phone 100,00 2020-11-20 01:01:00.000 2020-11-20 02:00:00.000
1 Phone 200,00 2020-11-20 02:01:00.000 2020-11-20 03:00:00.000
2 Apple 5,00 2020-11-20 00:00:00.000 2020-11-20 01:00:00.000
2 Apple 10,00 2020-11-20 01:01:00.000 2020-11-20 02:00:00.000
2 Pineapple 10,00 2020-11-20 02:01:00.000 2020-11-20 03:00:00.000
3 Orange juice 100,00 2020-11-21 02:01:00.000 2020-11-21 03:00:00.000
While the Name and Price column values are right, I'm not sure how to aggregate the StartDate and EndDate columns to get what I need. All code is available in this fiddle, if it's of any help.

This is a type of gaps-and-islands problem. Probably the simplest method is the difference of row numbers:
select productid, name, price, min(startdate), max(enddate)
from (select ph.*,
row_number() over (partition by productid order by startdate) as seqnum,
row_number() over (partition by productid, name, price order by startdate) as seqnum_2
from producthistory
) ph
group by productid, name, price, (seqnum - seqnum_2);
This assumes that there are no gaps in the time frames -- which seems reasonable with this data model.
Why does this work? That is a little hard to explain. But if you look at the results of the subquery, you will see how the difference between the two row numbers is constant for adjacent rows where name and price are the same.

Related

How to get date ranges over equal consecutive values having a single date

I have a table like this:
CREATE TABLE Rates
(
RateGroup int NOT NULL,
Rate decimal(5, 2) NOT NULL,
DueDate date NOT NULL
);
This table contains rates which are valid from a certain due date to the day before the next due date. If no next due date is present, the rate's validity has no end. There can be multiple consecutive due days with the same rate and a certain can appear on different non consecutive due days as well.
The rates are divided into groups. A single due date can appear in multiple groups but only once per group.
Here's some example data:
INSERT INTO Rates(RateGroup, Rate, DueDate)
VALUES
(1, 1.2, '20210101'), (1, 1.2, '20210215'), (1, 1.5, '20210216'),
(1, 1.2, '20210501'), (2, 3.7, '20210101'), (2, 3.7, '20210215'),
(2, 3.7, '20210216'), (2, 3.7, '20210501'), (3, 2.9, '20210101'),
(3, 2.5, '20210215'), (3, 2.5, '20210216'), (3, 2.1, '20210501');
RateGroup
Rate
DueDate
1
1.20
2021-01-01
1
1.20
2021-02-15
1
1.50
2021-02-16
1
1.20
2021-05-01
2
3.70
2021-01-01
2
3.70
2021-02-15
2
3.70
2021-02-16
2
3.70
2021-05-01
3
2.90
2021-01-01
3
2.50
2021-02-15
3
2.50
2021-02-16
3
2.10
2021-05-01
Now I want a query which folds multiple consecutive rows of a rate group with the same rate to a single row containing the date range (start and end date) where the rate is valid.
This is the desired result:
RateGroup
Rate
StartDate
EndDate
1
1.20
2021-01-01
2021-02-15
1
1.50
2021-02-16
2021-04-30
1
1.20
2021-05-01
NULL
2
3.70
2021-01-01
NULL
3
2.90
2021-01-01
2021-02-14
3
2.50
2021-02-15
2021-04-30
3
2.10
2021-05-01
NULL
How can I achieve this?
This can be done with Common Table Expressions
uilizing the OVER Clause
as in the following query:
WITH
RatesWithBegin AS
(
SELECT RateGroup, Rate, DueDate,
CASE
WHEN Rate = LAG(Rate) OVER (PARTITION BY RateGroup ORDER BY DueDate)
THEN 0
ELSE 1
END AS IsBegin
FROM Rates
),
RatesFromTo AS
(
SELECT RateGroup, Rate, DueDate AS StartDate,
LEAD (DATEADD(day, -1, DueDate)) OVER
(
PARTITION BY RateGroup
ORDER BY DueDate
) AS EndDate,
SUM (IsBegin) OVER
(
PARTITION BY RateGroup
ORDER BY DueDate
ROWS UNBOUNDED PRECEDING
) AS RangeID
FROM RatesWithBegin
)
SELECT RateGroup, MAX(Rate) AS Rate, MIN(StartDate) AS StartDate,
NULLIF(MAX(COALESCE(EndDate, '99990101')), '99990101') AS EndDate
FROM RatesFromTo
GROUP BY RateGroup, RangeID
ORDER BY RateGroup, StartDate;
How does it work?
RatesWithBegin AS
(
SELECT RateGroup, Rate, DueDate,
CASE
WHEN Rate = LAG(Rate) OVER (PARTITION BY RateGroup ORDER BY DueDate)
THEN 0
ELSE 1
END AS IsBegin
FROM Rates
),
Here we are using LAG()
to compare the current rate to it's predecessor. PARTITION BY RateGroup makes sure that we don't
mix rate groups and ORDER BY DueDate determines the order at which we look at the rows.
If the current rate is equal to it's predecessor we mark the row with a 0 otherwise witha 1.
The result of this CTE would look like so for the first rate group:
RateGroup
Rate
DueDate
IsBegin
1
1.20
2021-01-01
1
1
1.20
2021-02-15
0
1
1.50
2021-02-16
1
1
1.20
2021-05-01
1
0 and 1 are no arbitrary values; they are needed for the next step.
RatesFromTo AS
(
SELECT RateGroup, Rate, DueDate AS StartDate,
LEAD (DATEADD(day, -1, DueDate)) OVER
(
PARTITION BY RateGroup
ORDER BY DueDate
) AS EndDate,
SUM (IsBegin) OVER
(
PARTITION BY RateGroup
ORDER BY DueDate
ROWS UNBOUNDED PRECEDING
) AS RangeID
FROM RatesWithBegin
)
In this CTE we build a running total over the IsBegin column with
SUM(). Due to the value being 1 at the start
of a new range and 0 within a range our running total increments always at the begin of a new
range. This leads to an uniqe number for each range.
With LEAD() we add
the day before the next due date in our range to the output. The result of this step for the first rate group is then:
RateGroup
Rate
StartDate
EndDate
RangeID
1
1.20
2021-01-01
2021-02-14
1
1
1.20
2021-02-15
2021-02-15
1
1
1.50
2021-02-16
2021-04-30
2
1
1.20
2021-05-01
NULL
3
SELECT RateGroup, MAX(Rate) AS Rate, MIN(StartDate) AS StartDate,
NULLIF(MAX(COALESCE(EndDate, '99990101')), '99990101') AS EndDate
FROM RatesFromTo
GROUP BY RateGroup, RangeID
ORDER BY RateGroup, StartDate;
Now that we have an unique identifier (RangeID) for the date ranges we can do a simple
aggregation with GROUP BY to get our desired result. Since the ranges can be open ended
(no next due date) we use
NULLIF(MAX(COALESCE(EndDate, '99990101')), '99990101')
to make sure that NULL is always treated as the latest possible date.

Is this possible in SQL? Min and Max Dates On a Total. Where it changes in between Dates

I am trying to figure out how to write a query that will give me the correct historical data between dates. But only using sql. I know it is possible coding a loop, but I'm not sure if this is possible in a SQL query. Dates: DD/MM/YYYY
An Example of Data
ID
Points
DATE
1
10
01/01/2018
1
20
02/01/2019
1
25
03/01/2020
1
10
04/01/2021
With a simple query
SELECT ID, Points, MIN(Date), MAX(Date)
FROM table
GROUP BY ID,POINTS
The Min date for 10 points would be 01/01/2018, and the Max Date would be 04/01/2021. Which would be wrong in this instance. As It should be:
ID
Points
Min DATE
Max DATE
1
10
01/01/2018
01/01/2019
1
20
02/01/2019
02/01/2020
1
25
03/01/2020
03/01/2021
1
10
04/01/2021
04/01/2021
I was thinking of using LAG, but need some ideas here. What I haven't told you is there is a record per day. So I would need to group until a change of points. This is to create a view from the data that I already have.
It looks like - for your sample data set - the following lead should suffice:
select id, points, date as MinDate,
IsNull(DateAdd(day, -1, Lead(Date,1) over(partition by Id order by Date)), Date) as MaxDate
from t
Example Fiddle
I'm guessing you want the MAX date to be 1 day before the next MIN date.
And you can use the window function LEAD to get the next MIN date.
And if you group also by the year, then the date ranges match the expected result.
SELECT ID, Points
, MIN([Date]) AS [Min Date]
, COALESCE(DATEADD(day, -1, LEAD(MIN([Date])) OVER (PARTITION BY ID ORDER BY MIN([Date]))), MAX([Date])) AS [Max Date]
FROM your_table
GROUP BY ID, Points, YEAR([Date]);
ID
Points
Min Date
Max Date
1
10
2018-01-01
2019-01-01
1
20
2019-01-02
2020-01-02
1
25
2020-01-03
2021-01-03
1
10
2021-01-04
2021-01-04
Test on db<>fiddle here
We can do this by creating two tables one with the minimum and one with the maximum date for each grouping and then combining them
CREATE TABLE dataa(
id INT,
points INT,
ddate DATE);
INSERT INTO dataa values(1 , 10 ,'2018-10-01');
INSERT INTO dataa values(1 , 20 ,'2019-01-02');
INSERT INTO dataa values(1 , 25 ,'2020-01-03');
INSERT INTO dataa values(1 , 10 ,'2021-01-04');
SELECT
mi.id, mi.points,mi.date minDate, ma.date maxDate
FROM
(select id, points, min(ddate) date from dataa group by id,points) mi
JOIN
(select id, points, max(ddate) date from dataa group by id,points) ma
ON
mi.id = ma.id
AND
mi.points = ma.points;
DROP TABLE dataa;
this gives the following output
+------+--------+------------+------------+
| id | points | minDate | maxDate |
+------+--------+------------+------------+
| 1 | 10 | 2018-10-01 | 2021-01-04 |
| 1 | 20 | 2019-01-02 | 2019-01-02 |
| 1 | 25 | 2020-01-03 | 2020-01-03 |
+------+--------+------------+------------+
I've used the default date formatting. This could be modified if you wish.
*** See my other answer, as I don't think this answer is correct after reexamining the OPs question. Leaving ths answer in place, in case it has any value.
As I understand the problem consecutive daily values with the same value for a given ID may be ignored. This can be done by examining the prior value using the LAG() function and excluding records where the current value is unchanged from the prior.
From the remaining records, the LEAD() function can be used to look ahead to the next included record to extract the date where this value is superseded. Max Date is then calculated as one day prior.
Below is an example that includes expanded test data to cover multiple IDs and repeated Points values.
DECLARE #Data TABLE (Id INT, Points INT, Date DATE)
INSERT #Data
VALUES
(1, 10, '2018-01-01'), -- Start
(1, 20, '2019-01-02'), -- Updated
(1, 25, '2020-01-03'), -- Updated
(1, 10, '2021-01-04'), -- Updated
(2, 10, '2022-01-01'), -- Start
(2, 20, '2022-02-01'), -- Updated
(2, 20, '2022-03-01'), -- No change
(2, 20, '2022-04-01'), -- No change
(2, 20, '2022-05-01'), -- No change
(2, 25, '2022-06-01'), -- Updated
(2, 25, '2022-07-01'), -- No change
(2, 20, '2022-08-01'), -- Updated
(2, 25, '2022-09-08'), -- Updated
(2, 10, '2022-10-09'), -- Updated
(3, 10, '2022-01-01'), -- Start
(3, 10, '2022-01-02'), -- No change
(3, 20, '2022-01-03'), -- Updated
(3, 20, '2022-01-04'), -- No change
(3, 20, '2022-01-05'), -- No change
(3, 10, '2022-01-06'), -- Updated
(3, 10, '2022-01-07'); -- No change
WITH CTE AS (
SELECT *, PriorPoints = LAG(Points) OVER (PARTITION BY Id ORDER BY Date)
FROM #Data
)
SELECT ID, Points, MinDate = Date,
MaxDate = DATEADD(day, -1, (LEAD(Date) OVER (PARTITION BY Id ORDER BY Date)))
FROM CTE
WHERE (PriorPoints <> Points OR PriorPoints IS NULL) -- Exclude unchanged
ORDER BY Id, Date
Results:
ID
Points
MinDate
MaxDate
1
10
2018-01-01
2019-01-01
1
20
2019-01-02
2020-01-02
1
25
2020-01-03
2021-01-03
1
10
2021-01-04
null
2
10
2022-01-01
2022-01-31
2
20
2022-02-01
2022-05-31
2
25
2022-06-01
2022-07-31
2
20
2022-08-01
2022-09-07
2
25
2022-09-08
2022-10-08
2
10
2022-10-09
null
3
10
2022-01-01
2022-01-02
3
20
2022-01-03
2022-01-05
3
10
2022-01-06
null
db<>fiddle
For the last value for a given ID, the calculated MaxDate is NULL indicating no upper bound to the date range. If you really want MaxDate = MinDate for this case, you can add ISNULL( ..., Date).
(I am adding this as an alternative (and simpler) interpretation of the OP's question.)
Problem restatement: Given a collection if IDs, Dates, and Points values, a group is defined as any consecutive sequence of the same Points value for a given ID and ascending dates. For each such group, calculate the min and max dates.
The start of such a group can be identified as a row where the Points value changes from the preceding value, or if there is no preceding value for a given ID. If we first tag such rows (NewGroup = 1), we can then assign group numbers based on a count of preceding tagged rows (including the current row). Once we have assigned group numbers, it is then a simple matter to apply a group and aggregate operation.
Below is a sample that includes some additional test data to show multiple IDs and repeating values.
DECLARE #Data TABLE (Id INT, Points INT, Date DATE)
INSERT #Data
VALUES
(1, 10, '2018-01-01'), -- Start
(1, 20, '2019-01-02'), -- Updated
(1, 25, '2020-01-03'), -- Updated
(1, 10, '2021-01-04'), -- Updated
(2, 10, '2022-01-01'), -- Start
(2, 20, '2022-02-01'), -- Updated
(2, 20, '2022-03-01'), -- No change
(2, 20, '2022-04-01'), -- No change
(2, 20, '2022-05-01'), -- No change
(2, 25, '2022-06-01'), -- Updated
(2, 25, '2022-07-01'), -- No change
(2, 20, '2022-08-01'), -- Updated
(2, 25, '2022-09-08'), -- Updated
(2, 10, '2022-10-09'), -- Updated
(3, 10, '2022-01-01'), -- Start
(3, 10, '2022-01-02'), -- No change
(3, 20, '2022-01-03'), -- Updated
(3, 20, '2022-01-04'), -- No change
(3, 20, '2022-01-05'), -- No change
(3, 10, '2022-01-06'), -- Updated
(3, 10, '2022-01-07'); -- No change
WITH CTE AS (
SELECT *,
PriorPoints = LAG(Points) OVER (PARTITION BY Id ORDER BY Date)
FROM #Data
)
, CTE2 AS (
SELECT *,
NewGroup = CASE WHEN (PriorPoints <> Points OR PriorPoints IS NULL)
THEN 1 ELSE 0 END
FROM CTE
)
, CTE3 AS (
SELECT *, GroupNo = SUM(NewGroup) OVER(
PARTITION BY ID
ORDER BY Date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
FROM CTE2
)
SELECT Id, Points, MinDate = MIN(Date), MaxDate = MAX(Date)
FROM CTE3
GROUP BY Id, GroupNo, Points
ORDER BY Id, GroupNo
Results:
Id
Points
MinDate
MaxDate
1
10
2018-01-01
2018-01-01
1
20
2019-01-02
2019-01-02
1
25
2020-01-03
2020-01-03
1
10
2021-01-04
2021-01-04
2
10
2022-01-01
2022-01-01
2
20
2022-02-01
2022-05-01
2
25
2022-06-01
2022-07-01
2
20
2022-08-01
2022-08-01
2
25
2022-09-08
2022-09-08
2
10
2022-10-09
2022-10-09
3
10
2022-01-01
2022-01-02
3
20
2022-01-03
2022-01-05
3
10
2022-01-06
2022-01-07
To see the intermediate results, replace the final select with SELECT * FROM CTE3 ORDER BY Id, Date.
If you wish to treat gaps in dates as group criteria, add a PriorDate calculation to CTE and add OR Date <> PriorDate to the NewGroup condition.
db<>fiddle
Caution: In your original post, you state that "this is to create a view". Beware that if the above logic is included in a view, the entire result may be recalculated every time the view is accessed, regardless of any ID or date criteria applied. It might make more sense to use the above to populate and periodically refresh a historic roll-up data table for efficient access. Another alternative is to make a stored procedure with appropriate parameters that could filter that data before feeding it into the above.

SQL server group / partition to condense history table

Got a table of dates someone was in a particular category like this:
drop table if exists #category
create table #category (personid int, categoryid int, startdate datetime, enddate datetime)
insert into #category
select * from
(
select 1 Personid, 1 CategoryID, '01/04/2010' StartDate, '31/07/2016' EndDate union
select 1 Personid, 5 CategoryID, '07/08/2016' StartDate, '31/03/2019' EndDate union
select 1 Personid, 5 CategoryID, '01/04/2019' StartDate, '01/04/2019' EndDate union
select 1 Personid, 5 CategoryID, '02/04/2019' StartDate, '11/08/2019' EndDate union
select 1 Personid, 4 CategoryID, '12/08/2019' StartDate, '03/11/2019' EndDate union
select 1 Personid, 5 CategoryID, '04/11/2019' StartDate, '22/03/2020' EndDate union
select 1 Personid, 5 CategoryID, '23/03/2020' StartDate, NULL EndDate union
select 2 Personid, 1 CategoryID, '01/04/2010' StartDate, '09/04/2015' EndDate union
select 2 Personid, 4 CategoryID, '10/04/2015' StartDate, '31/03/2018' EndDate union
select 2 Personid, 4 CategoryID, '01/04/2018' StartDate, '31/03/2019' EndDate union
select 2 Personid, 4 CategoryID, '01/04/2019' StartDate, '23/06/2019' EndDate union
select 2 Personid, 4 CategoryID, '24/06/2019' StartDate, NULL EndDate
) x
order by personid, startdate
I'm trying to condense it so I get this:
PersonID
categoryid
startdate
EndDate
1
1
01/04/2010
31/07/2016
1
5
07/08/2016
11/08/2019
1
4
12/08/2019
03/11/2019
1
5
04/11/2019
NULL
2
1
01/04/2010
09/04/2015
2
4
01/04/2015
NULL
I'm having issues with people like personid 1 where they are in (e.g.) category 5, then go into category 4 and them back into category 5.
So doing something like:
select
personid,
categoryid,
min(startdate) startdate,
max(enddate) enddate
from #category
group by
personid, categoryid
gives me the earliest date from category 5's first period, and the latest date from the second period - and means it creates an overlapping period.
So I tried partitioning it with a rownum or rank, but it still does the same thing - i.e. treats the 'category 5's as the same group:
select
rank() over (partition by personid, categoryid order by personid, startdate) rank,
c.*
from #category c
order by personid, startdate
rank
personid
categoryid
startdate
enddate
1
1
1
2010-04-01 00:00:00.000
2016-07-31 00:00:00.000
1
1
5
2016-08-07 00:00:00.000
2019-03-31 00:00:00.000
2
1
5
2019-04-01 00:00:00.000
2019-04-01 00:00:00.000
3
1
5
2019-04-02 00:00:00.000
2019-08-11 00:00:00.000
1
1
4
2019-08-12 00:00:00.000
2019-11-03 00:00:00.000
4
1
5
2019-11-04 00:00:00.000
2020-03-22 00:00:00.000
5
1
5
2020-03-23 00:00:00.000
NULL
1
2
1
2010-04-01 00:00:00.000
2015-04-09 00:00:00.000
1
2
4
2015-04-10 00:00:00.000
2018-03-31 00:00:00.000
2
2
4
2018-04-01 00:00:00.000
2019-03-31 00:00:00.000
3
2
4
2019-04-01 00:00:00.000
2019-06-23 00:00:00.000
4
2
4
2019-06-24 00:00:00.000
NULL
You can see in the rank column that the category 5's start off 1,2,3, miss a row and carry on 4, 5 so obvs in the same partition - I thought that adding the order by clause would force it to start a new partition when the category changed from 5 to 4 and back again.
Any thoughts?
This is a type of gaps and islands problem. However, if your data tiles perfectly (no gaps) as it does in your example data, then you can do this without any aggregation at all -- which should be the most efficient method:
select personid, categoryid, startdate,
dateadd(day, -1, lead(startdate) over (partition by personid order by startdate)) as enddate
from (select c.*,
lag(categoryid) over (partition by personid order by startdate) as prev_categoryid
from #category c
) c
where prev_categoryid is null or prev_categoryid <> categoryid;
The where clause only selects the rows where the category changes. The lead() then gets the next start date -- and subtracts 1 for your desired enddate.

Given two date ranged discount tables and product price, calculate date ranged final price

I have two tables with seasonal discounts. In each of these two tables are non overlapping date ranges, product id and discount that applies in that date range. Date ranges from one table however may overlap with date ranges in the other table. Given a third table with product id and its default price, the goal is to efficiently calculate seasonal - date ranged prices for product id after discounts from both tables have been applied.
Discounts multiply only in their overlapping period, e.g. if a first discount is 0.9 (10%) from 2019-07-01 to 2019-07-30, and a second discount is 0.8 from 2019-07-16 to 2019-08-15, this translates to: 0.9 discount from 2019-07-01 to 2019-07-15, 0.72 discount from 2019-07-16 to 2019-07-30, and 0.8 discount from 2019-07-31 to 2019-08-15.
I have managed to come to a solution, by first generating a table that holds ordered all of start and end dates in both discount tables, then generating a resulting table of all smallest disjoint intervals, and then for each interval, generating all prices, default, price with only the discount from first table applied (if any applies), price with only the discount from second table applied (if any applies), price with both discounts applied (if so possible) and then taking a min of these four prices. See sample code bellow.
declare #pricesDefault table (product_id int, price decimal)
insert into #pricesDefault
values
(1, 100),
(2, 120),
(3, 200),
(4, 50)
declare #discountTypeA table (product_id int, modifier decimal(4,2), startdate datetime, enddate datetime)
insert into #discountTypeA
values
(1, 0.75, '2019-06-06', '2019-07-06'),
(1, 0.95, '2019-08-06', '2019-08-20'),
(1, 0.92, '2019-05-06', '2019-06-05'),
(2, 0.75, '2019-06-08', '2019-07-19'),
(2, 0.95, '2019-07-20', '2019-09-20'),
(3, 0.92, '2019-05-06', '2019-06-05')
declare #discountTypeB table (product_id int, modifier decimal(4,2), startdate datetime, enddate datetime)
insert into #discountTypeB
values
(1, 0.85, '2019-06-20', '2019-07-03'),
(1, 0.65, '2019-08-10', '2019-08-29'),
(1, 0.65, '2019-09-10', '2019-09-27'),
(3, 0.75, '2019-05-08', '2019-05-19'),
(2, 0.95, '2019-05-20', '2019-05-21'),
(3, 0.92, '2019-09-06', '2019-09-09')
declare #pricingPeriod table(product_id int, discountedPrice decimal, startdate datetime, enddate datetime);
with allDates(product_id, dt) as
(select distinct product_id, dta.startdate from #discountTypeA dta
union all
select distinct product_id, dta.enddate from #discountTypeA dta
union all
select distinct product_id, dtb.startdate from #discountTypeB dtb
union all
select distinct product_id, dtb.enddate from #discountTypeB dtb
),
allproductDatesWithId as
(select product_id, dt, row_number() over (partition by product_id order by dt asc) 'Id'
from allDates),
sched as
(select pd.product_id, apw1.dt startdate, apw2.dt enddate
from #pricesDefault pd
join allproductDatesWithId apw1 on apw1.product_id = pd.product_id
join allproductDatesWithId apw2 on apw2.product_id = pd.product_id and apw2.Id= apw1.Id+1
),
discountAppliedTypeA as(
select sc.product_id, sc.startdate, sc.enddate,
min(case when sc.startdate >= dta.startdate and dta.enddate >= sc.enddate then pd.price * dta.modifier else pd.price end ) 'price'
from sched sc
join #pricesDefault pd on pd.product_id = sc.product_id
left join #discountTypeA dta on sc.product_id = dta.product_id
group by sc.product_id, sc.startdate , sc.enddate ),
discountAppliedTypeB as(
select daat.product_id, daat.startdate, daat.enddate,
min(case when daat.startdate >= dta.startdate and dta.enddate >= daat.enddate then daat.price * dta.modifier else daat.price end ) 'price'
from discountAppliedTypeA daat
left join #discountTypeB dta on daat.product_id = dta.product_id
group by daat.product_id, daat.startdate , daat.enddate )
select * from discountAppliedTypeB
order by product_id, startdate
Calculating a min of all possible prices is unnecessary overhead. I'd like to generate, just one resulting price and have it as a final price.
Here is the resulting set:
product_id start_date end_date final_price
1 2019-05-06 00:00:00.000 2019-06-05 00:00:00.000 92.0000
1 2019-06-05 00:00:00.000 2019-06-06 00:00:00.000 100.0000
1 2019-06-06 00:00:00.000 2019-06-20 00:00:00.000 75.0000
1 2019-06-20 00:00:00.000 2019-07-03 00:00:00.000 63.7500
1 2019-07-03 00:00:00.000 2019-07-06 00:00:00.000 75.0000
1 2019-07-06 00:00:00.000 2019-08-06 00:00:00.000 100.0000
1 2019-08-06 00:00:00.000 2019-08-10 00:00:00.000 95.0000
1 2019-08-10 00:00:00.000 2019-08-20 00:00:00.000 61.7500
1 2019-08-20 00:00:00.000 2019-08-29 00:00:00.000 65.0000
1 2019-08-29 00:00:00.000 2019-09-10 00:00:00.000 100.0000
1 2019-09-10 00:00:00.000 2019-09-27 00:00:00.000 65.0000
2 2019-05-20 00:00:00.000 2019-05-21 00:00:00.000 114.0000
2 2019-05-21 00:00:00.000 2019-06-08 00:00:00.000 120.0000
2 2019-06-08 00:00:00.000 2019-07-19 00:00:00.000 90.0000
2 2019-07-19 00:00:00.000 2019-07-20 00:00:00.000 120.0000
2 2019-07-20 00:00:00.000 2019-09-20 00:00:00.000 114.0000
3 2019-05-06 00:00:00.000 2019-05-08 00:00:00.000 184.0000
3 2019-05-08 00:00:00.000 2019-05-19 00:00:00.000 138.0000
3 2019-05-19 00:00:00.000 2019-06-05 00:00:00.000 184.0000
3 2019-06-05 00:00:00.000 2019-09-06 00:00:00.000 200.0000
3 2019-09-06 00:00:00.000 2019-09-09 00:00:00.000 184.0000
Is there a more efficient to this solution that I am not seeing?
I have a large data set of ~20K rows in real product prices table, and 100K- 200K rows in both discount tables.
Indexing structure of the actual tables is following: product id is clustered index in product prices table, whilst discount tables have an Id surrogate column as clustered index (as well as primary key), and (product_id, start_date, end_date) as a non clustered index.
You can generate the dates using union. Then bring in all discounts that are valid on that date, and calculate the total.
This looks like:
with prices as (
select a.product_id, v.dte
from #discountTypeA a cross apply
(values (a.startdate), (a.enddate)) v(dte)
union -- on purpose to remove duplicates
select b.product_id, v.dte
from #discountTypeB b cross apply
(values (b.startdate), (b.enddate)) v(dte)
),
p as (
select p.*, 1-a.modifier as a_discount, 1-b.modifier as b_discount, pd.price
from prices p left join
#pricesDefault pd
on pd.product_id = p.product_id left join
#discountTypeA a
on p.product_id = a.product_id and
p.dte >= a.startdate and p.dte < a.enddate left join
#discountTypeb b
on p.product_id = b.product_id and
p.dte >= b.startdate and p.dte < b.enddate
)
select p.product_id, price * (1 - coalesce(a_discount, 0)) * (1 - coalesce(b_discount, 0)) as price, a_discount, b_discount,
dte as startdate, lead(dte) over (partition by product_id order by dte) as enddate
from p
order by product_id, dte;
Here is a db<>fiddle.
Here is a version that works out the price for every date. You can then either use this directly, or use one of the many solutions on SO for working out date ranges.
In this example I have hard coded the date limits, but you could easily read them from your tables if you prefer.
I haven't done any performance testing on this, but give it a go. Its quite a bit simpler do if you have the right indexes it might be quicker.
;with dates as (
select convert(datetime,'2019-05-06') as d
union all
select d+1 from dates where d<'2019-09-27'
)
select pricesDefault.product_id, d, pricesDefault.price as baseprice,
discountA.modifier as dA,
discountB.modifier as dB,
pricesDefault.price*isnull(discountA.modifier,1)*isnull(discountB.modifier,1) as finalprice
from #pricesDefault pricesDefault
cross join dates
left join #discountTypeA discountA on discountA.product_id=pricesDefault.product_id and d between discountA.startdate and discountA.enddate
left join #discountTypeB discountB on discountB.product_id=pricesDefault.product_id and d between discountB.startdate and discountB.enddate
order by pricesDefault.product_id, d
Option (MaxRecursion 1000)

Date range with minimum and maximum dates from dataset having records with continuous date range

I have a dataset with id ,Status and date range of employees.
The input dataset given below are the details of one employee.
The date ranges in the records are continuous(in exact order) such that startdate of second row will be the next date of enddate of first row.
If an employee takes leave continuously for different months, then the table is storing the info with date range as separated for different months.
For example: In the input set, the employee has taken Sick leave from '16-10-2016' to '31-12-2016' and joined back on '1-1-2017'.
So there are 3 records for this item but the dates are continuous.
In the output I need this as one record as shown in the expected output dataset.
INPUT
Id Status StartDate EndDate
1 Active 1-9-2007 15-10-2016
1 Sick 16-10-2016 31-10-2016
1 Sick 1-11-2016 30-11-2016
1 Sick 1-12-2016 31-12-2016
1 Active 1-1-2017 4-2-2017
1 Unpaid 5-2-2017 9-2-2017
1 Active 10-2-2017 11-2-2017
1 Unpaid 12-2-2017 28-2-2017
1 Unpaid 1-3-2017 31-3-2017
1 Unpaid 1-4-2017 30-4-2017
1 Active 1-5-2017 13-10-2017
1 Sick 14-10-2017 11-11-2017
1 Active 12-11-2017 NULL
EXPECTED OUTPUT
Id Status StartDate EndDate
1 Active 1-9-2007 15-10-2016
1 Sick 16-10-2016 31-12-2016
1 Active 1-1-2017 4-2-2017
1 Unpaid 5-2-2017 9-2-2017
1 Active 10-2-2017 11-2-2017
1 Unpaid 12-2-2017 30-4-2017
1 Active 1-5-2017 13-10-2017
1 Sick 14-10-2017 11-11-2017
1 Active 12-11-2017 NULL
I can't take min(startdate) and max(EndDate) group by id,status because if the same employee has taken another Sick leave then that end date ('11-11-2017' in the example) will come as the End date.
can anyone help me with the query in SQL server 2014?
It suddenly hit me that this is basically a gaps and islands problem - so I've completely changed my solution.
For this solution to work, the dates does not have to be consecutive.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
Id int,
Status varchar(10),
StartDate date,
EndDate date
);
SET DATEFORMAT DMY; -- This is needed because how you specified your dates.
INSERT INTO #T (Id, Status, StartDate, EndDate) VALUES
(1, 'Active', '1-9-2007', '15-10-2016'),
(1, 'Sick', '16-10-2016', '31-10-2016'),
(1, 'Sick', '1-11-2016', '30-11-2016'),
(1, 'Sick', '1-12-2016', '31-12-2016'),
(1, 'Active', '1-1-2017', '4-2-2017'),
(1, 'Unpaid', '5-2-2017', '9-2-2017'),
(1, 'Active', '10-2-2017', '11-2-2017'),
(1, 'Unpaid', '12-2-2017', '28-2-2017'),
(1, 'Unpaid', '1-3-2017', '31-3-2017'),
(1, 'Unpaid', '1-4-2017', '30-4-2017'),
(1, 'Active', '1-5-2017', '13-10-2017'),
(1, 'Sick', '14-10-2017', '11-11-2017'),
(1, 'Active', '12-11-2017', NULL);
The (new) common table expression:
;WITH CTE AS
(
SELECT Id,
Status,
StartDate,
EndDate,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY StartDate)
- ROW_NUMBER() OVER(PARTITION BY Id, Status ORDER BY StartDate) As IslandId,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY StartDate DESC)
- ROW_NUMBER() OVER(PARTITION BY Id, Status ORDER BY StartDate DESC) As ReverseIslandId
FROM #T
)
The (new) query:
SELECT DISTINCT Id,
Status,
MIN(StartDate) OVER(PARTITION BY IslandId, ReverseIslandId) As StartDate,
NULLIF(MAX(ISNULL(EndDate, '9999-12-31')) OVER(PARTITION BY IslandId, ReverseIslandId), '9999-12-31') As EndDate
FROM CTE
ORDER BY StartDate
(new) Results:
Id Status StartDate EndDate
1 Active 01.09.2007 15.10.2016
1 Sick 16.10.2016 31.12.2016
1 Active 01.01.2017 04.02.2017
1 Unpaid 05.02.2017 09.02.2017
1 Active 10.02.2017 11.02.2017
1 Unpaid 12.02.2017 30.04.2017
1 Active 01.05.2017 13.10.2017
1 Sick 14.10.2017 11.11.2017
1 Active 12.11.2017 NULL
You can see a live demo on rextester.
Please note that string representation of dates in SQL should be acccording to ISO 8601 - meaning either yyyy-MM-dd or yyyyMMdd as it's unambiguous and will always be interpreted correctly by SQL Server.
It's an example of GROUPING AND WINDOW.
First you set a reset point for each Status
Sum to set a group
Then get max/min dates of each group.
;with x as
(
select Id, Status, StartDate, EndDate,
iif (lag(Status) over (order by Id, StartDate) = Status, null, 1) rst
from emp
), y as
(
select Id, Status, StartDate, EndDate,
sum(rst) over (order by Id, StartDate) grp
from x
)
select Id,
MIN(Status) as Status,
MIN(StartDate) StartDate,
MAX(EndDate) EndDate
from y
group by Id, grp
order by Id, grp
GO
Id | Status | StartDate | EndDate
-: | :----- | :------------------ | :------------------
1 | Active | 01/09/2007 00:00:00 | 15/10/2016 00:00:00
1 | Sick | 16/10/2016 00:00:00 | 31/12/2016 00:00:00
1 | Active | 01/01/2017 00:00:00 | 04/02/2017 00:00:00
1 | Unpaid | 05/02/2017 00:00:00 | 09/02/2017 00:00:00
1 | Active | 10/02/2017 00:00:00 | 11/02/2017 00:00:00
1 | Unpaid | 12/02/2017 00:00:00 | 30/04/2017 00:00:00
1 | Active | 01/05/2017 00:00:00 | 13/10/2017 00:00:00
1 | Sick | 14/10/2017 00:00:00 | 11/11/2017 00:00:00
1 | Active | 12/11/2017 00:00:00 | null
dbfiddle here
Here's an alternative answer that doesn't use LAG.
First I need to take a copy of your test data:
DECLARE #table TABLE (Id INT, [Status] VARCHAR(50), StartDate DATE, EndDate DATE);
INSERT INTO #table SELECT 1, 'Active', '20070901', '20161015';
INSERT INTO #table SELECT 1, 'Sick', '20161016', '20161031';
INSERT INTO #table SELECT 1, 'Sick', '20161101', '20161130';
INSERT INTO #table SELECT 1, 'Sick', '20161201', '20161231';
INSERT INTO #table SELECT 1, 'Active', '20170101', '20170204';
INSERT INTO #table SELECT 1, 'Unpaid', '20170205', '20170209';
INSERT INTO #table SELECT 1, 'Active', '20170210', '20170211';
INSERT INTO #table SELECT 1, 'Unpaid', '20170212', '20170228';
INSERT INTO #table SELECT 1, 'Unpaid', '20170301', '20170331';
INSERT INTO #table SELECT 1, 'Unpaid', '20170401', '20170430';
INSERT INTO #table SELECT 1, 'Active', '20170501', '20171013';
INSERT INTO #table SELECT 1, 'Sick', '20171014', '20171111';
INSERT INTO #table SELECT 1, 'Active', '20171112', NULL;
Then the query is:
WITH add_order AS (
SELECT
*,
ROW_NUMBER() OVER (ORDER BY StartDate) AS order_id
FROM
#table),
links AS (
SELECT
a1.Id,
a1.[Status],
a1.order_id,
MIN(a1.order_id) AS start_order_id,
MAX(ISNULL(a2.order_id, a1.order_id)) AS end_order_id,
MIN(a1.StartDate) AS StartDate,
MAX(ISNULL(a2.EndDate, a1.EndDate)) AS EndDate
FROM
add_order a1
LEFT JOIN add_order a2 ON a2.Id = a1.Id AND a2.[Status] = a1.[Status] AND a2.order_id = a1.order_id + 1 AND a2.StartDate = DATEADD(DAY, 1, a1.EndDate)
GROUP BY
a1.Id,
a1.[Status],
a1.order_id),
merged AS (
SELECT
l1.Id,
l1.[Status],
l1.[StartDate],
ISNULL(l2.EndDate, l1.EndDate) AS EndDate,
ROW_NUMBER() OVER (PARTITION BY l1.Id, l1.[Status], ISNULL(l2.EndDate, l1.EndDate) ORDER BY l1.order_id) AS link_id
FROM
links l1
LEFT JOIN links l2 ON l2.order_id = l1.end_order_id)
SELECT
Id,
[Status],
StartDate,
EndDate
FROM
merged
WHERE
link_id = 1
ORDER BY
StartDate;
Results are:
Id Status StartDate EndDate
1 Active 2007-09-01 2016-10-15
1 Sick 2016-10-16 2016-12-31
1 Active 2017-01-01 2017-02-04
1 Unpaid 2017-02-05 2017-02-09
1 Active 2017-02-10 2017-02-11
1 Unpaid 2017-02-12 2017-04-30
1 Active 2017-05-01 2017-10-13
1 Sick 2017-10-14 2017-11-11
1 Active 2017-11-12 NULL
How does it work? First I add a sequence number, to assist with merging contiguous rows together. Then I determine the rows that can be merged together, add a number to identify the first row in each set that can be merged, and finally pick the first rows out of the final CTE. Note that I also have to handle rows that can't be merged, hence the LEFT JOINs and ISNULL statements.
Just for interest, this is what the output from the final CTE looks like, before I filter out all but the rows with a link_id of 1:
Id Status StartDate EndDate link_id
1 Active 2007-09-01 2016-10-15 1
1 Sick 2016-10-16 2016-12-31 1
1 Sick 2016-11-01 2016-12-31 2
1 Sick 2016-12-01 2016-12-31 3
1 Active 2017-01-01 2017-02-04 1
1 Unpaid 2017-02-05 2017-02-09 1
1 Active 2017-02-10 2017-02-11 1
1 Unpaid 2017-02-12 2017-04-30 1
1 Unpaid 2017-03-01 2017-04-30 2
1 Unpaid 2017-04-01 2017-04-30 3
1 Active 2017-05-01 2017-10-13 1
1 Sick 2017-10-14 2017-11-11 1
1 Active 2017-11-12 NULL 1
You could use lag() and lead() function together to check the previous and next status
WITH CTE AS
(
select *,
COALESCE(LEAD(status) OVER(ORDER BY (select 1)), '0') Nstatus,
COALESCE(LAG(status) OVER(ORDER BY (select 1)), '0') Pstatus
from table
)
SELECT * FROM CTE
WHERE (status <> Nstatus AND status <> Pstatus) OR
(status <> Pstatus)