SQL join against date ranges? - sql

Consider two tables:
Transactions, with amounts in a foreign currency:
Date Amount
========= =======
1/2/2009 1500
2/4/2009 2300
3/15/2009 300
4/17/2009 2200
etc.
ExchangeRates, with the value of the primary currency (let's say dollars) in the foreign currency:
Date Rate
========= =======
2/1/2009 40.1
3/1/2009 41.0
4/1/2009 38.5
5/1/2009 42.7
etc.
Exchange rates can be entered for arbitrary dates - the user could enter them on a daily basis, weekly basis, monthly basis, or at irregular intervals.
In order to translate the foreign amounts to dollars, I need to respect these rules:
A. If possible, use the most recent previous rate; so the transaction on 2/4/2009 uses the rate for 2/1/2009, and the transaction on 3/15/2009 uses the rate for 3/1/2009.
B. If there isn't a rate defined for a previous date, use the earliest rate available. So the transaction on 1/2/2009 uses the rate for 2/1/2009, since there isn't an earlier rate defined.
This works...
Select
t.Date,
t.Amount,
ConvertedAmount=(
Select Top 1
t.Amount/ex.Rate
From ExchangeRates ex
Where t.Date > ex.Date
Order by ex.Date desc
)
From Transactions t
... but (1) it seems like a join would be more efficient & elegant, and (2) it doesn't deal with Rule B above.
Is there an alternative to using the subquery to find the appropriate rate? And is there an elegant way to handle Rule B, without tying myself in knots?

You could first do a self-join on the exchange rates which are ordered by date so that you have the start and the end date of each exchange rate, without any overlap or gap in the dates (maybe add that as view to your database - in my case I'm just using a common table expression).
Now joining those "prepared" rates with the transactions is simple and efficient.
Something like:
WITH IndexedExchangeRates AS (
SELECT Row_Number() OVER (ORDER BY Date) ix,
Date,
Rate
FROM ExchangeRates
),
RangedExchangeRates AS (
SELECT CASE WHEN IER.ix=1 THEN CAST('1753-01-01' AS datetime)
ELSE IER.Date
END DateFrom,
COALESCE(IER2.Date, GETDATE()) DateTo,
IER.Rate
FROM IndexedExchangeRates IER
LEFT JOIN IndexedExchangeRates IER2
ON IER.ix = IER2.ix-1
)
SELECT T.Date,
T.Amount,
RER.Rate,
T.Amount/RER.Rate ConvertedAmount
FROM Transactions T
LEFT JOIN RangedExchangeRates RER
ON (T.Date > RER.DateFrom) AND (T.Date <= RER.DateTo)
Notes:
You could replace GETDATE() with a date in the far future, I'm assuming here that no rates for the future are known.
Rule (B) is implemented by setting the date of the first known exchange rate to the minimal date supported by the SQL Server datetime, which should (by definition if it is the type you're using for the Date column) be the smallest value possible.

Suppose you had an extended exchange rate table that contained:
Start Date End Date Rate
========== ========== =======
0001-01-01 2009-01-31 40.1
2009-02-01 2009-02-28 40.1
2009-03-01 2009-03-31 41.0
2009-04-01 2009-04-30 38.5
2009-05-01 9999-12-31 42.7
We can discuss the details of whether the first two rows should be combined, but the general idea is that it is trivial to find the exchange rate for a given date. This structure works with the SQL 'BETWEEN' operator which includes the ends of the ranges. Often, a better format for ranges is 'open-closed'; the first date listed is included and the second is excluded. Note that there is a constraint on the data rows - there are (a) no gaps in the coverage of the range of dates and (b) no overlaps in the coverage. Enforcing those constraints is not completely trivial (polite understatement - meiosis).
Now the basic query is trivial, and Case B is no longer a special case:
SELECT T.Date, T.Amount, X.Rate
FROM Transactions AS T JOIN ExtendedExchangeRates AS X
ON T.Date BETWEEN X.StartDate AND X.EndDate;
The tricky part is creating the ExtendedExchangeRate table from the given ExchangeRate table on the fly.
If it is an option, then revising the structure of the basic ExchangeRate table to match the ExtendedExchangeRate table would be a good idea; you resolve the messy stuff when the data is entered (once a month) instead of every time an exchange rate needs to be determined (many times a day).
How to create the extended exchange rate table? If your system supports adding or subtracting 1 from a date value to obtain the next or previous day (and has a single row table called 'Dual'), then a variation
on this will work (without using any OLAP functions):
CREATE TABLE ExchangeRate
(
Date DATE NOT NULL,
Rate DECIMAL(10,5) NOT NULL
);
INSERT INTO ExchangeRate VALUES('2009-02-01', 40.1);
INSERT INTO ExchangeRate VALUES('2009-03-01', 41.0);
INSERT INTO ExchangeRate VALUES('2009-04-01', 38.5);
INSERT INTO ExchangeRate VALUES('2009-05-01', 42.7);
First row:
SELECT '0001-01-01' AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
Result:
0001-01-01 2009-01-31 40.10000
Last row:
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
'9999-12-31' AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
Result:
2009-05-01 9999-12-31 42.70000
Middle rows:
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
);
Result:
2009-02-01 2009-02-28 40.10000
2009-03-01 2009-03-31 41.00000
2009-04-01 2009-04-30 38.50000
Note that the NOT EXISTS sub-query is rather crucial. Without it, the 'middle rows' result is:
2009-02-01 2009-02-28 40.10000
2009-02-01 2009-03-31 40.10000 # Unwanted
2009-02-01 2009-04-30 40.10000 # Unwanted
2009-03-01 2009-03-31 41.00000
2009-03-01 2009-04-30 41.00000 # Unwanted
2009-04-01 2009-04-30 38.50000
The number of unwanted rows increases dramatically as the table increases in size (for N > 2 rows, there are (N-2) * (N - 3) / 2 unwanted rows, I believe).
The result for ExtendedExchangeRate is the (disjoint) UNION of the three queries:
SELECT DATE '0001-01-01' AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual
UNION
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
)
UNION
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
DATE '9999-12-31' AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
On the test DBMS (IBM Informix Dynamic Server 11.50.FC6 on MacOS X 10.6.2), I was able to convert the query into a view but I had to stop cheating with the data types - by coercing the strings into dates:
CREATE VIEW ExtendedExchangeRate(StartDate, EndDate, Rate) AS
SELECT DATE('0001-01-01') AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual
UNION
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
)
UNION
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
DATE('9999-12-31') AS EndDate,
(SELECT Rate FROM ExchangeRate WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;

I can't test this, but I think it would work. It uses coalesce with two sub-queries to pick the rate by rule A or rule B.
Select t.Date, t.Amount,
ConvertedAmount = t.Amount/coalesce(
(Select Top 1 ex.Rate
From ExchangeRates ex
Where t.Date > ex.Date
Order by ex.Date desc )
,
(select top 1 ex.Rate
From ExchangeRates
Order by ex.Date asc)
)
From Transactions t

SELECT
a.tranDate,
a.Amount,
a.Amount/a.Rate as convertedRate
FROM
(
SELECT
t.date tranDate,
e.date as rateDate,
t.Amount,
e.rate,
RANK() OVER (Partition BY t.date ORDER BY
CASE WHEN DATEDIFF(day,e.date,t.date) < 0 THEN
DATEDIFF(day,e.date,t.date) * -100000
ELSE DATEDIFF(day,e.date,t.date)
END ) AS diff
FROM
ExchangeRates e
CROSS JOIN
Transactions t
) a
WHERE a.diff = 1
The difference between tran and rate date is calculated, then negative values ( condition b) are multiplied by -10000 so that they can still be ranked but positive values (condition a always take priority. we then select the minimum date difference for each tran date using the rank over clause.

Many solutions will work. You should really find the one that works best (fastest) for your workload: do you search usually for one Transaction, list of them, all of them?
The tie-breaker solution given your schema is:
SELECT t.Date,
t.Amount,
r.Rate
--//add your multiplication/division here
FROM "Transactions" t
INNER JOIN "ExchangeRates" r
ON r."ExchangeRateID" = (
SELECT TOP 1 x."ExchangeRateID"
FROM "ExchangeRates" x
WHERE x."SourceCurrencyISO" = t."SourceCurrencyISO" --//these are currency-related filters for your tables
AND x."TargetCurrencyISO" = t."TargetCurrencyISO" --//,which you should also JOIN on
AND x."Date" <= t."Date"
ORDER BY x."Date" DESC)
You need to have the right indices for this query to be fast. Also ideally you should not have a JOIN on "Date", but on "ID"-like field (INTEGER). Give me more schema info, I will create an example for you.

There's nothing about a join that will be more elegant than the TOP 1 correlated subquery in your original post. However, as you say, it doesn't satisfy requirement B.
These queries do work (SQL Server 2005 or later required). See the SqlFiddle for these.
SELECT
T.*,
ExchangeRate = E.Rate
FROM
dbo.Transactions T
CROSS APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E
WHERE E.RateDate <= T.TranDate
ORDER BY
CASE WHEN E.RateDate <= T.TranDate THEN 0 ELSE 1 END,
E.RateDate DESC
) E;
Note that the CROSS APPLY with a single column value is functionally equivalent to the correlated subquery in the SELECT clause as you showed. I just prefer CROSS APPLY now because it is much more flexible and lets you reuse the value in multiple places, have multiple rows in it (for custom unpivoting) and lets you have multiple columns.
SELECT
T.*,
ExchangeRate = Coalesce(E.Rate, E2.Rate)
FROM
dbo.Transactions T
OUTER APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E
WHERE E.RateDate <= T.TranDate
ORDER BY E.RateDate DESC
) E
OUTER APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E2
WHERE E.Rate IS NULL
ORDER BY E2.RateDate
) E2;
I don't know which one might perform better, or if either will perform better than other answers on the page. With a proper index on the Date columns, they should zing pretty well--definitely better than any Row_Number() solution.

Related

Calculating business days in Teradata

I need help in business days calculation.
I've two tables
1) One table ACTUAL_TABLE containing order date and contact date with timestamp datatypes.
2) The second table BUSINESS_DATES has each of the calendar dates listed and has a flag to indicate weekend days.
using these two tables, I need to ensure business days and not calendar days (which is the current logic) is calculated between these two fields.
My thought process was to first get a range of dates by comparing ORDER_DATE with TABLE_DATE field and then do a similar comparison of CONTACT_DATE to TABLE_DATE field. This would get me a range from the BUSINESS_DATES table which I can then use to calculate count of days, sum(Holiday_WKND_Flag) fields making the result look like:
Order# | Count(*) As DAYS | SUM(WEEKEND DATES)
100 | 25 | 8
However this only works when I use a specific order number and cant' bring all order numbers in a sub query.
My Query:
SELECT SUM(Holiday_WKND_Flag), COUNT(*) FROM
(
SELECT
* FROM
BUSINESS_DATES
WHERE BUSINESS.Business BETWEEN (SELECT ORDER_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
AND
(SELECT CONTACT_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
TEMP
Uploading the table structure for your reference.
SELECT ORDER#, SUM(Holiday_WKND_Flag), COUNT(*)
FROM business_dates bd
INNER JOIN actual_table at ON bd.table_date BETWEEN at.order_date AND at.contact_date
GROUP BY ORDER#
Instead of joining on a BETWEEN (which always results in a bad Product Join) followed by a COUNT you better assign a bussines day number to each date (in best case this is calculated only once and added as a column to your calendar table). Then it's two Equi-Joins and no aggregation needed:
WITH cte AS
(
SELECT
Cast(table_date AS DATE) AS table_date,
-- assign a consecutive number to each busines day, i.e. not increased during weekends, etc.
Sum(CASE WHEN Holiday_WKND_Flag = 1 THEN 0 ELSE 1 end)
Over (ORDER BY table_date
ROWS Unbounded Preceding) AS business_day_nbr
FROM business_dates
)
SELECT ORDER#,
Cast(t.contact_date AS DATE) - Cast(t.order_date AS DATE) AS #_of_days
b2.business_day_nbr - b1.business_day_nbr AS #_of_business_days
FROM actual_table AS t
JOIN cte AS b1
ON Cast(t.order_date AS DATE) = b1.table_date
JOIN cte AS b2
ON Cast(t.contact_date AS DATE) = b2.table_date
Btw, why are table_date and order_date timestamp instead of a date?
Porting from Oracle?
You can use this query. Hope it helps
select order#,
order_date,
contact_date,
(select count(1)
from business_dates_table
where table_date between a.order_date and a.contact_date
and holiday_wknd_flag = 0
) business_days
from actual_table a

Query to apply rate from the interval of dates

Let's say I have two tables in my oracle database
Table A : stDate, endDate, salary
For example:
03/02/2010 28/02/2010 2000
05/03/2012 29/03/2012 2500
Table B : DateOfActivation, rate
For example:
01/01/2010 1.023
01/11/2011 1.063
01/01/2012 1.075
I would like to have a SQL query displaying the sum of salary of table A with each salary multiplied by the rate of table B depending on the activation date.
Here, for the first salary the good rate is the first one (1.023) because the second rate has a date of activation that is later than stDate and endDate interval.
For the second salary, the third rate is applied because activation date of the rate was before the interval of dates of the second salary.
so the sum is this one : 2000 * 1.023 + 2500 * 1.075 = 4733.5
I hope I am clear
Thanks
Assuming the rate must be active before the beginning of the interval (i.e. DateOfActivation < stDate), you could do something like this (see fiddle):
SELECT SUM(salary*
(SELECT rate from TableB WHERE DateOfActivation=
(SELECT MAX(DateOfActivation) FROM TableB WHERE DateOfActivation < stDate)
)) FROM TableA;
This problem becomes much easier if DateofActivation is a true effective dated table with rate_start_date and rate_end_date such that a new row cannot be created where its start date or end_date will lie within an existing rate_start_date -- rate_end_date pair. The currently active row typically would have a NULL value for rate_end_date. In addition, Likely, you would want an EMP_ID on the salary table to be able to sum the rows to finish the calculation; and one needs to consider the following cases:
Start Date is between rate_start and rate_end
End Date is between rate_start and rate_end
Rate_start and Rate_end are between start_date and end_date (sandwiched)
If you run the following snippet you will see we can artificially create our rate_end_dates as follows:
SELECT D.ACTIVEDATE, D.RATE, NVL(MIN(E.ACTIVEDATE)-1,SYSDATE) ENDDATE
FROM XX_DATEOFACTIVATION D, XX_DATEOFACTIVATION E
WHERE D.ACTIVEDATE<E.ACTIVEDATE(+)
GROUP BY D.ACTIVEDATE, D.RATE
ORDER BY D.ACTIVEDATE
Proposed code is as follows:
SELECT DISTINCT * FROM
(SELECT S.*, T.RATE, S.SALARY*T.RATE
FROM XX_SAL_HIST S,
(SELECT D.ACTIVEDATE, D.RATE, NVL(MIN(E.ACTIVEDATE)-1,SYSDATE) ENDDATE
FROM XX_DATEOFACTIVATION D, XX_DATEOFACTIVATION E
WHERE D.ACTIVEDATE<E.ACTIVEDATE(+)
GROUP BY D.ACTIVEDATE, D.RATE) T -- creating synthetic rate_end_date
WHERE S.STDATE BETWEEN T.ACTIVEDATE AND T.ENDDATE)
UNION
(SELECT S.*, T.RATE, S.SALARY*T.RATE
FROM XX_SAL_HIST S,
(SELECT D.ACTIVEDATE, D.RATE, NVL(MIN(E.ACTIVEDATE)-1,SYSDATE) ENDDATE
FROM XX_DATEOFACTIVATION D, XX_DATEOFACTIVATION E
WHERE D.ACTIVEDATE<E.ACTIVEDATE(+)
GROUP BY D.ACTIVEDATE, D.RATE) T -- creating synthetic rate_end_date
WHERE S.ENDDATE BETWEEN T.ACTIVEDATE AND T.ENDDATE)
UNION
(SELECT S.*, T.RATE, S.SALARY*T.RATE
FROM XX_SAL_HIST S,
(SELECT D.ACTIVEDATE, D.RATE, NVL(MIN(E.ACTIVEDATE)-1,SYSDATE) ENDDATE
FROM XX_DATEOFACTIVATION D, XX_DATEOFACTIVATION E
WHERE D.ACTIVEDATE<E.ACTIVEDATE(+)
GROUP BY D.ACTIVEDATE, D.RATE) T -- creating synthetic rate_end_date
WHERE T.ACTIVEDATE BETWEEN S.STDATE AND S.ENDDATE)
The first thing to do is to transform Table B (Table2 in the query) to have, for each row, the start and end date
Select DateOfActivation AS startDate
, rate
, NVL(LEAD(DateOfActivation, 1) OVER (ORDER BY DateOfActivation)
, TO_DATE('9999/12/31', 'yyyy/mm/dd')) AS endDate
From Table2
Now we can join this table with Table A (Table1 in the query)
WITH Rates AS (
Select DateOfActivation AS startDate
, rate
, NVL(LEAD(DateOfActivation, 1) OVER (ORDER BY DateOfActivation)
, TO_DATE('9999/12/31', 'yyyy/mm/dd')) AS endDate
From Table2)
Select SUM(s.salary * r.rate)
From Rates r
INNER JOIN Table1 s ON s.stDate < r.endDate AND s.endDate > r.startDate
The JOIN condition get every row in Table A that are at least partially in the activation period of the rate, if you need it to be inclusive you can alter it as in the following query
WITH Rates AS (
Select DateOfActivation AS startDate
, rate
, NVL(LEAD(DateOfActivation, 1) OVER (ORDER BY DateOfActivation)
, TO_DATE('9999/12/31', 'yyyy/mm/dd')) AS endDate
From Table2)
Select SUM(s.salary * r.rate)
From Rates r
INNER JOIN Table1 s ON s.stDate >= r.startDate AND s.endDate <= r.endDate

How to calculate average value based on duration between measurements?

I have data similar to this:
Price DateChanged Product
10 2012-01-01 A
12 2012-02-01 A
30 2012-03-01 A
10 2012-09-01 A
12 2013-01-01 A
110 2012-01-01 B
112 2012-02-01 B
130 2012-03-01 B
110 2012-09-01 B
112 2013-01-01 B
I want to calculate average value, but the challenge is this:
Look at the first record, price 10 is valid for a duration of one month, price 12 is valid for a duration of one month while price 30 is valid for a duration of six months.
So, a basic average for product A (10+12+30+10+12)/5 would result in 14.8 while taking duration in to account then the average price would be ~20.1.
What is the best approach to solve this?
I know I could create a sub-query with a row_number() to join against to calculate a duration, but is there a better way? SQL Server has powerful features like STDistance, so surely there is a function for this?
What you are looking for is called weighted average, and AFAIK, there is no built-in function in SQL Server that calculates it for you. However, is not that hard to calculate it by hand.
First, you need to find the weight of each data point, in this case, you need to find the duration of each price period. You might have some additional columns in your data that could enable easier lookup, but you could do it like this as well:
SELECT p1.Product, p1.Price, p1.DateChanged AS DateStart,
isnull(min(p2.DateChanged),getdate()) AS DateEnd
INTO #PricePlanStartEnd
FROM PricePlan p1
LEFT OUTER JOIN PricePlan p2
ON p1.DateChanged < p2.DateChanged
AND p1.Product =p2.Product
GROUP BY p1.Product, p1.Price, p1.DateChanged
ORDER BY p1.Product, p1.DateChanged
This creates a #PricePlanStartEnd temporary table that has the start and the end of each price period. I've used getdate() as the end of the current time period. If you need to just calculate an average up to the last price change, just use INNER JOIN instead of the LEFT OUTER JOIN.
After that you just need to divide the sum of (price * period) by the total length of the period, and get the answer.
Here is an SQL Fiddle with the calculation
Also when your working with months, you must remember that not all months are equal, so the price for December was active longer than it was for February.
Using CTE and row_number() to get monthly average up to the last dateChanged. Fiddle-Demo
;with cte as (
select product, dateChanged, price,
row_number() over (partition by product order by datechanged) rn
from x
)
select t1.product,
sum(t1.price *1.0 * datediff(month, t1.dateChanged,t2.dateChanged))/12 monthlyAvg
from cte t1 join cte t2 on t1.product = t2.product
and t1.rn +1 = t2.rn
group by t1.product
--Results
Product MonthlyAvg
A 20.166666
B 120.166666
OR if you need up to date daily average then use a LEFT JOIN Fiddle-Demo;
;with cte as (
select product, dateChanged, price,
row_number() over (partition by product order by datechanged) rn
from x
)
select t1.product,
sum(t1.price *1.0 *
datediff(day, t1.dateChanged,isnull(t2.dateChanged,getdate())))/365 dailyAvg
from cte t1 left join cte t2 on t1.product = t2.product
and t1.rn +1 = t2.rn
group by t1.product
--Results
product dailyAvg
A 21.386301
B 130.975342

query to display additional column based on aggregate value

I've been mulling on this problem for a couple of hours now with no luck, so I though people on SO might be able to help :)
I have a table with data regarding processing volumes at stores. The first three columns shown below can be queried from that table. What I'm trying to do is to add a 4th column that's basically a flag regarding if a store has processed >=$150, and if so, will display the corresponding date. The way this works is the first instance where the store has surpassed $150 is the date that gets displayed. Subsequent processing volumes don't count after the the first instance the activated date is hit. For example, for store 4, there's just one instance of the activated date.
store_id sales_volume date activated_date
----------------------------------------------------
2 5 03/14/2012
2 125 05/21/2012
2 30 11/01/2012 11/01/2012
3 100 02/06/2012
3 140 12/22/2012 12/22/2012
4 300 10/15/2012 10/15/2012
4 450 11/25/2012
5 100 12/03/2012
Any insights as to how to build out this fourth column? Thanks in advance!
The solution start by calculating the cumulative sales. Then, you want the activation date only when the cumulative sales first pass through the $150 level. This happens when adding the current sales amount pushes the cumulative amount over the threshold. The following case expression handles this.
select t.store_id, t.sales_volume, t.date,
(case when 150 > cumesales - t.sales_volume and 150 <= cumesales
then date
end) as ActivationDate
from (select t.*,
sum(sales_volume) over (partition by store_id order by date) as cumesales
from t
) t
If you have an older version of Postgres that does not support cumulative sum, you can get the cumulative sales with a subquery like:
(select sum(sales_volume) from t t2 where t2.store_id = t.store_id and t2.date <= t.date) as cumesales
Variant 1
You can LEFT JOIN to a table that calculates the first date surpassing the 150 $ limit per store:
SELECT t.*, b.activated_date
FROM tbl t
LEFT JOIN (
SELECT store_id, min(thedate) AS activated_date
FROM (
SELECT store_id, thedate
,sum(sales_volume) OVER (PARTITION BY store_id
ORDER BY thedate) AS running_sum
FROM tbl
) a
WHERE running_sum >= 150
GROUP BY 1
) b ON t.store_id = b.store_id AND t.thedate = b.activated_date
ORDER BY t.store_id, t.thedate;
The calculation of the the first day has to be done in two steps, since the window function accumulating the running sum has to be applied in a separate SELECT.
Variant 2
Another window function instead of the LEFT JOIN. May of may not be faster. Test with EXPLAIN ANALYZE.
SELECT *
,CASE WHEN running_sum >= 150 AND thedate = first_value(thedate)
OVER (PARTITION BY store_id, running_sum >= 150 ORDER BY thedate)
THEN thedate END AS activated_date
FROM (
SELECT *
,sum(sales_volume)
OVER (PARTITION BY store_id ORDER BY thedate) AS running_sum
FROM tbl
) b
ORDER BY store_id, thedate;
->sqlfiddle demonstrating both.

Merge overlapping date intervals

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.
/***** DATA EXAMPLE *****/
DECLARE #T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO #T (d1, d2)
SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31'
UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10'
UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09'
UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08'
UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16'
UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13'
/***** INTERVAL ANALYSIS *****/
WHILE (1=1) BEGIN
UPDATE t1 SET t1.d2 = t2.d2
FROM #T AS t1 INNER JOIN #T AS t2 ON
DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2
IF ##ROWCOUNT = 0 BREAK
END
/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM #T
GROUP BY d2
ORDER BY StartDate, EndDate
/***** OUTPUT *****/
/*****
StartDate EndDate
2010-01-01 2010-06-13
2010-06-15 2010-08-16
2010-11-01 2010-12-31
*****/
I was looking for the same solution and came across this post on Combine overlapping datetime to return single overlapping range record.
There is another thread on Packing Date Intervals.
I tested this with various date ranges, including the ones listed here, and it works correctly every time.
SELECT
s1.StartDate,
--t1.EndDate
MIN(t1.EndDate) AS EndDate
FROM #T s1
INNER JOIN #T t1 ON s1.StartDate <= t1.EndDate
AND NOT EXISTS(SELECT * FROM #T t2
WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate)
WHERE NOT EXISTS(SELECT * FROM #T s2
WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate)
GROUP BY s1.StartDate
ORDER BY s1.StartDate
The result is:
StartDate | EndDate
2010-01-01 | 2010-06-13
2010-06-15 | 2010-06-25
2010-06-26 | 2010-08-16
2010-11-01 | 2010-12-31
You asked this back in 2010 but don't specify any particular version.
An answer for people on SQL Server 2012+
WITH T1
AS (SELECT *,
MAX(d2) OVER (ORDER BY d1) AS max_d2_so_far
FROM #T),
T2
AS (SELECT *,
CASE
WHEN d1 <= DATEADD(DAY, 1, LAG(max_d2_so_far) OVER (ORDER BY d1))
THEN 0
ELSE 1
END AS range_start
FROM T1),
T3
AS (SELECT *,
SUM(range_start) OVER (ORDER BY d1) AS range_group
FROM T2)
SELECT range_group,
MIN(d1) AS d1,
MAX(d2) AS d2
FROM T3
GROUP BY range_group
Which returns
+-------------+------------+------------+
| range_group | d1 | d2 |
+-------------+------------+------------+
| 1 | 2010-01-01 | 2010-06-13 |
| 2 | 2010-06-15 | 2010-08-16 |
| 3 | 2010-11-01 | 2010-12-31 |
+-------------+------------+------------+
DATEADD(DAY, 1 is used because your desired results show you want a period ending on 2010-06-25 to be collapsed into one starting 2010-06-26. For other use cases this may need adjusting.
Here is a solution with just three simple scans. No CTEs, no recursion, no joins, no table updates in a loop, no "group by" — as a result, this solution should scale the best (I think).
I think number of scans can be reduced to two, if min and max dates are known in advance;
the logic itself just needs two scans — find gaps, applied twice.
declare #datefrom datetime, #datethru datetime
DECLARE #T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO #T (d1, d2)
SELECT '2010-01-01','2010-03-31'
UNION SELECT '2010-03-01','2010-06-13'
UNION SELECT '2010-04-01','2010-05-31'
UNION SELECT '2010-06-15','2010-06-25'
UNION SELECT '2010-06-26','2010-07-10'
UNION SELECT '2010-08-01','2010-08-05'
UNION SELECT '2010-08-01','2010-08-09'
UNION SELECT '2010-08-02','2010-08-07'
UNION SELECT '2010-08-08','2010-08-08'
UNION SELECT '2010-08-09','2010-08-12'
UNION SELECT '2010-07-04','2010-08-16'
UNION SELECT '2010-11-01','2010-12-31'
select #datefrom = min(d1) - 1, #datethru = max(d2) + 1 from #t
SELECT
StartDate, EndDate
FROM
(
SELECT
MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
LEAD(StartDate ) OVER (ORDER BY StartDate) - 1 EndDate
FROM
(
SELECT
StartDate, EndDate
FROM
(
SELECT
MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
LEAD(StartDate) OVER (ORDER BY StartDate) - 1 EndDate
FROM
(
SELECT d1 StartDate, d2 EndDate from #T
UNION ALL
SELECT #datefrom StartDate, #datefrom EndDate
UNION ALL
SELECT #datethru StartDate, #datethru EndDate
) T
) T
WHERE StartDate <= EndDate
UNION ALL
SELECT #datefrom StartDate, #datefrom EndDate
UNION ALL
SELECT #datethru StartDate, #datethru EndDate
) T
) T
WHERE StartDate <= EndDate
The result is:
StartDate EndDate
2010-01-01 2010-06-13
2010-06-15 2010-08-16
2010-11-01 2010-12-31
The idea is to simulate the scanning algorithm for merging intervals. My solution makes sure it works across a wide range of SQL implementations. I've tested it on MySQL, Postgres, SQL-Server 2017, SQLite and even Hive.
Assuming the table schema is the following.
CREATE TABLE t (
a DATETIME,
b DATETIME
);
We also assume the interval is half-open like [a,b).
When (a,i,j) is in the table, it shows that there are j intervals covering a, and there are i intervals covering the previous point.
CREATE VIEW r AS
SELECT a,
Sum(d) OVER (ORDER BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS i,
Sum(d) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) AS j
FROM (SELECT a, Sum(d) AS d
FROM (SELECT a, 1 AS d FROM t
UNION ALL
SELECT b, -1 AS d FROM t) e
GROUP BY a) f;
We produce all the endpoints in the union of the intervals and pair up adjacent ones. Finally, we produce the set of intervals by only picking the odd-numbered rows.
SELECT a, b
FROM (SELECT a,
Lead(a) OVER (ORDER BY a) AS b,
Row_number() OVER (ORDER BY a) AS n
FROM r
WHERE j=0 OR i=0 OR i is null) e
WHERE n%2 = 1;
I've created a sample DB-fiddle and SQL-fiddle. I also wrote a blog post on union intervals in SQL.
A Geometric Approach
Here and elsewhere I've noticed that date packing questions don't provide a geometric approach to this problem. After all, any range, date-ranges included, can be interpreted as a line. So why not convert them to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.
Why?
This has the advantage of handling all types of overlaps, including fully nested ranges. It also works like any other aggregate query, so it's a little more intuitive in that respect. You also get the bonus of a visual representation of your results if you care to use it. Finally, it is the approach I use for simultaneous range packing (you work with rectangles instead of lines in that case, and there are many more considerations). I just couldn't get the existing approaches to work in that scenario.
This has the disadvantage of requiring more recent versions of SQL Server. It also requires a numbers table and it's annoying to extract the individually produced lines from the aggregated shape. But hopefully in the future Microsoft adds a TVF that allows you to do this easily without a numbers table (or you can just build one yourself). Also, geometrical objects work with floats, so you have conversion annoyances and precision concerns to keep in mind.
Performance-wise I don't know how it compares, but I've done a few things (not shown here) to make it work for me even with large datasets.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
In this problem, we're working in
'integer space,' meaning there are no time considerations, and so
an begin date in one range that is one day apart from an end date
in another should be merged with that other. In order to make
that merge happen, we need to convert to 'real space.', so we
add 1 to the tail of all ranges (we undo this later).
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In the outer query:
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations, ensuring to put them back into 'integer space'.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #t
),
mergeLines as (
select lines = geometry::UnionAggregate(line)
from #t
cross apply (select line =
geometry::Point(convert(float, d1), 0, 0).STUnion(
geometry::Point(convert(float, d2) + 1, 0, 0)
).STEnvelope()
) l
)
select ap.StartDate,
ap.EndDate
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
StartDate = convert(datetime,l.line.STPointN(1).STX),
EndDate = convert(datetime,l.line.STPointN(3).STX) - 1
) ap
order by ap.StartDate;
In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.
In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.
With Calendar As
(
Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
From sys.columns as s1
Cross Join sys.columns as s2
)
, StopDates As
(
Select C.[Date]
From Calendar As C
Left Join #T As T
On C.[Date] Between T.d1 And T.d2
Where C.[Date] >= ( Select Min(T2.d1) From #T As T2 )
And C.[Date] <= ( Select Max(T2.d2) From #T As T2 )
And T.d1 Is Null
)
, StopDatesInUse As
(
Select D1.[Date]
From StopDates As D1
Left Join StopDates As D2
On D1.[Date] = DateAdd(d,1,D2.Date)
Where D2.[Date] Is Null
)
, DataWithEariestStopDate As
(
Select *
, (Select Min(SD2.[Date])
From StopDatesInUse As SD2
Where T.d2 < SD2.[Date] ) As StopDate
From #T As T
)
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)
EDIT The problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.
Try this
;WITH T1 AS
(
SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
FROM #T
), NUMS AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
FROM T1 A
CROSS JOIN T1 B
CROSS JOIN T1 C
), ONERANGE AS
(
SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
FROM T1
CROSS JOIN NUMS
WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum
The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE
Then, identify the relationship between how dates increment and the way the row numbers do.
The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.
This Solution is similar to the 1st solution with additional Deletion Condition.
This will sort the data in the main table itself instead of using different table to store the result.
DROP TABLE IF EXISTS #SampleTable;
CREATE TABLE #SampleTable (StartTime DATETIME NULL, EndTime DATETIME NULL);
INSERT INTO #SampleTable(StartTime, EndTime)
VALUES
(N'2010-01-01T00:00:00', N'2010-03-31T00:00:00'),
(N'2010-03-01T00:00:00', N'2010-06-13T00:00:00'),
(N'2010-04-01T00:00:00', N'2010-05-31T00:00:00'),
(N'2010-06-15T00:00:00', N'2010-06-25T00:00:00'),
(N'2010-06-26T00:00:00', N'2010-07-10T00:00:00'),
(N'2010-07-04T00:00:00', N'2010-08-16T00:00:00'),
(N'2010-08-01T00:00:00', N'2010-08-05T00:00:00'),
(N'2010-08-01T00:00:00', N'2010-08-09T00:00:00'),
(N'2010-08-02T00:00:00', N'2010-08-07T00:00:00'),
(N'2010-08-08T00:00:00', N'2010-08-08T00:00:00'),
(N'2010-08-09T00:00:00', N'2010-08-12T00:00:00'),
(N'2010-11-01T00:00:00', N'2010-12-31T00:00:00');
--
DECLARE #RowCount INT=0;
WHILE(1=1) --
BEGIN
SET #RowCount=0;
--
UPDATE T1
SET T1.EndTime=T2.EndTime
FROM dbo.#SampleTable AS T1
INNER JOIN dbo.#SampleTable AS T2 ON DATEADD(DAY, 1, T1.EndTime) BETWEEN T2.StartTime AND T2.EndTime;
--
SET #RowCount=#RowCount+##ROWCOUNT;
--
DELETE T1
FROM dbo.#SampleTable AS T1
INNER JOIN dbo.#SampleTable AS T2 ON T1.EndTime=T2.EndTime AND T1.StartTime>T2.StartTime;
--
SET #RowCount=#RowCount+##ROWCOUNT;
--
IF #RowCount=0 --
BREAK;
END;
SELECT * FROM #SampleTable
I was inspired by the Geometric Approach given by pwilcox, but wanted to try a different approach. This is using Trino, but I hope the functions used can also be found in other versions of SQL.
WITH Geo AS (
SELECT
transform( -- 6) See Below~
ST_Geometries( -- 5) Extracts an array of individual lines from the union.
geometry_union( -- 4) Returns the union of aggregated lines, melding all lines together into a single geometric multi-line.
array_agg( -- 3) Aggregation function that joins all lines together.
ST_LineString( -- 2) Makes the pairs of geometric points into lines.
ARRAY[ST_Point(0, to_unixtime(d1)), ST_Point(0, to_unixtime(d2))] -- 1) Takes unix time start and end values and makes them into an array of geometric points.
)
)
)
)
, x -> (ST_YMin(x), ST_Length(x))) AS timestamp_duration -- 6) From the array of lines, The minimum value and length of each line is extracted.
FROM #T -- The miniumum value is a timestamp, length is duration.
WHERE d1 <> d2 -- I had errors any time this was the case.
)
-- 7) Finally, I unnest the produced array and convert the values back into timestamps.
SELECT from_unixtime(timestamp) AS StartDate
, from_unixtime(timestamp + duration) AS EndDate
FROM Geo
CROSS JOIN UNNEST(timestamp_duration) AS t(timestamp, duration)
For reference, this took my company cluster about 2 minutes to make 400k start/end timestamps into 700 distinct start/end timestamps.
It also runs in just 2 stages.