Oracle SQL: How to eliminate redundant recursive calls in CTE - sql

The below set represents the sales of a product in consecutive weeks.
22,19,20,23,16,14,15,15,18,21,24,10,17
...
weekly sales table
date sales
week-1 : 22
week-2 : 19
week-3 : 20
...
week-12 : 10
week-13 : 17
I need to find the longest run of higher sales figures for consecutive weeks, i.e week-6 to week-11 represented by 14,15,15,18,21,24.
I am trying to use a recursive CTE to move forward to the next week(s) to find if the sales value is equal or higher. As long as the value is equal or higher, keep on moving to the next week, recording the ROWNUMBER of the anchor member (represents the starting week number) and the week number of the iterated row. With this approach, there are redundant recursive calls. For example, when cte is called for week-2, it iterates week-3, week-4 and week-5 as the sales values are higher on each week from its previous week. Now, after week-2, the cte should be called for week-5 as week-3, week-4 and week-5 have already been visited.
Basically, if I have already visited a row of filt_coll in my recursive calls, I do not want it to be passed to the CTE again. The rows marked as redundant should not be found and the values for actualweek column should be unique.
I know the sql below does not give a solution to my problem of finding the longest run of higher values. I can work out that from the max count of startweek column. For now, I am trying to figure out how to eliminate the redundant recursive calls.
START_WEEK | SALES | SALESLAG | SALESLEAD | ACTUALWEEK
1 | 22 | 0 | -3 | 1
2 | 19 | -3 | 1 | 2
2 | 20 | 1 | 3 | 3
2 | 23 | 3 | -7 | 4
3 | 20 | 1 | 3 | 3 <-(redundant)
3 | 23 | 3 | -7 | 4 <-(redundant)
4 | 23 | 3 | -7 | 4 <-(redundant)
6 | 14 | -2 | 1 | 6
...
with
-- begin test data
raw_data (sales) as
(
select '22,19,20,23,16,14,15,15,18,21,24,10,17' from dual
)
,
derived_tbl(week, sales) as
(
select level, regexp_substr(sales, '([[:digit:]]+)(,|$)', 1, level, null, 1)
from raw_data connect by level <= regexp_count(sales,',')+1
)
-- end test data
,
coll(week, sales, saleslag, saleslead) as
(
select week, sales,
nvl(sales - (lag(sales) over (order by week)), 0),
nvl((lead(sales) over (order by week) - sales), 0)
from derived_tbl
)
,
filt_coll(week, sales, saleslag, saleslead) as
(
select week, sales, saleslag, saleslead
from coll
where not (saleslag < 0 and saleslead < 0)
)
,
cte(startweek, sales, saleslag, saleslead, actualweek) as
(
select week, sales, saleslag, saleslead, week from filt_coll
-- where week not in (select week from cte)
-- *** want to achieve the effect of the above commented out line
union all
select cte.startweek, cl.sales, cl.saleslag, cl.saleslead, cl.week
from filt_coll cl, cte
where cl.week = cte.actualweek + 1 and cl.sales >= cte.sales
)
select * from cte
order by 1,actualweek
;

Related

Get historical average and count of a value where a date could exist more than once

I have a table with multiple equal date entries and a value. I need a table that calculates the historical value and the count of entries per date. I want to use the data to create some charts in gnuplot/etc later.
Raw data:
date | value
------------+------
2017-11-26 | 5
2017-11-26 | 5
2017-11-26 | 5
2017-11-28 | 20
2017-11-28 | 5
2018-01-07 | 200
2018-01-07 | 5
2018-01-07 | 20
2018-01-15 | 5
2018-01-16 | 50
Output should be:
date | avg | count manual calc explanation
------------+--------+------- ---------------------------------------
2017-11-26 | 5 | 3 (5+5+5) / 3 = 5
2017-11-28 | 8 | 2 (5+5+5+20+5) / 5 = 8
2018-01-07 | 33.125 | 3 (5+5+5+20+5+200+5+20) / 8 = 33.125
2018-01-15 | 30 | 1 (5+5+5+20+5+200+5+20+5) / 9 = 30
2018-01-16 | 32 | 1 (5+5+5+20+5+200+5+20+5+50) / 10 = 32
If it is not possible to calculate two different columns, I would be fine for the avg column. For counting only the dates I have the solution "SELECT DISTINCT date, COUNT(date) FROM table_name GROUP BY date ORDER BY date"
I played around with DISTINCTs, GROUP BYs, JOINs, etc, but I did not find any solution. I found some other articles on the web, but no one covers a case where a date is more than once listed in the table.
You want a running average (total value divided by total count up to the row). This is done with window functions.
select
date,
sum(sum_value) over (order by date) as running_sum,
sum(cnt) over (order by date) as running_count,
sum(sum_value) over (order by date) /
sum(cnt) over (order by date) as running_average
from
(
select date, sum(value) as sum_value, count(*) as cnt
from mytable
group by date
) aggregated
order by date;
Demo: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=fb13b63970cb096913a53075b8b5c8d7

What's the most efficient way to calculate a rolling aggregate in SQL?

I have a dataset that includes a bunch of clients and date ranges that they had a "stay." For example:
| ClientID | DateStart | DateEnd |
+----------+-----------+---------+
| 1 | Jan 1 | Jan 31 | (datediff = 30)
| 1 | Apr 4 | May 4 | (datediff = 30)
| 2 | Jan 3 | Feb 27 | (datediff = 55)
| 3 | Jan 1 | Jan 7 | (datediff = 6)
| 3 | Jan 10 | Jan 17 | (datediff = 6)
| 3 | Jan 20 | Jan 27 | (datediff = 6)
| 3 | Feb 1 | Feb 7 | (datediff = 6)
| 3 | Feb 10 | Feb 17 | (datediff = 6)
| 3 | Feb 20 | Feb 27 | (datediff = 6)
My ultimate goal is to be able to identify the dates on which a client passed a threshold of N nights in the past X time. Let's say 30 days in the last 90 days. I also need to know when they pass out of the threshold. Use case: hotel stays and a VIP status.
In the example above, Client 1 passed the threshold on Jan 31 (had 30 nights in past 90 days), and still kept meeting the threshold until April 2 (now only 29 nights in the past 90 days), but passed the threshold again on May 4.
Client 2 passed the threshold on Feb 3, and kept meeting the threshold until April 28th, at which point the earliest days are more than 90 days ago and they expire.
Client 3 passed the threshold on around Feb 17
So I would like to generate a table like this:
| ClientID | VIPStart | VIPEnd |
+----------+-----------+---------+
| 1 | Jan 31 | Apr 2 |
| 1 | May 4 | Jul 5 |
| 2 | Feb 3 | Apr 28 |
| 3 | Feb 17 | Apr 11 |
(Forgive me if the dates are slightly off, I'm doing this in my head)
Ideally I would like to generate a view, as I will need to reference it often.
What I want to know is what's the most efficient way to generate this? Assuming I have thousands of clients and hundreds of thousands of stays.
The way that I've been approaching this so far has been to use a SQL statement that includes a parameter: as of {?Date}, who had VIP status and who didn't. I do that by calculating DATEADD(day,-90,{?Date}), then excluding the records that are out of the range, then truncating the DateStarts that extend earlier and DateEnds that extend later, then calculating the DATEDIFF(day,DateStart,DateEnd) for the resulting stays using adjusted DateStart and DateEnd, then getting a SUM() of the resulting DATEDIFF() for each Client as of {?Date}. It works, but it's not pretty. And it gives me a point in time snapshot; I want the history.
it seems a little inefficient to generate a table of dates and then for every single date, use the above method.
Another option I considered was converting the raw data into an exploded table with each record corresponding to one night, then I can count it easier. Like this:
| ClientID | StayDate |
+----------+-----------+
| 1 | Jan 1 |
| 1 | Jan 2 |
| 1 | Jan 3 |
| 1 | Jan 4 |
etc.
Then I could just add a column counting the number of days in the past 90 days, and that'll get me most of the way there.
But I'm not sure how to do that in a view. I have a code snippet that does this:
WITH DaysTally AS (
SELECT MAX(DATEDIFF(day, DateStart, DateEnd)) - 1 AS Tally
FROM Stays
UNION ALL
SELECT Tally - 1 AS Expr1
FROM DaysTally AS DaysTally_1
WHERE (Tally - 1 >= 0))
SELECT t.ClientID,
DATEADD(day, c.Tally, t.DateStart) AS "StayDate"
FROM Stays AS t
INNER JOIN DaysTally AS c ON
DATEDIFF(day, t.DateStart, t.DateEnd) - 1 >= c.Tally
OPTION (MAXRECURSION 0)
But I can't get it to work without the MAXRECURSION and I don't think you can save a view with MAXRECURSION
And now I'm rambling. So the help that I'm looking for is: what is the most efficient method to pursue my goal? And if you have a code example, that would be helpful too! Thanks.
This is an interesting and pretty well-asked question. I would start by enumerating the days from the beginning of the first stay of each client until 90 days after the end of its last stay with a recursive cte. You can then bring the stay table with a left join, and use window functions to flag the "VIP" days (note that this assumes no overlaping stays for a given client, which is consistent with your sample data).
What follows is gaps-and-islands: you can use a window sum to put "adjacent" VIP days in groups, and then aggregate.
with cte as (
select clientID, min(dateStart) dt, dateadd(day, 90, max(dateEnd)) dateMax
from stays
group by clientID
union all
select clientID, dateadd(day, 1, dt), dateMax
from cte
where dt < dateMax
)
select clientID, min(dt) VIPStart, max(dt) VIPEnd
from (
select t.*, sum(isNotVip) over(partition by clientID order by dt) grp
from (
select
c.clientID,
c.dt,
case when count(s.clientID) over(
partition by c.clientID
order by c.dt
rows between 90 preceding and current row
) >= 30
then 0
else 1
end isNotVip
from cte c
left join stays s
on c.clientID = s.clientID and c.dt between s.dateStart and s.dateEnd
) t
) t
where isNotVip = 0
group by clientID, grp
order by clientID, VIPStart
option (maxrecursion 0)
This demo on DB Fiddle with your sample data produces:
clientID | VIPStart | VIPEnd
-------: | :--------- | :---------
1 | 2020-01-30 | 2020-04-01
1 | 2020-05-03 | 2020-07-04
2 | 2020-02-01 | 2020-04-28
3 | 2020-02-07 | 2020-04-20
You can put this in a view as follows:
the order by and option(maxrecursion) clauses must be omitted when creating the view
each and every query that has the view in its from clause must end with option(max recursion 0)
Demo
You can eliminate the recursion by creating a tally table in the view. The approach is then the following:
For each period, generate dates from 90 days before the period to 90 days after. These are all the "candidate days" that the period could affect.
For each row, add a flag as to whether it is in the period (as opposed to the 90 days before and after).
Aggregate by client id and date.
Use a running sum to get the days with 30+ in the previous 90 days.
Then filter for the ones with 30+ days and treat this as a gaps-and-islands problem.
Assuming that 1000 days is sufficient for the periods (including the 90 days before and after), then the query looks like this:
with n as (
select v.n
from (values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(n)
),
nums as (
select (n1.n * 100 + n2.n * 10 + n3.n) as n
from n n1 cross join n n2 cross join n n3
),
running90 as (
select clientid, dte, sum(in_period) over (partition by clientid order by dte rows between 89 preceding and current row) as running_90
from (select t.clientid, dateadd(day, n.n - 90, datestart) as dte,
max(case when dateadd(day, n.n - 90, datestart) >= datestart and dateadd(day, n.n - 90, datestart) <= t.dateend then 1 else 0 end) as in_period
from t join
nums n
on dateadd(day, n.n - 90, datestart) <= dateadd(day, 90, dateend)
group by t.clientid, dateadd(day, n.n - 90, datestart)
) t
)
select clientid, min(dte), max(dte)
from (select r.*,
row_number() over (partition by clientid order by dte) as seqnum
from running90 r
where running_90 >= 30
) r
group by clientid, dateadd(day, - seqnum, dte);
Having no recursive CTE (although one could be used for n), this is not subject to the maxrecursion issue.
Here is a db<>fiddle.
The results are slightly different from your results. This is probably due to some slight difference in the definitions. The above includes the end day as an "occupied" day. The 90 days is 89 days before plus the current day in the above query. The second-to-last query shows the 90 days running days, and that seems correct to me.

Querying across months and days

My access logs database stores time as epoch and extracts year month and day as integers. Further, the partitioning of the database is based on the extracted Y/m/d and I have a 35 day retention.
If I run this query:
select *
from mydb
where year in (2017, 2018)
and month in (12, 1)
and day in (31, 1)
On the 29th of January, 2018, I will get data for 12/31/2017 and 1/1/2018.
On the 5th of January, 2018, I will get data for 12/1/2017, 12/31/2017, and 1/1/2018 (undesirable)
I also realize that I can do something like this:
select *
from mydb
where (year = 2017 and month = 12 and day = 31)
or (year = 2018 and month = 1 and day = 1)
But what I am really looking for is this: a good way to write a query where I give the year month and day number as the start and then a fourth value (number of days +) and then get all the data for 12/31/2017 + 5 days for example.
Is there a native way in SQL to accomplish this? I have an enormous data set and if I don't specify the days and have to rely on the epoch to do this, the query takes forever. I also have no influence over the partitioning configuration.
With Impala as the dbms and SQL dialect you will be able to use common table expressions but not recursion. In addition there may be problems inserting parameters as well.
Below is an untested suggestion that will require you to locate some function alternatives. First it generates a set of rows with an integer from 0 to 999 (in the example). It is quite easy to expand the number of rows if required. From those rows it is possible to add the number of days to a timestamp literal using date_add(timestamp startdate, int days/interval expression) and then with year(timestamp date) and month(timestamp date) and day(timestamp date) see Date and Time functions create the columns needed to match to your data.
Overall then you should be able to build a common table expression that has columns for year, month, day that cover a wanted range, and that you can inner join to your source table and thereby implementing a date range filter.
The code below was produced using T-SQL (SQL Server) and it can be tested here.
-- produce a set of integers, adjust to suit needed number of these
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
)
, cteTally AS (
SELECT
d1s.digit
+ d10s.digit * 10
+ d100s.digit * 100 /* add more like this as needed */
-- + d1000s.digit * 1000 /* add more like this as needed */
AS num
FROM cteDigits d1s
CROSS JOIN cteDigits d10s
CROSS JOIN cteDigits d100s /* add more like this as needed */
-- CROSS JOIN cteDigits d1000s /* add more like this as needed */
)
, DateRange AS (
select
num
, dateadd(day,num,'20181227') dt
, year(dateadd(day,num,'20181227')) yr
, month(dateadd(day,num,'20181227')) mn
, day(dateadd(day,num,'20181227')) dy
from cteTally
where num < 10
)
select
*
from DateRange
I think these are the Impala equivalents for the function calls used above:
, DateRange AS (
select
num
, date_add(to_timestamp('20181227','yyyyMMdd'),num) dt
, year( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) yr
, month( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) mn
, day( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) dy
from cteTally
where num < 10
Hopefully you can work out how to use these. Ultimately the purpose is to use the generated date range like so:
select * from mydb t
inner join DateRange on t.year = DateRange.yr and t.month = DateRange.mn and t.day = DateRange.dy
original post
Well in the absence of knowing what database to propose solutions for, here is a suggestion using SQL Server:
This suggestion involves a recursive common table expression, which may then be used as an inner join to your source data to limit the results to a date range.
--Sql Server 2014 Express Edition
--https://rextester.com/l/sql_server_online_compiler
declare #yr as integer = 2018
declare #mn as integer = 12
declare #dy as integer = 27
declare #du as integer = 10
;with CTE as (
select
datefromparts(#yr, #mn, #dy) as dt
, #yr as yr
, #mn as mn
, #dy as dy
union all
select
dateadd(dd,1,cte.dt)
, datepart(year,dateadd(dd,1,cte.dt))
, datepart(month,dateadd(dd,1,cte.dt))
, datepart(day,dateadd(dd,1,cte.dt))
from cte
where cte.dt < dateadd(dd,#du-1,datefromparts(#yr, #mn, #dy))
)
select
*
from cte
This produces the following result:
+----+---------------------+------+----+----+
| | dt | yr | mn | dy |
+----+---------------------+------+----+----+
| 1 | 27.12.2018 00:00:00 | 2018 | 12 | 27 |
| 2 | 28.12.2018 00:00:00 | 2018 | 12 | 28 |
| 3 | 29.12.2018 00:00:00 | 2018 | 12 | 29 |
| 4 | 30.12.2018 00:00:00 | 2018 | 12 | 30 |
| 5 | 31.12.2018 00:00:00 | 2018 | 12 | 31 |
| 6 | 01.01.2019 00:00:00 | 2019 | 1 | 1 |
| 7 | 02.01.2019 00:00:00 | 2019 | 1 | 2 |
| 8 | 03.01.2019 00:00:00 | 2019 | 1 | 3 |
| 9 | 04.01.2019 00:00:00 | 2019 | 1 | 4 |
| 10 | 05.01.2019 00:00:00 | 2019 | 1 | 5 |
+----+---------------------+------+----+----+
and:
select * from mydb t
inner join cte on t.year = cte.yr and t.month = cte.mn and t.day = cte.dy
Instead of a recursive common table expression a table of integers may be used instead (or use a set unioned select queries to generate a set of integers) - often known as a tally table. The method one chooses will depend of dbms type and version being used.
Again, depending on database, it may be more efficient to persist the result seen above as a temporary table and add an index to that.

Joining series of dates and counting continous days

Let's say I have a table as below
date add_days
2015-01-01 5
2015-01-04 2
2015-01-11 7
2015-01-20 10
2015-01-30 1
what I want to do is to check the days_balance, i.e. if date is greater or smaller than previous date + N days (add_days) and take the cumulated sum of days count if they are a continuous series.
So the algorithm should work like
for i in 2:N_rows {
days_balance[i] := date[i-1] + add_days[i-1] - date[i]
if days_balance[i] >= 0 then
date[i] := date[i] + days_balance[i]
}
The expected result should be as follows
date days_balance
2015-01-01 0
2015-01-04 2
2015-01-11 -3
2015-01-20 -2
2015-01-30 0
Is it possible in pure SQL? I imagine it should be with some conditional joins, but cannot see how it could be implemented.
I'm posting another answer since it may be nice to compare them since they use different methods (this one just does a n^2 style join, other one used a recursive CTE). This one takes advantage of the fact that you don't have to calculate the days_balance for each previous row before calculating it for a particular row, you just need to sum things from previous days....
drop table junk
create table junk(date DATETIME, add_days int)
insert into junk values
('2015-01-01',5 ),
('2015-01-04',2 ),
('2015-01-11',7 ),
('2015-01-20',10 ),
('2015-01-30',1 )
;WITH cte as
(
select ROW_NUMBER() OVER (ORDER BY date) i, date, add_days, ISNULL(DATEDIFF(DAY, LAG(date) OVER (ORDER BY date), date), 0) days_since_prev
FROM Junk
)
, combinedWithAllPreviousDaysCte as
(
select i [curr_i], date [curr_date], add_days [curr_add_days], days_since_prev [curr_days_since_prev], 0 [prev_add_days], 0 [prev_days_since_prev] from cte where i = 1 --get first row explicitly since it has no preceding rows
UNION ALL
select curr.i [curr_i], curr.date [curr_date], curr.add_days [curr_add_days], curr.days_since_prev [curr_days_since_prev], prev.add_days [prev_add_days], prev.days_since_prev [prev_days_since_prev]
from cte curr
join cte prev on curr.i > prev.i --join to all previous days
)
select curr_i, curr_date, SUM(prev_add_days) - curr_days_since_prev - SUM(prev_days_since_prev) [days_balance]
from combinedWithAllPreviousDaysCte
group by curr_i, curr_date, curr_days_since_prev
order by curr_i
outputs:
+--------+-------------------------+--------------+
| curr_i | curr_date | days_balance |
+--------+-------------------------+--------------+
| 1 | 2015-01-01 00:00:00.000 | 0 |
| 2 | 2015-01-04 00:00:00.000 | 2 |
| 3 | 2015-01-11 00:00:00.000 | -3 |
| 4 | 2015-01-20 00:00:00.000 | -5 |
| 5 | 2015-01-30 00:00:00.000 | -5 |
+--------+-------------------------+--------------+
Well, I think I have it with a recursive CTE (sorry, I only have Microsoft SQL Server available to me at the moment, so it may not comply with PostgreSQL).
Also I think the expected results you had were off (see comment above). If not, this can probably be modified to conform to your math.
drop table junk
create table junk(date DATETIME, add_days int)
insert into junk values
('2015-01-01',5 ),
('2015-01-04',2 ),
('2015-01-11',7 ),
('2015-01-20',10 ),
('2015-01-30',1 )
;WITH cte as
(
select ROW_NUMBER() OVER (ORDER BY date) i, date, add_days, ISNULL(DATEDIFF(DAY, LAG(date) OVER (ORDER BY date), date), 0) days_since_prev
FROM Junk
)
,recursiveCte (i, date, add_days, days_since_prev, days_balance, math) as
(
select top 1
i,
date,
add_days,
days_since_prev,
0 [days_balance],
CAST('no math for initial one, just has zero balance' as varchar(max)) [math]
from cte where i = 1
UNION ALL --recursive step now
select
curr.i,
curr.date,
curr.add_days,
curr.days_since_prev,
prev.days_balance - curr.days_since_prev + prev.add_days [days_balance],
CAST(prev.days_balance as varchar(max)) + ' - ' + CAST(curr.days_since_prev as varchar(max)) + ' + ' + CAST(prev.add_days as varchar(max)) [math]
from cte curr
JOIN recursiveCte prev ON curr.i = prev.i + 1
)
select i, DATEPART(day,date) [day], add_days, days_since_prev, days_balance, math
from recursiveCTE
order by date
And the results are like so:
+---+-----+----------+-----------------+--------------+------------------------------------------------+
| i | day | add_days | days_since_prev | days_balance | math |
+---+-----+----------+-----------------+--------------+------------------------------------------------+
| 1 | 1 | 5 | 0 | 0 | no math for initial one, just has zero balance |
| 2 | 4 | 2 | 3 | 2 | 0 - 3 + 5 |
| 3 | 11 | 7 | 7 | -3 | 2 - 7 + 2 |
| 4 | 20 | 10 | 9 | -5 | -3 - 9 + 7 |
| 5 | 30 | 1 | 10 | -5 | -5 - 10 + 10 |
+---+-----+----------+-----------------+--------------+------------------------------------------------+
I don’t quite get how your algorithm returns your expected results? But let me share a technique I came up with that might help.
This will only work if the end result of your data is to be exported to Excel, and even then it won’t work in all scenarios depending on what format you export your dataset in, but here it is....
If you’ll familiar with Excel Formulas, what I discovered is that if you write an Excel formula in your SQL as another field, it will execute that formula for you as soon as you export to excel (best method that works for me is just coping and pasting it into Excel, so that it doesn’t format it as text)
So for your example, here’s what you could do (noting again I don’t understand your algorithm, so this is probably wrong, but it’s just to give you the concept)
SELECT
date
, add_days
, '=INDEX($1:$65536,ROW()-1,COLUMN()-2)'
||'+INDEX($1:$65536,ROW()-1,COLUMN()-1)'
||'-INDEX($1:$65536,ROW(),COLUMN()-2)'
AS "days_balance[i]"
,'=IF(INDEX($1:$65536,ROW(),COLUMN()-1)>=0'
||',INDEX($1:$65536,ROW(),COLUMN()-3)'
||'+INDEX($1:$65536,ROW(),COLUMN()-1))'
AS "date[i]"
FROM
myTable
ORDER BY /*Ensure to order by whatever you need for your formula to work*/
The key part to making this work is using the INDEX formula function to select a cell based on the position of the current cell. So ROW()-1 tells it get me the result of the previous record, and COLUMN()-2 means take the value from two columns to the left of the current. Because you can't use cell references like A2+B2-A3 because the row numbers won't change on export, and it assumes the position of the columns.
I used SQL string concatenation with || just so it's easier to read on screen.
I tried this one in excel; it didn’t match your expected results. But if this technique works for you then just correct the excel formula to suit.

How can I identify groups of consecutive dates in SQL?

Im trying to write a function which identifies groups of dates, and measures the size of the group.
I've been doing this procedurally in Python until now but I'd like to move it into SQL.
for example, the list
Bill 01/01/2011
Bill 02/01/2011
Bill 03/01/2011
Bill 05/01/2011
Bill 07/01/2011
should be output into a new table as:
Bill 01/01/2011 3
Bill 02/01/2011 3
Bill 03/01/2011 3
Bill 05/01/2011 1
Bill 07/01/2011 1
Ideally this should also be able to account for weekends and public holidays - the dates in my table will aways be Mon-Fri (I think I can solve this by making a new table of working days and numbering them in sequence). Someone at work suggested I try a CTE. Im pretty new to this, so I'd appreciate any guidance anyone could provide! Thanks.
You can do this with a clever application of window functions. Consider the following:
select name, date, row_number() over (partition by name order by date)
from t
This adds a row number, which in your example would simply be 1, 2, 3, 4, 5. Now, take the difference from the date, and you have a constant value for the group.
select name, date,
dateadd(d, - row_number() over (partition by name order by date), date) as val
from t
Finally, you want the number of groups in sequence. I would also add a group identifier (for instance, to distinguish between the last two).
select name, date,
count(*) over (partition by name, val) as NumInSeq,
dense_rank() over (partition by name order by val) as SeqID
from (select name, date,
dateadd(d, - row_number() over (partition by name order by date), date) as val
from t
) t
Somehow, I missed the part about weekdays and holidays. This solution does not solve that problem.
The following query account the weekends and holidays. The query has a provision to include the holidays on-the-fly, though for the purpose of making the query clearer, I just materialized the holidays to an actual table.
CREATE TABLE tx
(n varchar(4), d date);
INSERT INTO tx
(n, d)
VALUES
('Bill', '2006-12-29'), -- Friday
-- 2006-12-30 is Saturday
-- 2006-12-31 is Sunday
-- 2007-01-01 is New Year's Holiday
('Bill', '2007-01-02'), -- Tuesday
('Bill', '2007-01-03'), -- Wednesday
('Bill', '2007-01-04'), -- Thursday
('Bill', '2007-01-05'), -- Friday
-- 2007-01-06 is Saturday
-- 2007-01-07 is Sunday
('Bill', '2007-01-08'), -- Monday
('Bill', '2007-01-09'), -- Tuesday
('Bill', '2012-07-09'), -- Monday
('Bill', '2012-07-10'), -- Tuesday
('Bill', '2012-07-11'); -- Wednesday
create table holiday(d date);
insert into holiday(d) values
('2007-01-01');
/* query should return 7 consecutive good
attendance(from December 29 2006 to January 9 2007) */
/* and 3 consecutive attendance from July 7 2012 to July 11 2012. */
Query:
with first_date as
(
-- get the monday of the earliest date
select dateadd( ww, datediff(ww,0,min(d)), 0 ) as first_date
from tx
)
,shifted as
(
select
tx.n, tx.d,
diff = datediff(day, fd.first_date, tx.d)
- (datediff(day, fd.first_date, tx.d)/7 * 2)
from tx
cross join first_date fd
union
select
xxx.n, h.d,
diff = datediff(day, fd.first_date, h.d)
- (datediff(day, fd.first_date, h.d)/7 * 2)
from holiday h
cross join first_date fd
cross join (select distinct n from tx) as xxx
)
,grouped as
(
select *, grp = diff - row_number() over(partition by n order by d)
from shifted
)
select
d, n, dense_rank() over (partition by n order by grp) as nth_streak
,count(*) over (partition by n, grp) as streak
from grouped
where d not in (select d from holiday) -- remove the holidays
Output:
| D | N | NTH_STREAK | STREAK |
-------------------------------------------
| 2006-12-29 | Bill | 1 | 7 |
| 2007-01-02 | Bill | 1 | 7 |
| 2007-01-03 | Bill | 1 | 7 |
| 2007-01-04 | Bill | 1 | 7 |
| 2007-01-05 | Bill | 1 | 7 |
| 2007-01-08 | Bill | 1 | 7 |
| 2007-01-09 | Bill | 1 | 7 |
| 2012-07-09 | Bill | 2 | 3 |
| 2012-07-10 | Bill | 2 | 3 |
| 2012-07-11 | Bill | 2 | 3 |
Live test: http://www.sqlfiddle.com/#!3/815c5/1
The main logic of the query is to shift all the dates two days back. This is done by dividing the date to 7 and multiplying it by two, then subtracting it from the original number. For example, if a given date falls on 15th, this will be computed as 15/7 * 2 == 4; then subtract 4 from the original number, 15 - 4 == 11. 15 will become the 11th day. Likewise the 8th day becomes the 6th day; 8 - (8/7 * 2) == 6.
Weekends are not in attendance(e.g. 6,7,13,14)
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15
Applying the computation to all the weekday numbers will yield these values:
1 2 3 4 5
6 7 8 9 10
11
For holidays, you need to slot them on attendance, so to the consecutive-ness could be easily determined, then just remove them from the final query. The above attendance yields 11 consecutive good attendance.
Query logic's detailed explanation here: http://www.ienablemuch.com/2012/07/monitoring-perfect-attendance.html