How to get correct year, month and day in firebird function datediff - sql

I have to ask another question about datediff in Firebird. I don`t know how to get correct result in this case:
worker x has two contract of employment, first in the period 1988-09-15 to 2000-03-16, second from 2000-03-16 to 2005-02-28. The result that I want to get is like this 16 years, 5 months and 3 days, because the result of first is 11 years, 6 months and 1 day, and the second result is 4 years, 11 months and 2 days.
Has anyone can tell me how to do this in firebird. Most I would like to know how from sum of months (17 months) can I do 5 months, and other 12 months add to value of year. Now I have SQL like this:
SELECT
a.id_contact,
sum(floor(datediff(day, a.DATE_FROM, a.DATE_TO)/365.25)) as YEAR,
mod(sum(mod(floor(datediff(day, a.DATE_FROM, a.DATE_TO)/30.41),12)),12) as MTH
FROM KP a
group by a.id_contact
and then I get 5 months, but I don`t have 12 months add to value of year. Please help me...

You should sum days first then sum the result and then calculate Y, M, D
SELECT
KP3.id_contact
, (KP3.D2-KP3.D1) / (12*31) AS Y
, ((KP3.D2-KP3.D1) - ((KP3.D2-KP3.D1) / (12*31)) * 12 * 31) / 31 AS M
, CAST(MOD((KP3.D2-KP3.D1) - (((KP3.D2-KP3.D1) / (12*31)) * 12 * 31), 31) AS INTEGER) AS D
FROM
(SELECT
KP2.id_contact, SUM(KP2.D1) AS D1, SUM(KP2.D2) AS D2
FROM
(
SELECT
KP.id_contact, DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO) / 12 AS Y, CAST(MOD(DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO), 12) AS INTEGER) AS M
, EXTRACT(YEAR FROM KP.DATE_FROM)*12*31+EXTRACT(MONTH FROM KP.DATE_FROM)*31+EXTRACT(DAY FROM KP.DATE_FROM) D1
, EXTRACT(YEAR FROM KP.DATE_TO)*12*31+EXTRACT(MONTH FROM KP.DATE_TO)*31+EXTRACT(DAY FROM KP.DATE_TO) D2
FROM
KP
) AS KP2
GROUP BY KP2.id_contact
) AS KP3

The proper approach seems to measure the DAYS spent on both assignments, then sum those dates, then convert it into inherently non-precise give-or-take streeet-talk-like form of years-months-days. More on this below
Borrowing the conversion query from Livius, and adjusting coefficients to more realistic, that will develop as that:
https://dbfiddle.uk/?rdbms=firebird_3.0&fiddle=2fba0ace6a70ae16a167ec838642dc28
Here, step by step we move, building up from simple blocks into more and more complex ones, which finally gives us 16 years and 5 months and 2 days:
select rdb$get_context('SYSTEM', 'ENGINE_VERSION') as version from rdb$database;
| VERSION |
| :------ |
| 3.0.5 |
create table KP (
ID_CONTACT integer not null,
DATE_FROM date not null,
DATE_TO date not null
)
-- https://stackoverflow.com/questions/51551257/how-to-get-correct-year-month-and-day-in-firebird-function-datediff
✓
create index KP_workers on KP(id_contact)
✓
insert into KP values (1, '1988-09-15', '2000-03-16')
1 rows affected
insert into KP values (1, '2000-03-16', '2005-02-28')
1 rows affected
-- the sample data from https://stackoverflow.com/questions/60030543
-- might expose the rounding bug in my original formulae:
-- unexpected ROUNDING UP leading to NEGATIVE value for months
insert into KP values (2, '2018-02-08', '2019-12-01')
1 rows affected
insert into KP values (2, '2017-02-20', '2018-01-01')
1 rows affected
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
ID_CONTACT | DATE_FROM | DATE_TO | DAYS_COUNT
---------: | :--------- | :--------- | :---------
1 | 1988-09-15 | 2000-03-16 | 4200
1 | 2000-03-16 | 2005-02-28 | 1810
2 | 2018-02-08 | 2019-12-01 | 661
2 | 2017-02-20 | 2018-01-01 | 315
-- Original answer by Livius
SELECT
KP3.id_contact
, KP3.D2-KP3.D1 as days_count
, (KP3.D2-KP3.D1) / (12*31) AS Y
, ((KP3.D2-KP3.D1) - ((KP3.D2-KP3.D1) / (12*31)) * 12 * 31) / 31 AS M
, CAST(MOD((KP3.D2-KP3.D1) - (((KP3.D2-KP3.D1) / (12*31)) * 12 * 31), 31) AS INTEGER) AS D
FROM
(SELECT
KP2.id_contact, SUM(KP2.D1) AS D1, SUM(KP2.D2) AS D2
FROM
(
SELECT
KP.id_contact, DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO) / 12 AS Y, CAST(MOD(DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO), 12) AS INTEGER) AS M
, EXTRACT(YEAR FROM KP.DATE_FROM)*12*31+EXTRACT(MONTH FROM KP.DATE_FROM)*31+EXTRACT(DAY FROM KP.DATE_FROM) D1
, EXTRACT(YEAR FROM KP.DATE_TO)*12*31+EXTRACT(MONTH FROM KP.DATE_TO)*31+EXTRACT(DAY FROM KP.DATE_TO) D2
FROM
KP
) AS KP2
GROUP BY KP2.id_contact
) AS KP3
ID_CONTACT | DAYS_COUNT | Y | M | D
---------: | :--------- | :- | :- | -:
1 | 6120 | 16 | 5 | 13
2 | 997 | 2 | 8 | 5
select ID_CONTACT, sum(DAYS_COUNT) as DAYS_COUNT
from (
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
)
GROUP BY 1
ID_CONTACT | DAYS_COUNT
---------: | :---------
1 | 6010
2 | 976
-- this step taken from https://dbfiddle.uk/?rdbms=firebird_3.0&fiddle=52c1e130f589ca507c9ff185b5b2346d
-- based on original Livius forumla with non-exact integer coefficients
-- it seems not be generating negative counts, but still shows very different results
SELECT
KP_DAYS.id_contact,
KP_DAYS.DAYS_COUNT / (12*31) AS Y,
((KP_DAYS.DAYS_COUNT) - ((KP_DAYS.DAYS_COUNT) / (12*31)) * 12 * 31) / 31 AS M,
CAST(MOD((KP_DAYS.DAYS_COUNT) - (((KP_DAYS.DAYS_COUNT) / (12*31)) * 12 * 31), 31) AS INTEGER) AS D
FROM
(
select ID_CONTACT, sum(DAYS_COUNT) as DAYS_COUNT
from (
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
)
GROUP BY 1
) as KP_DAYS
ID_CONTACT | Y | M | D
---------: | :- | :- | -:
1 | 16 | 1 | 27
2 | 2 | 7 | 15
SELECT
KP_DAYS.id_contact, KP_DAYS.days_count
, FLOOR(KP_DAYS.DAYS_COUNT / 365.25) AS Y
, FLOOR( (KP_DAYS.DAYS_COUNT - (FLOOR(KP_DAYS.DAYS_COUNT / 365.25) * 365.25) ) / 30.5) AS M
, CAST(MOD((KP_DAYS.DAYS_COUNT) - (((KP_DAYS.DAYS_COUNT) / 365.25) * 365.25), 30.5) AS INTEGER) AS D
FROM
(
select ID_CONTACT, sum(DAYS_COUNT) as DAYS_COUNT
from (
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
)
GROUP BY 1
) as KP_DAYS
ID_CONTACT | DAYS_COUNT | Y | M | D
---------: | :--------- | :- | :- | -:
1 | 6010 | 16 | 5 | 2
2 | 976 | 2 | 8 | 1
Notice, the above is still not correct mathematically. But should give a "gut feeling" of the time stamp.
The question of getting EXACT AND PRECISE measure of timespan in Y-M-D form is moot.
For example, you quoted 3 days while this query gives 2 days. I see no error there. Because months and years are different from each other you just can not correctly measure time DISTANCE in months. That would be like measuring geographical distance in cities.
How many New Yorks lie between London and Paris? How many Warsaws high is Elbrus mountain? You can not have any mathematically correct answer.
Thus you can only answer with NON-PRECISE estimations. Suitable for give-or-take kind of street talk. So, any DateDiff-based query would essentially generate a perfectly valid answer of the kind of "2Y 10M give or take few days" - an answer that IS valid for this context of "just give me overall impression".
Marrying this simplicity of getting a feel of it with perfectionism of mathematical accuracy just is not possible. For example, imagine you get the span of about 6Y. Now how many leap years should you account for? In the "6Y" from 1999 to 2004 there were TWO leap years, but in the same "6Y" from 1998 to 2003 there only was ONE leap year. Which of those is correct measure for "6Y" ???
And then we have milleniums, where 2000 was leap year but 1900 was not. And same "sliding window" problem gives you volatile undefined number of leap years in timespans like "110Y". If you want to go towards layman's perception and count timespans in "years and months" - you have to agree this makes things easy, simple and imprecise by definition. And mismatch of one or few days over several years is norm, is OK

Related

What's the most efficient way to calculate a rolling aggregate in SQL?

I have a dataset that includes a bunch of clients and date ranges that they had a "stay." For example:
| ClientID | DateStart | DateEnd |
+----------+-----------+---------+
| 1 | Jan 1 | Jan 31 | (datediff = 30)
| 1 | Apr 4 | May 4 | (datediff = 30)
| 2 | Jan 3 | Feb 27 | (datediff = 55)
| 3 | Jan 1 | Jan 7 | (datediff = 6)
| 3 | Jan 10 | Jan 17 | (datediff = 6)
| 3 | Jan 20 | Jan 27 | (datediff = 6)
| 3 | Feb 1 | Feb 7 | (datediff = 6)
| 3 | Feb 10 | Feb 17 | (datediff = 6)
| 3 | Feb 20 | Feb 27 | (datediff = 6)
My ultimate goal is to be able to identify the dates on which a client passed a threshold of N nights in the past X time. Let's say 30 days in the last 90 days. I also need to know when they pass out of the threshold. Use case: hotel stays and a VIP status.
In the example above, Client 1 passed the threshold on Jan 31 (had 30 nights in past 90 days), and still kept meeting the threshold until April 2 (now only 29 nights in the past 90 days), but passed the threshold again on May 4.
Client 2 passed the threshold on Feb 3, and kept meeting the threshold until April 28th, at which point the earliest days are more than 90 days ago and they expire.
Client 3 passed the threshold on around Feb 17
So I would like to generate a table like this:
| ClientID | VIPStart | VIPEnd |
+----------+-----------+---------+
| 1 | Jan 31 | Apr 2 |
| 1 | May 4 | Jul 5 |
| 2 | Feb 3 | Apr 28 |
| 3 | Feb 17 | Apr 11 |
(Forgive me if the dates are slightly off, I'm doing this in my head)
Ideally I would like to generate a view, as I will need to reference it often.
What I want to know is what's the most efficient way to generate this? Assuming I have thousands of clients and hundreds of thousands of stays.
The way that I've been approaching this so far has been to use a SQL statement that includes a parameter: as of {?Date}, who had VIP status and who didn't. I do that by calculating DATEADD(day,-90,{?Date}), then excluding the records that are out of the range, then truncating the DateStarts that extend earlier and DateEnds that extend later, then calculating the DATEDIFF(day,DateStart,DateEnd) for the resulting stays using adjusted DateStart and DateEnd, then getting a SUM() of the resulting DATEDIFF() for each Client as of {?Date}. It works, but it's not pretty. And it gives me a point in time snapshot; I want the history.
it seems a little inefficient to generate a table of dates and then for every single date, use the above method.
Another option I considered was converting the raw data into an exploded table with each record corresponding to one night, then I can count it easier. Like this:
| ClientID | StayDate |
+----------+-----------+
| 1 | Jan 1 |
| 1 | Jan 2 |
| 1 | Jan 3 |
| 1 | Jan 4 |
etc.
Then I could just add a column counting the number of days in the past 90 days, and that'll get me most of the way there.
But I'm not sure how to do that in a view. I have a code snippet that does this:
WITH DaysTally AS (
SELECT MAX(DATEDIFF(day, DateStart, DateEnd)) - 1 AS Tally
FROM Stays
UNION ALL
SELECT Tally - 1 AS Expr1
FROM DaysTally AS DaysTally_1
WHERE (Tally - 1 >= 0))
SELECT t.ClientID,
DATEADD(day, c.Tally, t.DateStart) AS "StayDate"
FROM Stays AS t
INNER JOIN DaysTally AS c ON
DATEDIFF(day, t.DateStart, t.DateEnd) - 1 >= c.Tally
OPTION (MAXRECURSION 0)
But I can't get it to work without the MAXRECURSION and I don't think you can save a view with MAXRECURSION
And now I'm rambling. So the help that I'm looking for is: what is the most efficient method to pursue my goal? And if you have a code example, that would be helpful too! Thanks.
This is an interesting and pretty well-asked question. I would start by enumerating the days from the beginning of the first stay of each client until 90 days after the end of its last stay with a recursive cte. You can then bring the stay table with a left join, and use window functions to flag the "VIP" days (note that this assumes no overlaping stays for a given client, which is consistent with your sample data).
What follows is gaps-and-islands: you can use a window sum to put "adjacent" VIP days in groups, and then aggregate.
with cte as (
select clientID, min(dateStart) dt, dateadd(day, 90, max(dateEnd)) dateMax
from stays
group by clientID
union all
select clientID, dateadd(day, 1, dt), dateMax
from cte
where dt < dateMax
)
select clientID, min(dt) VIPStart, max(dt) VIPEnd
from (
select t.*, sum(isNotVip) over(partition by clientID order by dt) grp
from (
select
c.clientID,
c.dt,
case when count(s.clientID) over(
partition by c.clientID
order by c.dt
rows between 90 preceding and current row
) >= 30
then 0
else 1
end isNotVip
from cte c
left join stays s
on c.clientID = s.clientID and c.dt between s.dateStart and s.dateEnd
) t
) t
where isNotVip = 0
group by clientID, grp
order by clientID, VIPStart
option (maxrecursion 0)
This demo on DB Fiddle with your sample data produces:
clientID | VIPStart | VIPEnd
-------: | :--------- | :---------
1 | 2020-01-30 | 2020-04-01
1 | 2020-05-03 | 2020-07-04
2 | 2020-02-01 | 2020-04-28
3 | 2020-02-07 | 2020-04-20
You can put this in a view as follows:
the order by and option(maxrecursion) clauses must be omitted when creating the view
each and every query that has the view in its from clause must end with option(max recursion 0)
Demo
You can eliminate the recursion by creating a tally table in the view. The approach is then the following:
For each period, generate dates from 90 days before the period to 90 days after. These are all the "candidate days" that the period could affect.
For each row, add a flag as to whether it is in the period (as opposed to the 90 days before and after).
Aggregate by client id and date.
Use a running sum to get the days with 30+ in the previous 90 days.
Then filter for the ones with 30+ days and treat this as a gaps-and-islands problem.
Assuming that 1000 days is sufficient for the periods (including the 90 days before and after), then the query looks like this:
with n as (
select v.n
from (values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(n)
),
nums as (
select (n1.n * 100 + n2.n * 10 + n3.n) as n
from n n1 cross join n n2 cross join n n3
),
running90 as (
select clientid, dte, sum(in_period) over (partition by clientid order by dte rows between 89 preceding and current row) as running_90
from (select t.clientid, dateadd(day, n.n - 90, datestart) as dte,
max(case when dateadd(day, n.n - 90, datestart) >= datestart and dateadd(day, n.n - 90, datestart) <= t.dateend then 1 else 0 end) as in_period
from t join
nums n
on dateadd(day, n.n - 90, datestart) <= dateadd(day, 90, dateend)
group by t.clientid, dateadd(day, n.n - 90, datestart)
) t
)
select clientid, min(dte), max(dte)
from (select r.*,
row_number() over (partition by clientid order by dte) as seqnum
from running90 r
where running_90 >= 30
) r
group by clientid, dateadd(day, - seqnum, dte);
Having no recursive CTE (although one could be used for n), this is not subject to the maxrecursion issue.
Here is a db<>fiddle.
The results are slightly different from your results. This is probably due to some slight difference in the definitions. The above includes the end day as an "occupied" day. The 90 days is 89 days before plus the current day in the above query. The second-to-last query shows the 90 days running days, and that seems correct to me.

How can I see point in time rolling five week counts of distinct values?

I am trying to see the point in time rolling five week count of distinct employees paid. For example, in week 48 I would need to see the count of distinct employees paid in weeks 44 through 48. I think I have to include something like "WHERE Week_Number BETWEEN Week_Number -5 AND Week_Number" but am not sure how to make this work. The output should just be the Year, Week Number, and count of distinct employee IDs.
SELECT Week_Number,
Year,
Account,
count(distinct EmployeeID as 'EmployeeCount'
FROM [Table]
GROUP BY Week_Number, Year, Account
I assume that you have a data table like this:
YearNumber | WeekNumber | Account | EmployeeID
----------------------------------------------
2019 | 51 | 101 | 1
2019 | 48 | 101 | 2
And this is the result you want to see:
YearNumber | WeekNumber | Account | Quantity
----------------------------------------------
2019 | 48 | 101 | 1
2019 | 49 | 101 | 1
2019 | 50 | 101 | 1
2019 | 51 | 101 | 2
2019 | 52 | 101 | 2
2020 | 1 | 101 | 1
2020 | 2 | 101 | 1
2020 | 3 | 101 | 1
So one person starts paying on week 48, one at 51, which means their payment on account 101 overlaps on week 51 and 52, but on the other weeks, only one person pays to the account.
To also answer your question in the comment: this - I think - is a good way to provide a sample data and expected result when you ask on SO.
The query which helped me produce the results above:
SELECT
d.Year + IIF((d.Week + n.Number - 1) >= 52, 1, 0) AS Year,
(d.Week + n.Number - 1) % 52 + 1 AS Week,
d.AccountID,
COUNT(d.EmployeeID) AS Quantity
FROM Data d
CROSS APPLY (SELECT * FROM Number n WHERE Number BETWEEN 0 AND 4) n
GROUP BY
d.Year + IIF((d.Week + n.Number - 1) >= 52, 1, 0), -- Year
(d.Week + n.Number - 1) % 52 + 1, -- Week
d.AccountID
This uses a Number table which is basically a table containing the numbers - help a lot in queries like this. The code also has a minimal handling for year turning, but be aware that you may need to care for years containing 53 weeks.

Querying across months and days

My access logs database stores time as epoch and extracts year month and day as integers. Further, the partitioning of the database is based on the extracted Y/m/d and I have a 35 day retention.
If I run this query:
select *
from mydb
where year in (2017, 2018)
and month in (12, 1)
and day in (31, 1)
On the 29th of January, 2018, I will get data for 12/31/2017 and 1/1/2018.
On the 5th of January, 2018, I will get data for 12/1/2017, 12/31/2017, and 1/1/2018 (undesirable)
I also realize that I can do something like this:
select *
from mydb
where (year = 2017 and month = 12 and day = 31)
or (year = 2018 and month = 1 and day = 1)
But what I am really looking for is this: a good way to write a query where I give the year month and day number as the start and then a fourth value (number of days +) and then get all the data for 12/31/2017 + 5 days for example.
Is there a native way in SQL to accomplish this? I have an enormous data set and if I don't specify the days and have to rely on the epoch to do this, the query takes forever. I also have no influence over the partitioning configuration.
With Impala as the dbms and SQL dialect you will be able to use common table expressions but not recursion. In addition there may be problems inserting parameters as well.
Below is an untested suggestion that will require you to locate some function alternatives. First it generates a set of rows with an integer from 0 to 999 (in the example). It is quite easy to expand the number of rows if required. From those rows it is possible to add the number of days to a timestamp literal using date_add(timestamp startdate, int days/interval expression) and then with year(timestamp date) and month(timestamp date) and day(timestamp date) see Date and Time functions create the columns needed to match to your data.
Overall then you should be able to build a common table expression that has columns for year, month, day that cover a wanted range, and that you can inner join to your source table and thereby implementing a date range filter.
The code below was produced using T-SQL (SQL Server) and it can be tested here.
-- produce a set of integers, adjust to suit needed number of these
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
)
, cteTally AS (
SELECT
d1s.digit
+ d10s.digit * 10
+ d100s.digit * 100 /* add more like this as needed */
-- + d1000s.digit * 1000 /* add more like this as needed */
AS num
FROM cteDigits d1s
CROSS JOIN cteDigits d10s
CROSS JOIN cteDigits d100s /* add more like this as needed */
-- CROSS JOIN cteDigits d1000s /* add more like this as needed */
)
, DateRange AS (
select
num
, dateadd(day,num,'20181227') dt
, year(dateadd(day,num,'20181227')) yr
, month(dateadd(day,num,'20181227')) mn
, day(dateadd(day,num,'20181227')) dy
from cteTally
where num < 10
)
select
*
from DateRange
I think these are the Impala equivalents for the function calls used above:
, DateRange AS (
select
num
, date_add(to_timestamp('20181227','yyyyMMdd'),num) dt
, year( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) yr
, month( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) mn
, day( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) dy
from cteTally
where num < 10
Hopefully you can work out how to use these. Ultimately the purpose is to use the generated date range like so:
select * from mydb t
inner join DateRange on t.year = DateRange.yr and t.month = DateRange.mn and t.day = DateRange.dy
original post
Well in the absence of knowing what database to propose solutions for, here is a suggestion using SQL Server:
This suggestion involves a recursive common table expression, which may then be used as an inner join to your source data to limit the results to a date range.
--Sql Server 2014 Express Edition
--https://rextester.com/l/sql_server_online_compiler
declare #yr as integer = 2018
declare #mn as integer = 12
declare #dy as integer = 27
declare #du as integer = 10
;with CTE as (
select
datefromparts(#yr, #mn, #dy) as dt
, #yr as yr
, #mn as mn
, #dy as dy
union all
select
dateadd(dd,1,cte.dt)
, datepart(year,dateadd(dd,1,cte.dt))
, datepart(month,dateadd(dd,1,cte.dt))
, datepart(day,dateadd(dd,1,cte.dt))
from cte
where cte.dt < dateadd(dd,#du-1,datefromparts(#yr, #mn, #dy))
)
select
*
from cte
This produces the following result:
+----+---------------------+------+----+----+
| | dt | yr | mn | dy |
+----+---------------------+------+----+----+
| 1 | 27.12.2018 00:00:00 | 2018 | 12 | 27 |
| 2 | 28.12.2018 00:00:00 | 2018 | 12 | 28 |
| 3 | 29.12.2018 00:00:00 | 2018 | 12 | 29 |
| 4 | 30.12.2018 00:00:00 | 2018 | 12 | 30 |
| 5 | 31.12.2018 00:00:00 | 2018 | 12 | 31 |
| 6 | 01.01.2019 00:00:00 | 2019 | 1 | 1 |
| 7 | 02.01.2019 00:00:00 | 2019 | 1 | 2 |
| 8 | 03.01.2019 00:00:00 | 2019 | 1 | 3 |
| 9 | 04.01.2019 00:00:00 | 2019 | 1 | 4 |
| 10 | 05.01.2019 00:00:00 | 2019 | 1 | 5 |
+----+---------------------+------+----+----+
and:
select * from mydb t
inner join cte on t.year = cte.yr and t.month = cte.mn and t.day = cte.dy
Instead of a recursive common table expression a table of integers may be used instead (or use a set unioned select queries to generate a set of integers) - often known as a tally table. The method one chooses will depend of dbms type and version being used.
Again, depending on database, it may be more efficient to persist the result seen above as a temporary table and add an index to that.

Oracle SQL: How to eliminate redundant recursive calls in CTE

The below set represents the sales of a product in consecutive weeks.
22,19,20,23,16,14,15,15,18,21,24,10,17
...
weekly sales table
date sales
week-1 : 22
week-2 : 19
week-3 : 20
...
week-12 : 10
week-13 : 17
I need to find the longest run of higher sales figures for consecutive weeks, i.e week-6 to week-11 represented by 14,15,15,18,21,24.
I am trying to use a recursive CTE to move forward to the next week(s) to find if the sales value is equal or higher. As long as the value is equal or higher, keep on moving to the next week, recording the ROWNUMBER of the anchor member (represents the starting week number) and the week number of the iterated row. With this approach, there are redundant recursive calls. For example, when cte is called for week-2, it iterates week-3, week-4 and week-5 as the sales values are higher on each week from its previous week. Now, after week-2, the cte should be called for week-5 as week-3, week-4 and week-5 have already been visited.
Basically, if I have already visited a row of filt_coll in my recursive calls, I do not want it to be passed to the CTE again. The rows marked as redundant should not be found and the values for actualweek column should be unique.
I know the sql below does not give a solution to my problem of finding the longest run of higher values. I can work out that from the max count of startweek column. For now, I am trying to figure out how to eliminate the redundant recursive calls.
START_WEEK | SALES | SALESLAG | SALESLEAD | ACTUALWEEK
1 | 22 | 0 | -3 | 1
2 | 19 | -3 | 1 | 2
2 | 20 | 1 | 3 | 3
2 | 23 | 3 | -7 | 4
3 | 20 | 1 | 3 | 3 <-(redundant)
3 | 23 | 3 | -7 | 4 <-(redundant)
4 | 23 | 3 | -7 | 4 <-(redundant)
6 | 14 | -2 | 1 | 6
...
with
-- begin test data
raw_data (sales) as
(
select '22,19,20,23,16,14,15,15,18,21,24,10,17' from dual
)
,
derived_tbl(week, sales) as
(
select level, regexp_substr(sales, '([[:digit:]]+)(,|$)', 1, level, null, 1)
from raw_data connect by level <= regexp_count(sales,',')+1
)
-- end test data
,
coll(week, sales, saleslag, saleslead) as
(
select week, sales,
nvl(sales - (lag(sales) over (order by week)), 0),
nvl((lead(sales) over (order by week) - sales), 0)
from derived_tbl
)
,
filt_coll(week, sales, saleslag, saleslead) as
(
select week, sales, saleslag, saleslead
from coll
where not (saleslag < 0 and saleslead < 0)
)
,
cte(startweek, sales, saleslag, saleslead, actualweek) as
(
select week, sales, saleslag, saleslead, week from filt_coll
-- where week not in (select week from cte)
-- *** want to achieve the effect of the above commented out line
union all
select cte.startweek, cl.sales, cl.saleslag, cl.saleslead, cl.week
from filt_coll cl, cte
where cl.week = cte.actualweek + 1 and cl.sales >= cte.sales
)
select * from cte
order by 1,actualweek
;

How can I identify groups of consecutive dates in SQL?

Im trying to write a function which identifies groups of dates, and measures the size of the group.
I've been doing this procedurally in Python until now but I'd like to move it into SQL.
for example, the list
Bill 01/01/2011
Bill 02/01/2011
Bill 03/01/2011
Bill 05/01/2011
Bill 07/01/2011
should be output into a new table as:
Bill 01/01/2011 3
Bill 02/01/2011 3
Bill 03/01/2011 3
Bill 05/01/2011 1
Bill 07/01/2011 1
Ideally this should also be able to account for weekends and public holidays - the dates in my table will aways be Mon-Fri (I think I can solve this by making a new table of working days and numbering them in sequence). Someone at work suggested I try a CTE. Im pretty new to this, so I'd appreciate any guidance anyone could provide! Thanks.
You can do this with a clever application of window functions. Consider the following:
select name, date, row_number() over (partition by name order by date)
from t
This adds a row number, which in your example would simply be 1, 2, 3, 4, 5. Now, take the difference from the date, and you have a constant value for the group.
select name, date,
dateadd(d, - row_number() over (partition by name order by date), date) as val
from t
Finally, you want the number of groups in sequence. I would also add a group identifier (for instance, to distinguish between the last two).
select name, date,
count(*) over (partition by name, val) as NumInSeq,
dense_rank() over (partition by name order by val) as SeqID
from (select name, date,
dateadd(d, - row_number() over (partition by name order by date), date) as val
from t
) t
Somehow, I missed the part about weekdays and holidays. This solution does not solve that problem.
The following query account the weekends and holidays. The query has a provision to include the holidays on-the-fly, though for the purpose of making the query clearer, I just materialized the holidays to an actual table.
CREATE TABLE tx
(n varchar(4), d date);
INSERT INTO tx
(n, d)
VALUES
('Bill', '2006-12-29'), -- Friday
-- 2006-12-30 is Saturday
-- 2006-12-31 is Sunday
-- 2007-01-01 is New Year's Holiday
('Bill', '2007-01-02'), -- Tuesday
('Bill', '2007-01-03'), -- Wednesday
('Bill', '2007-01-04'), -- Thursday
('Bill', '2007-01-05'), -- Friday
-- 2007-01-06 is Saturday
-- 2007-01-07 is Sunday
('Bill', '2007-01-08'), -- Monday
('Bill', '2007-01-09'), -- Tuesday
('Bill', '2012-07-09'), -- Monday
('Bill', '2012-07-10'), -- Tuesday
('Bill', '2012-07-11'); -- Wednesday
create table holiday(d date);
insert into holiday(d) values
('2007-01-01');
/* query should return 7 consecutive good
attendance(from December 29 2006 to January 9 2007) */
/* and 3 consecutive attendance from July 7 2012 to July 11 2012. */
Query:
with first_date as
(
-- get the monday of the earliest date
select dateadd( ww, datediff(ww,0,min(d)), 0 ) as first_date
from tx
)
,shifted as
(
select
tx.n, tx.d,
diff = datediff(day, fd.first_date, tx.d)
- (datediff(day, fd.first_date, tx.d)/7 * 2)
from tx
cross join first_date fd
union
select
xxx.n, h.d,
diff = datediff(day, fd.first_date, h.d)
- (datediff(day, fd.first_date, h.d)/7 * 2)
from holiday h
cross join first_date fd
cross join (select distinct n from tx) as xxx
)
,grouped as
(
select *, grp = diff - row_number() over(partition by n order by d)
from shifted
)
select
d, n, dense_rank() over (partition by n order by grp) as nth_streak
,count(*) over (partition by n, grp) as streak
from grouped
where d not in (select d from holiday) -- remove the holidays
Output:
| D | N | NTH_STREAK | STREAK |
-------------------------------------------
| 2006-12-29 | Bill | 1 | 7 |
| 2007-01-02 | Bill | 1 | 7 |
| 2007-01-03 | Bill | 1 | 7 |
| 2007-01-04 | Bill | 1 | 7 |
| 2007-01-05 | Bill | 1 | 7 |
| 2007-01-08 | Bill | 1 | 7 |
| 2007-01-09 | Bill | 1 | 7 |
| 2012-07-09 | Bill | 2 | 3 |
| 2012-07-10 | Bill | 2 | 3 |
| 2012-07-11 | Bill | 2 | 3 |
Live test: http://www.sqlfiddle.com/#!3/815c5/1
The main logic of the query is to shift all the dates two days back. This is done by dividing the date to 7 and multiplying it by two, then subtracting it from the original number. For example, if a given date falls on 15th, this will be computed as 15/7 * 2 == 4; then subtract 4 from the original number, 15 - 4 == 11. 15 will become the 11th day. Likewise the 8th day becomes the 6th day; 8 - (8/7 * 2) == 6.
Weekends are not in attendance(e.g. 6,7,13,14)
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15
Applying the computation to all the weekday numbers will yield these values:
1 2 3 4 5
6 7 8 9 10
11
For holidays, you need to slot them on attendance, so to the consecutive-ness could be easily determined, then just remove them from the final query. The above attendance yields 11 consecutive good attendance.
Query logic's detailed explanation here: http://www.ienablemuch.com/2012/07/monitoring-perfect-attendance.html