Querying across months and days - sql

My access logs database stores time as epoch and extracts year month and day as integers. Further, the partitioning of the database is based on the extracted Y/m/d and I have a 35 day retention.
If I run this query:
select *
from mydb
where year in (2017, 2018)
and month in (12, 1)
and day in (31, 1)
On the 29th of January, 2018, I will get data for 12/31/2017 and 1/1/2018.
On the 5th of January, 2018, I will get data for 12/1/2017, 12/31/2017, and 1/1/2018 (undesirable)
I also realize that I can do something like this:
select *
from mydb
where (year = 2017 and month = 12 and day = 31)
or (year = 2018 and month = 1 and day = 1)
But what I am really looking for is this: a good way to write a query where I give the year month and day number as the start and then a fourth value (number of days +) and then get all the data for 12/31/2017 + 5 days for example.
Is there a native way in SQL to accomplish this? I have an enormous data set and if I don't specify the days and have to rely on the epoch to do this, the query takes forever. I also have no influence over the partitioning configuration.

With Impala as the dbms and SQL dialect you will be able to use common table expressions but not recursion. In addition there may be problems inserting parameters as well.
Below is an untested suggestion that will require you to locate some function alternatives. First it generates a set of rows with an integer from 0 to 999 (in the example). It is quite easy to expand the number of rows if required. From those rows it is possible to add the number of days to a timestamp literal using date_add(timestamp startdate, int days/interval expression) and then with year(timestamp date) and month(timestamp date) and day(timestamp date) see Date and Time functions create the columns needed to match to your data.
Overall then you should be able to build a common table expression that has columns for year, month, day that cover a wanted range, and that you can inner join to your source table and thereby implementing a date range filter.
The code below was produced using T-SQL (SQL Server) and it can be tested here.
-- produce a set of integers, adjust to suit needed number of these
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
)
, cteTally AS (
SELECT
d1s.digit
+ d10s.digit * 10
+ d100s.digit * 100 /* add more like this as needed */
-- + d1000s.digit * 1000 /* add more like this as needed */
AS num
FROM cteDigits d1s
CROSS JOIN cteDigits d10s
CROSS JOIN cteDigits d100s /* add more like this as needed */
-- CROSS JOIN cteDigits d1000s /* add more like this as needed */
)
, DateRange AS (
select
num
, dateadd(day,num,'20181227') dt
, year(dateadd(day,num,'20181227')) yr
, month(dateadd(day,num,'20181227')) mn
, day(dateadd(day,num,'20181227')) dy
from cteTally
where num < 10
)
select
*
from DateRange
I think these are the Impala equivalents for the function calls used above:
, DateRange AS (
select
num
, date_add(to_timestamp('20181227','yyyyMMdd'),num) dt
, year( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) yr
, month( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) mn
, day( date_add(to_timestamp('20181227','yyyyMMdd'),num) ) dy
from cteTally
where num < 10
Hopefully you can work out how to use these. Ultimately the purpose is to use the generated date range like so:
select * from mydb t
inner join DateRange on t.year = DateRange.yr and t.month = DateRange.mn and t.day = DateRange.dy
original post
Well in the absence of knowing what database to propose solutions for, here is a suggestion using SQL Server:
This suggestion involves a recursive common table expression, which may then be used as an inner join to your source data to limit the results to a date range.
--Sql Server 2014 Express Edition
--https://rextester.com/l/sql_server_online_compiler
declare #yr as integer = 2018
declare #mn as integer = 12
declare #dy as integer = 27
declare #du as integer = 10
;with CTE as (
select
datefromparts(#yr, #mn, #dy) as dt
, #yr as yr
, #mn as mn
, #dy as dy
union all
select
dateadd(dd,1,cte.dt)
, datepart(year,dateadd(dd,1,cte.dt))
, datepart(month,dateadd(dd,1,cte.dt))
, datepart(day,dateadd(dd,1,cte.dt))
from cte
where cte.dt < dateadd(dd,#du-1,datefromparts(#yr, #mn, #dy))
)
select
*
from cte
This produces the following result:
+----+---------------------+------+----+----+
| | dt | yr | mn | dy |
+----+---------------------+------+----+----+
| 1 | 27.12.2018 00:00:00 | 2018 | 12 | 27 |
| 2 | 28.12.2018 00:00:00 | 2018 | 12 | 28 |
| 3 | 29.12.2018 00:00:00 | 2018 | 12 | 29 |
| 4 | 30.12.2018 00:00:00 | 2018 | 12 | 30 |
| 5 | 31.12.2018 00:00:00 | 2018 | 12 | 31 |
| 6 | 01.01.2019 00:00:00 | 2019 | 1 | 1 |
| 7 | 02.01.2019 00:00:00 | 2019 | 1 | 2 |
| 8 | 03.01.2019 00:00:00 | 2019 | 1 | 3 |
| 9 | 04.01.2019 00:00:00 | 2019 | 1 | 4 |
| 10 | 05.01.2019 00:00:00 | 2019 | 1 | 5 |
+----+---------------------+------+----+----+
and:
select * from mydb t
inner join cte on t.year = cte.yr and t.month = cte.mn and t.day = cte.dy
Instead of a recursive common table expression a table of integers may be used instead (or use a set unioned select queries to generate a set of integers) - often known as a tally table. The method one chooses will depend of dbms type and version being used.
Again, depending on database, it may be more efficient to persist the result seen above as a temporary table and add an index to that.

Related

What's the most efficient way to calculate a rolling aggregate in SQL?

I have a dataset that includes a bunch of clients and date ranges that they had a "stay." For example:
| ClientID | DateStart | DateEnd |
+----------+-----------+---------+
| 1 | Jan 1 | Jan 31 | (datediff = 30)
| 1 | Apr 4 | May 4 | (datediff = 30)
| 2 | Jan 3 | Feb 27 | (datediff = 55)
| 3 | Jan 1 | Jan 7 | (datediff = 6)
| 3 | Jan 10 | Jan 17 | (datediff = 6)
| 3 | Jan 20 | Jan 27 | (datediff = 6)
| 3 | Feb 1 | Feb 7 | (datediff = 6)
| 3 | Feb 10 | Feb 17 | (datediff = 6)
| 3 | Feb 20 | Feb 27 | (datediff = 6)
My ultimate goal is to be able to identify the dates on which a client passed a threshold of N nights in the past X time. Let's say 30 days in the last 90 days. I also need to know when they pass out of the threshold. Use case: hotel stays and a VIP status.
In the example above, Client 1 passed the threshold on Jan 31 (had 30 nights in past 90 days), and still kept meeting the threshold until April 2 (now only 29 nights in the past 90 days), but passed the threshold again on May 4.
Client 2 passed the threshold on Feb 3, and kept meeting the threshold until April 28th, at which point the earliest days are more than 90 days ago and they expire.
Client 3 passed the threshold on around Feb 17
So I would like to generate a table like this:
| ClientID | VIPStart | VIPEnd |
+----------+-----------+---------+
| 1 | Jan 31 | Apr 2 |
| 1 | May 4 | Jul 5 |
| 2 | Feb 3 | Apr 28 |
| 3 | Feb 17 | Apr 11 |
(Forgive me if the dates are slightly off, I'm doing this in my head)
Ideally I would like to generate a view, as I will need to reference it often.
What I want to know is what's the most efficient way to generate this? Assuming I have thousands of clients and hundreds of thousands of stays.
The way that I've been approaching this so far has been to use a SQL statement that includes a parameter: as of {?Date}, who had VIP status and who didn't. I do that by calculating DATEADD(day,-90,{?Date}), then excluding the records that are out of the range, then truncating the DateStarts that extend earlier and DateEnds that extend later, then calculating the DATEDIFF(day,DateStart,DateEnd) for the resulting stays using adjusted DateStart and DateEnd, then getting a SUM() of the resulting DATEDIFF() for each Client as of {?Date}. It works, but it's not pretty. And it gives me a point in time snapshot; I want the history.
it seems a little inefficient to generate a table of dates and then for every single date, use the above method.
Another option I considered was converting the raw data into an exploded table with each record corresponding to one night, then I can count it easier. Like this:
| ClientID | StayDate |
+----------+-----------+
| 1 | Jan 1 |
| 1 | Jan 2 |
| 1 | Jan 3 |
| 1 | Jan 4 |
etc.
Then I could just add a column counting the number of days in the past 90 days, and that'll get me most of the way there.
But I'm not sure how to do that in a view. I have a code snippet that does this:
WITH DaysTally AS (
SELECT MAX(DATEDIFF(day, DateStart, DateEnd)) - 1 AS Tally
FROM Stays
UNION ALL
SELECT Tally - 1 AS Expr1
FROM DaysTally AS DaysTally_1
WHERE (Tally - 1 >= 0))
SELECT t.ClientID,
DATEADD(day, c.Tally, t.DateStart) AS "StayDate"
FROM Stays AS t
INNER JOIN DaysTally AS c ON
DATEDIFF(day, t.DateStart, t.DateEnd) - 1 >= c.Tally
OPTION (MAXRECURSION 0)
But I can't get it to work without the MAXRECURSION and I don't think you can save a view with MAXRECURSION
And now I'm rambling. So the help that I'm looking for is: what is the most efficient method to pursue my goal? And if you have a code example, that would be helpful too! Thanks.
This is an interesting and pretty well-asked question. I would start by enumerating the days from the beginning of the first stay of each client until 90 days after the end of its last stay with a recursive cte. You can then bring the stay table with a left join, and use window functions to flag the "VIP" days (note that this assumes no overlaping stays for a given client, which is consistent with your sample data).
What follows is gaps-and-islands: you can use a window sum to put "adjacent" VIP days in groups, and then aggregate.
with cte as (
select clientID, min(dateStart) dt, dateadd(day, 90, max(dateEnd)) dateMax
from stays
group by clientID
union all
select clientID, dateadd(day, 1, dt), dateMax
from cte
where dt < dateMax
)
select clientID, min(dt) VIPStart, max(dt) VIPEnd
from (
select t.*, sum(isNotVip) over(partition by clientID order by dt) grp
from (
select
c.clientID,
c.dt,
case when count(s.clientID) over(
partition by c.clientID
order by c.dt
rows between 90 preceding and current row
) >= 30
then 0
else 1
end isNotVip
from cte c
left join stays s
on c.clientID = s.clientID and c.dt between s.dateStart and s.dateEnd
) t
) t
where isNotVip = 0
group by clientID, grp
order by clientID, VIPStart
option (maxrecursion 0)
This demo on DB Fiddle with your sample data produces:
clientID | VIPStart | VIPEnd
-------: | :--------- | :---------
1 | 2020-01-30 | 2020-04-01
1 | 2020-05-03 | 2020-07-04
2 | 2020-02-01 | 2020-04-28
3 | 2020-02-07 | 2020-04-20
You can put this in a view as follows:
the order by and option(maxrecursion) clauses must be omitted when creating the view
each and every query that has the view in its from clause must end with option(max recursion 0)
Demo
You can eliminate the recursion by creating a tally table in the view. The approach is then the following:
For each period, generate dates from 90 days before the period to 90 days after. These are all the "candidate days" that the period could affect.
For each row, add a flag as to whether it is in the period (as opposed to the 90 days before and after).
Aggregate by client id and date.
Use a running sum to get the days with 30+ in the previous 90 days.
Then filter for the ones with 30+ days and treat this as a gaps-and-islands problem.
Assuming that 1000 days is sufficient for the periods (including the 90 days before and after), then the query looks like this:
with n as (
select v.n
from (values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(n)
),
nums as (
select (n1.n * 100 + n2.n * 10 + n3.n) as n
from n n1 cross join n n2 cross join n n3
),
running90 as (
select clientid, dte, sum(in_period) over (partition by clientid order by dte rows between 89 preceding and current row) as running_90
from (select t.clientid, dateadd(day, n.n - 90, datestart) as dte,
max(case when dateadd(day, n.n - 90, datestart) >= datestart and dateadd(day, n.n - 90, datestart) <= t.dateend then 1 else 0 end) as in_period
from t join
nums n
on dateadd(day, n.n - 90, datestart) <= dateadd(day, 90, dateend)
group by t.clientid, dateadd(day, n.n - 90, datestart)
) t
)
select clientid, min(dte), max(dte)
from (select r.*,
row_number() over (partition by clientid order by dte) as seqnum
from running90 r
where running_90 >= 30
) r
group by clientid, dateadd(day, - seqnum, dte);
Having no recursive CTE (although one could be used for n), this is not subject to the maxrecursion issue.
Here is a db<>fiddle.
The results are slightly different from your results. This is probably due to some slight difference in the definitions. The above includes the end day as an "occupied" day. The 90 days is 89 days before plus the current day in the above query. The second-to-last query shows the 90 days running days, and that seems correct to me.

How to get correct year, month and day in firebird function datediff

I have to ask another question about datediff in Firebird. I don`t know how to get correct result in this case:
worker x has two contract of employment, first in the period 1988-09-15 to 2000-03-16, second from 2000-03-16 to 2005-02-28. The result that I want to get is like this 16 years, 5 months and 3 days, because the result of first is 11 years, 6 months and 1 day, and the second result is 4 years, 11 months and 2 days.
Has anyone can tell me how to do this in firebird. Most I would like to know how from sum of months (17 months) can I do 5 months, and other 12 months add to value of year. Now I have SQL like this:
SELECT
a.id_contact,
sum(floor(datediff(day, a.DATE_FROM, a.DATE_TO)/365.25)) as YEAR,
mod(sum(mod(floor(datediff(day, a.DATE_FROM, a.DATE_TO)/30.41),12)),12) as MTH
FROM KP a
group by a.id_contact
and then I get 5 months, but I don`t have 12 months add to value of year. Please help me...
You should sum days first then sum the result and then calculate Y, M, D
SELECT
KP3.id_contact
, (KP3.D2-KP3.D1) / (12*31) AS Y
, ((KP3.D2-KP3.D1) - ((KP3.D2-KP3.D1) / (12*31)) * 12 * 31) / 31 AS M
, CAST(MOD((KP3.D2-KP3.D1) - (((KP3.D2-KP3.D1) / (12*31)) * 12 * 31), 31) AS INTEGER) AS D
FROM
(SELECT
KP2.id_contact, SUM(KP2.D1) AS D1, SUM(KP2.D2) AS D2
FROM
(
SELECT
KP.id_contact, DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO) / 12 AS Y, CAST(MOD(DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO), 12) AS INTEGER) AS M
, EXTRACT(YEAR FROM KP.DATE_FROM)*12*31+EXTRACT(MONTH FROM KP.DATE_FROM)*31+EXTRACT(DAY FROM KP.DATE_FROM) D1
, EXTRACT(YEAR FROM KP.DATE_TO)*12*31+EXTRACT(MONTH FROM KP.DATE_TO)*31+EXTRACT(DAY FROM KP.DATE_TO) D2
FROM
KP
) AS KP2
GROUP BY KP2.id_contact
) AS KP3
The proper approach seems to measure the DAYS spent on both assignments, then sum those dates, then convert it into inherently non-precise give-or-take streeet-talk-like form of years-months-days. More on this below
Borrowing the conversion query from Livius, and adjusting coefficients to more realistic, that will develop as that:
https://dbfiddle.uk/?rdbms=firebird_3.0&fiddle=2fba0ace6a70ae16a167ec838642dc28
Here, step by step we move, building up from simple blocks into more and more complex ones, which finally gives us 16 years and 5 months and 2 days:
select rdb$get_context('SYSTEM', 'ENGINE_VERSION') as version from rdb$database;
| VERSION |
| :------ |
| 3.0.5 |
create table KP (
ID_CONTACT integer not null,
DATE_FROM date not null,
DATE_TO date not null
)
-- https://stackoverflow.com/questions/51551257/how-to-get-correct-year-month-and-day-in-firebird-function-datediff
✓
create index KP_workers on KP(id_contact)
✓
insert into KP values (1, '1988-09-15', '2000-03-16')
1 rows affected
insert into KP values (1, '2000-03-16', '2005-02-28')
1 rows affected
-- the sample data from https://stackoverflow.com/questions/60030543
-- might expose the rounding bug in my original formulae:
-- unexpected ROUNDING UP leading to NEGATIVE value for months
insert into KP values (2, '2018-02-08', '2019-12-01')
1 rows affected
insert into KP values (2, '2017-02-20', '2018-01-01')
1 rows affected
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
ID_CONTACT | DATE_FROM | DATE_TO | DAYS_COUNT
---------: | :--------- | :--------- | :---------
1 | 1988-09-15 | 2000-03-16 | 4200
1 | 2000-03-16 | 2005-02-28 | 1810
2 | 2018-02-08 | 2019-12-01 | 661
2 | 2017-02-20 | 2018-01-01 | 315
-- Original answer by Livius
SELECT
KP3.id_contact
, KP3.D2-KP3.D1 as days_count
, (KP3.D2-KP3.D1) / (12*31) AS Y
, ((KP3.D2-KP3.D1) - ((KP3.D2-KP3.D1) / (12*31)) * 12 * 31) / 31 AS M
, CAST(MOD((KP3.D2-KP3.D1) - (((KP3.D2-KP3.D1) / (12*31)) * 12 * 31), 31) AS INTEGER) AS D
FROM
(SELECT
KP2.id_contact, SUM(KP2.D1) AS D1, SUM(KP2.D2) AS D2
FROM
(
SELECT
KP.id_contact, DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO) / 12 AS Y, CAST(MOD(DATEDIFF(MONTH, KP.DATE_FROM, KP.DATE_TO), 12) AS INTEGER) AS M
, EXTRACT(YEAR FROM KP.DATE_FROM)*12*31+EXTRACT(MONTH FROM KP.DATE_FROM)*31+EXTRACT(DAY FROM KP.DATE_FROM) D1
, EXTRACT(YEAR FROM KP.DATE_TO)*12*31+EXTRACT(MONTH FROM KP.DATE_TO)*31+EXTRACT(DAY FROM KP.DATE_TO) D2
FROM
KP
) AS KP2
GROUP BY KP2.id_contact
) AS KP3
ID_CONTACT | DAYS_COUNT | Y | M | D
---------: | :--------- | :- | :- | -:
1 | 6120 | 16 | 5 | 13
2 | 997 | 2 | 8 | 5
select ID_CONTACT, sum(DAYS_COUNT) as DAYS_COUNT
from (
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
)
GROUP BY 1
ID_CONTACT | DAYS_COUNT
---------: | :---------
1 | 6010
2 | 976
-- this step taken from https://dbfiddle.uk/?rdbms=firebird_3.0&fiddle=52c1e130f589ca507c9ff185b5b2346d
-- based on original Livius forumla with non-exact integer coefficients
-- it seems not be generating negative counts, but still shows very different results
SELECT
KP_DAYS.id_contact,
KP_DAYS.DAYS_COUNT / (12*31) AS Y,
((KP_DAYS.DAYS_COUNT) - ((KP_DAYS.DAYS_COUNT) / (12*31)) * 12 * 31) / 31 AS M,
CAST(MOD((KP_DAYS.DAYS_COUNT) - (((KP_DAYS.DAYS_COUNT) / (12*31)) * 12 * 31), 31) AS INTEGER) AS D
FROM
(
select ID_CONTACT, sum(DAYS_COUNT) as DAYS_COUNT
from (
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
)
GROUP BY 1
) as KP_DAYS
ID_CONTACT | Y | M | D
---------: | :- | :- | -:
1 | 16 | 1 | 27
2 | 2 | 7 | 15
SELECT
KP_DAYS.id_contact, KP_DAYS.days_count
, FLOOR(KP_DAYS.DAYS_COUNT / 365.25) AS Y
, FLOOR( (KP_DAYS.DAYS_COUNT - (FLOOR(KP_DAYS.DAYS_COUNT / 365.25) * 365.25) ) / 30.5) AS M
, CAST(MOD((KP_DAYS.DAYS_COUNT) - (((KP_DAYS.DAYS_COUNT) / 365.25) * 365.25), 30.5) AS INTEGER) AS D
FROM
(
select ID_CONTACT, sum(DAYS_COUNT) as DAYS_COUNT
from (
select a.*, datediff(day, a.DATE_FROM, a.DATE_TO) as DAYS_COUNT from KP a
)
GROUP BY 1
) as KP_DAYS
ID_CONTACT | DAYS_COUNT | Y | M | D
---------: | :--------- | :- | :- | -:
1 | 6010 | 16 | 5 | 2
2 | 976 | 2 | 8 | 1
Notice, the above is still not correct mathematically. But should give a "gut feeling" of the time stamp.
The question of getting EXACT AND PRECISE measure of timespan in Y-M-D form is moot.
For example, you quoted 3 days while this query gives 2 days. I see no error there. Because months and years are different from each other you just can not correctly measure time DISTANCE in months. That would be like measuring geographical distance in cities.
How many New Yorks lie between London and Paris? How many Warsaws high is Elbrus mountain? You can not have any mathematically correct answer.
Thus you can only answer with NON-PRECISE estimations. Suitable for give-or-take kind of street talk. So, any DateDiff-based query would essentially generate a perfectly valid answer of the kind of "2Y 10M give or take few days" - an answer that IS valid for this context of "just give me overall impression".
Marrying this simplicity of getting a feel of it with perfectionism of mathematical accuracy just is not possible. For example, imagine you get the span of about 6Y. Now how many leap years should you account for? In the "6Y" from 1999 to 2004 there were TWO leap years, but in the same "6Y" from 1998 to 2003 there only was ONE leap year. Which of those is correct measure for "6Y" ???
And then we have milleniums, where 2000 was leap year but 1900 was not. And same "sliding window" problem gives you volatile undefined number of leap years in timespans like "110Y". If you want to go towards layman's perception and count timespans in "years and months" - you have to agree this makes things easy, simple and imprecise by definition. And mismatch of one or few days over several years is norm, is OK

How to write a SQL statement to sum data using group by the same day of every two neighboring months

I have a data table like this:
datetime data
-----------------------
...
2017/8/24 6.0
2017/8/25 5.0
...
2017/9/24 6.0
2017/9/25 6.2
...
2017/10/24 8.1
2017/10/25 8.2
I want to write a SQL statement to sum the data using group by the 24th of every two neighboring months in certain range of time such as : from 2017/7/20 to 2017/10/25 as above.
How to write this SQL statement? I'm using SQL Server 2008 R2.
The expected results table is like this:
datetime_range data_sum
------------------------------------
...
2017/8/24~2017/9/24 100.9
2017/9/24~2017/10/24 120.2
...
One conceptual way to proceed here is to redefine a "month" as ending on the 24th of each normal month. Using the SQL Server month function, we will assign any date occurring after the 24th as belonging to the next month. Then we can aggregate by the year along with this shifted month to obtain the sum of data.
WITH cte AS (
SELECT
data,
YEAR(datetime) AS year,
CASE WHEN DAY(datetime) > 24
THEN MONTH(datetime) + 1 ELSE MONTH(datetime) END AS month
FROM yourTable
)
SELECT
CONVERT(varchar(4), year) + '/' + CONVERT(varchar(2), month) +
'/25~' +
CONVERT(varchar(4), year) + '/' + CONVERT(varchar(2), (month + 1)) +
'/24' AS datetime_range,
SUM(data) AS data_sum
FROM cte
GROUP BY
year, month;
Note that your suggested ranges seem to include the 24th on both ends, which does not make sense from an accounting point of view. I assume that the month includes and ends on the 24th (i.e. the 25th is the first day of the next accounting period.
Demo
I would suggest dynamically building some date range rows so that you can then join you data to those for aggregation, like this example:
+----+---------------------+---------------------+----------------+
| | period_start_dt | period_end_dt | your_data_here |
+----+---------------------+---------------------+----------------+
| 1 | 24.04.2017 00:00:00 | 24.05.2017 00:00:00 | 1 |
| 2 | 24.05.2017 00:00:00 | 24.06.2017 00:00:00 | 1 |
| 3 | 24.06.2017 00:00:00 | 24.07.2017 00:00:00 | 1 |
| 4 | 24.07.2017 00:00:00 | 24.08.2017 00:00:00 | 1 |
| 5 | 24.08.2017 00:00:00 | 24.09.2017 00:00:00 | 1 |
| 6 | 24.09.2017 00:00:00 | 24.10.2017 00:00:00 | 1 |
| 7 | 24.10.2017 00:00:00 | 24.11.2017 00:00:00 | 1 |
| 8 | 24.11.2017 00:00:00 | 24.12.2017 00:00:00 | 1 |
| 9 | 24.12.2017 00:00:00 | 24.01.2018 00:00:00 | 1 |
| 10 | 24.01.2018 00:00:00 | 24.02.2018 00:00:00 | 1 |
| 11 | 24.02.2018 00:00:00 | 24.03.2018 00:00:00 | 1 |
| 12 | 24.03.2018 00:00:00 | 24.04.2018 00:00:00 | 1 |
+----+---------------------+---------------------+----------------+
DEMO
declare #start_dt date;
set #start_dt = '20170424';
select
period_start_dt, period_end_dt, sum(1) as your_data_here
from (
select
dateadd(month,m.n,start_dt) period_start_dt
, dateadd(month,m.n+1,start_dt) period_end_dt
from (
select #start_dt start_dt ) seed
cross join (
select 0 n union all
select 1 union all
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10 union all
select 11
) m
) r
-- LEFT JOIN YOUR DATA
-- ON yourdata.date >= r.period_start_dt and data.date < r.period_end_dt
group by
period_start_dt, period_end_dt
Please don't be tempted to use "between" when it comes to joining to your data. Follow the note above and use yourdata.date >= r.period_start_dt and data.date < r.period_end_dt otherwise you could double count information as between is inclusive of both lower and upper boundaries.
I think the simplest way is to subtract 25 days and aggregate by the month:
select year(dateadd(day, -25, datetime)) as yr,
month(dateadd(day, -25, datetime)) as mon,
sum(data)
from t
group by dateadd(day, -25, datetime);
You can format yr and mon to get the dates for the specific ranges, but this does the aggregation (and the yr/mon columns might be sufficient).
Step 0: Build a calendar table. Every database needs a calendar table eventually to simplify this sort of calculation.
In this table you may have columns such as:
Date (primary key)
Day
Month
Year
Quarter
Half-year (e.g. 1 or 2)
Day of year (1 to 366)
Day of week (numeric or text)
Is weekend (seems redundant now, but is a huge time saver later on)
Fiscal quarter/year (if your company's fiscal year doesn't start on Jan. 1)
Is Holiday
etc.
If your company starts its month on the 24th, then you can add a "Fiscal Month" column that represents that.
Step 1: Join on the calendar table
Step 2: Group by the columns in the calendar table.
Calendar tables sound weird at first, but once you realize that they are in fact tiny even if they span a couple hundred years they quickly become a major asset.
Don't try to cheap out on disk space by using computed columns. You want real columns because they are much faster and can be indexed if necessary. (Though honestly, usually just the PK index is enough for even wide calendar tables.)

Joining series of dates and counting continous days

Let's say I have a table as below
date add_days
2015-01-01 5
2015-01-04 2
2015-01-11 7
2015-01-20 10
2015-01-30 1
what I want to do is to check the days_balance, i.e. if date is greater or smaller than previous date + N days (add_days) and take the cumulated sum of days count if they are a continuous series.
So the algorithm should work like
for i in 2:N_rows {
days_balance[i] := date[i-1] + add_days[i-1] - date[i]
if days_balance[i] >= 0 then
date[i] := date[i] + days_balance[i]
}
The expected result should be as follows
date days_balance
2015-01-01 0
2015-01-04 2
2015-01-11 -3
2015-01-20 -2
2015-01-30 0
Is it possible in pure SQL? I imagine it should be with some conditional joins, but cannot see how it could be implemented.
I'm posting another answer since it may be nice to compare them since they use different methods (this one just does a n^2 style join, other one used a recursive CTE). This one takes advantage of the fact that you don't have to calculate the days_balance for each previous row before calculating it for a particular row, you just need to sum things from previous days....
drop table junk
create table junk(date DATETIME, add_days int)
insert into junk values
('2015-01-01',5 ),
('2015-01-04',2 ),
('2015-01-11',7 ),
('2015-01-20',10 ),
('2015-01-30',1 )
;WITH cte as
(
select ROW_NUMBER() OVER (ORDER BY date) i, date, add_days, ISNULL(DATEDIFF(DAY, LAG(date) OVER (ORDER BY date), date), 0) days_since_prev
FROM Junk
)
, combinedWithAllPreviousDaysCte as
(
select i [curr_i], date [curr_date], add_days [curr_add_days], days_since_prev [curr_days_since_prev], 0 [prev_add_days], 0 [prev_days_since_prev] from cte where i = 1 --get first row explicitly since it has no preceding rows
UNION ALL
select curr.i [curr_i], curr.date [curr_date], curr.add_days [curr_add_days], curr.days_since_prev [curr_days_since_prev], prev.add_days [prev_add_days], prev.days_since_prev [prev_days_since_prev]
from cte curr
join cte prev on curr.i > prev.i --join to all previous days
)
select curr_i, curr_date, SUM(prev_add_days) - curr_days_since_prev - SUM(prev_days_since_prev) [days_balance]
from combinedWithAllPreviousDaysCte
group by curr_i, curr_date, curr_days_since_prev
order by curr_i
outputs:
+--------+-------------------------+--------------+
| curr_i | curr_date | days_balance |
+--------+-------------------------+--------------+
| 1 | 2015-01-01 00:00:00.000 | 0 |
| 2 | 2015-01-04 00:00:00.000 | 2 |
| 3 | 2015-01-11 00:00:00.000 | -3 |
| 4 | 2015-01-20 00:00:00.000 | -5 |
| 5 | 2015-01-30 00:00:00.000 | -5 |
+--------+-------------------------+--------------+
Well, I think I have it with a recursive CTE (sorry, I only have Microsoft SQL Server available to me at the moment, so it may not comply with PostgreSQL).
Also I think the expected results you had were off (see comment above). If not, this can probably be modified to conform to your math.
drop table junk
create table junk(date DATETIME, add_days int)
insert into junk values
('2015-01-01',5 ),
('2015-01-04',2 ),
('2015-01-11',7 ),
('2015-01-20',10 ),
('2015-01-30',1 )
;WITH cte as
(
select ROW_NUMBER() OVER (ORDER BY date) i, date, add_days, ISNULL(DATEDIFF(DAY, LAG(date) OVER (ORDER BY date), date), 0) days_since_prev
FROM Junk
)
,recursiveCte (i, date, add_days, days_since_prev, days_balance, math) as
(
select top 1
i,
date,
add_days,
days_since_prev,
0 [days_balance],
CAST('no math for initial one, just has zero balance' as varchar(max)) [math]
from cte where i = 1
UNION ALL --recursive step now
select
curr.i,
curr.date,
curr.add_days,
curr.days_since_prev,
prev.days_balance - curr.days_since_prev + prev.add_days [days_balance],
CAST(prev.days_balance as varchar(max)) + ' - ' + CAST(curr.days_since_prev as varchar(max)) + ' + ' + CAST(prev.add_days as varchar(max)) [math]
from cte curr
JOIN recursiveCte prev ON curr.i = prev.i + 1
)
select i, DATEPART(day,date) [day], add_days, days_since_prev, days_balance, math
from recursiveCTE
order by date
And the results are like so:
+---+-----+----------+-----------------+--------------+------------------------------------------------+
| i | day | add_days | days_since_prev | days_balance | math |
+---+-----+----------+-----------------+--------------+------------------------------------------------+
| 1 | 1 | 5 | 0 | 0 | no math for initial one, just has zero balance |
| 2 | 4 | 2 | 3 | 2 | 0 - 3 + 5 |
| 3 | 11 | 7 | 7 | -3 | 2 - 7 + 2 |
| 4 | 20 | 10 | 9 | -5 | -3 - 9 + 7 |
| 5 | 30 | 1 | 10 | -5 | -5 - 10 + 10 |
+---+-----+----------+-----------------+--------------+------------------------------------------------+
I don’t quite get how your algorithm returns your expected results? But let me share a technique I came up with that might help.
This will only work if the end result of your data is to be exported to Excel, and even then it won’t work in all scenarios depending on what format you export your dataset in, but here it is....
If you’ll familiar with Excel Formulas, what I discovered is that if you write an Excel formula in your SQL as another field, it will execute that formula for you as soon as you export to excel (best method that works for me is just coping and pasting it into Excel, so that it doesn’t format it as text)
So for your example, here’s what you could do (noting again I don’t understand your algorithm, so this is probably wrong, but it’s just to give you the concept)
SELECT
date
, add_days
, '=INDEX($1:$65536,ROW()-1,COLUMN()-2)'
||'+INDEX($1:$65536,ROW()-1,COLUMN()-1)'
||'-INDEX($1:$65536,ROW(),COLUMN()-2)'
AS "days_balance[i]"
,'=IF(INDEX($1:$65536,ROW(),COLUMN()-1)>=0'
||',INDEX($1:$65536,ROW(),COLUMN()-3)'
||'+INDEX($1:$65536,ROW(),COLUMN()-1))'
AS "date[i]"
FROM
myTable
ORDER BY /*Ensure to order by whatever you need for your formula to work*/
The key part to making this work is using the INDEX formula function to select a cell based on the position of the current cell. So ROW()-1 tells it get me the result of the previous record, and COLUMN()-2 means take the value from two columns to the left of the current. Because you can't use cell references like A2+B2-A3 because the row numbers won't change on export, and it assumes the position of the columns.
I used SQL string concatenation with || just so it's easier to read on screen.
I tried this one in excel; it didn’t match your expected results. But if this technique works for you then just correct the excel formula to suit.

Most Efficient Way to Compute Running Value in SQL [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Calculate a Running Total in SqlServer
Consider this data
Day | OrderCount
1 3
2 2
3 11
4 3
5 6
How can i get this accumulation of OrderCount(running value) resultset using T-SQL query
Day | OrderCount | OrderCountRunningValue
1 3 3
2 2 5
3 11 16
4 3 19
5 6 25
I Can easily do this with looping in the actual query (using #table) or in my C# codebehind but its so slow (Considering that i also get the orders per day) when im processing thousand of records so i'm looking for better / more efficient approach hopefully without loops something like recursing CTE or something else.
Any idea would be greatly appreciated. TIA
As you seem to need these results in the client rather than for use within another SQL query, you are probably better off Not doing this in SQL.
(The linked question in my comment shows 'the best' option within SQL, if that is infact necessary.)
What may be recommended is to pull the Day and OrderCount values as one result set (SELECT day, orderCount FROM yourTable ORDER BY day) and then calculate the running total in your C#.
Your C# code will be able to iterate through the dataset efficiently, and will almost certainly outperform the SQL approaches. What this does do, is to transfer some load from the SQL Server to the web-server, but at an overall (and significant) resource saving.
SELECT t.Day,
t.OrderCount,
(SELECT SUM(t1.OrderCount) FROM table t1 WHERE t1.Day <= t.Day)
AS OrderCountRunningValue
FROM table t
SELECT
t.day,
t.orderCount,
SUM(t1.orderCount) orderCountRunningValue
FROM
table t INNER JOIN table t1 ON t1.day <= t.day
group by t.day,t.orderCount
CTE's to the rescue (again):
DROP TABLE tmp.sums;
CREATE TABLE tmp.sums
( id INTEGER NOT NULL
, zdate timestamp not null
, amount integer NOT NULL
);
INSERT INTO tmp.sums (id,zdate,amount) VALUES
(1, '2011-10-24', 1 ),(1, '2011-10-25', 2 ),(1, '2011-10-26', 3 )
,(2, '2011-10-24', 11 ),(2, '2011-10-25', 12 ),(2, '2011-10-26', 13 )
;
WITH RECURSIVE list AS (
-- Terminal part
SELECT t0.id, t0.zdate
, t0.amount AS amount
, t0.amount AS runsum
FROM tmp.sums t0
WHERE NOT EXISTS (
SELECT * FROM tmp.sums px
WHERE px.id = t0.id
AND px.zdate < t0.zdate
)
UNION
-- Recursive part
SELECT p1.id AS id
, p1.zdate AS zdate
, p1.amount AS amount
, p0.runsum + p1.amount AS runsum
FROM tmp.sums AS p1
, list AS p0
WHERE p1.id = p0.id
AND p0.zdate < p1.zdate
AND NOT EXISTS (
SELECT * FROM tmp.sums px
WHERE px.id = p1.id
AND px.zdate < p1.zdate
AND px.zdate > p0.zdate
)
)
SELECT * FROM list
ORDER BY id, zdate;
The output:
DROP TABLE
CREATE TABLE
INSERT 0 6
id | zdate | amount | runsum
----+---------------------+--------+--------
1 | 2011-10-24 00:00:00 | 1 | 1
1 | 2011-10-25 00:00:00 | 2 | 3
1 | 2011-10-26 00:00:00 | 3 | 6
2 | 2011-10-24 00:00:00 | 11 | 11
2 | 2011-10-25 00:00:00 | 12 | 23
2 | 2011-10-26 00:00:00 | 13 | 36
(6 rows)