Time dimension table in SQLPlus - sql

I'm doing a datawarehouse project for college with Oracle Database (SQLPlus).
I need to create the time dimension table and to populate it. The table needs to be like this:
It needs to go from 2004 to 2019.
I've tried different things and queries that I've found but they don't works and, sadly, I don't know enough about SQLPlus to create one on my own (or to successfully modify one). I'm completely lost.
Thank you very much for your help and patience.

Do not store all the columns; use virtual columns instead to calculate derived data otherwise you will find that your columns could be inconsistent:
CREATE TABLE table_name (
id NUMBER(10,0)
GENERATED ALWAYS AS IDENTITY
CONSTRAINT table_name__id__pk PRIMARY KEY,
"DATE" DATE
CONSTRAINT table_name__date__nn NOT NULL
CONSTRAINT table_name__date__u UNIQUE
CONSTRAINT table_name__date__chk CHECK ( "DATE" = TRUNC( "DATE" ) ),
id_day_of_week NUMBER(1,0)
GENERATED ALWAYS AS ( "DATE" - TRUNC( "DATE", 'IW' ) + 1 ),
day_of_week VARCHAR2(9)
GENERATED ALWAYS AS ( CAST( TRIM( TO_CHAR( "DATE", 'DAY', 'NLS_DATE_LANGUAGE = AMERICAN' ) ) AS VARCHAR2(9) ) ),
is_holiday NUMBER(1,0)
CONSTRAINT table_name__id_holiday__chk CHECK ( is_holiday IN ( 0, 1 ) ),
id_month NUMBER(2,0)
GENERATED ALWAYS AS ( EXTRACT( MONTH FROM "DATE" ) ),
month VARCHAR2(9)
GENERATED ALWAYS AS ( CAST( TRIM( TO_CHAR( "DATE", 'MONTH', 'NLS_DATE_LANGUAGE = AMERICAN' ) ) AS VARCHAR2(9) ) ),
id_year NUMBER(5,0)
GENERATED ALWAYS AS ( EXTRACT( YEAR FROM "DATE" ) ),
id_total NUMBER(1,0)
GENERATED ALWAYS AS ( 1 ),
total CHAR(5)
GENERATED ALWAYS AS ( 'Total' )
);
Note:
You should not name the column DATE as its a keyword and you will need to surround it in double-quotes and use the same case every time you use it.
The id_day_of_week is based on the day of the ISO8601 week because relying on TO_CHAR( "DATE", 'D' ) depends on the NLS_TERRITORY setting as to which day of the week is the first day; this way it is independent of any settings.
The day_of_week and month columns have a fixed language.
It is unclear what id_total and total should contain so these are generated as literal values; if you want to have non-static data in these columns then remove the GENERATED ... part of the declaration.
Then you can populate it using:
INSERT INTO table_name ( "DATE", is_holiday )
SELECT DATE '2004-01-01' + LEVEL - 1, 0
FROM DUAL
CONNECT BY DATE '2004-01-01' + LEVEL - 1 < DATE '2020-01-01';
And update the holiday dates using an UPDATE statement according to your territory.
Then if you do:
SELECT *
FROM table_name
ORDER BY "DATE" ASC
FETCH FIRST 32 ROWS ONLY;
The output is:
ID | DATE | ID_DAY_OF_WEEK | DAY_OF_WEEK | IS_HOLIDAY | ID_MONTH | MONTH | ID_YEAR | ID_TOTAL | TOTAL
-: | :-------- | -------------: | :---------- | ---------: | -------: | :------- | ------: | -------: | :----
1 | 01-JAN-04 | 4 | THURSDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
2 | 02-JAN-04 | 5 | FRIDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
3 | 03-JAN-04 | 6 | SATURDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
4 | 04-JAN-04 | 7 | SUNDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
5 | 05-JAN-04 | 1 | MONDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
6 | 06-JAN-04 | 2 | TUESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
7 | 07-JAN-04 | 3 | WEDNESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
8 | 08-JAN-04 | 4 | THURSDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
9 | 09-JAN-04 | 5 | FRIDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
10 | 10-JAN-04 | 6 | SATURDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
11 | 11-JAN-04 | 7 | SUNDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
12 | 12-JAN-04 | 1 | MONDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
13 | 13-JAN-04 | 2 | TUESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
14 | 14-JAN-04 | 3 | WEDNESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
15 | 15-JAN-04 | 4 | THURSDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
16 | 16-JAN-04 | 5 | FRIDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
17 | 17-JAN-04 | 6 | SATURDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
18 | 18-JAN-04 | 7 | SUNDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
19 | 19-JAN-04 | 1 | MONDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
20 | 20-JAN-04 | 2 | TUESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
21 | 21-JAN-04 | 3 | WEDNESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
22 | 22-JAN-04 | 4 | THURSDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
23 | 23-JAN-04 | 5 | FRIDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
24 | 24-JAN-04 | 6 | SATURDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
25 | 25-JAN-04 | 7 | SUNDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
26 | 26-JAN-04 | 1 | MONDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
27 | 27-JAN-04 | 2 | TUESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
28 | 28-JAN-04 | 3 | WEDNESDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
29 | 29-JAN-04 | 4 | THURSDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
30 | 30-JAN-04 | 5 | FRIDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
31 | 31-JAN-04 | 6 | SATURDAY | 0 | 1 | JANUARY | 2004 | 1 | Total
32 | 01-FEB-04 | 7 | SUNDAY | 0 | 2 | FEBRUARY | 2004 | 1 | Total
db<>fiddle here

create table date_dim
(id number(38),
date date,
id_dayofweek number(38),
dayofweek varchar(100),
id_holiday number(38),
id_month number(38),
month varchar(100),
id_year number(38),
id_total number(38),
Total varchar(100));
Use above query to create the table.
Regarding the data, you can generate it through connect by clause.
insert into date_dim
(select level as id, to_date('31-DEC-2003', 'DD-MON-YYYY') + level as date1,
case when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Monday' then 2
when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Tuesday' then 3
when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Wednesday' then 4
when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Thursday' then 5
when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Friday' then 6
when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Saturday' then 7
when ltrim(rtrim(to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day'))) = 'Sunday' then 1 end as id_dayofweek,
to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Day') as dayofweek,
0 as id_holiday,
to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'MM') as id_month,
to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'Month') as month,
to_char(to_date('31-DEC-2003', 'DD-MON-YYYY') + level, 'YYYY') as year,
1 as id_total,
'Total' as Total
from dual
connect by level < = 5844);

Related

SQL Find max no of consecutive months over a period of last 12 Months

I am trying to write a query in sql where I need to find the max no. of consecutive months over a period of last 12 months excluding June and July.
so for example I have an initial table as follows
+---------+--------------+-----------+------------+
| id | Payment | amount | Date |
+---------+--------------+-----------+------------+
| 1 | CJ1 | 70000 | 11/3/2020 |
| 1 | 1B4 | 36314000 | 12/1/2020 |
| 1 | I21 | 119439000 | 1/12/2021 |
| 1 | 0QO | 9362100 | 2/2/2021 |
| 1 | 1G0 | 140431000 | 2/23/2021 |
| 1 | 1G | 9362100 | 3/2/2021 |
| 1 | g5d | 9362100 | 4/6/2021 |
| 1 | rt5s | 13182500 | 4/13/2021 |
| 1 | fgs5 | 48598 | 5/18/2021 |
| 1 | sd8 | 42155 | 5/25/2021 |
| 1 | wqe8 | 47822355 | 7/20/2021 |
| 1 | cbg8 | 4589721 | 7/27/2021 |
| 1 | jlk8 | 4589721 | 8/3/2021 |
| 1 | cxn9 | 4589721 | 10/5/2021 |
| 1 | qwe | 45897210 | 11/9/2021 |
| 1 | mmm | 45897210 | 12/16/2021 |
+---------+--------------+-----------+------------+
I have written below query:
SELECT customer_number, year, month,
payment_month - lag(payment_month) OVER(partition by customer_number ORDER BY year, month) as previous_month_indicator,
FROM
(
SELECT DISTINCT Month(date) as month, Year(date) as year, CUSTOMER_NUMBER
FROM Table1
WHERE Month(date) not in (6,7)
and TO_DATE(date,'yyyy-MM-dd') >= DATE_SUB('2021-12-31', 425)
and customer_number = 1
) As C
and I get this output
+-----------------+------+-------+--------------------------+
| customer_number | year | month | previous_month_indicator |
+-----------------+------+-------+--------------------------+
| 1 | 2020 | 11 | null |
| 1 | 2020 | 12 | 1 |
| 1 | 2021 | 1 | -11 |
| 1 | 2021 | 2 | 1 |
| 1 | 2021 | 3 | 1 |
| 1 | 2021 | 4 | 1 |
| 1 | 2021 | 5 | 1 |
| 1 | 2021 | 8 | 3 |
| 1 | 2021 | 10 | 2 |
| 1 | 2021 | 11 | 1 |
+-----------------+------+-------+--------------------------+
What I want is to get a view like this
Expected output
+-----------------+------+-------+--------------------------+
| customer_number | year | month | previous_month_indicator |
+-----------------+------+-------+--------------------------+
| 1 | 2020 | 11 | 1 |
| 1 | 2020 | 12 | 1 |
| 1 | 2021 | 1 | 1 |
| 1 | 2021 | 2 | 1 |
| 1 | 2021 | 3 | 1 |
| 1 | 2021 | 4 | 1 |
| 1 | 2021 | 5 | 1 |
| 1 | 2021 | 8 | 1 |
| 1 | 2021 | 9 | 0 |
| 1 | 2021 | 10 | 1 |
| 1 | 2021 | 11 | 1 |
+-----------------+------+-------+--------------------------+
As June/July does not matter, after May, August should be considered as consecutive month, and since in September there was no record it appears as 0 and breaks the consecutive months chain.
My final desired output is to get the max no of consecutive months in which transactions were made which in above case is 8 from Nov-2020 to Aug-2021
Final Desired Output:
+-----------------+-------------------------+
| customer_number | Max_consecutive_months |
+-----------------+-------------------------+
| 1 | 8 |
+-----------------+-------------------------+
CTEs can break this down a little easier. In the code below, the payment_streak CTE is the key bit; the start_of_streak field is first marking rows that count as the start of a streak, and then taking the maximum over all previous rows (to find the start of this streak).
The last SELECT is only comparing these two dates, computing how many months are between them (excluding June/July), and then finding the best streak per customer.
WITH payments_in_context AS (
SELECT customer_number,
date,
lag(date) OVER (PARTITION BY customer_number ORDER BY date) AS prev_date
FROM Table1
WHERE EXTRACT(month FROM date) NOT IN (6,7)
),
payment_streak AS (
SELECT
customer_number,
date,
max(
CASE WHEN (prev_date IS NULL)
OR (EXTRACT(month FROM date) <> 8
AND (date - prev_date >= 62
OR MOD(12 + EXTRACT(month FROM date) - EXTRACT(month FROM prev_date),12)) > 1))
OR (EXTRACT(month FROM date) = 8
AND (date - prev_date >= 123
OR EXTRACT(month FROM prev_date) NOT IN (5,8)))
THEN date END
) OVER (PARTITION BY customer_number ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
as start_of_streak
FROM payments_in_context
)
SELECT customer_number,
max( 1 +
10*(EXTRACT(year FROM date) - EXTRACT(year FROM start_of_streak))
+ (EXTRACT(month FROM date) - EXTRACT(month FROM start_of_streak))
+ CASE WHEN (EXTRACT(month FROM date) > 7 AND EXTRACT(month FROM start_of_streak) < 6)
THEN -2
WHEN (EXTRACT(month FROM date) < 6 AND EXTRACT(month FROM start_of_streak) > 7)
THEN 2
ELSE 0 END
) AS max_consecutive_months
FROM payment_streak
GROUP BY 1;
You can use a recursive cte to generate all the dates in the twelve month timespan for each customer id, and then find the maximum number of consecutive dates excluding June and July in that interval:
with recursive cte(id, m, c) as (
select cust_id, min(date), 1 from payments group by cust_id
union all
select c.id, c.m + interval 1 month, c.c+1 from cte c where c.c <= 12
),
dts(id, m, f) as (
select c.id, c.m, c.c = 1 or exists
(select 1 from payments p where p.cust_id = c.id and extract(month from p.date) = extract(month from (c.m - interval 1 month)) and extract(year from p.date) = extract(year from (c.m - interval 1 month)))
from cte c where extract(month from c.m) not in (6,7)
),
result(id, f, c) as (
select d.id, d.f, (select sum(d.id = d1.id and d1.m < d.m and d1.f = 0)+1 from dts d1)
from dts d where d.f != 0
)
select r1.id, max(r1.s)-1 from (select r.id, r.c, sum(r.f) s from result r group by r.id, r.c) r1 group by r1.id

Join two columns as a date in sql

I am currently working with a report through Microsoft Query and I ran into this problem where I need to calculate the total amount of money for the past year.
The table looks like this:
Item Number | Month | Year | Amount |
...........PAST YEARS DATA...........
12345 | 1 | 2019 | 10 |
12345 | 2 | 2019 | 20 |
12345 | 3 | 2019 | 15 |
12345 | 4 | 2019 | 12 |
12345 | 5 | 2019 | 11 |
12345 | 6 | 2019 | 12 |
12345 | 7 | 2019 | 12 |
12345 | 8 | 2019 | 10 |
12345 | 9 | 2019 | 10 |
12345 | 10 | 2019 | 10 |
12345 | 11 | 2019 | 10 |
12345 | 12 | 2019 | 10 |
12345 | 1 | 2020 | 10 |
12345 | 2 | 2020 | 10 |
How would you calculate the total amount from 02-2019 to 02-2020 for the item number 12345?
Assuming that you are running SQL Server, you can recreate a date with datefromparts() and use it for filtering:
select sum(amount)
from mytable
where
itemnumber = 12345
and datefromparts(year, month, 1) >= '20190201'
and datefromparts(year, month, 1) < '20200301'
You can use this also
SELECT sum(amount) as Amount
FROM YEARDATA
WHERE ( Month >=2 and year = '2019')
or ( Month <=2 and year = '2020')
and ItemNumber = '12345'

Is there a function in Google big query to find the first date and last date of the ISO Week number / Week number of a calendar year?

Let us assume a calendar week.
The week number is 02 of 2020.
I am looking for ways to find the beginning and end dates of the week.
Any pointers to built in function or any other approaches will be helpful.
I don't see a direct way, but with existing date functions, it is super easy to build a look up table which you can query:
CREATE TABLE day_of_week_table AS
SELECT
date,
EXTRACT(ISOYEAR FROM date) AS isoyear,
EXTRACT(ISOWEEK FROM date) AS isoweek,
EXTRACT(WEEK FROM date) AS week,
EXTRACT(DAYOFWEEK FROM date) AS dayOfWeek
FROM UNNEST(GENERATE_DATE_ARRAY('2020-1-1', '2021-1-1')) AS date
ORDER BY date;
Paste first a few rows of this table
| date | isoyear | isoweek | week | dayOfWeek |
+------------+---------+---------+------+-----------+
| 2020-01-01 | 2020 | 1 | 0 | 4 |
| 2020-01-02 | 2020 | 1 | 0 | 5 |
| 2020-01-03 | 2020 | 1 | 0 | 6 |
| 2020-01-04 | 2020 | 1 | 0 | 7 |
| 2020-01-05 | 2020 | 1 | 1 | 1 |
| 2020-01-06 | 2020 | 2 | 1 | 2 |
| 2020-01-07 | 2020 | 2 | 1 | 3 |
| 2020-01-08 | 2020 | 2 | 1 | 4 |
| 2020-01-09 | 2020 | 2 | 1 | 5 |
| 2020-01-10 | 2020 | 2 | 1 | 6 |
| 2020-01-11 | 2020 | 2 | 1 | 7 |

Create DateFirst in Select

I have a Select Statement that requires that the DateFirst = 1 Monday
In the US so default is 7 Sunday
How can I modify this to embed the DateFirst in the select statement so I can create it as a view?
SET DATEFIRST 1;
SELECT
T_APPLICANT.APPL_ID AS empID,
T_APPLICANT.APPL_LASTNAME,
T_APPLICANT.APPL_FIRSTNAME,
T_APPLICANT_ASSIGNMENT.ASS_STARTDATE,
DATEPART(ww, dbo.T_APPLICANT_ASSIGNMENT.ASS_STARTDATE) AS WeekNo,
DATEPART(WEEKDAY, dbo.T_APPLICANT_ASSIGNMENT.ASS_STARTDATE) AS WeekDay,
DATEPART(ww, GETDATE()) AS CurWeekNo,
(T_APPLICANT_ASSIGNMENT.ASS_HOURS) AS Total_Assigned_hrs,
(T_APPLICANT_ASSIGNMENT.ASS_BILL) AS AvgBill_Rate,
(T_APPLICANT_ASSIGNMENT.ASS_PAY) AS AvgPay_Rate,
(T_APPLICANT_ASSIGNMENT.ASS_HOURS * T_APPLICANT_ASSIGNMENT.ASS_PAY) AS Total_AmtPaid,
(T_APPLICANT_ASSIGNMENT.ASS_HOURS * T_APPLICANT_ASSIGNMENT.ASS_BILL) AS Total_AmtBilled,
(LTRIM(STR(DATEPART(yy, T_APPLICANT_ASSIGNMENT.ASS_STARTDATE))) + '-'
+ LTRIM(STR(DATEPART(M, T_APPLICANT_ASSIGNMENT.ASS_STARTDATE)))
) AS YearMo
FROM
T_APPLICANT
RIGHT OUTER JOIN
T_APPLICANT_ASSIGNMENT
ON T_APPLICANT.APPL_ID = T_APPLICANT_ASSIGNMENT.APPL_ID
WHERE
DATEPART(ww, dbo.T_APPLICANT_ASSIGNMENT.ASS_STARTDATE)
BETWEEN DATEPART(ww, GETDATE()) AND DATEPART(ww, GETDATE()) + 1
AND DATEPART(yy, T_APPLICANT_ASSIGNMENT.ASS_STARTDATE) = DATEPART(yy, GETDATE())
AND ASS_STATUS = 'A';
Unless proven otherwise, you can't set DATEFIRST in a view.
And neither in a user defined function.
So to have a view that returns week & weekday numbers as if DATEFIRST was set to 1?
That could use different calculations.
Haven't figured out yet how to calculate the week number regardless of the DATEFIRST setting as if it the weeks would start on Monday.
That's a tricky one.
I know, one could link to a Calendar table with the week numbers.
But that's not the goal here.
However, the WEEKDAY can also be calculated without using DATEPART.
For example by combining a CASE with a FORMAT.
Because the names of the weekdays remain the same, regardless of the DATEFIRST setting.
And an ISO_WEEK also starts on Monday.
So it can be used in the WHERE clause to filter on the current week & next week.
create table testdatefirst (
id int primary key not null identity(1,1),
dt date not null
)
GO
✓
with rcte as
(
select cast('2018-12-24' as date) dt
union all
select dateadd(day, 1, dt)
from rcte
where dt < cast('2019-03-01' as date)
)
insert into testdatefirst (dt)
select *
from rcte
order by dt
GO
68 rows affected
CREATE view vw_testdatefirst AS
select dt
, FORMAT(dt,'ddd','en-GB') as [dayname]
, DATEPART(WEEKDAY, dt) as [weekday]
, DATEPART(WEEK, dt) as [week]
-- , DATEPART(ISO_WEEK, dt) as [ISO_WEEK]
, case FORMAT(dt,'ddd','en-GB')
when 'Mon' then 1
when 'Tue' then 2
when 'Wed' then 3
when 'Thu' then 4
when 'Fri' then 5
when 'Sat' then 6
when 'Sun' then 7
end as [weekday2]
, (((DATEPART(WEEKDAY, dt) + ##DATEFIRST-2)%7)+1) AS [weekday3]
from testdatefirst
where DATEPART(ISO_WEEK, dt) between DATEPART(ISO_WEEK, '2019-01-01') and DATEPART(ISO_WEEK, '2019-01-01')+1
GO
✓
set datefirst 7;
GO
✓
select ##datefirst as [datefirst];
select * from vw_testdatefirst order by dt;
GO
| datefirst |
| :-------- |
| 7 |
dt | dayname | weekday | week | weekday2 | weekday3
:------------------ | :------ | ------: | ---: | -------: | -------:
31/12/2018 00:00:00 | Mon | 2 | 53 | 1 | 1
01/01/2019 00:00:00 | Tue | 3 | 1 | 2 | 2
02/01/2019 00:00:00 | Wed | 4 | 1 | 3 | 3
03/01/2019 00:00:00 | Thu | 5 | 1 | 4 | 4
04/01/2019 00:00:00 | Fri | 6 | 1 | 5 | 5
05/01/2019 00:00:00 | Sat | 7 | 1 | 6 | 6
06/01/2019 00:00:00 | Sun | 1 | 2 | 7 | 7
07/01/2019 00:00:00 | Mon | 2 | 2 | 1 | 1
08/01/2019 00:00:00 | Tue | 3 | 2 | 2 | 2
09/01/2019 00:00:00 | Wed | 4 | 2 | 3 | 3
10/01/2019 00:00:00 | Thu | 5 | 2 | 4 | 4
11/01/2019 00:00:00 | Fri | 6 | 2 | 5 | 5
12/01/2019 00:00:00 | Sat | 7 | 2 | 6 | 6
13/01/2019 00:00:00 | Sun | 1 | 3 | 7 | 7
set datefirst 1;
GO
✓
select ##datefirst as [datefirst];
select * from vw_testdatefirst order by dt;
GO
| datefirst |
| :-------- |
| 1 |
dt | dayname | weekday | week | weekday2 | weekday3
:------------------ | :------ | ------: | ---: | -------: | -------:
31/12/2018 00:00:00 | Mon | 1 | 53 | 1 | 1
01/01/2019 00:00:00 | Tue | 2 | 1 | 2 | 2
02/01/2019 00:00:00 | Wed | 3 | 1 | 3 | 3
03/01/2019 00:00:00 | Thu | 4 | 1 | 4 | 4
04/01/2019 00:00:00 | Fri | 5 | 1 | 5 | 5
05/01/2019 00:00:00 | Sat | 6 | 1 | 6 | 6
06/01/2019 00:00:00 | Sun | 7 | 1 | 7 | 7
07/01/2019 00:00:00 | Mon | 1 | 2 | 1 | 1
08/01/2019 00:00:00 | Tue | 2 | 2 | 2 | 2
09/01/2019 00:00:00 | Wed | 3 | 2 | 3 | 3
10/01/2019 00:00:00 | Thu | 4 | 2 | 4 | 4
11/01/2019 00:00:00 | Fri | 5 | 2 | 5 | 5
12/01/2019 00:00:00 | Sat | 6 | 2 | 6 | 6
13/01/2019 00:00:00 | Sun | 7 | 2 | 7 | 7
db<>fiddle here

SQL / Oracle Aggregation Buckets Between Dates

I have a SQL related question I would love some help with as a suitable answer has been eluding me for some time.
Background
I’m working with a vendor product which has an Oracle Database which serves as the backend. I have the ability to write any adhoc SQL to query the underlying tables, but I cannot make any changes to their underlying structure (or to the data model itself). The table I’m interested currently has about +1M rows and essentially tracks users sessions. It has 4 columns of interest: session_id (which is a primary key and unique per session), user_name, start_date (date which tracks the beginning of the session), and stop_date (date which tracks the end of the session). My goal is to perform the aggregation of data for active sessions based on month, day, and hour give a set start date and end date. I need to create a view (or 3 separate views) which can either perform the aggregation itself or serve as the intermediate object from which I can then query and perform the aggregation. I understand the eventual SQL / view may actually need to be 3 different views (one for month, one for day, one for hour), but it seems to me that the concept (once achieved) should be the same regardless of the time period.
Current table example
Table Name = web_session
| Session_id | user_name | start_date | stop_date
----------------------------------------------------------------------------
| 1 | joe | 4/20/2017 10:42:10 PM | 4/21/2017 2:42:10 AM |
| 2 | matt | 4/20/2017 5:43:10 PM | 4/20/2017 5:59:10 PM |
| 3 | matt | 4/20/2017 3:42:10 PM | 4/20/2017 5:42:10 PM |
| 4 | joe | 4/20/2017 11:20:10 AM | 4/20/2017 4:42:10 PM |
| 5 | john | 4/20/2017 8:42:10 AM | 4/20/2017 11:42:10 AM |
| 6 | matt | 4/20/2017 7:42:10 AM | 4/20/2017 11:42:10 PM |
| 7 | joe | 4/19/2017 11:20:10 PM | 4/20/2017 1:42:10 AM |
Ideal Output For Hour View
-12:00 can be either 0 or 24 for the example
| Date | HR | active_sessions | distinct_users |
------------------------------------------------------------
| 4/21/2017 | 2 | 1 | 1 |
| 4/21/2017 | 1 | 1 | 1 |
| 4/20/2017 | 0 | 1 | 1 |
| 4/20/2017 | 23 | 1 | 1 |
| 4/20/2017 | 22 | 1 | 1 |
| 4/20/2017 | 17 | 2 | 1 |
| 4/20/2017 | 16 | 2 | 2 |
| 4/20/2017 | 15 | 2 | 2 |
| 4/20/2017 | 14 | 1 | 1 |
| 4/20/2017 | 13 | 1 | 1 |
| 4/20/2017 | 12 | 1 | 1 |
| 4/20/2017 | 11 | 3 | 3 |
| 4/20/2017 | 10 | 2 | 2 |
| 4/20/2017 | 9 | 2 | 2 |
| 4/20/2017 | 8 | 2 | 2 |
| 4/20/2017 | 7 | 1 | 1 |
| 4/20/2017 | 1 | 1 | 1 |
| 4/20/2017 | 0 | 1 | 1 |
| 4/19/2017 | 23 | 1 | 1 |
End Goal and Other Options
What I am eventually trying to achieve with this output is to populate a line chart which displays the number of active sessions for either a month, day, or hour (used in the example output) between two dates. In the hour example, the date in combination with the HR would be used along the X-axis and the active sessions would be used along the Y-axis. The distinct user count would be available if a user hovered over the point on the chart. FYI Active sessions are the total number of sessions that were open at any point during the interval. Distinct users are the total number of distinct users during the interval. If I logged on and off twice in the same hour, it would be 2 active sessions, but only 1 distinct user.
Alternative Solutions
This seems to be a problem which must have come up may times before, but from all of my googling and stack overflow research I cannot seem to find the correct approach. If I am thinking about the query or ideal output incorrectly I AM OPEN TO ALTERNATE SUGGESTIONS which allow me to get the desired output to populate the chart appropriately on the front end.
Some SQL I Have Tried (Good Faith Effort)
There are many queries I've tried, but I'll start with this one as it is the closest I got but is extremely slow (unusably so)and it still does not produce the result I need.
Select * FROM (
SELECT
u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt,
COUNT(Distinct u.session_id) as unique_sessions,
COUNT(Distinct u.user_name) as unique_users,
LISTAGG(u.user_name, ', ') WITHIN GROUP (ORDER BY u.user_name ASC) as users
FROM
(SELECT EXTRACT(year FROM l.start_date) as YearDt,
EXTRACT(month FROM l.start_date) as MonthDt,
EXTRACT(day FROM l.start_date) as DayDt,
EXTRACT(HOUR FROM CAST(l.start_date AS TIMESTAMP)) as HourDt,
EXTRACT(MINUTE FROM CAST(l.start_date AS TIMESTAMP)) as MinDt,
l.session_id,
l.user_name,
l.start_date as act_date,
1 as is_start
FROM web_session l
UNION ALL
SELECT EXTRACT(year FROM l.stop_date) as YearDt,
EXTRACT(month FROM l.stop_date) as MonthDt,
EXTRACT(day FROM l.stop_date) as DayDt,
EXTRACT(HOUR FROM CAST(l.stop_date AS TIMESTAMP)) as HourDt,
EXTRACT(MINUTE FROM CAST(l.stop_date AS TIMESTAMP)) as MinDt,
l.session_id,
l.user_name,
l.stop_date as act_date,
0 as is_start
FROM web_session l
) u
GROUP BY CUBE ( u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt)
) c
You can use a CTE (Query 1) or a correlated hierarchical query (Query 2) to generate the hours within the time ranges and then aggregate. This only requires a single table scan:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE Web_Session ( Session_id, user_name, start_date, stop_date ) AS
SELECT 1, 'joe', CAST( TIMESTAMP '2017-04-20 22:42:10' AS DATE ), CAST( TIMESTAMP '2017-04-21 02:42:10' AS DATE ) FROM DUAL UNION ALL
SELECT 2, 'matt', TIMESTAMP '2017-04-20 17:43:10', TIMESTAMP '2017-04-20 17:59:10' FROM DUAL UNION ALL
SELECT 3, 'matt', TIMESTAMP '2017-04-20 15:42:10', TIMESTAMP '2017-04-20 17:42:10' FROM DUAL UNION ALL
SELECT 4, 'joe', TIMESTAMP '2017-04-20 11:20:10', TIMESTAMP '2017-04-20 16:42:10' FROM DUAL UNION ALL
SELECT 5, 'john', TIMESTAMP '2017-04-20 08:42:10', TIMESTAMP '2017-04-20 11:42:10' FROM DUAL UNION ALL
SELECT 6, 'matt', TIMESTAMP '2017-04-20 07:42:10', TIMESTAMP '2017-04-20 23:42:10' FROM DUAL UNION ALL
SELECT 7, 'joe', TIMESTAMP '2017-04-19 23:20:10', TIMESTAMP '2017-04-20 01:42:10' FROM DUAL;
Query 1:
WITH hours ( session_id, user_name, hour, duration ) AS (
SELECT session_id,
user_name,
CAST( TRUNC( start_date, 'HH24' ) AS DATE ),
( TRUNC( stop_date, 'HH24' ) - TRUNC( start_date, 'HH24' ) ) * 24
FROM web_session
UNION ALL
SELECT session_id,
user_name,
hour + INTERVAL '1' HOUR, -- There is a bug in SQLFiddle that subtracts
-- hours instead of adding so -1 is used there.
duration - 1
FROM hours
WHERE duration > 0
)
SELECT hour,
COUNT( session_id ) AS active_sessions,
COUNT( DISTINCT user_name ) AS distinct_users
FROM hours
GROUP BY hour
ORDER BY hour
Results:
| HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z | 1 | 1 |
| 2017-04-20T00:00:00Z | 1 | 1 |
| 2017-04-20T01:00:00Z | 1 | 1 |
| 2017-04-20T07:00:00Z | 1 | 1 |
| 2017-04-20T08:00:00Z | 2 | 2 |
| 2017-04-20T09:00:00Z | 2 | 2 |
| 2017-04-20T10:00:00Z | 2 | 2 |
| 2017-04-20T11:00:00Z | 3 | 3 |
| 2017-04-20T12:00:00Z | 2 | 2 |
| 2017-04-20T13:00:00Z | 2 | 2 |
| 2017-04-20T14:00:00Z | 2 | 2 |
| 2017-04-20T15:00:00Z | 3 | 2 |
| 2017-04-20T16:00:00Z | 3 | 2 |
| 2017-04-20T17:00:00Z | 3 | 1 |
| 2017-04-20T18:00:00Z | 1 | 1 |
| 2017-04-20T19:00:00Z | 1 | 1 |
| 2017-04-20T20:00:00Z | 1 | 1 |
| 2017-04-20T21:00:00Z | 1 | 1 |
| 2017-04-20T22:00:00Z | 2 | 2 |
| 2017-04-20T23:00:00Z | 2 | 2 |
| 2017-04-21T00:00:00Z | 1 | 1 |
| 2017-04-21T01:00:00Z | 1 | 1 |
| 2017-04-21T02:00:00Z | 1 | 1 |
Execution Plan:
-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 364 | 7 | 00:00:01 |
| 1 | SORT GROUP BY | | 14 | 364 | 7 | 00:00:01 |
| 2 | VIEW | VW_DAG_0 | 14 | 364 | 7 | 00:00:01 |
| 3 | HASH GROUP BY | | 14 | 364 | 7 | 00:00:01 |
| 4 | VIEW | | 14 | 364 | 6 | 00:00:01 |
| 5 | UNION ALL (RECURSIVE WITH) BREADTH FIRST | | | | | |
| 6 | TABLE ACCESS FULL | WEB_SESSION | 7 | 245 | 3 | 00:00:01 |
| * 7 | RECURSIVE WITH PUMP | | | | | |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 7 - filter("DURATION">0)
Note
-----
- dynamic sampling used for this statement
Query 2:
SELECT t.COLUMN_VALUE AS hour,
COUNT( session_id ) AS active_sessions,
COUNT( DISTINCT user_name ) AS distinct_users
FROM web_session w
CROSS JOIN
TABLE(
CAST(
MULTISET(
SELECT TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24
FROM DUAL
CONNECT BY TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24 < w.stop_date
) AS SYS.ODCIDATELIST
)
) t
GROUP BY t.COLUMN_VALUE
ORDER BY hour
Results:
| HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z | 1 | 1 |
| 2017-04-20T00:00:00Z | 1 | 1 |
| 2017-04-20T01:00:00Z | 1 | 1 |
| 2017-04-20T07:00:00Z | 1 | 1 |
| 2017-04-20T08:00:00Z | 2 | 2 |
| 2017-04-20T09:00:00Z | 2 | 2 |
| 2017-04-20T10:00:00Z | 2 | 2 |
| 2017-04-20T11:00:00Z | 3 | 3 |
| 2017-04-20T12:00:00Z | 2 | 2 |
| 2017-04-20T13:00:00Z | 2 | 2 |
| 2017-04-20T14:00:00Z | 2 | 2 |
| 2017-04-20T15:00:00Z | 3 | 2 |
| 2017-04-20T16:00:00Z | 3 | 2 |
| 2017-04-20T17:00:00Z | 3 | 1 |
| 2017-04-20T18:00:00Z | 1 | 1 |
| 2017-04-20T19:00:00Z | 1 | 1 |
| 2017-04-20T20:00:00Z | 1 | 1 |
| 2017-04-20T21:00:00Z | 1 | 1 |
| 2017-04-20T22:00:00Z | 2 | 2 |
| 2017-04-20T23:00:00Z | 2 | 2 |
| 2017-04-21T00:00:00Z | 1 | 1 |
| 2017-04-21T01:00:00Z | 1 | 1 |
| 2017-04-21T02:00:00Z | 1 | 1 |
Execution Plan:
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 57176 | 2115512 | 200 | 00:00:03 |
| 1 | SORT GROUP BY | | 57176 | 2115512 | 200 | 00:00:03 |
| 2 | NESTED LOOPS | | 57176 | 2115512 | 195 | 00:00:03 |
| 3 | TABLE ACCESS FULL | WEB_SESSION | 7 | 245 | 3 | 00:00:01 |
| 4 | COLLECTION ITERATOR SUBQUERY FETCH | | 8168 | 16336 | 27 | 00:00:01 |
| * 5 | CONNECT BY WITHOUT FILTERING | | | | | |
| 6 | FAST DUAL | | 1 | | 2 | 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter(TRUNC(:B1,'fmhh24')+(LEVEL-1)/24<:B2)
Note
-----
- dynamic sampling used for this statement
I think something like this will work:
WITH ct ( active_dt ) AS (
-- Build the query for the "table" of hours
SELECT DATE'2018-04-19' + (LEVEL-1)/24 AS active_dt FROM dual
CONNECT BY DATE'2018-04-19' + (LEVEL-1)/24 < DATE'2018-04-22'
)
SELECT active_dt AS "Date", active_hr AS "HR"
, COUNT(session_id) AS active_sessions
, COUNT(DISTINCT user_name) AS distinct_users
FROM (
SELECT TRUNC(ct.active_dt) AS active_dt
, TO_CHAR(ct.active_dt, 'HH24') AS active_hr
, ws.session_id, ws.user_name
FROM ct LEFT JOIN web_session ws
ON ct.active_dt + 1/24 >= ws.start_dt
AND ct.active_dt < ws.stop_dt
) GROUP BY active_dt, active_hr
ORDER BY active_dt DESC, active_hr DESC;
I may not have the conditions for the LEFT JOIN 100% correct.
Hope this helps.
Matt,
What you need to do is generate a time dimension either as a static table or dynamically at run time:
create table time_dim (
ts date primary key,
year number not null,
month number not null,
day number not null,
wday number not null,
dy varchar2(3) not null,
hr number not null
);
insert into time_dim (ts, year, month, day, wday, dy, hr)
select ts
, extract(year from ts) year
, extract(month from ts) month
, extract(day from ts) day
, to_char(ts,'d') wday
, to_char(ts,'dy') dy
, to_number(to_char(ts,'HH24')) hr
from (
select DATE '2017-01-01' + (level - 1)/24 ts
FROM DUAL connect by level <= 365*24) a;
Then outer join that to your web_sessions table:
select t.ts, t.year, t.month, t.wday, t.dy, t.hr
, count(session_id) sessions
, count(distinct user_name) users
from time_dim t
left join web_session w
on t.ts between trunc(w.start_date, 'hh24') and w.stop_date
where trunc(t.ts) between date '2017-04-19' and date '2017-04-21'
group by rollup (t.year, t.month, (t.wday, t.dy), (t.hr, t.ts));
You can change up the group by clause to get the various aggregates you're interested in.
In the above code, I'm truncating the start_date to the hour in the ON clause so that the start hour will be included in the results otherwise sessions that don't start exactly at the top of the hour would not get counted in that hour.