in sql, calculating date parts versus date lookup table in group queries - sql

many queries are by week, month or quarter when the base table date is either date or timestamp.
in general, in group by queries, does it matter whether using
- functions on the date
- a day table that has extraction pre-calculated
note: similar question as DATE lookup table (1990/01/01:2041/12/31)
for example, in postgresql
create table sale(
tran_id serial primary key,
tran_dt date not null default current_date,
sale_amt decimal(8,2) not null,
...
);
create table days(
day date primary key,
week date not null,
month date not null,
quarter date non null
);
-- week query 1: group using funcs
select
date_trunc('week',tran_dt)::date - 1 as week,
count(1) as sale_ct,
sum(sale_amt) as sale_amt
from sale
where date_trunc('week',tran_dt)::date - 1 between '2012-1-1' and '2011-12-31'
group by date_trunc('week',tran_dt)::date - 1
order by 1;
-- query 2: group using days
select
days.week,
count(1) as sale_ct,
sum(sale_amt) as sale_amt
from sale
join days on( days.day = sale.tran_dt )
where week between '2011-1-1'::date and '2011-12-31'::date
group by week
order by week;
to me, whereas the date_trunc() function seems more organic, the the days table is easier to use.
is there anything here more than a matter of taste?

-- query 3: group using instant "immediate" calendar table
WITH calender AS (
SELECT ser::date AS dd
, date_trunc('week', ser)::date AS wk
-- , date_trunc('month', ser)::date AS mon
-- , date_trunc('quarter', ser)::date AS qq
FROM generate_series( '2012-1-1' , '2012-12-31', '1 day'::interval) ser
)
SELECT
cal.wk
, count(1) as sale_ct
, sum(sa.sale_amt) as sale_amt
FROM sale sa
JOIN calender cal ON cal.dd = sa.tran_dt
-- WHERE week between '2012-1-1' and '2011-12-31'
GROUP BY cal.wk
ORDER BY cal.wk
;
Note: I fixed an apparent typo in the BETWEEN range.
UPDATE: I used Erwin's recursive CTE to squeeze out the duplicated date_trunc(). Nested CTE galore:
WITH calendar AS (
WITH RECURSIVE montag AS (
SELECT '2011-01-01'::date AS dd
UNION ALL
SELECT dd + 1 AS dd
FROM montag
WHERE dd < '2012-1-1'::date
)
SELECT mo.dd, date_trunc('week', mo.dd + 1)::date AS wk
FROM montag mo
)
SELECT
cal.wk
, count(1) as sale_ct
, sum(sa.sale_amt) as sale_amt
FROM sale sa
JOIN calendar cal ON cal.dd = sa.tran_dt
-- WHERE week between '2012-1-1' and '2011-12-31'
GROUP BY cal.wk
ORDER BY cal.wk
;

Yes, it is more than a matter of taste. The performance of the query depends on the method.
As a first approximation, the functions should be faster. They don't require joins, doing the read in a single table scan.
However, a good optimizer could make effective use of a lookup table. It would know the distribution of the target values. And, an in memory join could be quite fast.
As a database design, I think having a calendar table is very useful. Some information such as holidays just isn't going to work as a function. However, for most ad hoc queries the date functions are fine.

1. Your expression:
... between '2012-1-1' and '2011-12-31'
doesn't work. Basic BETWEEN requires the left argument to be less than or equal to the right argument. Would have to be:
... BETWEEN SYMMETRIC '2012-1-1' and '2011-12-31'
Or it's just a typo and you mean something like:
... BETWEEN '2011-1-1' and '2011-12-31'
It's unclear to me, what your queries are supposed to retrieve. I'll assume you want all weeks (Monday to Sunday) that start in 2011 for the rest of this answer. This expression generates exactly that in less than a microsecond on modern hardware (works for any year):
SELECT generate_series(
date_trunc('week','2010-12-31'::date) + interval '7d'
,date_trunc('week','2011-12-31'::date) + interval '6d'
, '1d')::date
*Note that the ISO 8601 definition of the "first week of a year is slightly different.
2. Your second query does not work at all. No GROUP BY?
3. The question you link to did not deal with PostgreSQL, which has outstanding date / timestamp support. And it has generate_series() which can obviate the need for a separate "days" table in most cases - as demonstrated above. Your query would look like this:
In the meantime #wildplasser provided an example query that was supposed to go here.
By popular* demand, a recursive CTE version - which is actually not that far from being a serious alternative!
* and by "popular" I mean #wildplasser's very serious request.
WITH RECURSIVE days AS (
SELECT '2011-01-01'::date AS dd
,date_trunc('week', '2011-01-01'::date )::date AS wk
UNION ALL
SELECT dd + 1
,date_trunc('week', dd + 1)::date AS wk
FROM days
WHERE dd < '2011-12-31'::date
)
SELECT d.wk
,count(*) AS sale_ct
,sum(s.sale_amt) AS sale_amt
FROM days d
JOIN sale s ON s.tran_dt = d.dd
-- WHERE d.wk between '2011-01-01' and '2011-12-31'
GROUP BY 1
ORDER BY 1;
Could also be written as (compare to #wildplasser's version):
WITH RECURSIVE d AS (
SELECT '2011-01-01'::date AS dd
UNION ALL
SELECT dd + 1 FROM d WHERE dd < '2011-12-31'::date
), days AS (
SELECT dd, date_trunc('week', dd + 1)::date AS wk
FROM d
)
SELECT ...
4. If performance is of the essence, just make sure, that you do not apply functions or calculations to the values of your table. This prohibits the use of indexes and is generally very slow, because every row has to be processed. That's why your first query is going to suck with big table. When ever possible, apply calculations to the values you filter with, instead.
Indexes on expressions are one way around this. If you had an index like
CREATE INDEX sale_tran_dt_week_idx ON sale (date_trunc('week', tran_dt)::date);
.. your first query could be very fast again - at some cost for write operations for index maintenance.

Related

PL-SQL query to calculate customers per period from start and stop dates

I have a PL-SQL table with a structure as shown in the example below:
I have customers (customer_number) with insurance cover start and stop dates (cover_start_date and cover_stop_date). I also have dates of accidents for those customers (accident_date). These customers may have more than one row in the table if they have had more than one accident. They may also have no accidents. And they may also have a blank entry for the cover stop date if their cover is ongoing. Sorry I did not design the data format, but I am stuck with it.
I am looking to calculate the number of accidents (num_accidents) and number of customers (num_customers) in a given time period (period_start), and from that the number of accidents-per-customer (which will be easy once I've got those two pieces of information).
Any ideas on how to design a PL-SQL function to do this in a simple way? Ideally with the time periods not being fixed to monthly (for example, weekly or fortnightly too)? Ideally I will end up with a table like this shown below:
Many thanks for any pointers...
You seem to need a list of dates. You can generate one in the query and then use correlated subqueries to calculate the columns you want:
select d.*,
(select count(distinct customer_id)
from t
where t.cover_start_date <= d.dte and
(t.cover_end_date > d.date + interval '1' month or t.cover_end_date is null)
) as num_customers,
(select count(*)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as accidents,
(select count(distinct customer_id)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as num_customers_with_accident
from (select date '2020-01-01' as dte from dual union all
select date '2020-02-01' as dte from dual union all
. . .
) d;
If you want to do arithmetic on the columns, you can use this as a subquery or CTE.

Break down from range date to daily with high effeciency?

The data have start_date and end_date like 2020-09-18 and 2020-09-28. I need to break it down to daily which is 11 days including 2020-09-18.
My solution is to create a date table with every single day.
with cte as(
select b.fulldate,
count(1) over (partition by a,b,metric_c,metric_d) as count,
a,b,
metric_c, metric_d
from a
join dim_date b
on b.fulldate between a.start_date and a.end_date
)
select
fulldate,
a,b,
metric_c / count as metric_c, --maybe some cast or convert in here
metric_d / count as metric_d
from cte
This is what I'm using currently. But is there a more effective way? If the table have 1,000,000 rows and maybe 10 metric, how can I get a better performance?
Thanks in advance anyway. Maybe there's some method that don't have to use an extra date table(which need some update if it's not enough date there), and have a really brilliant performance with millions data. If not, I'll keep using my method then.
I would keep the dim_date data model you have, as it has materialized the rows between the start_dates and end_dates.
The table DIM_DATE is an example of a confirmed dimension and it cab be used across any other subject areas in your reporting application that would need a date dimension.
I would check if in your DIM_DATE you have an index on the key which is being looked up (b.full_date) field.
I wouldn't be surprised if a recursive subquery had better performance if you have lots of dates and relatively short periods:
with cte as (
select start_date, end_date,
metric_a / (datediff(day, start_date, end_date) + 1) as metric_a,
metric_b / (datediff(day, start_date, end_date) + 1) as metric_b
from a
union all
select dateadd(day, 1, start_date), end_date, metric_a, metric_b
from cte
where start_date < end_date
)
select *
from cte;
You can just add more metrics into the CTE as needed.
If any of the periods exceed 100 days, then you need to add option (maxrecursion 0).

SQL In Oracle - How to search through occurrences in an interval?

I've gotten myself stuck working in Oracle with SQL for the first time. In my library example, I need to make a query on my tables for a library member who has borrowed more than 5 books in some week during the past year. Here's my attempt:
SELECT
PN.F_NAME,
PN.L_NAME,
M.ENROLL_DATE,
COUNT(*) AS BORROWED_COUNT,
(SELECT
(BD.DATE_BORROWED + INTERVAL '7' DAY)
FROM DUAL, BORROW_DETAILS BD
GROUP BY BD.DATE_BORROWED + INTERVAL '7' DAY
HAVING COUNT(*) > 5
) AS VALID_INTERVALS
FROM PERSON_NAME PN, BORROW_DETAILS BD, HAS H, MEMBER M
WHERE
PN.PID = M.PID AND
M.PID = BD.PID AND
BD.BORROWID = H.BORROWID
GROUP BY PN.F_NAME, PN.L_NAME, M.ENROLL_DATE, DATEDIFF(DAY, BD.DATE_RETURNED, VALID_INTERVALS)
ORDER BY BORROWED_COUNT DESC;
As I'm sure you can tell, Im really struggling with the Dates in oracle. For some reason DATEDIFF wont work at all for me, and I cant find any way to evaluate the VALID_INTERVAL which should be another date...
Also apologies for the all caps.
DATEDIFF is not a valid function in Oracle; if you want the difference then subtract one date from another and you'll get a number representing the number of days (or fraction thereof) between the values.
If you want to count it for a week starting from Midnight Monday then you can TRUNCate the date to the start of the ISO week (which will be Midnight of the Monday of that week) and then group and count:
SELECT MAX( PN.F_NAME ) AS F_NAME,
MAX( PN.L_NAME ) AS L_NAME,
MAX( M.ENROLL_DATE ) AS ENROLL_DATE,
TRUNC( BD.DATE_BORROWED, 'IW' ) AS monday_of_iso_week,
COUNT(*) AS BORROWED_COUNT
FROM PERSON_NAME PN
INNER JOIN MEMBER M
ON ( PN.PID = M.PID )
INNER JOIN BORROW_DETAILS BD
ON ( M.PID = BD.PID )
GROUP BY
PN.PID,
TRUNC( BD.DATE_BORROWED, 'IW' )
HAVING COUNT(*) > 5
ORDER BY BORROWED_COUNT DESC;
db<>fiddle
You haven't given your table structures or any sample data so its difficult to test; but you don't appear to need to include the HAS table and I'm assuming there is a 1:1 relationship between person and member.
You also don't want to GROUP BY names as there could be two people with the same first and last name (who happened to enrol on the same date) and should use something that uniquely identifies the person (which I assume is PID).

Oracle - Split a record into multiple records

I have a schedule table for each month schedule. And this table also has days off within that month. I need a result set that will tell working days and off days for that month.
Eg.
CREATE TABLE SCHEDULE(sch_yyyymm varchar2(6), sch varchar2(20), sch_start_date date, sch_end_date date);
INSERT INTO SCHEDULE VALUES('201703','Working Days', to_date('03/01/2017','mm/dd/yyyy'), to_date('03/31/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','Off Day', to_date('03/05/2017','mm/dd/yyyy'), to_date('03/07/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','off Days', to_date('03/08/2017','mm/dd/yyyy'), to_date('03/10/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','off Days', to_date('03/15/2017','mm/dd/yyyy'), to_date('03/15/2017','mm/dd/yyyy'));
Using SQL or PL/SQL I need to split the record with Working Days and Off Days.
From above records I need result set as:
201703 Working Days 03/01/2017 - 03/04/2017
201703 Off Days 03/05/2017 - 03/10/2017
201703 Working Days 03/11/2017 - 03/14/2017
201703 Off Days 03/15/2017 - 03/15/2017
201703 Working Days 03/16/2017 - 03/31/2017
Thank You for your help.
Edit: I've had a bit more of a think, and this approach works fine for your insert records above - however, it misses records where there are not continuous "off day" periods. I need to have a bit more of a think and will then make some changes
I've put together a test using the lead and lag functions and a self join.
The upshot is you self-join the "Off Days" onto the existing tables to find the overlaps. Then calculate the start/end dates on either side of each record. A bit of logic then lets us work out which date to use as the final start/end dates.
SQL fiddle here - I used Postgres as the Oracle function wasn't working but it should translate ok.
select sch,
/* Work out which date to use as this record's Start date */
case when prev_end_date is null then sch_start_date
else off_end_date + 1
end as final_start_date,
/* Work out which date to use as this record's end date */
case when next_start_date is null then sch_end_date
when next_start_date is not null and prev_end_date is not null then next_start_date - 1
else off_start_date - 1
end as final_end_date
from (
select a.*,
b.*,
/* Get the start/end dates for the records on either side of each working day record */
lead( b.off_start_date ) over( partition by a.sch_start_date order by b.off_start_date ) as next_start_date,
lag( b.off_end_date ) over( partition by a.sch_start_date order by b.off_start_date ) as prev_end_date
from (
/* Get all schedule records */
select sch,
sch_start_date,
sch_end_date
from schedule
) as a
left join
(
/* Get all non-working day schedule records */
select sch as off_sch,
sch_start_date as off_start_date,
sch_end_date as off_end_date
from schedule
where sch <> 'Working Days'
) as b
/* Join on "Off Days" that overlap "Working Days" */
on a.sch_start_date <= b.off_end_date
and a.sch_end_date >= b.off_start_date
and a.sch <> b.off_sch
) as c
order by final_start_date
If you had a dates table this would have been easier.
You can construct a dates table using a recursive cte and join on to it. Then use the difference of row number approach to classify rows with same schedules on consecutive dates into one group and then get the min and max of each group which would be the start and end dates for a given sch. I assume there are only 2 sch values Working Days and Off Day.
with dates(dt) as (select date '2017-03-01' from dual
union all
select dt+1 from dates where dt < date '2017-03-31')
,groups as (select sch_yyyymm,dt,sch,
row_number() over(partition by sch_yyyymm order by dt)
- row_number() over(partition by sch_yyyymm,sch order by dt) as grp
from (select s.sch_yyyymm,d.dt,
/*This condition is to avoid a given date with 2 sch values, as 03-01-2017 - 03-31-2017 are working days
on one row and there is an Off Day status for some of these days.
In such cases Off Day would be picked up as sch*/
case when count(*) over(partition by d.dt) > 1 then min(s.sch) over(partition by d.dt) else s.sch end as sch
from dates d
join schedule s on d.dt >= s.sch_start_date and d.dt <= s.sch_end_date
) t
)
select sch_yyyymm,sch,min(dt) as start_date,max(dt) as end_date
from groups
group by sch_yyyymm,sch,grp
I couldn't get the recursive cte running in Oracle. Here is a demo using SQL Server.
Sample Demo in SQL Server

Calculate closest working day in Postgres

I need to schedule some items in a postgres query based on a requested delivery date for an order. So for example, the order has a requested delivery on a Monday (20120319 for example), and the order needs to be prepared on the prior working day (20120316).
Thoughts on the most direct method? I'm open to adding a dates table. I'm thinking there's got to be a better way than a long set of case statements using:
SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40');
This gets you previous business day.
SELECT
CASE (EXTRACT(ISODOW FROM current_date)::integer) % 7
WHEN 1 THEN current_date-3
WHEN 0 THEN current_date-2
ELSE current_date-1
END AS previous_business_day
To have the previous work day:
select max(s.a) as work_day
from (
select s.a::date
from generate_series('2012-01-02'::date, '2050-12-31', '1 day') s(a)
where extract(dow from s.a) between 1 and 5
except
select holiday_date
from holiday_table
) s
where s.a < '2012-03-19'
;
If you want the next work day just invert the query.
SELECT y.d AS prep_day
FROM (
SELECT generate_series(dday - 8, dday - 1, interval '1d')::date AS d
FROM (SELECT '2012-03-19'::date AS dday) x
) y
LEFT JOIN holiday h USING (d)
WHERE h.d IS NULL
AND extract(isodow from y.d) < 6
ORDER BY y.d DESC
LIMIT 1;
It should be faster to generate only as many days as necessary. I generate one week prior to the delivery. That should cover all possibilities.
isodow as extract parameter is more convenient than dow to test for workdays.
min() / max(), ORDER BY / LIMIT 1, that's a matter of taste with the few rows in my query.
To get several candidate days in descending order, not just the top pick, change the LIMIT 1.
I put the dday (delivery day) in a subquery so you only have to input it once. You can enter any date or timestamp literal. It is cast to date either way.
CREATE TABLE Holidays (Holiday, PrecedingBusinessDay) AS VALUES
('2012-12-25'::DATE, '2012-12-24'::DATE),
('2012-12-26'::DATE, '2012-12-24'::DATE);
SELECT Day, COALESCE(PrecedingBusinessDay, PrecedingMondayToFriday)
FROM
(SELECT Day, Day - CASE DATE_PART('DOW', Day)
WHEN 0 THEN 2
WHEN 1 THEN 3
ELSE 1
END AS PrecedingMondayToFriday
FROM TestDays) AS PrecedingMondaysToFridays
LEFT JOIN Holidays ON PrecedingMondayToFriday = Holiday;
You might want to rename some of the identifiers :-).