Context:
I am working with some complicated schema and have got many CTEs and joins to get to this point. This is a watered-down version and completely different source data and example to illustrate my point (data anonymity). Hopefully it provides enough of a snapshot.
Data Overview:
I have a service which generates a production forecast looking ahead 30 days. The forecast is generated for each facility, for each shift (morning/afternoon). Each forecast produced covers all shifts (morning/afternoon/evening) so they share a common generation_id but different forecast_profile_key.
What I am trying to do: I want to find the SUM of the forecast error for a given forecast generation constrained by a dynamic date range based on whether the date is a weekday or weekend. The SUM must be grouped only on similar IDs.
Basically, the temp table provides one record per facility per date per shift with the forecast error. I want to SUM the historical error dynamically for a facility/shift/date based on whether the date is weekday/weekend, and only SUM the error where the IDs match up.. (hope that makes sense!!)
Specifics: I want to find the SUM grouped by 'week_part_grouping', 'forecast_profile_key', 'forecast_profile' and 'forecast_generation_id'. The part I am struggling with is that I only want to SUM the error dynamically based on date: (a) if the date is a weekday, I want to SUM the error from up to the 5 recent-most days in a 7 day look back period, or (b) if the date is a weekend, I want to SUM the error from up to the 3 recent-most days in a 16 day look back period.
Ideally, having an extra column for 'total_forecast_error_in_lookback_range'.
Specific examples:
For 'facility_a', '2020-11-22' is a weekend. The lookback range is 16 days, so any date between '2020-11-21' and '2020-11-05' is eligible. The 3 recent-most dates would be '2020-11-21', '2020-11-15' and '2020-11'14'. Therefore, the sum of error would be 2000+3250+1050.
For 'facility_a', '2020-11-20' is a weekday. The lookback range is 7 days, so any date between '2020-11-19 and '2020-11-13'. That would work out to be '2020-11-19':'2020-11-16' and '2020-11-13'.
For 'facility_b', notice there is a change in the 'forecast_generation_id'. So, the error for '2020-11-20' would be only be 4565.
What I have tried: I'll confess to not being quite sure how to break down this portion. I did consider a case statement on the week_part but then got into a nested mess. I considered using a RANK windowed function but I didn't make much progress as was unsure how to implement the dynamic lookback component. I then also thought about doing some LISTAGG to get all the dates and do a REGEXP wildcard lookup but that would be very slow..
I am seeking pointers how to go about achieving this in SQL. I don't know if I am missing something from my toolkit here to go about breaking this down into something I can implement.
DROP TABLE IF EXISTS seventh__error_calc;
create temporary table seventh__error_calc
(
facility_name varchar,
shift varchar,
date_actuals date,
week_part_grouping varchar,
forecast_profile_key varchar,
forecast_profile_id varchar,
forecast_generation_id varchar,
count_dates_in_forecast bigint,
forecast_error bigint
);
Insert into seventh__error_calc
VALUES
('facility_a','morning','2020-11-22','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1000'),
('facility_a','morning','2020-11-21','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2000'),
('facility_a','morning','2020-11-20','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3000'),
('facility_a','morning','2020-11-19','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2500'),
('facility_a','morning','2020-11-18','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1200'),
('facility_a','morning','2020-11-17','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','5000'),
('facility_a','morning','2020-11-16','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','4400'),
('facility_a','morning','2020-11-15','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3250'),
('facility_a','morning','2020-11-14','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1050'),
('facility_a','morning','2020-11-13','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-12','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-11','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-10','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-09','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-08','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_b','morning','2020-11-22','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3400'),
('facility_b','morning','2020-11-21','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','2800'),
('facility_b','morning','2020-11-20','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3687'),
('facility_b','morning','2020-11-19','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','4565'),
('facility_b','morning','2020-11-18','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1262'),
('facility_b','morning','2020-11-17','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','8765'),
('facility_b','morning','2020-11-16','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','5678'),
('facility_b','morning','2020-11-15','weekend','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','2893'),
('facility_b','morning','2020-11-14','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1928'),
('facility_b','morning','2020-11-13','weekday','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','4736')
;
SELECT *
FROM seventh__error_calc
This achieved what I was trying to do. There were two learning points here.
Self Joins. I've never used one before but can now see why they are powerful!
Using a CASE statement in the WHERE clause.
Hope this might help someone else some day!
select facility_name,
forecast_profile_key,
forecast_profile_id,
shift,
date_actuals,
week_part_grouping,
forecast_generation_id,
sum(forecast_error) forecast_err_calc
from (
select rank() over (partition by forecast_profile_id, forecast_profile_key, facility_name, a.date_actuals order by b.date_actuals desc) rnk,
a.facility_name, a.forecast_profile_key, a.forecast_profile_id, a.shift, a.date_actuals, a.week_part_grouping, a.forecast_generation_id, b.forecast_error
from seventh__error_calc a
join seventh__error_calc b
using (facility_name, forecast_profile_key, forecast_profile_id, week_part_grouping, forecast_generation_id)
where case when a.week_part_grouping = 'weekend' then b.date_actuals between a.date_actuals - 16 and a.date_actuals
when a.week_part_grouping = 'weekday' then b.date_actuals between a.date_actuals - 7 and a.date_actuals
end
) src
where case when week_part_grouping = 'weekend' then rnk < 4
when week_part_grouping = 'weekday' then rnk < 6
end
I'm trying to model a DateRange concept for a reporting application. Some date ranges need to be absolute, March 1, 2011 - March 31, 2011. Others are relative to current date, Last 30 Days, Next Week, etc. What's the best way to store that data in SQL table?
Obviously for absolute ranges, I can have a BeginDate and EndDate. For relative ranges, having an InceptionDate and an integer RelativeDays column makes sense. How do I incorporate both of these ideas into a single table without implementing context into it, ie have all four columns mentioned and use XOR logic to populate 2 of the 4.
Two possible schemas I rejected due to having context-driven columns:
CREATE TABLE DateRange
(
BeginDate DATETIME NULL,
EndDate DATETIME NULL,
InceptionDate DATETIME NULL,
RelativeDays INT NULL
)
OR
CREATE TABLE DateRange
(
InceptionDate DATETIME NULL,
BeginDaysRelative INT NULL,
EndDaysRelative INT NULL
)
Thanks for any advice!
I don't see why your second design doesn't meet your needs unless you're in the "no NULLs never" camp. Just leave InceptionDate NULL for "relative to current date" choices so that your application can tell them apart from fixed date ranges.
(Note: not knowing your DB engine, I've left date math and current date issues in pseudocode. Also, as in your question, I've left out any text description and primary key columns).
Then, either create a view like this:
CREATE VIEW DateRangesSolved (Inception, BeginDays, EndDays) AS
SELECT CASE WHEN Inception IS NULL THEN Date() ELSE Inception END,
BeginDays,
EndDays,
FROM DateRanges
or just use that logic when you SELECT from the table directly.
You can even take it one step further:
CREATE VIEW DateRangesSolved (BeginDate, EndDate) AS
SELECT (CASE WHEN Inception IS NULL THEN Date() ELSE Inception END + BeginDays),
(CASE WHEN Inception IS NULL THEN Date() ELSE Inception END + EndDays)
FROM DateRanges
Others are relative to current date, Last 30 Days, Next Week, etc.
What's the best way to store that data in SQL table?
If you store those ranges in a table, you have to update them every day. In this case, you have to update each row differently every day. That might be a big problem; it might not.
There usually aren't many rows in that kind of table, often less than 50. The table structure is obvious. Updating should be driven by a cron job (or its equivalent), and you should run very picky exception reports every day to make sure things have been updated correctly.
Normally, these kinds of reports should produce no output if things are fine. You have the added complication that driving such a report from cron will produce no output if cron isn't running. And that's not fine.
You can also create a view, which doesn't require any maintenance. With a few dozen rows, it might be slower than a physical table, but it might still fast enough. And it eliminates all maintenance and administrative work for these ranges. (Check for off-by-one errors, because I didn't.)
create view relative_date_ranges as
select 'Last 30 days' as range_name,
(current_date - interval '30' day)::date as range_start,
current_date as range_end
union all
select 'Last week' as range_name,
(current_date - interval '7' day)::date as range_start,
current_date as range_end
union all
select 'Next week' as range_name,
(current_date + interval '7' day)::date as range_start,
current_date as range_end
Depending on the app, you might be able to treat your "absolute" ranges the same way.
...
union all
select 'March this year' as range_name,
(extract(year from current_date) || '-03-01')::date as range_start,
(extract(year from current_date) || '-03-31')::date as range_end
Put them in separate tables. There is absolutely no reason to have them in a single table.
For the relative dates, I would go so far as to simply make the table the parameters you need for the date functions, i.e.
CREATE TABLE RelativeDate
(
Id INT Identity,
Date_Part varchar(25),
DatePart_Count int
)
Then you can know that it is -2 WEEK or 30 DAY variance and use that in your logic.
If you need to see them both at the same time, you can combine them LOGICALLY in a query or view without needing to mess up your data structure by cramming different data elements into the same table.
Create a table containing the begindate and the offset. The precision of the offset is up to you to decide.
CREATE TABLE DateRange(
BeginDate DATETIME NOT NULL,
Offset int NOT NULL,
OffsetLabel varchar(100)
)
to insert into it:
INSERT INTO DateRange (BeginDate, Offset, OffsetLabel)
select '20110301', DATEDIFF(sec, '20110301', '20110331'), 'March 1, 2011 - March 31, 2011'
Last 30 days
INSERT INTO DateRange (BeginDate, Duration, OffsetLabel)
select '20110301', DATEDIFF(sec, current_timestamp, DATEADD(day, -30, current_timestamp)), 'Last 30 Days'
To display the values later:
select BeginDate, EndDate = DATEADD(sec, Offset, BeginDate), OffsetLabel
from DateRange
If you want to be able to parse the "original" vague descriptions you will have to look for a "Fuzzy Date" or "Approxidate" function. (There exists something like this in the git source code. )
I'm trying to get the products that havn't been made in the last 2 years. I'm not that great with SQL but here's what i've started with and it doesn't work.
Lets say for this example that my schema looks like this
prod_id, date_created, num_units_created.
I'll take any advice i can get.
select id, (select date from table
where date <= sysdate - 740) older,
(select date from table
where date >= sysdate - 740) newer
from table
where newer - older
I'm not being clear enough.
Basically i want all products that havn't been produced in the last 2 years. Whenever a product is produced, a line gets added. So if i just did sysdate <= 740, it would only give me all the products that were produced from the beginning up til 2 years ago.
I want all products that have been produced in the at least once, but not in the last 2 years.
I hope that clears it up.
GROUP BY with HAVING
select id, max(date)
from table
group by id
having max(date) < add_months(sysdate,-24)
I'd use SQL's dateadd function.
where date < dateadd(year,-2,getdate())
would be a where clause that would select records with date less than 2 years from the current date.
Hope that helps.
EDIT: If you want to go by days, use dateadd(d,-740,getdate())
Maybe something like this?
select id, date
from table
where date <= (sysdate - 730);
SELECT id FROM table WHERE date + (365*2) <= sysdate;
Use SELECT id, date, other, columns ... if you need to get them at the same time.