SQL: Dynamic Join Based on Row Value - sql
Context:
I am working with some complicated schema and have got many CTEs and joins to get to this point. This is a watered-down version and completely different source data and example to illustrate my point (data anonymity). Hopefully it provides enough of a snapshot.
Data Overview:
I have a service which generates a production forecast looking ahead 30 days. The forecast is generated for each facility, for each shift (morning/afternoon). Each forecast produced covers all shifts (morning/afternoon/evening) so they share a common generation_id but different forecast_profile_key.
What I am trying to do: I want to find the SUM of the forecast error for a given forecast generation constrained by a dynamic date range based on whether the date is a weekday or weekend. The SUM must be grouped only on similar IDs.
Basically, the temp table provides one record per facility per date per shift with the forecast error. I want to SUM the historical error dynamically for a facility/shift/date based on whether the date is weekday/weekend, and only SUM the error where the IDs match up.. (hope that makes sense!!)
Specifics: I want to find the SUM grouped by 'week_part_grouping', 'forecast_profile_key', 'forecast_profile' and 'forecast_generation_id'. The part I am struggling with is that I only want to SUM the error dynamically based on date: (a) if the date is a weekday, I want to SUM the error from up to the 5 recent-most days in a 7 day look back period, or (b) if the date is a weekend, I want to SUM the error from up to the 3 recent-most days in a 16 day look back period.
Ideally, having an extra column for 'total_forecast_error_in_lookback_range'.
Specific examples:
For 'facility_a', '2020-11-22' is a weekend. The lookback range is 16 days, so any date between '2020-11-21' and '2020-11-05' is eligible. The 3 recent-most dates would be '2020-11-21', '2020-11-15' and '2020-11'14'. Therefore, the sum of error would be 2000+3250+1050.
For 'facility_a', '2020-11-20' is a weekday. The lookback range is 7 days, so any date between '2020-11-19 and '2020-11-13'. That would work out to be '2020-11-19':'2020-11-16' and '2020-11-13'.
For 'facility_b', notice there is a change in the 'forecast_generation_id'. So, the error for '2020-11-20' would be only be 4565.
What I have tried: I'll confess to not being quite sure how to break down this portion. I did consider a case statement on the week_part but then got into a nested mess. I considered using a RANK windowed function but I didn't make much progress as was unsure how to implement the dynamic lookback component. I then also thought about doing some LISTAGG to get all the dates and do a REGEXP wildcard lookup but that would be very slow..
I am seeking pointers how to go about achieving this in SQL. I don't know if I am missing something from my toolkit here to go about breaking this down into something I can implement.
DROP TABLE IF EXISTS seventh__error_calc;
create temporary table seventh__error_calc
(
facility_name varchar,
shift varchar,
date_actuals date,
week_part_grouping varchar,
forecast_profile_key varchar,
forecast_profile_id varchar,
forecast_generation_id varchar,
count_dates_in_forecast bigint,
forecast_error bigint
);
Insert into seventh__error_calc
VALUES
('facility_a','morning','2020-11-22','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1000'),
('facility_a','morning','2020-11-21','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2000'),
('facility_a','morning','2020-11-20','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3000'),
('facility_a','morning','2020-11-19','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2500'),
('facility_a','morning','2020-11-18','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1200'),
('facility_a','morning','2020-11-17','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','5000'),
('facility_a','morning','2020-11-16','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','4400'),
('facility_a','morning','2020-11-15','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3250'),
('facility_a','morning','2020-11-14','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1050'),
('facility_a','morning','2020-11-13','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-12','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-11','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-10','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-09','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-08','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_b','morning','2020-11-22','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3400'),
('facility_b','morning','2020-11-21','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','2800'),
('facility_b','morning','2020-11-20','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3687'),
('facility_b','morning','2020-11-19','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','4565'),
('facility_b','morning','2020-11-18','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1262'),
('facility_b','morning','2020-11-17','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','8765'),
('facility_b','morning','2020-11-16','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','5678'),
('facility_b','morning','2020-11-15','weekend','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','2893'),
('facility_b','morning','2020-11-14','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1928'),
('facility_b','morning','2020-11-13','weekday','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','4736')
;
SELECT *
FROM seventh__error_calc
This achieved what I was trying to do. There were two learning points here.
Self Joins. I've never used one before but can now see why they are powerful!
Using a CASE statement in the WHERE clause.
Hope this might help someone else some day!
select facility_name,
forecast_profile_key,
forecast_profile_id,
shift,
date_actuals,
week_part_grouping,
forecast_generation_id,
sum(forecast_error) forecast_err_calc
from (
select rank() over (partition by forecast_profile_id, forecast_profile_key, facility_name, a.date_actuals order by b.date_actuals desc) rnk,
a.facility_name, a.forecast_profile_key, a.forecast_profile_id, a.shift, a.date_actuals, a.week_part_grouping, a.forecast_generation_id, b.forecast_error
from seventh__error_calc a
join seventh__error_calc b
using (facility_name, forecast_profile_key, forecast_profile_id, week_part_grouping, forecast_generation_id)
where case when a.week_part_grouping = 'weekend' then b.date_actuals between a.date_actuals - 16 and a.date_actuals
when a.week_part_grouping = 'weekday' then b.date_actuals between a.date_actuals - 7 and a.date_actuals
end
) src
where case when week_part_grouping = 'weekend' then rnk < 4
when week_part_grouping = 'weekday' then rnk < 6
end
Related
Adding x work days onto a date in SQL Server?
I'm a bit confused if there is a simple way to do this. I have a field called receipt_date in my data table and I wish to add 10 working days to this (with bank holidays). I'm not sure if there is any sort of query I could use to join onto this table from my original to calculate 10 working days from this, I've tried a few sub queries but I couldn't get it right or perhaps its not possible to do this. I didn't know if there was a way to extract the 10th rowcount after the receipt date to get the calendar date if I only include 'Y' into the WHERE? Any help appreciated.
This is making several assumptions about your data, because we have none. One method, however, would be to create a function, I use a inline table value function here, to return the relevant row from your calendar table. Note that this assumes that the number of days must always be positive, and that if you provide a date that isn't a working day that day 0 would be the next working day. I.e. adding zero working days to 2021-09-05 would return 2021-09-06, or adding 3 would return 2021-09-09. If that isn't what you want, this should be more than enough for you to get there yourself. CREATE FUNCTION dbo.AddWorkingDays (#Days int, #Date date) RETURNS TABLE AS RETURN WITH Dates AS( SELECT CalendarDate, WorkingDay FROM dbo.CalendarTable WHERE CalendarDate >= #Date) SELECT CalendarDate FROM Dates WHERE WorkingDay = 1 ORDER BY CalendarDate OFFSET #Days ROWS FETCH NEXT 1 ROW ONLY; GO --Using the function SELECT YT.DateColumn, AWD.CalendarDate AS AddedWorkingDays FROM dbo.YourTable YT CROSS APPLY dbo.AddWorkingDays(10,YT.DateColumn) AWD;
Impala get the difference between 2 dates excluding weekends
I'm trying to get the day difference between 2 dates in Impala but I need to exclude weekends. I know it should be something like this but I'm not sure how the weekend piece would go... DATEDIFF(resolution_date,created_date) Thanks!
One approach at such task is to enumerate each and every day in the range, and then filter out the week ends before counting. Some databases have specific features to generate date series, while in others offer recursive common-table-expression. Impala does not support recursive queries, so we need to look at alternative solutions. If you have a table wit at least as many rows as the maximum number of days in a range, you can use row_number() to offset the starting date, and then conditional aggregation to count working days. Assuming that your table is called mytable, with column id as primary key, and that the big table is called bigtable, you would do: select t.id, sum( case when dayofweek(dateadd(t.created_date, n.rn)) between 2 and 6 then 1 else 0 end ) no_days from mytable t inner join (select row_number() over(order by 1) - 1 rn from bigtable) n on t.resolution_date > dateadd(t.created_date, n.rn) group by id
Trying to UNNEST timestamp array field, but need to GROUP BY
I have a repeated field of type TIMESTAMP in a BigQuery table. I am attempting to UNNEST this field. However, I must group or aggregate the field in order. I am not knowledgable with SQL, so I could use some help. The code snippet is part of a larger query that works when substituting subscription.future_renewal_dates with GENERATE_TIMESTAMP_ARRAY subscription.future_renewal_dates is ARRAY<TIMESTAMP> The TIMESTAMP array is unique (recurring subscriptions) and cannot be generated using GENERATE_TIMESTAMP_ARRAY, so I have to generate the dates before uploading to BigQuery. UDF is too much. SELECT subscription.amount AS subscription_amount, subscription.status AS subscription_status, "1" AS analytic_name, ARRAY ( SELECT AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name FROM UNNEST(subscription.future_renewal_dates) as days WHERE ( days >= TIMESTAMP("2019-06-05T19:30:02+00:00") AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00") ) ) AS forecast FROM `mydataset.subscription` AS subscription GROUP BY subscription_amount, subscription_status, analytic_name Cannot figure out how to successfully unnest subscription.future_renewal_dates without error 'UNNEST expression references subscription.future_renewal_dates which is neither grouped nor aggregated'
When you do GROUP BY - all expressions, columns in the SELECT (except those in GROUP BY list) should be used with some aggregation function - which you clearly do not have. So you need to decide what it is that you actually trying to achieve here with that grouping Below is the option I think you had in mind - though it can be different - but at least you have an idea on how to fix it SELECT subscription.amount AS subscription_amount, subscription.status AS subscription_status, "1" AS analytic_name, ARRAY_CONCAT_AGG( ARRAY ( SELECT AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name FROM UNNEST(subscription.future_renewal_dates) as days WHERE ( days >= TIMESTAMP("2019-06-05T19:30:02+00:00") AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00") ) )) AS forecast FROM `mydataset.subscription` AS subscription GROUP BY subscription_amount, subscription_status, analytic_name
Calculating a running count of Weeks
I am looking to calculate a running count of the weeks that have occurred since a starting point. The biggest problem here is that the calendar I am working on is not a traditional Gregorian calendar. The easiest dimension to reference would be something like 'TWEEK' which actually tells you the week of the year that the record falls into. Example data: CREATE TABLE #foobar ( DateKey INT ,TWEEK INT ,CumWEEK INT ); INSERT INTO #foobar (DateKey, TWEEK, CumWEEK) VALUES(20150630, 1,1), (20150701,1,1), (20150702,1,1), (20150703,1,1), (20150704,1,1), (20150705,1,1), (20150706,1,1), (20150707,2,2), (20150708,2,2), (20150709,2,2), (20150710,2,2), (20150711,2,2), (20150712,2,2), (20150713,2,2), (20150714,1,3), (20150715,1,3), (20150716,1,3), (20150717,1,3), (20150718,1,3), (20150719,1,3), (20150720,1,3), (20150721,2,4), (20150722,2,4), (20150723,2,4), (20150724,2,4), (20150725,2,4), (20150726,2,4), (20150727,2,4) For sake of ease, I did not go all the way to 52, but you get the point. I am trying to recreate the 'CumWEEK' column. I have a column already that tells me the correct week of the year according to the weird calendar convention ('TWEEK'). I know this will involve some kind of OVER() windowing, but I cannot seem to figure It out.
The windowing function LAG() along with a summation of ORDER BY ROWS BETWEEN "Changes" should get you close enough to work with. The caveat to this is that the ORDER BY ROWS BETWEEN can only take an integer literal. Year Rollover : I guess you could create another ranking level based on mod 52 to start the count fresh. So 53 would become year 2, week 1, not 53. SELECT * , SUM(ChangedRow) OVER (ORDER BY DateKey ROWS BETWEEN 99999 PRECEDING AND CURRENT ROW) FROM ( SELECT DateKey, TWEEK, ChangedRow=CASE WHEN LAG(TWEEK) OVER (ORDER BY DateKey) <> TWEEK THEN 1 ELSE 0 END FROM #foobar F2 )AS DETAIL
Some minutes ago I answered a different question, in a way this is a similar question to https://stackoverflow.com/a/31303395/5089204 The idea is roughly to create a table of a running number and find the weeks with modulo 7. This you could use as grouping in an OVER clause... EDIT: Example CREATE FUNCTION dbo.RunningNumber(#Counter AS INT) RETURNS TABLE AS RETURN SELECT TOP (#Counter) ROW_NUMBER() OVER(ORDER BY o.object_id) AS RunningNumber FROM sys.objects AS o; --take any large table here... GO SELECT 'test',CAST(numbers.RunningNumber/7 AS INT) FROM dbo.RunningNumber(100) AS numbers Dividing by 7 "as INT" offers a quite nice grouping criteria. Hope this helps...
How to have GROUP BY and COUNT include zero sums?
I have SQL like this (where $ytoday is 5 days ago): $sql = 'SELECT Count(*), created_at FROM People WHERE created_at >= "'. $ytoday .'" AND GROUP BY DATE(created_at)'; I want this to return a value for every day, so it would return 5 results in this case (5 days ago until today). But say Count(*) is 0 for yesterday, instead of returning a zero it doesn't return any data at all for that date. How can I change that SQLite query so it also returns data that has a count of 0?
Without convoluted (in my opinion) queries, your output data-set won't include dates that don't exist in your input data-set. This means that you need a data-set with the 5 days to join on to. The simple version would be to create a table with the 5 dates, and join on that. I typically create and keep (effectively caching) a calendar table with every date I could ever need. (Such as from 1900-01-01 to 2099-12-31.) SELECT calendar.calendar_date, Count(People.created_at) FROM Calendar LEFT JOIN People ON Calendar.calendar_date = People.created_at WHERE Calendar.calendar_date >= '2012-05-01' GROUP BY Calendar.calendar_date
You'll need to left join against a list of dates. You can either create a table with the dates you need in it, or you can take the dynamic approach I outlined here: generate days from date range