SQL: Dynamic Join Based on Row Value

SQL: Dynamic Join Based on Row Value - sql

Context:
I am working with some complicated schema and have got many CTEs and joins to get to this point. This is a watered-down version and completely different source data and example to illustrate my point (data anonymity). Hopefully it provides enough of a snapshot.
Data Overview:
I have a service which generates a production forecast looking ahead 30 days. The forecast is generated for each facility, for each shift (morning/afternoon). Each forecast produced covers all shifts (morning/afternoon/evening) so they share a common generation_id but different forecast_profile_key.
What I am trying to do: I want to find the SUM of the forecast error for a given forecast generation constrained by a dynamic date range based on whether the date is a weekday or weekend. The SUM must be grouped only on similar IDs.
Basically, the temp table provides one record per facility per date per shift with the forecast error. I want to SUM the historical error dynamically for a facility/shift/date based on whether the date is weekday/weekend, and only SUM the error where the IDs match up.. (hope that makes sense!!)
Specifics: I want to find the SUM grouped by 'week_part_grouping', 'forecast_profile_key', 'forecast_profile' and 'forecast_generation_id'. The part I am struggling with is that I only want to SUM the error dynamically based on date: (a) if the date is a weekday, I want to SUM the error from up to the 5 recent-most days in a 7 day look back period, or (b) if the date is a weekend, I want to SUM the error from up to the 3 recent-most days in a 16 day look back period.
Ideally, having an extra column for 'total_forecast_error_in_lookback_range'.
Specific examples:
For 'facility_a', '2020-11-22' is a weekend. The lookback range is 16 days, so any date between '2020-11-21' and '2020-11-05' is eligible. The 3 recent-most dates would be '2020-11-21', '2020-11-15' and '2020-11'14'. Therefore, the sum of error would be 2000+3250+1050.
For 'facility_a', '2020-11-20' is a weekday. The lookback range is 7 days, so any date between '2020-11-19 and '2020-11-13'. That would work out to be '2020-11-19':'2020-11-16' and '2020-11-13'.
For 'facility_b', notice there is a change in the 'forecast_generation_id'. So, the error for '2020-11-20' would be only be 4565.
What I have tried: I'll confess to not being quite sure how to break down this portion. I did consider a case statement on the week_part but then got into a nested mess. I considered using a RANK windowed function but I didn't make much progress as was unsure how to implement the dynamic lookback component. I then also thought about doing some LISTAGG to get all the dates and do a REGEXP wildcard lookup but that would be very slow..
I am seeking pointers how to go about achieving this in SQL. I don't know if I am missing something from my toolkit here to go about breaking this down into something I can implement.
DROP TABLE IF EXISTS seventh__error_calc;
create temporary table seventh__error_calc
(
facility_name varchar,
shift varchar,
date_actuals date,
week_part_grouping varchar,
forecast_profile_key varchar,
forecast_profile_id varchar,
forecast_generation_id varchar,
count_dates_in_forecast bigint,
forecast_error bigint
);
Insert into seventh__error_calc
VALUES
('facility_a','morning','2020-11-22','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1000'),
('facility_a','morning','2020-11-21','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2000'),
('facility_a','morning','2020-11-20','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3000'),
('facility_a','morning','2020-11-19','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2500'),
('facility_a','morning','2020-11-18','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1200'),
('facility_a','morning','2020-11-17','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','5000'),
('facility_a','morning','2020-11-16','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','4400'),
('facility_a','morning','2020-11-15','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3250'),
('facility_a','morning','2020-11-14','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1050'),
('facility_a','morning','2020-11-13','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-12','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-11','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-10','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-09','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-08','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_b','morning','2020-11-22','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3400'),
('facility_b','morning','2020-11-21','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','2800'),
('facility_b','morning','2020-11-20','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3687'),
('facility_b','morning','2020-11-19','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','4565'),
('facility_b','morning','2020-11-18','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1262'),
('facility_b','morning','2020-11-17','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','8765'),
('facility_b','morning','2020-11-16','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','5678'),
('facility_b','morning','2020-11-15','weekend','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','2893'),
('facility_b','morning','2020-11-14','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1928'),
('facility_b','morning','2020-11-13','weekday','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','4736')
;
SELECT *
FROM seventh__error_calc

This achieved what I was trying to do. There were two learning points here.
Self Joins. I've never used one before but can now see why they are powerful!
Using a CASE statement in the WHERE clause.
Hope this might help someone else some day!
select facility_name,
forecast_profile_key,
forecast_profile_id,
shift,
date_actuals,
week_part_grouping,
forecast_generation_id,
sum(forecast_error) forecast_err_calc
from (
select rank() over (partition by forecast_profile_id, forecast_profile_key, facility_name, a.date_actuals order by b.date_actuals desc) rnk,
a.facility_name, a.forecast_profile_key, a.forecast_profile_id, a.shift, a.date_actuals, a.week_part_grouping, a.forecast_generation_id, b.forecast_error
from seventh__error_calc a
join seventh__error_calc b
using (facility_name, forecast_profile_key, forecast_profile_id, week_part_grouping, forecast_generation_id)
where case when a.week_part_grouping = 'weekend' then b.date_actuals between a.date_actuals - 16 and a.date_actuals
when a.week_part_grouping = 'weekday' then b.date_actuals between a.date_actuals - 7 and a.date_actuals
end
) src
where case when week_part_grouping = 'weekend' then rnk < 4
when week_part_grouping = 'weekday' then rnk < 6
end

Related

Adding x work days onto a date in SQL Server?

I'm a bit confused if there is a simple way to do this.
I have a field called receipt_date in my data table and I wish to add 10 working days to this (with bank holidays).
I'm not sure if there is any sort of query I could use to join onto this table from my original to calculate 10 working days from this, I've tried a few sub queries but I couldn't get it right or perhaps its not possible to do this. I didn't know if there was a way to extract the 10th rowcount after the receipt date to get the calendar date if I only include 'Y' into the WHERE?
Any help appreciated.

This is making several assumptions about your data, because we have none. One method, however, would be to create a function, I use a inline table value function here, to return the relevant row from your calendar table. Note that this assumes that the number of days must always be positive, and that if you provide a date that isn't a working day that day 0 would be the next working day. I.e. adding zero working days to 2021-09-05 would return 2021-09-06, or adding 3 would return 2021-09-09. If that isn't what you want, this should be more than enough for you to get there yourself.
CREATE FUNCTION dbo.AddWorkingDays (#Days int, #Date date)
RETURNS TABLE AS
RETURN
WITH Dates AS(
SELECT CalendarDate,
WorkingDay
FROM dbo.CalendarTable
WHERE CalendarDate >= #Date)
SELECT CalendarDate
FROM Dates
WHERE WorkingDay = 1
ORDER BY CalendarDate
OFFSET #Days ROWS FETCH NEXT 1 ROW ONLY;
GO
--Using the function
SELECT YT.DateColumn,
AWD.CalendarDate AS AddedWorkingDays
FROM dbo.YourTable YT
CROSS APPLY dbo.AddWorkingDays(10,YT.DateColumn) AWD;

Impala get the difference between 2 dates excluding weekends

I'm trying to get the day difference between 2 dates in Impala but I need to exclude weekends.
I know it should be something like this but I'm not sure how the weekend piece would go...
DATEDIFF(resolution_date,created_date)
Thanks!

One approach at such task is to enumerate each and every day in the range, and then filter out the week ends before counting.
Some databases have specific features to generate date series, while in others offer recursive common-table-expression. Impala does not support recursive queries, so we need to look at alternative solutions.
If you have a table wit at least as many rows as the maximum number of days in a range, you can use row_number() to offset the starting date, and then conditional aggregation to count working days.
Assuming that your table is called mytable, with column id as primary key, and that the big table is called bigtable, you would do:
select
t.id,
sum(
case when dayofweek(dateadd(t.created_date, n.rn)) between 2 and 6
then 1 else 0 end
) no_days
from mytable t
inner join (select row_number() over(order by 1) - 1 rn from bigtable) n
on t.resolution_date > dateadd(t.created_date, n.rn)
group by id

Trying to UNNEST timestamp array field, but need to GROUP BY

I have a repeated field of type TIMESTAMP in a BigQuery table. I am attempting to UNNEST this field. However, I must group or aggregate the field in order. I am not knowledgable with SQL, so I could use some help. The code snippet is part of a larger query that works when substituting subscription.future_renewal_dates with GENERATE_TIMESTAMP_ARRAY
subscription.future_renewal_dates is ARRAY<TIMESTAMP>
The TIMESTAMP array is unique (recurring subscriptions) and cannot be generated using GENERATE_TIMESTAMP_ARRAY, so I have to generate the dates before uploading to BigQuery. UDF is too much.
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name
Cannot figure out how to successfully unnest subscription.future_renewal_dates without error 'UNNEST expression references subscription.future_renewal_dates which is neither grouped nor aggregated'

When you do GROUP BY - all expressions, columns in the SELECT (except those in GROUP BY list) should be used with some aggregation function - which you clearly do not have. So you need to decide what it is that you actually trying to achieve here with that grouping
Below is the option I think you had in mind - though it can be different - but at least you have an idea on how to fix it
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY_CONCAT_AGG( ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
)) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name

Calculating a running count of Weeks

I am looking to calculate a running count of the weeks that have occurred since a starting point. The biggest problem here is that the calendar I am working on is not a traditional Gregorian calendar.
The easiest dimension to reference would be something like 'TWEEK' which actually tells you the week of the year that the record falls into.
Example data:
CREATE TABLE #foobar
( DateKey INT
,TWEEK INT
,CumWEEK INT
);
INSERT INTO #foobar (DateKey, TWEEK, CumWEEK)
VALUES(20150630, 1,1),
(20150701,1,1),
(20150702,1,1),
(20150703,1,1),
(20150704,1,1),
(20150705,1,1),
(20150706,1,1),
(20150707,2,2),
(20150708,2,2),
(20150709,2,2),
(20150710,2,2),
(20150711,2,2),
(20150712,2,2),
(20150713,2,2),
(20150714,1,3),
(20150715,1,3),
(20150716,1,3),
(20150717,1,3),
(20150718,1,3),
(20150719,1,3),
(20150720,1,3),
(20150721,2,4),
(20150722,2,4),
(20150723,2,4),
(20150724,2,4),
(20150725,2,4),
(20150726,2,4),
(20150727,2,4)
For sake of ease, I did not go all the way to 52, but you get the point. I am trying to recreate the 'CumWEEK' column. I have a column already that tells me the correct week of the year according to the weird calendar convention ('TWEEK').
I know this will involve some kind of OVER() windowing, but I cannot seem to figure It out.

The windowing function LAG() along with a summation of ORDER BY ROWS BETWEEN "Changes" should get you close enough to work with. The caveat to this is that the ORDER BY ROWS BETWEEN can only take an integer literal.
Year Rollover : I guess you could create another ranking level based on mod 52 to start the count fresh. So 53 would become year 2, week 1, not 53.
SELECT
* ,
SUM(ChangedRow) OVER (ORDER BY DateKey ROWS BETWEEN 99999 PRECEDING AND CURRENT ROW)
FROM
(
SELECT
DateKey,
TWEEK,
ChangedRow=CASE WHEN LAG(TWEEK) OVER (ORDER BY DateKey) <> TWEEK THEN 1 ELSE 0 END
FROM
#foobar F2
)AS DETAIL

Some minutes ago I answered a different question, in a way this is a similar question to
https://stackoverflow.com/a/31303395/5089204
The idea is roughly to create a table of a running number and find the weeks with modulo 7. This you could use as grouping in an OVER clause...
EDIT: Example
CREATE FUNCTION dbo.RunningNumber(#Counter AS INT)
RETURNS TABLE
AS
RETURN
SELECT TOP (#Counter) ROW_NUMBER() OVER(ORDER BY o.object_id) AS RunningNumber
FROM sys.objects AS o; --take any large table here...
GO
SELECT 'test',CAST(numbers.RunningNumber/7 AS INT)
FROM dbo.RunningNumber(100) AS numbers
Dividing by 7 "as INT" offers a quite nice grouping criteria.
Hope this helps...

How to have GROUP BY and COUNT include zero sums?

I have SQL like this (where $ytoday is 5 days ago):
$sql = 'SELECT Count(*), created_at FROM People WHERE created_at >= "'. $ytoday .'" AND GROUP BY DATE(created_at)';
I want this to return a value for every day, so it would return 5 results in this case (5 days ago until today).
But say Count(*) is 0 for yesterday, instead of returning a zero it doesn't return any data at all for that date.
How can I change that SQLite query so it also returns data that has a count of 0?

Without convoluted (in my opinion) queries, your output data-set won't include dates that don't exist in your input data-set. This means that you need a data-set with the 5 days to join on to.
The simple version would be to create a table with the 5 dates, and join on that. I typically create and keep (effectively caching) a calendar table with every date I could ever need. (Such as from 1900-01-01 to 2099-12-31.)
SELECT
calendar.calendar_date,
Count(People.created_at)
FROM
Calendar
LEFT JOIN
People
ON Calendar.calendar_date = People.created_at
WHERE
Calendar.calendar_date >= '2012-05-01'
GROUP BY
Calendar.calendar_date

You'll need to left join against a list of dates. You can either create a table with the dates you need in it, or you can take the dynamic approach I outlined here:
generate days from date range

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL: Dynamic Join Based on Row Value - sql

Related

Adding x work days onto a date in SQL Server?

Impala get the difference between 2 dates excluding weekends

Trying to UNNEST timestamp array field, but need to GROUP BY

Calculating a running count of Weeks

How to have GROUP BY and COUNT include zero sums?

Categories

Resources