Context:
I am working with some complicated schema and have got many CTEs and joins to get to this point. This is a watered-down version and completely different source data and example to illustrate my point (data anonymity). Hopefully it provides enough of a snapshot.
Data Overview:
I have a service which generates a production forecast looking ahead 30 days. The forecast is generated for each facility, for each shift (morning/afternoon). Each forecast produced covers all shifts (morning/afternoon/evening) so they share a common generation_id but different forecast_profile_key.
What I am trying to do: I want to find the SUM of the forecast error for a given forecast generation constrained by a dynamic date range based on whether the date is a weekday or weekend. The SUM must be grouped only on similar IDs.
Basically, the temp table provides one record per facility per date per shift with the forecast error. I want to SUM the historical error dynamically for a facility/shift/date based on whether the date is weekday/weekend, and only SUM the error where the IDs match up.. (hope that makes sense!!)
Specifics: I want to find the SUM grouped by 'week_part_grouping', 'forecast_profile_key', 'forecast_profile' and 'forecast_generation_id'. The part I am struggling with is that I only want to SUM the error dynamically based on date: (a) if the date is a weekday, I want to SUM the error from up to the 5 recent-most days in a 7 day look back period, or (b) if the date is a weekend, I want to SUM the error from up to the 3 recent-most days in a 16 day look back period.
Ideally, having an extra column for 'total_forecast_error_in_lookback_range'.
Specific examples:
For 'facility_a', '2020-11-22' is a weekend. The lookback range is 16 days, so any date between '2020-11-21' and '2020-11-05' is eligible. The 3 recent-most dates would be '2020-11-21', '2020-11-15' and '2020-11'14'. Therefore, the sum of error would be 2000+3250+1050.
For 'facility_a', '2020-11-20' is a weekday. The lookback range is 7 days, so any date between '2020-11-19 and '2020-11-13'. That would work out to be '2020-11-19':'2020-11-16' and '2020-11-13'.
For 'facility_b', notice there is a change in the 'forecast_generation_id'. So, the error for '2020-11-20' would be only be 4565.
What I have tried: I'll confess to not being quite sure how to break down this portion. I did consider a case statement on the week_part but then got into a nested mess. I considered using a RANK windowed function but I didn't make much progress as was unsure how to implement the dynamic lookback component. I then also thought about doing some LISTAGG to get all the dates and do a REGEXP wildcard lookup but that would be very slow..
I am seeking pointers how to go about achieving this in SQL. I don't know if I am missing something from my toolkit here to go about breaking this down into something I can implement.
DROP TABLE IF EXISTS seventh__error_calc;
create temporary table seventh__error_calc
(
facility_name varchar,
shift varchar,
date_actuals date,
week_part_grouping varchar,
forecast_profile_key varchar,
forecast_profile_id varchar,
forecast_generation_id varchar,
count_dates_in_forecast bigint,
forecast_error bigint
);
Insert into seventh__error_calc
VALUES
('facility_a','morning','2020-11-22','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1000'),
('facility_a','morning','2020-11-21','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2000'),
('facility_a','morning','2020-11-20','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3000'),
('facility_a','morning','2020-11-19','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2500'),
('facility_a','morning','2020-11-18','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1200'),
('facility_a','morning','2020-11-17','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','5000'),
('facility_a','morning','2020-11-16','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','4400'),
('facility_a','morning','2020-11-15','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3250'),
('facility_a','morning','2020-11-14','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1050'),
('facility_a','morning','2020-11-13','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-12','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-11','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-10','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-09','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-08','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_b','morning','2020-11-22','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3400'),
('facility_b','morning','2020-11-21','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','2800'),
('facility_b','morning','2020-11-20','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3687'),
('facility_b','morning','2020-11-19','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','4565'),
('facility_b','morning','2020-11-18','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1262'),
('facility_b','morning','2020-11-17','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','8765'),
('facility_b','morning','2020-11-16','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','5678'),
('facility_b','morning','2020-11-15','weekend','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','2893'),
('facility_b','morning','2020-11-14','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1928'),
('facility_b','morning','2020-11-13','weekday','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','4736')
;
SELECT *
FROM seventh__error_calc
This achieved what I was trying to do. There were two learning points here.
Self Joins. I've never used one before but can now see why they are powerful!
Using a CASE statement in the WHERE clause.
Hope this might help someone else some day!
select facility_name,
forecast_profile_key,
forecast_profile_id,
shift,
date_actuals,
week_part_grouping,
forecast_generation_id,
sum(forecast_error) forecast_err_calc
from (
select rank() over (partition by forecast_profile_id, forecast_profile_key, facility_name, a.date_actuals order by b.date_actuals desc) rnk,
a.facility_name, a.forecast_profile_key, a.forecast_profile_id, a.shift, a.date_actuals, a.week_part_grouping, a.forecast_generation_id, b.forecast_error
from seventh__error_calc a
join seventh__error_calc b
using (facility_name, forecast_profile_key, forecast_profile_id, week_part_grouping, forecast_generation_id)
where case when a.week_part_grouping = 'weekend' then b.date_actuals between a.date_actuals - 16 and a.date_actuals
when a.week_part_grouping = 'weekday' then b.date_actuals between a.date_actuals - 7 and a.date_actuals
end
) src
where case when week_part_grouping = 'weekend' then rnk < 4
when week_part_grouping = 'weekday' then rnk < 6
end
I am using postgres and, I recently encountered that the code I am using has too many roundtrips.
What I am doing is basically getting data from a table on a daily basis because I have to look for changes on a daily basis, but the whole function that does this job is called once a month.
An example of my table
Amount
Id | Itemid | Amount | Date
1 | 2 | 50 | 20-5-20
Now this table can be updated to add items at any point in time and I have to see the total amount that is SUM(Amount) every day.
But here's the catch, I have to add interest to the amount of each day at the rate of 5%.
So I can't just once call the function, I have to look at its value every day.
For example if I add an item of 50$ on the 1st of may then the interest on that day is 5/100*50
I add another item on the 5th of may worth 50$ and now the interest on the 5th day is 5/100*50.
But prior to 5th, the interest was on only 50$ so If I just simply use SUM(Amount)*5/100. It is wrong.
Also, another issue is the fact that dates are stored as timestamps and I need to group it by date of the timestamp because if I group it on the basis of timestamp then it will create multiple rows for the same date which I want to avoid while taking the sum.
So if there are two entries on the same date but different hours ideally the query should sum it up as one single date.
Example
Amount Table
Date | Amount
2020-5-5 20:8:8 100
2020-5-5 7:8:8 | 100
Result should be
Amount Table
Date | Amount
2020-5-5 200
My current code.
for i in numberofdaysinthemonth:
amount = amount + session.query(func.sum(Amount.Amount)).filter(Amount.date<current_date).scalar() * 5/100
I want a query that gets all these values according to dates, for example
date | Sum of amount till that date
20-5-20 | 50
20-6-20 | 100
Any ideas about what I should do to avoid a loop that runs 30 times since the function is called once in a month.
I am supposed to get all this data in a table daywise and aggregated as the sum of amount for each day
That is a simple "running total"
select "date",
sum(amount) over (order by "date") as amount_til_date
from the_table
order by "date";
If you need the amount per itemid
select "date",
sum(amount) over (partition by itemid order by "date") as amount_til_date
from the_table
order by "date";
If you also need to calculate the "compound interest rate" up to that day, you can do that as well:
select item_id,
"date",
sum(amount) over (partition by itemid order by "date") as amount_til_date,
sum(amount) over (partition by item_id order by "date") * power(1.05, count(*) over (partition by item_id order by "date")) as compound_interest
from the_table
order by "date";
To get that for a specific month, add a WHERE clause:
where "date" >= date '2020-06-01'
and "date" < date '2020-07-01'
In general to avoid round trips between application and database, application code must be moved from application to database in stored code (stored procedures an stored functions) using a procedural language. This approach is sometimes called "thick database" in commercial databases like Oracle Database.
PostgreSQL default procedural language is pl/pgsql but you can use Java, Perl, Python, Javascript using PostgreSQL extensions that you would need to install in PostgreSQL.
I am trying find employees that worked during a specific time period and the hours they worked during that time period. My query has to join the employee table that has employee id as pk and uses effective_date and expiration_date as time measures for the employee's position to the timekeeping table that has a pay period id number as pk and also uses effective and expiration dates.
The problem with the expiration date in the employee table is that if the employee is currently employed then the date is '12/31/9999'. I am looking for employees that worked in a certain year and current employees as well as the hours they worked separated by pay periods.
When I take this condition in account in the where with an OR statement, I get duplicates that is employees that have worked the time period I am looking for and beyond as well as duplicate records for the '12/31/9999' and the valid employee in that time period.
This is the query I am using:
SELECT
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,J.EFF_DT
,J.EXP_DT
,TM1.PPRD_ID
,TM1.EMPL_ID
,TM1.EXP_DT
,TM1.EFF_DT
--PULLING IN THE DAILY HRS WORKED
,(SELECT NVL(SUM(((to_number(SUBSTR(TI.DAY_1, 1
,INSTR(TI.DAY_1, ':', 1, 1)-1),99))*60)+
(TO_NUMBER(SUBSTR(TI.DAY_1
,INSTR(TI.DAY_1,':', -1, 1)+1),99))),0)
FROM PPRD_LINE TI
WHERE
TI.PPRD_ID=TM1.PPRD_ID
) "DAY1"
---AND THE REST OF THE DAYS FOR THE WORK PERIOD
FROM PPRD_LINE TM1
JOIN EMPL J ON TM1.EMPL_ID=J.EMPL_ID
WHERE
J.EMPL_ID='some id number' --for test purposes, will need to break down to depts-
AND
J.EFF_DT >=TO_DATE('1/1/2012','MM/DD/YYYY')
AND
(
J.EXP_DT<=TO_DATE('12/31/2012','MM/DD/YYYY')
OR
J.EXP_DT=TO_DATE('12/31/9999','MM/DD/YYYY') --I think the problem might be here???
)
GROUP BY
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,J.EFF_DT
,J.EXP_DT
,TM1.PPRD_ID
,TM1.EMPL_ID
,TM1.DOC_ID
,TM1.EXP_DT
,TM1.EFF_DT
ORDER BY
J.EFF_DT
,TM1.EFF_DT
,TM1.EXP_DT
I'm pretty sure I'm missing something simple but at this point I can't see the forest for the trees. Can anyone out there point me in the right direction?
an example of the duplicate records:
for employee 1 for the year of 2012:
Empl_ID Dept Unit Last First Title Eff Date Exp Date PPRD ID Empl_ID
00001 04 012 Babbage Charles Somejob 4/1/2012 10/15/2012 0407123 00001
Exp Date_1 Eff Date_1
4/15/2012 4/1/2012
this record repeats 3 times and goes past the pay periods in 2012 to the current pay period in 2013
the subquery I use to convert time to be able to add hrs and mins together to compare down the line.
I'm going to take a wild guess and see if this is what you want, remember I could not test so there may be typos.
If this is and especially if it is not, you should read in the FAQ about how to ask good questions. If this is what you were trying to understand your question should have been answered within about 10 mins. Because it was not clear what you were asking no one could answer your question.
You should include inputs and outputs and EXPECTED output in your question. The data you gave was not the output of the select statement (it did not have the DAY1 column).
SELECT
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,J.EFF_DT
,J.EXP_DT
,TM1.PPRD_ID
,TM1.EMPL_ID
-- ,TM1.EXP_DT Can't have these if you are summing accross multiple records.
-- ,TM1.EFF_DT
--PULLING IN THE DAILY HRS WORKED
,NVL(SUM(((to_number(SUBSTR(TM1.DAY_1, 1,INSTR(TM1.DAY_1, ':', 1, 1)-1),99))*60)+
(TO_NUMBER(SUBSTR(TM1.DAY_1,INSTR(TM1.DAY_1,':', -1, 1)+1),99))),0)
"DAY1"
---AND THE REST OF THE DAYS FOR THE WORK PERIOD
FROM PPRD_LINE TM1
JOIN EMPL J ON TM1.EMPL_ID=J.EMPL_ID
WHERE
J.EMPL_ID='some id number' --for test purposes, will need to break down to depts-
AND J.EFF_DT >=TO_DATE('1/1/2012','MM/DD/YYYY')
AND(J.EXP_DT<=TO_DATE('12/31/2012','MM/DD/YYYY') OR J.EXP_DT=TO_DATE('12/31/9999','MM/DD/YYYY'))
GROUP BY
J.EMPL_ID
,J.DEPT
,J.UNIT
,J.LAST_NM
,J.FIRST_NM
,J.TITLE
,TM1.PPRD_ID
,TM1.EMPL_ID
,TM1.DOC_ID
ORDER BY
MIN(J.EFF_DT)
,MAX(TM1.EFF_DT)
,MAX(TM1.EXP_DT)