Our dataset is fundamentally joining a set of dates (weeks from the current week into the past) to a set of sections based on whether those sections started on or before and ended on or after that week. While originally this query gave us the results we expected, this week it began providing us incorrect results. After a bunch of tinkering, we discovered that if we changed the query to a LEFT JOIN and then filtered the query using a WHERE clause, it gave us correct results again.
What's the difference? Why does one work and the other doesn't? (Bonus points: why did the original query work for weeks before suddenly experiencing this error?) Performing the same inner join on Redshift delivers correct results, so it seems to be a Snowflake nuance that we don't understand.
Original query:
WITH week_list AS
(
SELECT DATEADD(week, -4, DATE_TRUNC(week, CURRENT_DATE())) AS week_value
UNION ALL
SELECT DATEADD(week, 1, week_value)
FROM week_list
WHERE DATEADD(week, 1, week_value) < CURRENT_DATE()
),
active_sections_per_week AS
(
SELECT
wl.week_value, s.id section_id
FROM week_list wl
JOIN schema.sections s ON wl.week_value >= DATE_TRUNC(week, s.starts_at)
AND wl.week_value <= DATE_TRUNC(week, s.ends_at)
)
SELECT
aspw.week_value,
COUNT(DISTINCT aspw.section_id) count_sections
FROM
active_sections_per_week aspw
GROUP BY 1
ORDER BY 1 DESC
Results: One row, dated 2019-12-30 (4 weeks ago). No data for the past three weeks.
Note: If you adjust the DATEADD in the first CTE, whatever is the first date returned will always seem to join successfully. This behavior started only within the last week--previously, this query provided the expected number of rows (in other words, the number of weeks specified in that first DATEADD).
"Fixed" query:
WITH week_list AS
(
SELECT DATEADD(week, -4, DATE_TRUNC(week, CURRENT_DATE())) AS week_value
UNION ALL
SELECT DATEADD(week, 1, week_value)
FROM week_list
WHERE DATEADD(week, 1, week_value) < CURRENT_DATE()
),
active_sections_per_week AS
(
SELECT wl.week_value, s.id section_id
FROM week_list wl
LEFT JOIN schema.sections s ON wl.week_value >= DATE_TRUNC(week, s.starts_at)
AND wl.week_value <= DATE_TRUNC(week, s.ends_at)
WHERE s.id IS NOT NULL
)
SELECT aspw.week_value, COUNT(DISTINCT aspw.section_id) count_sections
FROM active_sections_per_week aspw
GROUP BY 1
ORDER BY 1 DESC
Results: returns four rows, weeks dated 2019-12-30 to 2020-01-20, with appropriate section counts.
This is a recursive CTE on "week_list". Redshift does not support recursive CTEs.
Snowflake does support recursive CTEs, which would explain the difference in behavior.
It's hard to test this without the underlying data. If you're getting correct results in Redshift, then chances are you do not need or want a recursive CTE. You can modify it so that "week_list" does not reference itself.
As for why it worked before, it's likely the table state and recursive CTE worked only under special cases. When CURRENT_DATE() advanced, it took it out of that special case. Also, the inner join and left outer join where s.id IS NOT NULL would be equivalent if not in a recursive CTE.
You can read more about recursive CTEs here:
https://docs.snowflake.net/manuals/user-guide/queries-cte.html#recursive-ctes-and-hierarchical-data
the recursive CTE can be avoided if the -4 weeks is a constant with this code:
WITH week_list AS (
SELECT DATEADD(week, column1, DATE_TRUNC(week, CURRENT_DATE()))
FROM VALUES (-4),(-3),(-2),(-1),(0)
)
with the JOIN snowflake will move the filters higher in the execution stack, and you might have found a bug. Where-as with the LEFT JOIN (even though it has a equivalent WHERE clause it most likely avoiding the aggressive broken optimization.
There was a software release last night for us, but we are on an Enterprise account so you might have been upgrade 2 days prior. This release had a number of bugs that impacted us, we had it rolled back (for us)
Thank you for all of the feedback! The good news is you all helped me get to a solution that I think I am satisfied with. I have also followed up with Snowflake so they can investigate this behavior and see if it was user error on my part due to not understanding how recursive CTEs process, or whether it is possibly a bug introduced in a recent release.
Here's what I found: while recursion works for the use case I was applying it to (generating a list of dates based on CURRENT_DATE), it is not strictly necessary. Since we want a list of dates, I could just as easily generate a table and use the row numbers to perform the DATEADD adjustments.
It looks like this:
SELECT DATEADD(week, '-' || ROW_NUMBER() OVER (ORDER BY NULL),
DATEADD(week, 1, DATE_TRUNC(week, CURRENT_DATE()))) AS week_value
FROM table (generator(rowcount => 200))
One of the big benefits to this approach is I am no longer limited by the MAX_RECURSIONS setting in Snowflake (which is set to 100 by default). Since I am using this data to create graphs of activity over time, having 200 values gives me more than three years of history rather than just shy of 2 years of history. I also don't have to contact my Snowflake rep if I want to expand it.
Changing the week_list CTE to this non-recursive approach seems to fix whatever issue was causing the INNER JOIN to perform incorrectly. We still don't understand why the recursive CTE seemed to work for several weeks and then suddenly started misbehaving, but if Snowflake can shed light on that via our support ticket, I will double back here to provide an update. Thank you all for your help and guidance!
I need it to give me me a total of 0 for week 33 - 39, but I'm really bad with joining 3 tables and I cant figure it out
Right now it only gives me an answer for dates that there are actual records in the tracker_weld_table.
SELECT SUM(tracker_parts_archive.weight),
WEEK(mycal.dt) as week
FROM
tracker_parts_archive, tracker_weld_archive
RIGHT JOIN
(SELECT dt FROM calendar_table WHERE dt >= '2018-7-1' AND dt <= '2018-10-1') as mycal
ON
weld_worker = '133'AND date(weld_dateandtime) = mycal.dt
WHERE
tracker_weld_archive.tracker_partsID = tracker_parts_archive.id
GROUP BY week
I think you are trying for something like this:
SELECT WEEK(c.dt) as week, COALESCE(SUM(tpa.weight), 0)
FROM calendar_table c left join
tracker_weld_archive tw
on date(tw.weld_dateandtime) = c.dt left join
tracker_parts_archive tp
on tw.tracker_partsID = tp.id and tp.weld_worker = 133
WHERE c.dt >= '2018-07-01' AND c.dt <= '2018-10-01'
GROUP BY week
ORDER BY week;
Notes:
You want to keep all (matching) rows in the calendar table, so it should be first.
All subsequent joins should be LEFT JOINs.
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Write out the full proper date constant -- YYYY-MM-DD. This is an ISO-standard format.
I am guessing that weld_worker is a number, so single quotes are not needed for the comparison.
First, lets start with understanding what you want.. You want totals per week. This means there will be a "GROUP BY" clause (also for any MIN(), MAX(), AVG(), SUM(), COUNT(), etc. aggregates). What is the group BY basis. In this scenario, you want per week. Leading to the next part that you want for a specific date range qualified per your calendar table.
I would start in order what WHAT filtering criteria first. Also, ALWAYS TRY to identify all table( or alias).column in your queries so anyone after you knows where the columns are coming from, especially when multiple tables. In this case "ct" is the ALIAS for "Calendar_Table"
SELECT
ct.dt
from
calendar_table ct
where
ct.dt >= '2018-07-01'
AND ct.dt <= '2018-10-01'
Now, the above date looks to be INCLUSIVE of October 1 and looks like you are trying to generate a quarterly sum from July, Aug, Sept. I would change to LESS than Oct 1.
Now, your calendar has many days and you want it grouped by week, so the WEEK() function gets you that distinct reference without explicitly checking every date. Also, try NOT to use reserved keywords as final column names... makes for confusion later on sometimes.
I have aliased the column name as "WeekBasis". Here, I did a COUNT(*) just to show the total days and the group by showing it in context.
SELECT
WEEK( ct.dt ) WeekBasis,
MIN( ct.dt ) as FirstDayOfThisWeek,
MAX( ct.dt ) as LastDayOfThisWeek,
COUNT(*) as DaysInThisWeek
from
calendar_table ct
where
ct.dt >= '2018-07-01'
AND ct.dt <= '2018-10-01'
group by
WEEK( ct.dt )
So, at this point, we have 1 record per week within the date period you are concerned,
but I also grabbed the earliest and latest dates just to show other components too.
Now, lets get back to your extra tables. We know the dates in question, now need to
get the details from the other tables (which is lacking in the post. You should post
critical components such as how tables are related via common / joined column basis.
How is tracker_part_archive related to tracker_weld_archive??
To simplify your query, you dont even NEED your calendar table as the welding
table HAS a date field and you know your range. Just query against that directly.
IF your worker's ID is numeric, don't add quotes around it, just leave as a number.
SELECT
WEEK( twa.Weld_DateAndTime ) WeekBasis,
COUNT(*) WeldingEntriesDone,
SUM(tpa.weight) TotalWeight
from
tracker_weld_archive twa
JOIN tracker_parts_archive tpa
-- GUESSING on therelationship here.
-- may also be on a given date too???
-- all pieces welded by a person on a given date
ON twa.weld_worker = tpa.weld_worker
AND twa.Weld_DateAndTime = tpa.Weld_DateAndTime
where
twa.Weld_Worker = 133
AND twa.Weld_DateAndTime >= '2018-07-01'
AND twa.Weld_DateAndTime <= '2018-10-01'
group by
WEEK( twa.Weld_DateAndTime )
IF you provide the table structures AND sample data, this can be refined a bit more for you.
Okay, so I've done quite a lot of reading on the possibility of emulating the networkdays function of excel in sql, and have come to the conclusion that by far the easiest solution is to have a calendar table which will flag working days or non working days. However, due to circumstances out of my control, we don't have access to such a luxury and it's unlikely that we will any time in the near future.
Currently I have managed to bodge together what is undoubtedly a horrible ineffecient query in SQL that does work - the catch is, it will only work for a single client record at a time.
SELECT O_ASSESSMENTS.ASM_ID,
O_ASSESSMENTS.ASM_START_DATE,
O_ASSESSMENTS.ASM_END_DATE,
sum(CASE
When TO_CHAR(O_ASSESSMENTS.ASM_START_DATE + rownum -1,'Day')
= 'Sunday ' THEN 0
When TO_CHAR(O_ASSESSMENTS.ASM_START_DATE + rownum -1,'Day')
= 'Saturday ' THEN 0
WHEN O_ASSESSMENTS.ASM_START_DATE + rownum - 1
IN ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011')
THEN 0
ELSE 1
END)-1 AS Week_Day
From O_ASSESSMENTS,
ALL_OBJECTS
WHERE O_ASSESSMENTS.ASM_QSA_ID IN ('TYPE1')
AND O_ASSESSMENTS.ASM_END_DATE >= '01/01/2012'
AND O_ASSESSMENTS.ASM_ID = 'A00000'
AND ROWNUM <= O_ASSESSMENTS.ASM_END_DATE-O_ASSESSMENTS.ASM_START_DATE+1
GROUP BY
O_ASSESSMENTS.ASM_ID,
O_ASSESSMENTS.ASM_START_DATE,
O_ASSESSMENTS.ASM_END_DATE
Basically, I'm wondering if a) I should stop wasting my time on this or b) is it possible to get this to work for multiple clients? Any pointers appreciated thanks!
Edit: Further clarification - I already work out timescales using excel, but it would be ideal if we could do it in the report as the report in question is something that we would like end users to be able to run without any further manipulation.
Edit:
MarkBannister's answer works perfectly albeit slowly (though I had expected as much given it's not the preferred solution) - the challenge now lies in me integrating this into an existing report!
with
calendar_cte as (select
to_date('01-01-2000')+level-1 calendar_date,
case when to_char(to_date('01-01-2000')+level-1, 'day') in ('sunday ','saturday ') then 0 when to_date('01-01-2000')+level-1 in ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011','01-01-2012','02-01-2012') then 0 else 1 end working_day
from dual
connect by level <= 1825 + sysdate - to_date('01-01-2000') )
SELECT
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE,
sum(c.working_day)-1 AS Week_Day
From
O_ASSESSMENTS a
join calendar_cte c
on c.calendar_date between a.ASM_START_DATE and a.ASM_END_DATE
WHERE a.ASM_QSA_ID IN ('TYPE1')
and a.ASM_END_DATE >= '01/01/2012'
GROUP BY
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE
There are a few ways to do this. Perhaps the simplest might be to create a CTE that produces a virtual calendar table, based on Oracle's connect by syntax, and then join it to the Assesments table, like so:
with calendar_cte as (
select to_date('01-01-2000')+level-1 calendar_date,
case when to_char(to_date('01-01-2000')+level-1, 'Day')
in ('Sunday ','Saturday ') then 0
when to_date('01-01-2000')+level-1
in ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011')
then 0
else 1
end working_day
from dual
connect by level <= 36525 + sysdate - to_date('01-01-2000') )
SELECT a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE,
sum(c.working_day) AS Week_Day
From O_ASSESSMENTS a
join calendar_cte c
on c.calendar_date between a.ASM_START_DATE and a.ASM_END_DATE
WHERE a.ASM_QSA_ID IN ('TYPE1') and
a.ASM_END_DATE >= '01/01/2012' -- and a.ASM_ID = 'A00000'
GROUP BY
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE
This will produce a virtual table populated with dates from 01 January 2000 to 10 years after the current date, with all weekends marked as non-working days and all days specified in the second in clause (ie. up to 27 December 2011) also marked as non-working days.
The drawback of this method (or any method where the holiday dates are hardcoded into the query) is that each time new holiday dates are defined, every single query that uses this approach will have to have those dates added.
If you can't use a calendar table in Oracle, you might be better off exporting to Excel. Brute force always works.
Networkdays() "returns the number of whole working days between start_date and end_date. Working days exclude weekends and any dates identified in holidays."
Excluding weekends seems fairly straightforward. Every 7-day period will contain two weekend days. You'll just need to take some care with the leftover days.
Holidays are a different story. You have to either store them or pass them as an argument. If you could store them, you'd store them in a calendar table, and your problem would be over. But you can't do that.
So you're looking at passing them as an argument. Off the top of my head--and I haven't had any tea yet this morning--I'd consider a common table expression or a wrapper for a stored procedure.