SQL - BigQuery - How do I fill in dates from a calendar table? - sql

My goal is to join a sales program table to a calendar table so that there would be a joined table with the full trailing 52 weeks by day, and then the sales data would be joined to it. The idea would be that there are nulls I could COALESCE after the fact. However, my problem is that I only get results without nulls from my sales data table.
The questions I've consulted so far are:
Join to Calendar Table - 5 Business Days
Joining missing dates from calendar table Which points to
MySQL how to fill missing dates in range?
My Calendar table is all 364 days previous to today (today being day 0). And the sales data has a program field, a store field, and then a start date and an end date for the program.
Here's what I have coded:
SELECT
CAL.DATE,
CAL.DAY,
SALES.ITEM,
SALES.PROGRAM,
SALES.SALE_DT,
SALES.EFF_BGN_DT,
SALES.EFF_END_DT
FROM
CALENDAR_TABLE AS CAL
LEFT JOIN
SALES_TABLE AS SALES
ON CAL.DATE = SALES.SALE_DT
WHERE 1=1
and SALES.ITEM = 1 or SALES.ITEM is null
ORDER BY DATE ASC
What I expected was 365 records with dates where there were nulls and dates where there were filled in records. My query resulted in a few dates with null values but otherwise just the dates where a program exists.
DATE | ITEM | PROGRAM | SALE_DT | PRGM_BGN | PRGM_END |
----------|--------|---------|----------|-----------|-----------|
8/27/2020 | | | | | |
8/26/2020 | | | | | |
8/25/2020 | | | | | |
8/24/2020 | | | | | |
6/7/2020 | 1 | 5 | 6/7/2020 | 2/13/2016 | 6/7/2020 |
6/6/2020 | 1 | 5 | 6/6/2020 | 2/13/2016 | 6/7/2020 |
6/5/2020 | 1 | 5 | 6/5/2020 | 2/13/2016 | 6/7/2020 |
6/4/2020 | 1 | 5 | 6/4/2020 | 2/13/2016 | 6/7/2020 |
Date = Calendar day.
Item = Item number being sold.
Program = Unique numeric ID of program.
Sale_Dt = Field populated if at least one item was sold under this program.
Prgm_bgn = First day when item was eligible to be sold under this program.
Prgm_end = Last day when item was eligible to be sold under this program.
What I would have expected would have been records between June 7 and August 24 which just had the DATE column populated for each day and null values as what happens in the most recent four records.
I'm trying to understand why a calendar table and what I've written are not providing the in-between dates.
EDIT: I've removed the request for feedback to shorten the question as well as an example I don't think added value. But please continue to give feedback as you see necessary.

I'd be more than happy to delete this whole question or have someone else give a better answer, but after staring at the logic in some of the answers in this thread (MySQL how to fill missing dates in range?) long enough, I came up with this:
SELECT
CAL.DATE,
t.* EXCEPT (DATE)
FROM
CALENDER_TABLE AS CAL
LEFT JOIN
(SELECT
CAL.DATE,
CAL.DAY,
SALES.ITEM,
SALES.PROGRAM,
SALES.SALE_DT,
SALES.EFF_BGN_DT,
SALES.EFF_END_DT
FROM
CALENDAR_TABLE AS CAL
LEFT JOIN
SALES_TABLE AS SALES
ON CAL.DATE = SALES.SALE_DT
WHERE 1=1
and SALES.ITEM = 1 or SALES.ITEM is null
ORDER BY DATE ASC) **t**
ON CAL.DATE = t.DATE
From what I can tell, it seems to be what I needed. It allows for the subquery to connect a date to all those records, then just joins on the calendar table again solely on date to allow for those nulls to be created.

Related

Retrieving 52 weeks after the result of a subquery

From a table that contains sales, I retrieved the last week of that table. That gives me the last week where there are sales being made. 'Date' is always the first day of the month but it doesn't matter, the real important data is week and partial_week.
The result is simple :
+------------+---------+--------------+
| Date | Week | Partial_week |
+------------+---------+--------------+
| 2020-02-01 | 2020-09 | 2020M02W09 |
+------------+---------+--------------+
Let's call it t1
I have a table with the first day of each month, every week and partial week from 2015 to 2025
(when a week is on two months, it's split in two partial weeks that have the same number but different month). It looks like this :
+------------+---------+--------------+
| Date | Week | Partial_week |
+------------+---------+--------------+
| 2020-02-01 | 2020-05 | 2020M02W05 |
| 2020-02-01 | 2020-06 | 2020M02W06 |
| 2020-02-01 | 2020-07 | 2020M02W07 |
| 2020-02-01 | 2020-08 | 2020M02W08 |
| 2020-02-01 | 2020-09 | 2020M02W09 |
| 2020-03-01 | 2020-09 | 2020M03W09 |
+------------+---------+--------------+
Let's call it t2
I now need to retrieve everything in t2 that is between 1 et 52 weeks after my week retrieved in t1. (this should get me every weeks and partial weeks until 2021-09 or so).
I tought about having a
'select top 52 distinct week from t2'
joining on t1 and having a where clause 'where t1.week < t2.week'
then joining everything on t2 again to get every partial week too,
but that doesn't work because on every week t1.week is equal to null (I wish t1.week could just be a variable since it only has one row...)
Any ideas would be appreciated.
Your logic seems to be close. Put the initial query in a Scalar Subquery to handle it like a variable:
select *
from t2
where t2.week >=
( select week from t1 -- i.e. your existing query to return the latest week
)
qualify
dense_rank()
over (order by week) <= 52
You can also switch to a join:
select *
from t2
join
( select week from t1 -- i.e. your existing query to return the latest week
) as t1
on t2.week >= t1.week
qualify
dense_rank() -- next 52 week & partial weeks
over (order by t2.week) <= 52
Explain of the Scalar Subquery might be better.

Creating user time report that includes zero hour weeks

I'm having a heck of a time putting together a query that I thought would be quite simple. I have a table that records total hours spent on a task and the user that reported those hours. I need to put together a query that returns how many hours a given user charged to each week of the year (including weeks where no hours were charged).
Expected Output:
|USER_ID | START_DATE | END_DATE | HOURS |
-------------------------------------------
|'JIM' | 4/28/2019 | 5/4/2019 | 6 |
|'JIM' | 5/5/2019 | 5/11/2019 | 0 |
|'JIM' | 5/12/2019 | 5/18/2019 | 16 |
I have a function that returns the start and end date of the week for each day, so I used that and joined it to the task table by date and summed up the hours. This gets me very close, but since I'm joining on date I obviously end up with NULL for the USER_ID on all zero hour rows.
Current Output:
|USER_ID | START_DATE | END_DATE | HOURS |
-------------------------------------------
|'JIM' | 4/28/2019 | 5/4/2019 | 6 |
| NULL | 5/5/2019 | 5/11/2019 | 0 |
|'JIM' | 5/12/2019 | 5/18/2019 | 16 |
I've tried a few other approaches, but each time I end up hitting the same problem. Any ideas?
Schema:
---------------------------------
| TASK_LOG |
---------------------------------
|USER_ID | DATE_ENTERED | HOURS |
-------------------------------
|'JIM' | 4/28/2019 | 6 |
|'JIM' | 5/12/2019 | 6 |
|'JIM' | 5/13/2019 | 10 |
------------------------------------
| DATE_HELPER_TABLE |
|(This is actually a function, but I|
| put it in a table to simplify) |
-------------------------------------
|DATE | START_OF_WEEK | END_OF_WEEK |
-------------------------------------
|5/3/2019 | 4/28/2019 | 5/4/2019 |
|5/4/2019 | 4/28/2019 | 5/4/2019 |
|5/5/2019 | 5/5/2019 | 5/11/2019 |
| ETC ... |
Query:
SELECT HRS.USER_ID
,DHT.START_OF_WEEK
,DHT.END_OF_WEEK
,SUM(HOURS)
FROM DATE_HELPER_TABLE DHT
LEFT JOIN (
SELECT TL.USER_ID
,TL.HOURS
,DHT2.START_OF_WEEK
,DHT2.END_OF_WEEK
FROM TASK_LOG TL
JOIN DATE_HELPER_TABLE DHT2 ON DHT2.DATE_VALUE = TL.DATE_ENTERED
WHERE TL.USER_ID = 'JIM1'
) HRS ON HRS.START_OF_WEEK = DHT.START_OF_WEEK
GROUP BY USER_ID
,DHT.START_OF_WEEK
,DHT.END_OF_WEEK
ORDER BY DHT.START_OF_WEEK
http://sqlfiddle.com/#!18/02d43/3 (note: for this sql fiddle, I converted my date helper function into a table to simplify)
Cross join the users (in question) and include them in the join condition. Use coalesce() to get 0 instead of NULL for the hours of weeks where no work was done.
SELECT u.user_id,
dht.start_of_week,
dht.end_of_week,
coalesce(sum(hrs.hours), 0)
FROM date_helper_table dht
CROSS JOIN (VALUES ('JIM1')) u (user_id)
LEFT JOIN (SELECT tl.user_id,
dht2.start_of_week,
tl.hours
FROM task_log tl
INNER JOIN date_helper_table dht2
ON dht2.date_value = tl.date_entered) hrs
ON hrs.user_id = u.user_id
AND hrs.start_of_week = dht.start_of_week
GROUP BY u.user_id,
dht.start_of_week,
dht.end_of_week
ORDER BY dht.start_of_week;
I used a VALUES clause here to list the users. If you only want to get the times for particular users you can do so too (or use any other subquery, or ...). Otherwise you can use your user table (which you didn't post, so I had to use that substitute).
However the figures that are produced by this (and your original query) look strange to me. In the fiddle your user has worked for a total of 23 hours in the task_log table. Yet your sums in the result are 24 and 80, that is way to much on its own and even worse taking into account, that 1 hour in task_log isn't even on a date listed in date_helper_table.
I suspect you get more accurate figures if you just join task_log, not that weird derived table.
SELECT u.user_id,
dht.start_of_week,
dht.end_of_week,
coalesce(sum(tl.hours), 0)
FROM date_helper_table dht
CROSS JOIN (VALUES ('JIM1')) u (user_id)
LEFT JOIN task_log tl
ON tl.user_id = u.user_id
AND tl.date_entered = dht.date_value
GROUP BY u.user_id,
dht.start_of_week,
dht.end_of_week
ORDER BY dht.start_of_week;
But maybe that's just me.
SQL Fiddle
http://sqlfiddle.com/#!18/02d43/65
Using your SQL fiddle, I simply updated the select statement to account for and convert null values. As far as I can tell, there is nothing in your post that makes this option not viable. Please let me know if this is not the case and I will update. (This is not intended to detract from sticky bit's answer, but to offer an alternative)
SELECT ISNULL(HRS.USER_ID, '') as [USER_ID]
,DHT.START_OF_WEEK
,DHT.END_OF_WEEK
,SUM(ISNULL(HOURS,0)) as [SUM]
FROM DATE_HELPER_TABLE DHT
LEFT JOIN (
SELECT TL.USER_ID
,TL.HOURS
,DHT2.START_OF_WEEK
,DHT2.END_OF_WEEK
FROM TASK_LOG TL
JOIN DATE_HELPER_TABLE DHT2 ON DHT2.DATE_VALUE = TL.DATE_ENTERED
WHERE TL.USER_ID = 'JIM1'
) HRS ON HRS.START_OF_WEEK = DHT.START_OF_WEEK
GROUP BY USER_ID
,DHT.START_OF_WEEK
,DHT.END_OF_WEEK
ORDER BY DHT.START_OF_WEEK
Create a dates table that includes all dates for the next 100 years in the first column, the week of the year, day of the month etc in the next.
Then select from that dates table and left join everything else. Do isnull function to replace nulls with zeros.

How to dynamically call date instead of hardcoding in WHERE clause?

In my code using SQL Server, I am comparing data between two months where I have the exact dates identified. I am trying to find if the value in a certain column changes in a bunch of different scenarios. That part works, but what I'd like to do is make it so that I don't have to always go back to change the date each time I wanted to get the results I'm looking for. Is this possible?
My thought was that adding a WITH clause, but it is giving me an aggregation error. Is there anyway I can go about making this date problem simpler? Thanks in advance
EDIT
Ok I'd like to clarify. In my WITH statement, I have:
select distinct
d.Date
from Database d
Which returns:
+------+-------------+
| | Date |
+------+-------------|
| 1 | 01-06-2017 |
| 2 | 01-13-2017 |
| 3 | 01-20-2017 |
| 4 | 01-27-2017 |
| 5 | 02-03-2017 |
| 6 | 02-10-2017 |
| 7 | 02-17-2017 |
| 8 | 02-24-2017 |
| 9 | ........ |
+------+-------------+
If I select this statement and execute, it will return just the dates from my table as shown above. What I'd like to do is be able to have sql that will pull from these date values and compare the last date value from one month to the last date value of the next month. In essence, it should compare the values from date 8 to values from date 4, but it should be dynamic enough that it can do the same for any two dates without much tinkering.
If I didn't misunderstand your request, it seems you need a numbers table, also known as a tally table, or in this case a calendar table.
Recommended post: https://dba.stackexchange.com/questions/11506/why-are-numbers-tables-invaluable
Basically, you create a table and populate it with numbers of year's week o start and end dates. Then join your main query to this table.
+------+-----------+----------+
| week | startDate | endDate |
+------+-----------+----------+
| 1 | 20170101 | 20170107 |
| 2 | 20170108 | 20170114 |
+------+-----------+----------+
Select b.week, max(a.data) from yourTable a
inner join calendarTable b
on a.Date between b.startDate and b.endDate
group by b.week
dynamic dates to filter by BETWEEN
select dateadd(m,-1,dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date))) -- 1st date of last month
select dateadd(day,-datepart(day,cast(getdate() as date)),cast(getdate() as date)) -- last date of last month
select dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date)) -- 1st date of current month
select dateadd(day,-datepart(day,dateadd(m,1,cast(getdate() as date))),dateadd(m,1,cast(getdate() as date))) -- last date of the month

SQL Query for 7 Day Rolling Average in SQL Server

I have a table of hourly product usage (how many times the product is used) data –
ID (bigint)| ProductId (tinyint)| Date (int - YYYYMMDD) | Hour (tinyint)| UsageCount (int)
#|1 | 20140901 | 0 | 10
#|1 | 20140901 | 1 | 15
#|1 | 20140902 | 5 | 25
#|1 | 20140903 | 5 | 25
#|1 | 20140904 | 3 | 25
#|1 | 20140905 | 7 | 25
#|1 | 20140906 | 10 | 25
#|1 | 20140907 | 9 | 25
#|1 | 20140908 | 5 | 25
#|2 | 20140903 | 16 | 10
#|2 | 20140903 | 13 | 115
Likewise, I have the usage data for 4 different products (ProductId from 1 through 4) stored for every hour in the product_usage table. As you can imagine, it is constantly growing as the nightly ETL process dumps the data for the entire previous day. If a product is not used on any hour of a day, the record for that hour won’t appear in this table. Similarly, if a product is not used for the entire day, there won’t be any record for that day in the table. I need to generate a report that gives daily usage and last 7 days’ rolling average –
For example:
ProductId | Date | DailyUsage | RollingAverage
1 | 20140901 | sum of usages of that day | (Sum of usages from 20140901 through 20140826) / 7
1 | 20140901 | sum of usages of that day | (Sum of usages from 20140901 through 20140826) / 7
1 | 20140902 | sum of usages of that day | (Sum of usages from 20140902 through 20140827) / 7
2 | 20140902 | sum of usages of that day | (Sum of usages from 20140902 through 20140827) / 7
And so on..
I am planning to create an Indexed View in SQL server 2014. Can you think of an efficient SQL query to do this?
Try:
select x.*,
avg(dailyusage) over(partition by productid order by productid, date rows between 6 preceding and current row) as rolling_avg
from (select productid, date, sum(usagecount) as dailyusage
from tbl
group by productid, date) x
Fiddle:
http://sqlfiddle.com/#!6/f674a7/4/0
Replace "avg(dailusage) over...." with sum (rather than avg) if what you really want is the sum for the past week. In your title you say you want the average but later you say you want the sum. The query should be the same other than that, so use whichever you actually want.
As was pointed out by Gordon this is basically the average of the past 6 dates in which the product was used, which might be more than just the past 6 days if there are days without any rows for that product on the table because it wasn't used at all. To get around that you could use a date table and your products table.
You have to be careful if you can be missing data on some days. If I assume that there is data for some product on each day, then this approach will work:
select p.productid, d.date, sum(usagecount),
sum(sum(usagecount)) over (partition by p.productid order by d.date
rows between 6 preceding and current row) as Sum7day
from (select distinct productid from hourly) p cross join
(select distinct date from hourly) d left join
hourly h
on h.productid = p.productid and h.date = p.date
group by p.productid, d.date;

SQL Query X Days back excluding date ranges (Confusing!)

Ok, I have a tough SQL query, and I'm not sure how to go about writing it.
I am summing the number of "bananas collected" by an employee within the last X days, but what I could really use help on is determining X.
The "last X days" value is defined to be the last 100 days that the employee was NOT out due to Purple Fever, starting from some ChosenDate (we'll say today, 6/24/14). That is to say, if the person was sick with Purple Fever for 3 days, then I want to look back over the last 103 days from ChosenDate rather than the last 100 days. Any other reason the employee may have been out does not affect our calculation.
Table PersonOutIncident
+----------------------+----------+-------------+
| PersonOutIncidentID | PersonID | ReasonOut |
+----------------------+----------+-------------+
| 1 | Sarah | PurpleFever |
| 2 | Sarah | PaperCut |
| 3 | Jon | PurpleFever |
| 4 | Sarah | PurpleFever |
+----------------------+----------+-------------+
Table PersonOutDetail
+-------------------+----------------------+-----------+-----------+
| PersonOutDetailID | PersonOutIncidentID | BeginDate | EndDate |
+-------------------+----------------------+-----------+-----------+
| 1 | 1 | 1/1/2014 | 1/3/2014 |
| 2 | 1 | 1/7/2014 | 1/13/2014 |
| 3 | 2 | 2/1/2014 | 2/3/2014 |
| 4 | 3 | 1/15/2014 | 1/20/2014 |
| 5 | 4 | 5/1/2014 | 5/15/2014 |
+-------------------+----------------------+-----------+-----------+
The tables are established. Many PersonOutDetail records can be associated with one PersonOutIncident record and there may be multiple PersonOutIncident records for a single employee (That is to say, there could be two or three PersonOutIncident records with an identical ReasonOut column, because they represent a particular incident or event and the not-necessarily-continuous days lost due to that particular incident)
The nature of this requirement complicates things, even conceptually to me.
The best I can think of is to check for a BeginDate/EndDate pair within the 100 day base period, then determine the number of days from BeginDate to EndDate and add that to the base 100 days. But then I would have to check again that this new range doesn't overlap or contain additional BeginDate/EndDate pairs and add, if so, add those days as well. I can tell already that this isn't the method I want to use, but I can't wrap my mind quite around how exactly what I need to start/structure this query. Does anyone have an idea that might steer me in the correct direction? I realize this might not be clear and I apologize if I'm just confusing things.
One way to do this is to work with a table or WITH CLAUSE that contains a list of days. Let's say days is a table with one column that contains the last 200 days. (This means the query will break if the employee had more than 100 sick days in the last 200 days).
Now you can get a list of all working days of an employee like this (replace ? with the employee id):
WITH t1 AS
(
SELECT day,
ROW_NUMBER() OVER (ORDER BY day DESC) AS 'RowNumber'
FROM days d
WHERE NOT EXISTS (SELECT * FROM PersonOutDetail pd
INNER JOIN PersonOutIncidentID po ON po.PersonOutIncidentID = pd.PersonOutIncidentID
WHERE d.day BETWEEN pd.BeginDate AND pd.EndDate
AND po.ReasonOut = 'PurpleFever'
AND po.PersonID = ?)
)
SELECT * FROM t1
WHERE RowNumber <= 100;
Alternatively, you can obtain the '100th day' by replacing RowNumber <= 100 with RowNumber = 100.