SQL/BIGQUERY Running Average with GAPs in Dates - sql

I'm having trouble with a moving average in BigQuery/SQL, I have table 'SCORES' and I need to make a 30d moving average while grouping the data using users, the problem is my dates aren't sequential, e.g there are gaps in it.
Below is my current code:
SELECT user, date,
AVG(score) OVER (PARTITION BY user ORDER BY date)
FROM SCORES;
I don't know how to add the date restrictions into that line or if this is even possible.
My current table looks like this, but of course with a lot more users:
user date score
AA 13/02/2018 2.00
AA 15/02/2018 3.00
AA 17/02/2018 4.00
AA 01/03/2018 5.00
AA 28/03/2018 6.00
Then I need it to become, this:
user date score 30D Avg
AA 13/02/2018 2.00 2.00
AA 15/02/2018 3.00 2.50
AA 17/02/2018 4.00 3.00
AA 01/03/2018 5.00 3.50
AA 28/03/2018 6.00 5.50
Where in the last row, it's only measuring backward one because of the date (up to 30D backwards) is there any way to implement this in SQL or am I asking for too much?

You want to use range between. For this, you need an integer, so:
select s.*,
avg(score) over (partition by user
order by days
range between 29 preceding and current row
) as avg_30day
from (select s.*, date_diff(s.date, date('2000-01-01'), day) as days
from scores s
) s;
An alternative to date_diff() is unix_date():
select s.*,
avg(score) over (partition by user
order by unix_days
range between 29 preceding and current row
) as avg_30day
from (select s.*, unix_date(s.date) as unix_days
from scores s
) s;

Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
AVG(score) OVER (
PARTITION BY user
ORDER BY UNIX_DATE(PARSE_DATE('%d/%m/%Y', date))
RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
) AS avg_30day
FROM `project.dataset.scores`
You can test / play with above using dummy data from your question
#standardSQL
WITH `project.dataset.scores` AS (
SELECT 'AA' user, '13/02/2018' date, 2.00 score UNION ALL
SELECT 'AA', '15/02/2018', 3.00 UNION ALL
SELECT 'AA', '17/02/2018', 4.00 UNION ALL
SELECT 'AA', '01/03/2018', 5.00 UNION ALL
SELECT 'AA', '28/03/2018', 6.00
)
SELECT *,
AVG(score) OVER (
PARTITION BY user
ORDER BY UNIX_DATE(PARSE_DATE('%d/%m/%Y', date))
RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
) AS avg_30day
FROM `project.dataset.scores`
result
Row user date score avg_30day
1 AA 13/02/2018 2.0 2.0
2 AA 15/02/2018 3.0 2.5
3 AA 17/02/2018 4.0 3.0
4 AA 01/03/2018 5.0 3.5
5 AA 28/03/2018 6.0 5.5

Related

PARTITION BY with date between 2 date

I work on Azure SQL Database working with SQL Server
In SQL, I try to have a table by day, but the day is not in the table.
I explain it by the example below:
TABLE STARTER: (Format Date: YYYY-MM-DD)
Date begin
Date End
Category
Value
2021-01-01
2021-01-03
1
0.2
2021-01-02
2021-01-03
1
0.1
2021-01-01
2021-01-02
2
0.3
For the result, I try to have this TABLE RESULT:
Date
Category
Value
2021-01-01
1
0.2
2021-01-01
2
0.3
2021-01-02
1
0.3 (0.2+0.1)
2021-01-02
2
0.3
2021-01-03
1
0.3 (0.2+0.1)
For each day, I want to sum the value if the day is between the beginning and the end of the date. I need to do that for each category.
In terms of SQL code I try to do something like that:
SELECT SUM(CAST(value as float)) OVER (PARTITION BY Date begin, Category) as value,
Date begin,
Category,
Value
FROM TABLE STARTER
This code calculates only the value that has the same Date begin but don't consider all date between Date begin and Date End.
So in my code, it doesn't calculate the sum of the value for the 02-01-2021 of Category 1 because it doesn't write explicitly. (between 01-01-2021 and 03-01-2021)
Is it possible to do that in SQL?
Thanks so much for your help!
You can use a recursive CTE to expand the date ranges into the list of separate days. Then, it's matter of joining and aggregating.
For example:
with
r as (
select category,
min(date_begin) as date_begin, max(date_end) as date_end
from starter
group by category
),
d as (
select category, date_begin as d from r
union all
select d.category, dateadd(day, 1, d.d)
from d
join r on r.category = d.category
where d.d < r.date_end
)
select d.d, d.category, sum(s.value) as value
from d
join starter s on s.category = d.category
and d.d between s.date_begin and s.date_end
group by d.category, d.d;
Result:
d category value
----------- --------- -----
2021-01-01 1 0.20
2021-01-01 2 0.30
2021-01-02 1 0.30
2021-01-02 2 0.30
2021-01-03 1 0.30
See running example at db<>fiddle.
Note: Starting in SQL Server 2022 it seems there is/will be a new GENERATE_SERIES() function that will make this query much shorter.

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30
You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

Uniform distribution of monthly budget to date

I have monthly budget need to distribute to per day
Datasource
Month
Budget
Jan
31
Feb
56
I want to smoothen out to
Date
Budget
01-Jan
1
02-Jan
1
...
...
01-Feb
2
02-Feb
2
...
...
How can I do this?
Assuming the month is really a date on the first day, then a pretty simply method uses a recursive CTE:
with cte as (
select month as day, budget
from t
union all
select dateadd(day, 1, day), budget
from cte
where day < eomonth(day)
)
select day, budget * 1.0 / day(eomonth(day))
from cte
order by day;
Here is a db<>fiddle.
Just another option using an ad-hoc tally/numbers table
This assumes the source MONTH is a string and the desired year is the current year.
Example or dbFiddle
Declare #YourTable Table ([Month] varchar(50),[Budget] money)
Insert Into #YourTable Values
('Jan',31)
,('Feb',56)
Select Date = DateFromParts(year(D),month(D),N)
,Budget = Budget / day(D)
From #YourTable A
Cross Apply ( values (EOMonth(try_convert(date,concat('01-',Month,'-',year(getdate())))))) B(D)
Join (Select Top 31 N=Row_Number() Over (Order By (Select Null)) From master..spt_values n1) C
on N<=day(D)
Results
Date Budget
2021-01-01 1.00
2021-01-02 1.00
...
2021-01-30 1.00
2021-01-31 1.00
2021-02-01 2.00
...
2021-02-27 2.00
2021-02-28 2.00

How to shift a year-week field in bigquery

This question is about shifting values of a year-week field in bigquery.
run_id year_week value
0001 201451 13
0001 201452 6
0001 201503 3
0003 201351 8
0003 201352 5
0003 201403 1
Here for each year the week can range from 01 to 53. For example year 2014 has last week which is 201452 but year 2015 has last week which is 201553.
Now I want to shift the values for each year_week in each run_id by 5 weeks. For the weeks there is no value it is assumed that they have a value of 0. For example the output from the example table above should look like this:
run_id year_week value
0001 201504 13
0001 201505 6
0001 201506 0
0001 201507 0
0001 201508 3
0003 201404 8
0003 201405 5
0003 201406 0
0003 201407 0
0003 201408 1
Explanation of the output: In the table above for run_id 0001 the year_week 201504 has a value of 13 because in the input table we had a value of 13 for year_week 201451 which is 5 weeks before 201504.
I could create a table programmatically by creating a mapping from a year_week to a shifted year_week and then doing a join to get the output, but I was wondering if there is any other way to do it by just using sql.
#standardSQL
WITH `project.dataset.table` AS (
SELECT '001' run_id, 201451 year_week, 13 value UNION ALL
SELECT '001', 201452, 6 UNION ALL
SELECT '001', 201503, 3
), weeks AS (
SELECT 100 * year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) value
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN 5 PRECEDING AND 5 PRECEDING
)
)
WHERE NOT value IS NULL
ORDER BY run_id, year_week
with result as
Row run_id year_week value
1 001 201504 13
2 001 201505 6
3 001 201508 3
if you need to "preserve" zero rows - just change below portion
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
to
SELECT i.run_id, w.year_week, d.year_week week2, IFNULL(value, 0) value
FROM weeks w
or
SUM(value) OVER(win) value
FROM temp
to
SUM(IFNULL(value, 0)) OVER(win) value
FROM temp
If you have data in the table for all year-weeks, then you can do:
with yw as (
select year_week, row_number() over (order by year_week) as seqnum
from t
group by year_week
)
select t.*, yw5, year_week as new_year_week
from t join
yw
on t.year_week = yw.year_week left join
yw yw5
on yw5.seqnum = yw.seqnum + 5;
If you don't have a table of year weeks, then I would advise you to create such a table, so you can do such manipulations -- or a more general calendar table.

Duplicating records to fill gap between dates in Google BigQuery

So I've found similar resources that address how to do this in SQL, like this:
Duplicating records to fill gap between dates
I understand that BigQuery may not be the best place to do this, so I'm trying to see if it's at all possible. When trying to run some of the methods in the link above above I'm hitting a wall as some of the functions aren't supported within BigQuery.
If a table exists with data structured like so:
MODIFY_DATE SKU STORE STOCK_ON_HAND
08/01/2016 00:00:00 1120010 21 100
08/05/2016 00:00:00 1120010 21 75
08/07/2016 00:00:00 1120010 21 40
How can I build a query within Google BigQuery that yields an output like the one below? A value at a given date is repeated until the next change for the dates in between:
MODIFY_DATE SKU STORE STOCK_ON_HAND
08/01/2016 00:00:00 1120010 21 100
08/02/2016 00:00:00 1120010 21 100
08/03/2016 00:00:00 1120010 21 100
08/04/2016 00:00:00 1120010 21 100
08/05/2016 00:00:00 1120010 21 75
08/06/2016 00:00:00 1120010 21 75
08/07/2016 00:00:00 1120010 21 40
I know I need to generate a table that has all the dates within a given range, but I'm having a hard time understanding if this can be done. Any ideas?
How can I build a query within Google BigQuery that yields an output like the one below? A value at a given date is repeated until the next change for the dates in between
See example below
SELECT
MODIFY_DATE,
MAX(SKU_TEMP) OVER(PARTITION BY grp) AS SKU,
MAX(STORE_TEMP) OVER(PARTITION BY grp) AS STORE,
MAX(STOCK_ON_HAND_TEMP) OVER(PARTITION BY grp) AS STOCK_ON_HAND,
FROM (
SELECT
DAY AS MODIFY_DATE, SKU AS SKU_TEMP, STORE AS STORE_TEMP, STOCK_ON_HAND AS STOCK_ON_HAND_TEMP,
COUNT(SKU) OVER(ORDER BY DAY ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp,
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2016-08-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP("2016-08-07"), TIMESTAMP("2016-08-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS DATES
LEFT JOIN (
SELECT DATE(MODIFY_DATE) AS MODIFY_DATE, SKU, STORE, STOCK_ON_HAND
FROM
(SELECT "2016-08-01" AS MODIFY_DATE, "1120010" AS SKU, 21 AS STORE, 75 AS STOCK_ON_HAND),
(SELECT "2016-08-05" AS MODIFY_DATE, "1120010" AS SKU, 22 AS STORE, 100 AS STOCK_ON_HAND),
(SELECT "2016-08-07" AS MODIFY_DATE, "1120011" AS SKU, 23 AS STORE, 40 AS STOCK_ON_HAND),
) AS TABLE_WITH_GAPS
ON TABLE_WITH_GAPS.MODIFY_DATE = DATES.DAY
)
ORDER BY MODIFY_DATE
I need to generate a table that has all the dates within a given range, but I'm having a hard time understanding if this can be done. Any ideas?
SELECT DATE(DATE_ADD(TIMESTAMP("2016-08-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP("2016-08-07"), TIMESTAMP("2016-08-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))