Sliding window aggregate for year-week in bigquery - google-bigquery

My question is about sliding window sum up in bigquery.
I have a table like the following
run_id year_week value
001 201451 5
001 201452 8
001 201501 1
001 201505 5
003 201352 8
003 201401 1
003 201405 5
Here for each year the week can range from 01 to 53. For example year 2014 has last week which is 201452 but year 2015 has last week which is 201553. If it makes life easier I only have 5 years, 2013, 2014, 2015, 2016 and 2017 and only year 2015 has weeks those go upto 53.
Now for each run I am trying to get a sliding window sum of the values. Each year_week would assume the sum of the values next 5 year_week (including itself) for the current run_id (e.g. 001). For example the following could be a an output from the current table
run_id year_week aggregate_sum
001 201451 5+8+1+0+0
001 201452 8+1+0+0+0
001 201501 1+0+0+0+5
001 201502 0+0+0+5+0
001 201503 0+0+5+0+0
001 201504 0+5+0+0+0
001 201505 5+0+0+0+0
003 201352 8+1+0+0+0
003 201401 1+0+0+0+5
003 201402 0+0+0+5+0
003 201403 0+0+5+0+0
003 201404 0+5+0+0+0
003 201405 5+0+0+0+0
To explain what is happening, the next 5 weeks for 201451 including itself would be 201451,201452,201501,201502,201503 . If there is a value for those weeks in the table for current run_id we just sum them up which would be, 5+8+1+0+0, because the corresponding value for a year_week is 0 if it is not in the table.
Is it possible to do it using sliding window operation in bigquery?

Below is for BigQuery Standard SQL
#standardSQL
WITH weeks AS (
SELECT 100* year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) aggregate_sum
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
)
)
WHERE NOT aggregate_sum IS NULL
You can test / play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '001' run_id, 201451 year_week, 5 value UNION ALL
SELECT '001', 201452, 8 UNION ALL
SELECT '001', 201501, 1 UNION ALL
SELECT '001', 201505, 5
), weeks AS (
SELECT 100* year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) aggregate_sum
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
)
)
WHERE NOT aggregate_sum IS NULL
-- ORDER BY run_id, year_week
with result as
Row run_id year_week aggregate_sum
1 001 201447 5
2 001 201448 13
3 001 201449 14
4 001 201450 14
5 001 201451 14
6 001 201452 9
7 001 201501 6
8 001 201502 5
9 001 201503 5
10 001 201504 5
11 001 201505 5
12 003 201348 8
13 003 201349 9
14 003 201350 9
15 003 201351 9
16 003 201352 9
17 003 201401 6
18 003 201402 5
19 003 201403 5
20 003 201404 5
21 003 201405 5
note; this is for - I only have 5 years, 2013, 2014, 2015, 2016 and 2017 but can easily be extended in weeks CTE

Related

How to find first time a price has changed in SQL

I have a table that contains an item ID, the date and the price. All items show their price for each day, but I want only to select the items that have not had their price change, and to show the days without change.
An example of the table is
id
Price
Day
Month
Year
asdf
10
03
11
2022
asdr1
8
03
11
2022
asdf
10
02
11
2022
asdr1
8
02
11
2022
asdf
10
01
11
2022
asdr1
7
01
11
2022
asdf
9
31
10
2022
asdr1
8
31
10
2022
asdf
8
31
10
2022
asdr1
8
31
10
2022
The output I want is:
Date
id
Last_Price
First_Price_Appearance
DaysWOchange
2022-11-03
asdf
10
2022-11-01
2
2022-11-03
asdr1
8
2022-11-02
1
The solutions needs to run quickly, so how are some efficency intensive ways to solve this, considering that the table has millions of rows, and there are items that have not changed their price in years.
The issue for efficiency comes because for each id, I would need to loop the entire table, looking for the first match in which the price has changed, and repeat this for thousands of items.
I am attempting to calculate the difference between the current last price, and all the history, but these becomes slow to process, and may take several minutes to calculate for all of history.
The main concern for this problem is efficiency.
DECLARE #table TABLE (id NVARCHAR(5), Price INT, Date DATE)
INSERT INTO #table (id, Price, Date) VALUES
('asdf', 10, '2022-10-20'),
('asdr1', 8, '2022-10-15'),
('asdf', 10, '2022-11-03'),
('asdr1', 8, '2022-11-02'),
('asdf', 10, '2022-11-02'),
('asdr1', 8, '2022-11-02'),
('asdf', 10, '2022-11-01'),
('asdr1', 7, '2022-11-01'),
('asdf', 9, '2022-10-31'),
('asdr1', 8, '2022-10-31'),
('asdf', 8, '2022-10-31'),
('asdr1', 8, '2022-10-31')
Tables of data are useful, but it's even more so if you can put the demo date into an object.
SELECT id, FirstDate, LastChange, DaysSinceChange, Price
FROM (
SELECT id, MIN(Date) OVER (PARTITION BY id ORDER BY Date) AS FirstDate, Date AS LastChange, Price,
CASE WHEN LEAD(Date,1) OVER (PARTITION BY id ORDER BY Date) IS NULL THEN DATEDIFF(DAY,Date,CURRENT_TIMESTAMP)
ELSE DATEDIFF(DAY,LAG(Date) OVER (PARTITION BY id ORDER BY Date),Date)
END AS DaysSinceChange, ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) AS rn
FROM #table
) a
WHERE rn = 1
This is a quick way to get what you want. If you execute the subquery by itself you can see all the history.
id FirstDate LastChange Price DaysSinceChange
-------------------------------------------------------
asdf 2022-10-20 2022-11-03 10 0
asdr1 2022-10-15 2022-11-02 8 1
SELECT id, MIN(Date) OVER (PARTITION BY id ORDER BY Date) AS FirstDate, Date AS LastChange, Price,
CASE WHEN LEAD(Date,1) OVER (PARTITION BY id ORDER BY Date) IS NULL THEN DATEDIFF(DAY,Date,CURRENT_TIMESTAMP)
ELSE DATEDIFF(DAY,LAG(Date) OVER (PARTITION BY id ORDER BY Date),Date)
END AS DaysSinceChange, ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) AS rn
FROM #table
id FirstDate LastChange Price DaysSinceChange rn
------------------------------------------------------
asdf 2022-10-20 2022-11-03 10 0 1
asdf 2022-10-20 2022-11-02 10 1 2
asdf 2022-10-20 2022-11-01 10 1 3
asdf 2022-10-20 2022-10-31 9 11 4
asdf 2022-10-20 2022-10-31 8 0 5
asdf 2022-10-20 2022-10-20 10 NULL 6
asdr1 2022-10-15 2022-11-02 8 1 1
asdr1 2022-10-15 2022-11-02 8 1 2
asdr1 2022-10-15 2022-11-01 7 1 3
asdr1 2022-10-15 2022-10-31 8 16 4
asdr1 2022-10-15 2022-10-31 8 0 5
asdr1 2022-10-15 2022-10-15 8 NULL 6
You can use lag() and a cumulative max():
select id, date, price
from (select t.*,
max(case when price <> lag_price then date end) over (partition by id) as price_change_date
from (select t.*, lag(price) over (partition by id order by date) as lag_price
from t
) t
) t
where price_change_date is null;
This calculates the first date of a price change for each id. It then filters out all rows where a price change occurred.
The use of window functions should be highly efficient, taking advantage of indexes on (id, date) and (id, price, date).

ORACLE: Splitting a string into multiple rows

I am trying to split a string "HHHWWWHHHHWWWWWHHWWWWWHHWWWWWHH"
is there any possibility to make like :
H
H
H
W
W
W
BRANCH_CODE YEAR MONTH HOLIDAY_LIST
1 001 2021 1 HHHWWWHHHHWWWWWHHWWWWWHHWWWWWHH
2 001 2021 2 WWWWWHHWWWWWHHWWWWWHHWHWWWHH
From Oracle 12, you can use:
SELECT branch_code, year, month, day, holiday
FROM branches
CROSS JOIN LATERAL (
SELECT LEVEL AS day,
SUBSTR(holiday_list, LEVEL, 1) AS holiday
FROM DUAL
CONNECT BY LEVEL <= LENGTH(holiday_list)
)
Which, for the sample data:
CREATE TABLE branches (BRANCH_CODE, YEAR, MONTH, HOLIDAY_LIST) AS
SELECT '001', 2021, 1, 'HHHWWWHHHHWWWWWHHWWWWWHHWWWWWHH' FROM DUAL UNION ALL
SELECT '001', 2021, 2, 'WWWWWHHWWWWWHHWWWWWHHWHWWWHH' FROM DUAL
Outputs:
BRANCH_CODE
YEAR
MONTH
DAY
HOLIDAY
001
2021
1
1
H
001
2021
1
2
H
001
2021
1
3
H
001
2021
1
4
W
...
...
...
...
...
001
2021
1
29
W
001
2021
1
30
H
001
2021
1
31
H
001
2021
2
1
W
001
2021
2
2
W
001
2021
2
3
W
...
...
...
...
...
001
2021
2
26
W
001
2021
2
27
H
001
2021
2
28
H
db<>fiddle here
If it's Oracle:
with data AS (
select 'WWWWWHHWWWWWHHWWWWWHHWHWWWHH' AS letters
from dual
)
select substr (
letters,
level,
1
) value
from data
connect by level <=
length ( letters )

Counts and divide from two different selects with dates

I have a table with this kind of structure (Sample only)
ID | STATUS | DATE |
--- -------- ------
1 OPEN 31-01-2022
2 CLOSE 15-11-2021
3 CLOSE 21-10-2021
4 OPEN 11-10-2021
5 OPEN 28-09-2021
I would like to know the counts of close vs open records by week. So it will be count(close)/count(open) where close.week = open.week
If there are no matching values, need to return 0 of course.
I got to this query below
SELECT *
FROM
(SELECT COUNT(*) AS 'CLOSE', DATEPART(WEEK, DATE) AS 'WEEKSA', DATEPART(YEAR, DATE) AS 'YEARA' FROM TABLE
WHERE STATUS IN ('CLOSE')
GROUP BY DATEPART(WEEK, DATE),DATEPART(YEAR, DATE)) TMPA
FULL OUTER JOIN
(SELECT COUNT(*) AS 'OPEN', DATEPART(WEEK, DATE) AS 'WEEKSB', DATEPART(YEAR, DATE) AS 'YEARB' FROM TABLE
WHERE STATUS IN ('OPEN')
GROUP BY DATEPART(WEEK, DATE),DATEPART(YEAR, DATE)) TMPB
ON TMPA.WEEKSA = TMPB.WEEKSB AND TMPA.YEARA = TMPB.YEARB
My results are as below (sample only)
close | weeksa | yeara | open | weeksb | yearb |
------ -------- ------ ------- ------- ------
3 2 2021
1 3 2021
1 4 2021
2 20 2021 2 20 2021
7 22 2021
2 23 2021
7 26 2021
7 27 2021
2 28 2021 14 28 2021
2 29 2021
10 30
24 31 2021
2 32 2021 5 32
4 33 2021
1 34 2021 13 34 2021
6 35 2021
1 36 2021
1 38 2021
1 39 2021
2 41 2021
4 43 2021
1 45 2021
2 46 2021 25 46 2021
1 47 2021 5 47 2021
4 48 2021
1 49 2021 20 49 2021
1 50 2021 17 50 2021
1 51 2021
How do I do the math now?
If I do another select the query fails. So I guess either syntax is bad or the whole concept is wrong.
The required result should look like this (Sample)
WEEK | YEAR | RATIO |
----- ------ -------
2 2021 0
3 2021 0
4 2021 0
5 2021 0.93
20 2021 0.1
22 2021 0
23 2021 0
26 2021 0
1 2022 0.75
2 2022 0.23
4 2022 0.07
Cheers!
I have added some test data to check the logic, adding the same in the code.
;with cte as(
select 1 ID, 'OPEN' as STATUS, cast('2021 -01-31' as DATE) DATE
union select 10 ID, 'CLOSE' as STATUS, cast('2021 -01-31' as DATE) DATE
union select 11 ID, 'CLOSE' as STATUS, cast('2021 -01-31' as DATE) DATE
union select 12 ID, 'CLOSE' as STATUS, cast('2021 -01-31' as DATE) DATE
union select 22 ID, 'CLOSE' as STATUS, cast('2021 -01-31' as DATE) DATE
union select 32 ID, 'CLOSE' as STATUS, cast('2021 -01-31' as DATE) DATE
union select 2,'CLOSE',cast('2021-11-28' as DATE)
union select 3,'CLOSE',cast('2021-10-21' as DATE)
union select 8,'CLOSE',cast('2021-10-21' as DATE)
union select 9,'CLOSE',cast('2021-10-21' as DATE)
union select 4,'OPEN', cast('2021-10-11' as DATE)
union select 5,'CLOSE', cast('2021-09-28' as DATE)
union select 6,'OPEN', cast('2021-09-27' as DATE)
union select 7,'CLOSE', cast('2021-09-26' as DATE) )
, cte2 as (
select DATEPART(WEEK,date) as week_number,* from cte)
,cte3 as(
select week_number,year(date) yr,count(case when status = 'open' then 1 end)open_count,count(case when status <> 'open' then 1 end) close_count from cte2 group by week_number,year(date))
select week_number as week,yr as year,
cast(case when open_count = 0 then 1.0 else open_count end /
case when close_count = 0 then 1.0 else close_count end as numeric(3,2)) as ratio
from cte3

How to query last n months average while grouping by a different field

I have the following table:
deviceid year month value
432 2019 3 2571
432 2019 2 90
432 2019 1 314
432 2018 12 0
432 2018 11 100
437 2019 2 0
437 2019 1 1
437 2018 12 0
437 2018 11 3
437 2018 10 2
437 2018 9 0
437 2018 8 2
I need a query that for each deviceId retrieves the average value for the last 3 months.
Eg:
deviceId Average
432 991.7
437 0.3
I'm quite sure i have to use partitions or something similar.
Im thinking it should look something like this, but I can't figure out how to get the average value of just the last 3 months.
select deviceId, avg(value) from Table group by deviceid
You can use ROW_NUMBER to figure out which months are the last 3 per deviceid.
Sample Data:
DECLARE #t TABLE (deviceid INT, year INT, month INT, value NUMERIC(12,2))
INSERT INTO #t VALUES
(432, 2019 , 3 ,2571),
(432, 2019 , 2 ,90 ),
(432, 2019 , 1 ,314 ),
(432, 2018 , 12,0 ),
(432, 2018 , 11,100 ),
(437, 2019 , 2 ,0 ),
(437, 2019 , 1 ,1 ),
(437, 2018 , 12,0 ),
(437, 2018 , 11,3 ),
(437, 2018 , 10,2 ),
(437, 2018 , 9 ,0 ),
(437, 2018 , 8 ,2 )
Query:
;WITH cte AS (
SELECT deviceid, value,
ROW_NUMBER() OVER (PARTITION BY deviceid ORDER BY year DESC, month DESC) rn
FROM #t
)
SELECT deviceid, AVG(value) average
FROM cte
WHERE rn BETWEEN 1 AND 3
GROUP BY deviceid
Returns:
deviceid average
432 991.666666
437 0.333333
Another option would be to calculate a date from you year and month columns to use for comparison.
DECLARE #ComparisonDate DATE = '20190301';
--Or, perhaps, EOMONTH(GETDATE(),-1) to get the last day of last month?
SELECT
deviceId,
AVG(value) AS AvgValue
FROM
#t as t
WHERE
DATEFROMPARTS(year,month,1) >= DATEADD(MONTH,-3,#ComparisonDate)
GROUP BY
deviceId;

How to shift a year-week field in bigquery

This question is about shifting values of a year-week field in bigquery.
run_id year_week value
0001 201451 13
0001 201452 6
0001 201503 3
0003 201351 8
0003 201352 5
0003 201403 1
Here for each year the week can range from 01 to 53. For example year 2014 has last week which is 201452 but year 2015 has last week which is 201553.
Now I want to shift the values for each year_week in each run_id by 5 weeks. For the weeks there is no value it is assumed that they have a value of 0. For example the output from the example table above should look like this:
run_id year_week value
0001 201504 13
0001 201505 6
0001 201506 0
0001 201507 0
0001 201508 3
0003 201404 8
0003 201405 5
0003 201406 0
0003 201407 0
0003 201408 1
Explanation of the output: In the table above for run_id 0001 the year_week 201504 has a value of 13 because in the input table we had a value of 13 for year_week 201451 which is 5 weeks before 201504.
I could create a table programmatically by creating a mapping from a year_week to a shifted year_week and then doing a join to get the output, but I was wondering if there is any other way to do it by just using sql.
#standardSQL
WITH `project.dataset.table` AS (
SELECT '001' run_id, 201451 year_week, 13 value UNION ALL
SELECT '001', 201452, 6 UNION ALL
SELECT '001', 201503, 3
), weeks AS (
SELECT 100 * year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) value
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN 5 PRECEDING AND 5 PRECEDING
)
)
WHERE NOT value IS NULL
ORDER BY run_id, year_week
with result as
Row run_id year_week value
1 001 201504 13
2 001 201505 6
3 001 201508 3
if you need to "preserve" zero rows - just change below portion
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
to
SELECT i.run_id, w.year_week, d.year_week week2, IFNULL(value, 0) value
FROM weeks w
or
SUM(value) OVER(win) value
FROM temp
to
SUM(IFNULL(value, 0)) OVER(win) value
FROM temp
If you have data in the table for all year-weeks, then you can do:
with yw as (
select year_week, row_number() over (order by year_week) as seqnum
from t
group by year_week
)
select t.*, yw5, year_week as new_year_week
from t join
yw
on t.year_week = yw.year_week left join
yw yw5
on yw5.seqnum = yw.seqnum + 5;
If you don't have a table of year weeks, then I would advise you to create such a table, so you can do such manipulations -- or a more general calendar table.