Moving average of 2 columns - sql

Hello I have a problem. I know how to calculate moving average last 3 months using oracle analytic functions... but my situatiion is a little different
Month-----ProductType-----Sales----------Average(HAVE TO FIND THIS)
1---------A---------------10
1---------B---------------12
1---------C---------------17
2---------A---------------21
3---------C---------------2
3---------B---------------21
4---------B---------------23
5
6
7
8
9
So we have sales for each month and each product type... I need to calculate the moving average of the last 3 months and the particular product.
example:
For month 4 and Produt B it would be (21+0+12)/3
Any ideas ?

Another option is to use the windowing clause of analytic functions
with my_data as (
select 1 as month, 'A' as product, 10 as sales from dual union all
select 1 as month, 'B' as product, 12 as sales from dual union all
select 1 as month, 'C' as product, 17 as sales from dual union all
select 2 as month, 'A' as product, 21 as sales from dual union all
select 3 as month, 'C' as product, 2 as sales from dual union all
select 3 as month, 'B' as product, 21 as sales from dual union all
select 4 as month, 'B' as product, 23 as sales from dual
)
select
month,
product,
sales,
nvl(sum(sales)
over (partition by product order by month
range between 3 preceding and 1 preceding),0)/3 as average_sales
from my_data
order by month, product

SELECT month,
productType,
sales,
(lag(sales, 3) over (partition by produtType order by month) +
lag(sales, 2) over (partition by productType order by month) +
lag(sales, 1) over (partition by productType order by month)/3 moving_avg
FROM your_table_name

Related

How can i do a rolling 12 month sum when some year month values are missing?

I am calculating rolling sum as such:
select
city,
month_year,
person,
sum(total) over (partition by person,city order by month_year rows between 11 preceding and current row) rolling_one_year
from
(select
city,
month_year,
person,
sum(amount_dollar) as total
from db1 d
group by 1,2,3) ;
however sometimes the not every person has a month_year value: e.g. a rolling 12 year some is as below IF we had consecutive month values:
but what if a month was missing for person e.g. 202208, according to the logic above it would calculate the following 202201 - 202301 which as we know 13 months.
How can i adapt my code above to ensure that the range of months selected is within 1 year?
A possible solution is to LEFT JOIN your data to the calendar table.
Here is a guide on how to create the calendar table if you don't have one.
Create a date table in hive
You should use a logical window frame RANGE instead of ROWS. consider below query.
WITH monthly_total AS (
SELECT '201911' year_month, 4 total UNION ALL
SELECT '201912' year_month, 10 total UNION ALL
SELECT '202201' year_month, 1 total UNION ALL
SELECT '202202' year_month, 3 total UNION ALL
SELECT '202203' year_month, 9 total UNION ALL
SELECT '202204' year_month, 4 total UNION ALL
SELECT '202205' year_month, 2 total UNION ALL
SELECT '202206' year_month, 8 total UNION ALL
SELECT '202207' year_month, 6 total UNION ALL
SELECT '202209' year_month, 3 total UNION ALL
SELECT '202210' year_month, 10 total UNION ALL
SELECT '202211' year_month, 1 total UNION ALL
SELECT '202212' year_month, 3 total UNION ALL
SELECT '202301' year_month, 50 total
)
SELECT *, SUM(total) OVER w AS rolling_12m_sum
FROM monthly_total
WINDOW w AS (
ORDER BY CAST(SUBSTR(year_month, 1, 4) AS INTEGER) * 12 + CAST(SUBSTR(year_month, 5, 2) AS INTEGER)
RANGE BETWEEN 11 PRECEDING AND CURRENT ROW
) ORDER BY year_month;
I'ved ignored partition by person,city for simplicity.
Below would be helpful in case you're not familiar with RANGE
https://learnsql.com/blog/difference-between-rows-range-window-functions/
Query results

SQL count distinct over partition by cumulatively

I am using AWS Athena (Presto based) and I have this table named base:
id
category
year
month
1
a
2021
6
1
b
2022
8
1
a
2022
11
2
a
2022
1
2
a
2022
4
2
b
2022
6
I would like to craft a query that counts the distinct values of the categories per id, cumulatively per month and year, but retaining the original columns:
id
category
year
month
sumC
1
a
2021
6
1
1
b
2022
8
2
1
a
2022
11
2
2
a
2022
1
1
2
a
2022
4
1
2
b
2022
6
2
I've tried doing the following query with no success:
SELECT id,
category,
year,
month,
COUNT(category) OVER (PARTITION BY id, ORDER BY year, month) AS sumC FROM base;
This results in 1, 2, 3, 1, 2, 3 which is not what I'm looking for. I'd rather need something like a COUNT(DISTINCT) inside a window function, though it's not supported as a construct.
I also tried the DENSE_RANK trick:
DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
+ DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
- 1 as sumC
Though, because there is no ordering between year and month, it just results in 2, 2, 2, 2, 2, 2.
Any help is appreciated!
One option is
creating a new column that will contain when each "category" is seen for the first time (partitioning on "id", "category" and ordering on "year", "month")
computing a running sum over this column, with the same partition
WITH cte AS (
SELECT *,
CASE WHEN ROW_NUMBER() OVER(
PARTITION BY id, category
ORDER BY year, month) = 1
THEN 1
ELSE 0
END AS rn1
FROM base
ORDER BY id,
year_,
month_
)
SELECT id,
category,
year_,
month_,
SUM(rn1) OVER(
PARTITION BY id
ORDER BY year, month
) AS sumC
FROM cte

take sum of last 7 days from the observed date in BigQuery

I have a table on which I want to compute the sum of revenue on last 7 days from the observed day. Here is my table -
with temp as
(
select DATE('2019-06-29') as transaction_date, "x"as id, 0 as revenue
union all
select DATE('2019-06-30') as transaction_date, "x"as id, 80 as revenue
union all
select DATE('2019-07-04') as transaction_date, "x"as id, 64 as revenue
union all
select DATE('2019-07-06') as transaction_date, "x"as id, 64 as revenue
union all
select DATE('2019-07-11') as transaction_date, "x"as id, 75 as revenue
union all
select DATE('2019-07-12') as transaction_date, "x"as id, 0 as revenue
)
select * from temp
I want to take a sum of last 7 days for each transaction_date. For instance for the last record which has transaction_date = 2019-07-12, I would like to add another column which adds up revenue for last 7 days from 2019-07-12 (which is until 2019-07-05), hence the value of new rollup_revenue column would be 0 + 75 + 64 = 139. Likewise, I need to compute the rollup for all the dates for every ID.
Note - the ID may or may not appear daily.
I have tried self join but I am unable to figure it out.
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
SUM(revenue) OVER(
PARTITION BY id ORDER BY UNIX_DATE(transaction_date)
RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
) rollup_revenue
FROM `project.dataset.temp`
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.temp` AS (
SELECT DATE '2019-06-29' AS transaction_date, 'x' AS id, 0 AS revenue UNION ALL
SELECT '2019-06-30', 'x', 80 UNION ALL
SELECT '2019-07-04', 'x', 64 UNION ALL
SELECT '2019-07-06', 'x', 64 UNION ALL
SELECT '2019-07-11', 'x', 75 UNION ALL
SELECT '2019-07-12', 'x', 0
)
SELECT *,
SUM(revenue) OVER(
PARTITION BY id ORDER BY UNIX_DATE(transaction_date)
RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
) rollup_revenue
FROM `project.dataset.temp`
-- ORDER BY transaction_date
with result
Row transaction_date id revenue rollup_revenue
1 2019-06-29 x 0 0
2 2019-06-30 x 80 80
3 2019-07-04 x 64 144
4 2019-07-06 x 64 208
5 2019-07-11 x 75 139
6 2019-07-12 x 0 139
One option uses a correlated subquery to find the rolling sum:
SELECT
transaction_date,
revenue,
(SELECT SUM(t2.revenue) FROM temp t2 WHERE t2.transaction_date
BETWEEN DATE_SUB(t1.transaction_date, INTERVAL 7 DAY) AND
t1.transaction_date) AS rev_7_days
FROM temp t1
ORDER BY
transaction_date;

How to find the highest sales in each year in BigQuery?

The following table contains phone name,number of items sold,month and year.
with table1 as(
select "iphone" as phone,3 as sold_out,"Jan" as month,2015 as year union all
select "iphone",10,"Feb",2015 union all
select "samsung",4,"March",2015 union all
select "Lava",14,"June",2016 union all
select "Lenova",8,"July",2016 union all
select "Lenova",10,"Sep",2016 union all
select "Motorola",8,"Jan",2017 union all
select "Nokia",7,"Jan",2017 union all
select "Nokia",3,"Feb",2017
)
and I would to get the answer like this
-----------------------------
year Phone sales
-----------------------------
2015 iphone 13
2016 lenova 18
2017 Nokia 10
-----------------------------
I haven't tried because honestly I don't know
Below is for BigQuery Standrad SQL
#standardSQL
SELECT
year,
ARRAY_AGG(STRUCT(phone, sales) ORDER BY sales DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT year, phone, SUM(sold_out) sales
FROM `project.dataset.table1`
GROUP BY year, phone
)
GROUP BY year
You can test / play above with dummy data from your question as below
#standardSQL
WITH `project.dataset.table1` AS(
SELECT "iphone" AS phone,3 AS sold_out,"Jan" AS month,2015 AS year UNION ALL
SELECT "iphone",10,"Feb",2015 UNION ALL
SELECT "samsung",4,"March",2015 UNION ALL
SELECT "Lava",14,"June",2016 UNION ALL
SELECT "Lenova",8,"July",2016 UNION ALL
SELECT "Lenova",10,"Sep",2016 UNION ALL
SELECT "Motorola",8,"Jan",2017 UNION ALL
SELECT "Nokia",7,"Jan",2017 UNION ALL
SELECT "Nokia",3,"Feb",2017
)
SELECT
year,
ARRAY_AGG(STRUCT(phone, sales) ORDER BY sales DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT year, phone, SUM(sold_out) sales
FROM `project.dataset.table1`
GROUP BY year, phone
)
GROUP BY year
ORDER BY year
with result
Row year phone sales
1 2015 iphone 13
2 2016 Lenova 18
3 2017 Nokia 10
SELECT year AS year, phone AS Phone, sum(sold_out) AS sales
FROM table1
GROUP BY year, Phone
HAVING COUNT(Phone)=2
ORDER BY year ASC
;
This will give you the output that you desire, in Standard SQL.

Adding zero-value records in a query using cumulative analytical functions

Input and code:
with data as (
select 1 id, 'A' name, 'fruit' r_group, '2007' year, '04' month, 5 sales from dual union all
select 2 id, 'Z' name, 'fruit' r_group, '2007' year, '04' month, 99 sales from dual union all
select 3 id, 'A' name, 'fruit' r_group, '2008' year, '05' month, 10 sales from dual union all
select 4 id, 'B' name, 'vegetable' r_group, '2008' year, '07' month, 20 sales from dual
)
select year,
month,
r_group,
sum(sales) sales,
sum(opening) opening,
sum(closing) closing
from (
select t.*,
(sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) -sales ) as opening,
sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) as closing
from data t
)
group by year, month, r_group
order by year, month
Output:
year | month | r_group | sales | opening | closing |
2007 | 04 | fruit | 104 | 0 | 104 |
2008 | 05 | fruit | 10 | 5 | 15 |
2008 | 07 | vegetable | 20 | 0 | 20 |
I want the output to be like the following:
year | month | r_group | sales | opening | closing |
2007 | 04 | fruit | 104 | 0 | 104 |
2008 | 05 | fruit | 10 | 104 | 114 |
2008 | 07 | vegetable | 20 | 0 | 20 |
I can achieve the desired output only by adding a zero-valued record in the data for month=05 and for name = 'Z' like this:
select 1 id, 'A' name, 'fruit' r_group, '2007', year '04' month, 5 sales from dual union all
select 2 id, 'Z' name, 'fruit' r_group, '2007', year '04' month, 99 sales from dual union all
select 3 id, 'A' name, 'fruit' r_group, '2008', year '05' month, 10 sales from dual union all
select 4 id, 'Z' name, 'fruit' r_group, '2008', year '05' month, 0 sales from dual union all
select 5 id, 'B' name, 'vegetable' r_group, '2008', year '07' month, 20 sales from dual ))
However, I want to know if I can do this as part of the select query without having to edit the data itself.
EDIT
The inner select statement will input into a database table the detailed version: year, month, name, r_group, opening, closing. In other words the result of this query will be used to populate the db table and then aggregation using the outer query will happen afterwards:
select t.*,
(sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) -sales ) as opening,
sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) as closing
from data t
then I'll use an aggregate on that using an analytical tool (3rd party) to aggregate on r_group only without including the name. But the year, month, name, r_group detail must exist in the background.
EDIT 2
In other workds, I'm trying to dynamically add missing data. For instance, if name = 'Z' exists in 2007,04 but DOES NOT in 2008,05 then the cumulative function will fail once it gets to 2008. Because, it does not have a name ='Z' in 2008 to start with it fails.
Instead of CURRENT ROW you can use PRECEDING keyword to sum till the previous row.
with data as (
select 1 id, 'A' name, 'fruit' r_group, '2007' year, '04' month, 5 sales from dual union all
select 2 id, 'Z' name, 'fruit' r_group, '2007' year, '04' month, 99 sales from dual union all
select 3 id, 'A' name, 'fruit' r_group, '2008' year, '05' month, 10 sales from dual union all
select 4 id, 'B' name, 'vegetable' r_group, '2008' year, '07' month, 20 sales from dual )
select t.*,
coalesce(sum(sales) over (partition by r_group order by year, month rows between unbounded preceding and 1 preceding),0) opening,
sum(sales) over (partition by r_group order by year, month rows between unbounded preceding and current row) closing
from (
select year, month, r_group, sum(sales) sales
from data
group by year, month, r_group
) t
order by 3,1,2;
year month r_group sales opening closing
---------------------------------------------------
2007 04 fruit 104 0 104
2008 05 fruit 10 104 114
2008 07 vegetable 20 0 20
Group by R_GROUP, YEAR and MONTH first then use the analytical query:
SELECT t.*,
SUM( sales ) OVER ( PARTITION BY r_group ORDER BY year, month ) - sales
AS opening,
SUM( sales ) OVER ( PARTITION BY r_group ORDER BY year, month ) AS closing
FROM (
SELECT r_group,
year,
month,
SUM( sales ) AS sales
FROM data
GROUP BY r_group, year, month
) t
ORDER BY year, month
Update:
This will also include the name in the output:
SELECT t.*,
SUM( sales ) OVER ( PARTITION BY r_group, dt ) AS r_group_month_sales,
COALESCE(
SUM( sales ) OVER (
PARTITION BY r_group
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '1' MONTH PRECEDING
),
0
) AS opening,
SUM( sales ) OVER (
PARTITION BY r_group
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS closing
FROM (
SELECT d.*,
TO_DATE( year || month, 'YYYYMM' ) AS dt
FROM data d
) t
ORDER BY dt
Output:
ID NAME R_GROUP YEAR MONTH SALES DT R_GROUP_MONTH_SALES OPENING CLOSING
-- ---- --------- ---- ----- ----- ---------- ------------------- ------- -------
1 A fruit 2007 04 5 2007-04-01 104 0 104
2 Z fruit 2007 04 99 2007-04-01 104 0 104
3 A fruit 2008 05 10 2008-05-01 10 104 114
4 B vegetable 2008 07 20 2008-07-01 20 0 20
You can then do whatever processing you want on the output of this query.
Maybe something like this:
SELECT year,
month,
r_group,
MAX( r_group_month_sales ) AS sales,
MAX( opening ) AS opening,
MAX( closing ) AS closing,
YOUR_THIRD_PARTY_AGGREGATION_FUNCTION( column_names ) AS other
FROM (
-- insert the query above
)
GROUP BY year, month, r_group
ORDER BY year, month