Running average but reset when increasing streak stops - sql

Original Dataset
+---------+--------+------------+
| Product | Date | Units Sold |
+---------+--------+------------+
| Prod A | 1/1/19 | 100 |
| Prod A | 1/2/19 | 200 |
| Prod A | 1/3/19 | 300 |
| Prod A | 1/4/19 | 136 |
| Prod A | 1/5/19 | 116 |
| Prod A | 1/6/19 | 120 |
| Prod A | 1/7/19 | 140 |
| Prod A | 1/8/19 | 160 |
+---------+--------+------------+
Desired Output (Last two columns)
+---------+--------+------------+---------------------+------+--------+--------------------+
| Product | Date | Units Sold | Previous Units Sold | Diff | Streak | Streak Running Avg |
+---------+--------+------------+---------------------+------+--------+--------------------+
| Prod A | 1/1/19 | 100 | | 100 | 1 | 100 |
| Prod A | 1/2/19 | 200 | 100 | 100 | 2 | 150 |
| Prod A | 1/3/19 | 300 | 200 | 100 | 3 | 200 |
| Prod A | 1/4/19 | 100 | 300 | -200 | 0 | 0 |
| Prod A | 1/5/19 | 200 | 100 | 100 | 1 | 200 |
| Prod A | 1/6/19 | 300 | 200 | 100 | 2 | 250 |
| Prod A | 1/7/19 | 200 | 300 | -100 | 0 | 0 |
| Prod A | 1/8/19 | 200 | 200 | 0 | 0 | 0 |
+---------+--------+------------+---------------------+------+--------+--------------------+
My goal is to calculate running average only when the difference between previous day sale and current day sale is positive (i.e. when the streak >0 calculate calculate the running average.)
For example, Day 1, 2, & 3 have diff positive i.e. 100
therefore the running average for
Day 1 is 100 (100/1)
Day 2 is 150 (100+200)/2
Day 3 is 200 (100+200+300)/3
 My Query
select *,
CASE WHEN flag=1 then sum(units) over (partition by item_name order by order_date) else 0 end as running_avg_sum
from
(select *,
lag(units, 1) over (partition by product order by order_date) as previous_day_units
CASE WHEN units > previous_day_units then 1 else 0 end as flag
from (
SELECT product,
order_date,
SUM(units_sold) units
FROM product_table
GROUP BY 1, 2
) tbl
)
But above query throws
Invalid operation: Aggregate window functions with an ORDER BY clause require a frame clause;
I'm aware how to resolve the error by adding rows between unbounded preceding and current row in the window function but that way it would average all preceding rows. I'm now sure how can I achieve the desired output.
If there's a way to define boundaries, that'd be really helpful
Any help is appreciated. Thank you!
Insert commands incase you want to replicate
CREATE TABLE product_table
(
product varchar(200),
order_date timestamp,
units_sold bigint
)
INSERT INTO product values("Prod A","2019-01-01","100");
INSERT INTO product values("Prod A","2019-01-02","200");
INSERT INTO product values("Prod A","2019-01-03","300");
INSERT INTO product values("Prod A","2019-01-04","136");
INSERT INTO product values("Prod A","2019-01-05","116");
INSERT INTO product values("Prod A","2019-01-06","120");
INSERT INTO product values("Prod A","2019-01-07","140");
INSERT INTO product values("Prod A","2019-01-08","160");

This is a somewhat tricky gaps and islands problem. Using the difference in row numbers method, we can try:
WITH cte1 AS (
SELECT *, LAG(units_sold, 1, 0) OVER (PARTITION BY product ORDER BY order_date) AS prev_units_sold,
units_sold - LAG(units_sold, 1, 0) OVER (PARTITION BY product ORDER BY order_date) diff
FROM product_table
),
cte2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY product ORDER BY order_date) rn1,
ROW_NUMBER() OVER (PARTITION BY product, diff > 0 ORDER BY order_date) rn2
FROM cte1
)
SELECT product, order_date, units_sold, prev_units_sold, diff,
CASE WHEN diff > 0
THEN ROW_NUMBER() OVER (PARTITION BY product, rn1 - rn2 ORDER BY order_date)
ELSE 0 END AS streak,
CASE WHEN diff > 0
THEN AVG(units_sold) OVER (PARTITION BY product, rn1 - rn2 ORDER BY order_date)
ELSE 0 END AS running_avg
FROM cte2
ORDER BY product, order_date;
Demo
The first pass CTE is required to compute the lag number of units sold, per product as ordered by order date. The second CTE computes two different row number sequences. The first one is simply ordered by order date. The second computes two separate sequences for positive diff and negative diff records. The difference between the first and second sequence can be used to identify the "islands" over which we want to display the streak and streak average. For the negative diff records, we just report zero for the streak and streak average.

Related

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

How do I update a value based on the dense_rank result?

I have a query (SQL Server 2017) that finds two different discounts on the same date.
WITH CTE AS (SELECT [date_id], [good_id], [store_id], [name_promo_mech], [discount],
RN = DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount]) +
DENSE_RANK() OVER (PARTITION BY [date_id], [good_id], [store_id], [name_promo_mech]
ORDER BY [discount] DESC) - 1
FROM [dbo].[ds_promo_list_by_day_new] AS PL
)
SELECT * FROM CTE
WHERE RN > 1;
GO
Query result:
+------------+----------+---------+-----------------+----------+----+
| date_id | store_id | good_id | name_promo_mech | discount | RN |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 15 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 3 | 98398 | January 2017 | 40 | 2 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 15 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 40 | 3 |
+------------+----------+---------+-----------------+----------+----+
| 2017-01-01 | 5 | 98398 | January 2017 | 30 | 3 |
+------------+----------+---------+-----------------+----------+----+
Now I want to make the discounts the same for all unique good_id, store_id, name_promo_merch in the source table. There is a rule for this. For example, for the row good_id = 98398, store_id = 3, name_promo_mech = N'january 2017' there were 10 entries with a 15 discount, and 20 with a 40 discount, then the 15 discount should be replaced with 40. However, if the number of entries for each discount was the same, then the maximum discount is set for all of them.
Can I do this? The number of rows in the source table is about 100 million.
What you want to do is set the value to the mode (a statistical term for the most common value) on each date and combination of whatever. You can use window functions:
with toupdate as (
select pl.*,
first_value(discount) over (partition by date_id, good_id, store_id, name_promo_mech order by cnt desc, discount desc) as mode_discount
from (select pl.*,
count(*) over (partition by date_id, good_id, store_id, name_promo_mech, discount) as cnt
from ds_promo_list_by_day_new pl
) pl
)
update toupdate
set discount = mode_discount
where mode_discount <> discount;
The subquery counts the number of values for each discount for each whatever on each day. The outer query gets the discount with the largest count, and in the case of ties, the larger value.
The rest is a simple update.

Find groups in data by relative difference between records

I have some rows that are sorted by price:
| id | price |
|----|-------|
| 1 | 2.00 |
| 2 | 2.10 |
| 3 | 2.11 |
| 4 | 2.50 |
| 5 | 2.99 |
| 6 | 3.02 |
| 7 | 9.01 |
| 8 | 9.10 |
| 9 | 9.11 |
| 10 | 13.01 |
| 11 | 13.51 |
| 12 | 14.10 |
I need to group them in "price groups". An item belongs to a different group when difference in price between it and the previous item is greater than some fixed value, say 1.50.
So the expected result is something like this:
| MIN(price) | MAX(price) |
|------------|------------|
| 2.00 | 3.02 |
| 9.01 | 9.11 |
| 13.01 | 14.10 |
I'm not even sure how to call this type of grouping. Group by "rolling difference"? Not exactly...
Can this be done in SQL (or in Postgres in particular)?
Your results are consistent with looking at the previous value and saying a group starts when the difference is greater than 1.5. You can do this with lag(), a cumulative sum, and aggregation:
select min(price), max(price)
from (select t.*,
count(*) filter (where prev_price is null or prev_price < price - 1.5) over (order by price) as grp
from (select t.*,
lag(price) over (order by price) as prev_price
from t
) t
) t
group by grp
Thanks Gordon Linoff for his answer, it is exactly what I was after!
I ended up using this query here simply because I understand it better. I guess it is more noobish, but so am I.
Both queries sort a table of 1M rows into 34 groups in about a second. This query is a bit more performant on 11M rows, sorting them into 380 groups in 15 seconds, vs 23 seconds in Gordon's answer.
SELECT results.group_i, MIN(results.price), MAX(results.price), AVG(results.price)
FROM (
SELECT *,
SUM(new_group) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_i
FROM (
SELECT annotated.*,
CASE
WHEN prev_price IS NULL OR price - prev_price > 1.5 THEN 1
ELSE 0
END AS new_group
FROM (
SELECT *,
LAG(price) OVER (ORDER BY price) AS prev_price
FROM prices
) AS annotated
) AS grouppable
) AS results
GROUP BY results.group_i
ORDER BY results.group_i;

SQL - group by a change of value in a given column

Apologies for the confusing title, I was unsure how to phrase it.
Below is my dataset:
+----+-----------------------------+--------+
| Id | Date | Amount |
+----+-----------------------------+--------+
| 1 | 2019-02-01 12:14:08.8056282 | 10 |
| 1 | 2019-02-04 15:23:21.3258719 | 10 |
| 1 | 2019-02-06 17:29:16.9267440 | 15 |
| 1 | 2019-02-08 14:18:14.9710497 | 10 |
+----+-----------------------------+--------+
It is an example of a bank trying to collect money from a debtor, where first, 10% of the owed sum is attempted to be collected, if a card is managed to be charged 15% is attempted, if that throws an error (for example insufficient funds), 10% is attempted again.
The desired output would be:
+----+--------+---------+
| Id | Amount | Attempt |
+----+--------+---------+
| 1 | 10 | 1 |
| 1 | 15 | 2 |
| 1 | 10 | 3 |
+----+--------+---------+
I have tried:
SELECT Id, Amount
FROM table1
GROUP BY Id, Amount
I am struggling to create a new column based on when value changes in the Amount column as I assume that could be used as another grouping variable that could fix this.
If you just want when a value changes, use lag():
select t.id, t.amount,
row_number() over (partition by id order by date) as attempt
from (select t.*, lag(amount) over (partition by id order by date) as prev_amount
from table1 t
) t
where prev_amount is null or prev_amount <> amount

Select Rows who's Sum Value = 80% of the Total

Here is an example the business problem.
I have 10 sales that resulted in negative margin.
We want to review these records, we generally use the 20/80 rule in reviews.
That is 20 percent of the sales will likely represent 80 of the negative margin.
So with the below records....
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
| 5 | 5 |
| 6 | 5 |
| 7 | 2 |
| 8 | 2 |
| 9 | 1 |
| 10 | 1 |
+----+-------+
I would want to return...
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
+----+-------+
The Total of Value is 106, 80% is then 84.8.
I need all the records, sorted descending who sum value gets me to at least 84.8
We use Microsoft APS PDW SQL, but can process on SMP if needed.
Assuming window functions are supported, you can use
with cte as (select id,value
,sum(value) over(order by value desc,id) as running_sum
,sum(value) over() as total
from tbl
)
select id,value from cte where running_sum < total*0.8
union all
select top 1 id,value from cte where running_sum >= total*0.8 order by value desc
One way is to use running totals:
select
id,
value
from
(
select
id,
value,
sum(value) over () as total,
sum(value) over (order by value desc) as till_here,
sum(value) over (order by value desc rows between unbounded preceding and 1 preceding)
as till_prev
from mytable
) summed_up
where till_here * 1.0 / total <= 0.8
or (till_here * 1.0 / total >= 0.8 and coalesce(till_prev, 0) * 1.0 / total < 0.8)
order by value desc;
This link could be useful, it calculates running totals:
https://www.codeproject.com/Articles/300785/Calculating-simple-running-totals-in-SQL-Server