Find groups in data by relative difference between records - sql

I have some rows that are sorted by price:
| id | price |
|----|-------|
| 1 | 2.00 |
| 2 | 2.10 |
| 3 | 2.11 |
| 4 | 2.50 |
| 5 | 2.99 |
| 6 | 3.02 |
| 7 | 9.01 |
| 8 | 9.10 |
| 9 | 9.11 |
| 10 | 13.01 |
| 11 | 13.51 |
| 12 | 14.10 |
I need to group them in "price groups". An item belongs to a different group when difference in price between it and the previous item is greater than some fixed value, say 1.50.
So the expected result is something like this:
| MIN(price) | MAX(price) |
|------------|------------|
| 2.00 | 3.02 |
| 9.01 | 9.11 |
| 13.01 | 14.10 |
I'm not even sure how to call this type of grouping. Group by "rolling difference"? Not exactly...
Can this be done in SQL (or in Postgres in particular)?

Your results are consistent with looking at the previous value and saying a group starts when the difference is greater than 1.5. You can do this with lag(), a cumulative sum, and aggregation:
select min(price), max(price)
from (select t.*,
count(*) filter (where prev_price is null or prev_price < price - 1.5) over (order by price) as grp
from (select t.*,
lag(price) over (order by price) as prev_price
from t
) t
) t
group by grp

Thanks Gordon Linoff for his answer, it is exactly what I was after!
I ended up using this query here simply because I understand it better. I guess it is more noobish, but so am I.
Both queries sort a table of 1M rows into 34 groups in about a second. This query is a bit more performant on 11M rows, sorting them into 380 groups in 15 seconds, vs 23 seconds in Gordon's answer.
SELECT results.group_i, MIN(results.price), MAX(results.price), AVG(results.price)
FROM (
SELECT *,
SUM(new_group) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_i
FROM (
SELECT annotated.*,
CASE
WHEN prev_price IS NULL OR price - prev_price > 1.5 THEN 1
ELSE 0
END AS new_group
FROM (
SELECT *,
LAG(price) OVER (ORDER BY price) AS prev_price
FROM prices
) AS annotated
) AS grouppable
) AS results
GROUP BY results.group_i
ORDER BY results.group_i;

Related

Running average but reset when increasing streak stops

Original Dataset
+---------+--------+------------+
| Product | Date | Units Sold |
+---------+--------+------------+
| Prod A | 1/1/19 | 100 |
| Prod A | 1/2/19 | 200 |
| Prod A | 1/3/19 | 300 |
| Prod A | 1/4/19 | 136 |
| Prod A | 1/5/19 | 116 |
| Prod A | 1/6/19 | 120 |
| Prod A | 1/7/19 | 140 |
| Prod A | 1/8/19 | 160 |
+---------+--------+------------+
Desired Output (Last two columns)
+---------+--------+------------+---------------------+------+--------+--------------------+
| Product | Date | Units Sold | Previous Units Sold | Diff | Streak | Streak Running Avg |
+---------+--------+------------+---------------------+------+--------+--------------------+
| Prod A | 1/1/19 | 100 | | 100 | 1 | 100 |
| Prod A | 1/2/19 | 200 | 100 | 100 | 2 | 150 |
| Prod A | 1/3/19 | 300 | 200 | 100 | 3 | 200 |
| Prod A | 1/4/19 | 100 | 300 | -200 | 0 | 0 |
| Prod A | 1/5/19 | 200 | 100 | 100 | 1 | 200 |
| Prod A | 1/6/19 | 300 | 200 | 100 | 2 | 250 |
| Prod A | 1/7/19 | 200 | 300 | -100 | 0 | 0 |
| Prod A | 1/8/19 | 200 | 200 | 0 | 0 | 0 |
+---------+--------+------------+---------------------+------+--------+--------------------+
My goal is to calculate running average only when the difference between previous day sale and current day sale is positive (i.e. when the streak >0 calculate calculate the running average.)
For example, Day 1, 2, & 3 have diff positive i.e. 100
therefore the running average for
Day 1 is 100 (100/1)
Day 2 is 150 (100+200)/2
Day 3 is 200 (100+200+300)/3
 My Query
select *,
CASE WHEN flag=1 then sum(units) over (partition by item_name order by order_date) else 0 end as running_avg_sum
from
(select *,
lag(units, 1) over (partition by product order by order_date) as previous_day_units
CASE WHEN units > previous_day_units then 1 else 0 end as flag
from (
SELECT product,
order_date,
SUM(units_sold) units
FROM product_table
GROUP BY 1, 2
) tbl
)
But above query throws
Invalid operation: Aggregate window functions with an ORDER BY clause require a frame clause;
I'm aware how to resolve the error by adding rows between unbounded preceding and current row in the window function but that way it would average all preceding rows. I'm now sure how can I achieve the desired output.
If there's a way to define boundaries, that'd be really helpful
Any help is appreciated. Thank you!
Insert commands incase you want to replicate
CREATE TABLE product_table
(
product varchar(200),
order_date timestamp,
units_sold bigint
)
INSERT INTO product values("Prod A","2019-01-01","100");
INSERT INTO product values("Prod A","2019-01-02","200");
INSERT INTO product values("Prod A","2019-01-03","300");
INSERT INTO product values("Prod A","2019-01-04","136");
INSERT INTO product values("Prod A","2019-01-05","116");
INSERT INTO product values("Prod A","2019-01-06","120");
INSERT INTO product values("Prod A","2019-01-07","140");
INSERT INTO product values("Prod A","2019-01-08","160");
This is a somewhat tricky gaps and islands problem. Using the difference in row numbers method, we can try:
WITH cte1 AS (
SELECT *, LAG(units_sold, 1, 0) OVER (PARTITION BY product ORDER BY order_date) AS prev_units_sold,
units_sold - LAG(units_sold, 1, 0) OVER (PARTITION BY product ORDER BY order_date) diff
FROM product_table
),
cte2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY product ORDER BY order_date) rn1,
ROW_NUMBER() OVER (PARTITION BY product, diff > 0 ORDER BY order_date) rn2
FROM cte1
)
SELECT product, order_date, units_sold, prev_units_sold, diff,
CASE WHEN diff > 0
THEN ROW_NUMBER() OVER (PARTITION BY product, rn1 - rn2 ORDER BY order_date)
ELSE 0 END AS streak,
CASE WHEN diff > 0
THEN AVG(units_sold) OVER (PARTITION BY product, rn1 - rn2 ORDER BY order_date)
ELSE 0 END AS running_avg
FROM cte2
ORDER BY product, order_date;
Demo
The first pass CTE is required to compute the lag number of units sold, per product as ordered by order date. The second CTE computes two different row number sequences. The first one is simply ordered by order date. The second computes two separate sequences for positive diff and negative diff records. The difference between the first and second sequence can be used to identify the "islands" over which we want to display the streak and streak average. For the negative diff records, we just report zero for the streak and streak average.

How to calculate average of values without including the last value (sql)?

I have a table. I partition it by the id and want to calculate average of the values previous to the current, without including the current value. Here is a sample table:
+----+-------+------------+
| id | Value | Date |
+----+-------+------------+
| 1 | 51 | 2020-11-26 |
| 1 | 45 | 2020-11-25 |
| 1 | 47 | 2020-11-24 |
| 2 | 32 | 2020-11-26 |
| 2 | 51 | 2020-11-25 |
| 2 | 45 | 2020-11-24 |
| 3 | 47 | 2020-11-26 |
| 3 | 32 | 2020-11-25 |
| 3 | 35 | 2020-11-24 |
+----+-------+------------+
In this case, it means calculating the average of values for dates BEFORE 2020-11-26. This is the expected result
+----+-------+
| id | Value |
+----+-------+
| 1 | 46 |
| 2 | 48 |
| 3 | 33.5 |
+----+-------+
I have calculated it using ROWS N PRECEDING but it appears that this way I average N preceding + last row, and I want to exclude the last row (which is the most recent date in my case).
Here is my query:
SELECT ID,
(avg(Value) OVER(
PARTITION BY ID
ORDER BY Date
ROWS 9 PRECEDING )) as avg9
FROM t1
Then define your window in full using both the start and ends with BETWEEN:
SELECT ID,
(AVG(Value) OVER (PARTITION BY ID ORDER BY Date ROWS BETWEEN 9 PRECEDING AND 1 PRECEDING)) AS avg9
FROM t1;
Why not just filter:
select id, avg(value)
from t1
where date < '2020-11-26'
group by id;
If you want the date to be flexible -- say the most recent value for each date, then:
select id, avg(value)
from (select t1.*,
max(date) over (partition by id) as max_date
from t1
) t1
where date < max_date
group by id;
Do a row_number() over (Partition by id ORDER BY [Date] DESC). This will give a rank = 1 to the row with latest date. Wrap it within a CTE and then calculate avg for each partition where RANK > 1. Please check syntax.
;with a as
(
select id, value, Date, row_number() over (partition by id order by date
desc) as RN
)
select id, avg(Value) from a group by id where r.RN > 1

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

SQL - group by a change of value in a given column

Apologies for the confusing title, I was unsure how to phrase it.
Below is my dataset:
+----+-----------------------------+--------+
| Id | Date | Amount |
+----+-----------------------------+--------+
| 1 | 2019-02-01 12:14:08.8056282 | 10 |
| 1 | 2019-02-04 15:23:21.3258719 | 10 |
| 1 | 2019-02-06 17:29:16.9267440 | 15 |
| 1 | 2019-02-08 14:18:14.9710497 | 10 |
+----+-----------------------------+--------+
It is an example of a bank trying to collect money from a debtor, where first, 10% of the owed sum is attempted to be collected, if a card is managed to be charged 15% is attempted, if that throws an error (for example insufficient funds), 10% is attempted again.
The desired output would be:
+----+--------+---------+
| Id | Amount | Attempt |
+----+--------+---------+
| 1 | 10 | 1 |
| 1 | 15 | 2 |
| 1 | 10 | 3 |
+----+--------+---------+
I have tried:
SELECT Id, Amount
FROM table1
GROUP BY Id, Amount
I am struggling to create a new column based on when value changes in the Amount column as I assume that could be used as another grouping variable that could fix this.
If you just want when a value changes, use lag():
select t.id, t.amount,
row_number() over (partition by id order by date) as attempt
from (select t.*, lag(amount) over (partition by id order by date) as prev_amount
from table1 t
) t
where prev_amount is null or prev_amount <> amount

Select Rows who's Sum Value = 80% of the Total

Here is an example the business problem.
I have 10 sales that resulted in negative margin.
We want to review these records, we generally use the 20/80 rule in reviews.
That is 20 percent of the sales will likely represent 80 of the negative margin.
So with the below records....
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
| 5 | 5 |
| 6 | 5 |
| 7 | 2 |
| 8 | 2 |
| 9 | 1 |
| 10 | 1 |
+----+-------+
I would want to return...
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
+----+-------+
The Total of Value is 106, 80% is then 84.8.
I need all the records, sorted descending who sum value gets me to at least 84.8
We use Microsoft APS PDW SQL, but can process on SMP if needed.
Assuming window functions are supported, you can use
with cte as (select id,value
,sum(value) over(order by value desc,id) as running_sum
,sum(value) over() as total
from tbl
)
select id,value from cte where running_sum < total*0.8
union all
select top 1 id,value from cte where running_sum >= total*0.8 order by value desc
One way is to use running totals:
select
id,
value
from
(
select
id,
value,
sum(value) over () as total,
sum(value) over (order by value desc) as till_here,
sum(value) over (order by value desc rows between unbounded preceding and 1 preceding)
as till_prev
from mytable
) summed_up
where till_here * 1.0 / total <= 0.8
or (till_here * 1.0 / total >= 0.8 and coalesce(till_prev, 0) * 1.0 / total < 0.8)
order by value desc;
This link could be useful, it calculates running totals:
https://www.codeproject.com/Articles/300785/Calculating-simple-running-totals-in-SQL-Server