Select Rows who's Sum Value = 80% of the Total - sql

Here is an example the business problem.
I have 10 sales that resulted in negative margin.
We want to review these records, we generally use the 20/80 rule in reviews.
That is 20 percent of the sales will likely represent 80 of the negative margin.
So with the below records....
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
| 5 | 5 |
| 6 | 5 |
| 7 | 2 |
| 8 | 2 |
| 9 | 1 |
| 10 | 1 |
+----+-------+
I would want to return...
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
+----+-------+
The Total of Value is 106, 80% is then 84.8.
I need all the records, sorted descending who sum value gets me to at least 84.8
We use Microsoft APS PDW SQL, but can process on SMP if needed.

Assuming window functions are supported, you can use
with cte as (select id,value
,sum(value) over(order by value desc,id) as running_sum
,sum(value) over() as total
from tbl
)
select id,value from cte where running_sum < total*0.8
union all
select top 1 id,value from cte where running_sum >= total*0.8 order by value desc

One way is to use running totals:
select
id,
value
from
(
select
id,
value,
sum(value) over () as total,
sum(value) over (order by value desc) as till_here,
sum(value) over (order by value desc rows between unbounded preceding and 1 preceding)
as till_prev
from mytable
) summed_up
where till_here * 1.0 / total <= 0.8
or (till_here * 1.0 / total >= 0.8 and coalesce(till_prev, 0) * 1.0 / total < 0.8)
order by value desc;

This link could be useful, it calculates running totals:
https://www.codeproject.com/Articles/300785/Calculating-simple-running-totals-in-SQL-Server

Related

How to calculate average of values without including the last value (sql)?

I have a table. I partition it by the id and want to calculate average of the values previous to the current, without including the current value. Here is a sample table:
+----+-------+------------+
| id | Value | Date |
+----+-------+------------+
| 1 | 51 | 2020-11-26 |
| 1 | 45 | 2020-11-25 |
| 1 | 47 | 2020-11-24 |
| 2 | 32 | 2020-11-26 |
| 2 | 51 | 2020-11-25 |
| 2 | 45 | 2020-11-24 |
| 3 | 47 | 2020-11-26 |
| 3 | 32 | 2020-11-25 |
| 3 | 35 | 2020-11-24 |
+----+-------+------------+
In this case, it means calculating the average of values for dates BEFORE 2020-11-26. This is the expected result
+----+-------+
| id | Value |
+----+-------+
| 1 | 46 |
| 2 | 48 |
| 3 | 33.5 |
+----+-------+
I have calculated it using ROWS N PRECEDING but it appears that this way I average N preceding + last row, and I want to exclude the last row (which is the most recent date in my case).
Here is my query:
SELECT ID,
(avg(Value) OVER(
PARTITION BY ID
ORDER BY Date
ROWS 9 PRECEDING )) as avg9
FROM t1
Then define your window in full using both the start and ends with BETWEEN:
SELECT ID,
(AVG(Value) OVER (PARTITION BY ID ORDER BY Date ROWS BETWEEN 9 PRECEDING AND 1 PRECEDING)) AS avg9
FROM t1;
Why not just filter:
select id, avg(value)
from t1
where date < '2020-11-26'
group by id;
If you want the date to be flexible -- say the most recent value for each date, then:
select id, avg(value)
from (select t1.*,
max(date) over (partition by id) as max_date
from t1
) t1
where date < max_date
group by id;
Do a row_number() over (Partition by id ORDER BY [Date] DESC). This will give a rank = 1 to the row with latest date. Wrap it within a CTE and then calculate avg for each partition where RANK > 1. Please check syntax.
;with a as
(
select id, value, Date, row_number() over (partition by id order by date
desc) as RN
)
select id, avg(Value) from a group by id where r.RN > 1

Postgres: Range Lookup with Auto increment

I have the following table: table1
begin | value | end
---------------------
1 | 3 | 10
1 | 5 | 10
1 | 2 | 10
1 | 7 | 10
11 | 19 | 20
11 | 16 | 20
11 | 14 | 20
I am looking for the following output:
begin | value | end | case
-----------------------------
1 | 3 | 10 | 1
1 | 5 | 10 | 1
1 | 2 | 10 | 1
1 | 7 | 10 | 1
11 | 19 | 20 | 2
11 | 16 | 20 | 2
11 | 14 | 20 | 2
I want to assign a unique number for numbers falling within a particular range but I am unable to find my way around it. Any suggestions?
Hmmm. This is a gap and islands problem. You can identify where islands start by checking that there are no other rows that overlap with them. For that, you can use a cumulative max.
This gets you close:
select t.*,
count(*) filter where (prev_end < start) over (order by start) as grp
from (select t.*,
max(end) over (order by start range between unbounded preceding and 1 preceding) as prev_end
from t
) t;
However, the ties in the data mean that this has gaps. So, one more level:
select t.*, dense_rank() over (order by grp) as sequential_grp
from (select t.*,
count(*) filter (where prev_end < start) over (order by start) as grp
from (select t.*,
max(end) over (order by start range between unbounded preceding and 1 preceding) as prev_end
from t
) t
) t;
Here is a db<>fiddle -- with the column names changed, because names like begin and end are SQL keywords and hence a bad idea for column names.

Find groups in data by relative difference between records

I have some rows that are sorted by price:
| id | price |
|----|-------|
| 1 | 2.00 |
| 2 | 2.10 |
| 3 | 2.11 |
| 4 | 2.50 |
| 5 | 2.99 |
| 6 | 3.02 |
| 7 | 9.01 |
| 8 | 9.10 |
| 9 | 9.11 |
| 10 | 13.01 |
| 11 | 13.51 |
| 12 | 14.10 |
I need to group them in "price groups". An item belongs to a different group when difference in price between it and the previous item is greater than some fixed value, say 1.50.
So the expected result is something like this:
| MIN(price) | MAX(price) |
|------------|------------|
| 2.00 | 3.02 |
| 9.01 | 9.11 |
| 13.01 | 14.10 |
I'm not even sure how to call this type of grouping. Group by "rolling difference"? Not exactly...
Can this be done in SQL (or in Postgres in particular)?
Your results are consistent with looking at the previous value and saying a group starts when the difference is greater than 1.5. You can do this with lag(), a cumulative sum, and aggregation:
select min(price), max(price)
from (select t.*,
count(*) filter (where prev_price is null or prev_price < price - 1.5) over (order by price) as grp
from (select t.*,
lag(price) over (order by price) as prev_price
from t
) t
) t
group by grp
Thanks Gordon Linoff for his answer, it is exactly what I was after!
I ended up using this query here simply because I understand it better. I guess it is more noobish, but so am I.
Both queries sort a table of 1M rows into 34 groups in about a second. This query is a bit more performant on 11M rows, sorting them into 380 groups in 15 seconds, vs 23 seconds in Gordon's answer.
SELECT results.group_i, MIN(results.price), MAX(results.price), AVG(results.price)
FROM (
SELECT *,
SUM(new_group) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_i
FROM (
SELECT annotated.*,
CASE
WHEN prev_price IS NULL OR price - prev_price > 1.5 THEN 1
ELSE 0
END AS new_group
FROM (
SELECT *,
LAG(price) OVER (ORDER BY price) AS prev_price
FROM prices
) AS annotated
) AS grouppable
) AS results
GROUP BY results.group_i
ORDER BY results.group_i;

Sum length of overlapping intervals

I've got a table in a Redshift database that contains intervals which are grouped and that potentially overlap, like so:
| interval_id | l | u | group |
| ----------- | -- | -- | ----- |
| 1 | 1 | 10 | A |
| 2 | 2 | 5 | A |
| 3 | 5 | 15 | A |
| 4 | 26 | 30 | B |
| 5 | 28 | 35 | B |
| 6 | 30 | 31 | B |
| 7 | 44 | 45 | B |
| 8 | 56 | 58 | C |
What I would like to do is to determine the length of the union of the intervals within group. That is, for each interval take u - l, sum over all group members and then subtract off the length of the overlaps between the intervals.
Desired result:
| group | length |
| ----- | ------ |
| A | 14 |
| B | 10 |
| C | 2 |
This question has been asked before, alas it seems that all of the solutions in that thread use features that Redshift doesn't support.
This is not difficult but requires multiple steps. The key is to define the "islands" within each group and then aggregate over those. Lots of subquerys, aggregations, and window functions.
select groupId, sum(ul)
from (select groupId, (max(u) - min(l) + 1) as ul
from (select t.*,
sum(case when prev_max_u < l then 1 else 0 end) over (order by l) as grp
from (select t.*,
max(u) over (order by l rows between unbounded preceding and 1 preceding) as prev_max_u
from t
) t
) t
group by groupid, grp
) g
group by groupId;
The idea is to determine if there is an overlap at the beginning of each record. For this purpose, it uses a cumulative max function on all preceding records. Then, it determines if there is an overlap by comparing the previous max with the current l -- a cumulative sum of overlaps defines a group.
The rest is just aggregation. And more aggregation.

Select dynamic couples of lines in SQL (PostgreSQL)

My objective is to make dynamic group of lines (of product by TYPE & COLOR in fact)
I don't know if it's possible just with one select query.
But : I want to create group of lines (A PRODUCT is a TYPE and a COLOR) as per the number_per_group column and I want to do this grouping depending on the date order (Order By DATE)
A single product with a NB_PER_GROUP number 2 is exclude from the final result.
Table :
-----------------------------------------------
NUM | TYPE | COLOR | NB_PER_GROUP | DATE
-----------------------------------------------
0 | 1 | 1 | 2 | ...
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 4 | 1 | 3 |
7 | 1 | 1 | 2 |
8 | 4 | 1 | 3 |
9 | 4 | 1 | 3 |
10 | 5 | 1 | 2 |
Results :
------------------------
GROUP_NUMBER | NUM |
------------------------
0 | 0 |
0 | 1 |
~~~~~~~~~~~~~~~~~~~~~~~~
1 | 2 |
1 | 3 |
~~~~~~~~~~~~~~~~~~~~~~~~
2 | 4 |
2 | 5 |
~~~~~~~~~~~~~~~~~~~~~~~~
3 | 6 |
3 | 8 |
3 | 9 |
If you have another way to solve this problem, I will accept it.
What about something like this?
select max(gn.group_number) group_number, ip.num
from products ip
join (
select date, type, color, row_number() over (order by date) - 1 group_number
from (
select op.num, op.type, op.color, op.nb_per_group, op.date, (row_number() over (partition by op.type, op.color order by op.date) - 1) % nb_per_group group_order
from products op
) sq
where sq.group_order = 0
) gn
on ip.type = gn.type
and ip.color = gn.color
and ip.date >= gn.date
group by ip.num
order by group_number, ip.num
This may only work if your nb_per_group values are the same for each combination of type and color. It may also require unique dates, but that could probably be worked around if required.
The innermost subquery partitions the rows by type and color, orders them by date, then calculates the row numbers modulo nb_per_group; this forms a 0-based count for the group that resets to 0 each time nb_per_group is exceeded.
The next-level subquery finds all of the 0 values we mapped in the lower subquery and assigns group numbers to them.
Finally, the outermost query ties each row in the products table to a group number, calculated as the highest group number that split off before this product's date.