PostgreSQL get relative average with group by - sql

I have a table as follows. The rows are in a specific order.
id | value
------+---------------------
1 | 2
1 | 4
1 | 3
2 | 2
2 | 2
2 | 5
I would want to group the rows by the column 'id' and get the average of value displayed in each column in terms of the previous values of the column (As explained in the following example within brackets)
id | value | RelativeAverage
------+-------------+--------------------
1 | 2 | (2/1) = 2
1 | 4 | (2+4 /2) = 3
1 | 3 | (2+4+3 / 3) = 3
2 | 2 | (2/1) = 2
2 | 2 | (2+2 / 2) = 2
2 | 5 | (2+2+5 / 3) = 9
Is there an approach with which I can achieve this?
Thanks in Advance

Wrong query:
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by id);
Output (wrong):
| id | value | sum | rank | relative_average |
|----|-------|-----|------|------------------|
| 1 | 2 | 9 | 1 | 9 |
| 1 | 4 | 9 | 1 | 9 |
| 1 | 3 | 9 | 1 | 9 |
| 2 | 1 | 8 | 1 | 8 |
| 2 | 2 | 8 | 1 | 8 |
| 2 | 5 | 8 | 1 | 8 |
You need something that sorts correctly in order for sum and rank to work properly on your actual arrangement of your data. You can use table row's hidden ctid field, but that is Postgres-specific
Correct query:
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by tbl.ctid);
Output (correct):
| id | value | sum | rank | relative_average |
|----|-------|-----|------|--------------------|
| 1 | 2 | 2 | 1 | 2 |
| 1 | 4 | 6 | 2 | 3 |
| 1 | 3 | 9 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 |
Best way is to introduce a serial primary key, so doing a running-total(sum over()) based on actual arrangement of your data could be achieved.
CREATE TABLE tbl
(ordered_pk serial primary key, "id" int, "value" int)
;
INSERT INTO tbl
("id", "value")
VALUES
(1, 2),
(1, 4),
(1, 3),
(2, 1),
(2, 2),
(2, 5)
;
Correct query:
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by ordered_pk);
Output (correct):
| id | value | sum | rank | relative_average |
|----|-------|-----|------|--------------------|
| 1 | 2 | 2 | 1 | 2 |
| 1 | 4 | 6 | 2 | 3 |
| 1 | 3 | 9 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 |
Live test: http://sqlfiddle.com/#!17/f18276/1
You can order by value, but it will yield different result, not necessarily wrong output, but different because of different arrangement of values. And then you also need to use row_number instead of rank/dense_rank due to possible duplication of values. Here I made an example of duplicate values.
Correct query:
select
id, value,
sum(value) over(arrangement),
row_number() over(arrangement),
rank() over(arrangement),
dense_rank() over(arrangement),
sum(value) over(arrangement)::numeric / row_number() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by value)
Output:
| id | value | sum | row_number | rank | dense_rank | relative_average |
|----|-------|-----|------------|------|------------|--------------------|
| 1 | 2 | 2 | 1 | 1 | 1 | 2 |
| 1 | 3 | 5 | 2 | 2 | 2 | 2.5 |
| 1 | 4 | 9 | 3 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 5 | 2 | 2 | 2 | 2.5 |
| 2 | 2 | 5 | 3 | 2 | 2 | 1.6666666666666667 |
| 2 | 5 | 10 | 4 | 4 | 3 | 2.5 |
Live test:
http://sqlfiddle.com/#!17/2b5aac/1

Not so proud of my other answer
Just use avg.
Today I Learned rows between unbounded preceding and current row. And it works with the actual arrangement of data even in the absence of a good candidate field for order by. It looks like that at least you can get away with using Postgres' hidden ctid field, or you can even avoid using serial primary. Recommending though to use serial primary key or date created field to order by upon.
Here's a better query. No need to divide, just use avg
select
id, value,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id);
Output
| id | value | avg |
|----|-------|--------------------|
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 1 | 3 | 3 |
| 2 | 1 | 1 |
| 2 | 2 | 1.5 |
| 2 | 5 | 2.6666666666666665 |
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id order by id);
Output:
| id | value | sum | rank | relative_average | avg |
|----|-------|-----|------|------------------|--------------------|
| 1 | 2 | 9 | 1 | 9 | 2 |
| 1 | 4 | 9 | 1 | 9 | 3 |
| 1 | 3 | 9 | 1 | 9 | 3 |
| 2 | 1 | 8 | 1 | 8 | 1 |
| 2 | 2 | 8 | 1 | 8 | 1.5 |
| 2 | 5 | 8 | 1 | 8 | 2.6666666666666665 |
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id order by tbl.ctid);
Output:
| id | value | sum | rank | relative_average | avg |
|----|-------|-----|------|--------------------|--------------------|
| 1 | 2 | 2 | 1 | 2 | 2 |
| 1 | 4 | 6 | 2 | 3 | 3 |
| 1 | 3 | 9 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 | 2.6666666666666665 |
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id order by ordered_pk);
Output:
| id | value | sum | rank | relative_average | avg |
|----|-------|-----|------|--------------------|--------------------|
| 1 | 2 | 2 | 1 | 2 | 2 |
| 1 | 4 | 6 | 2 | 3 | 3 |
| 1 | 3 | 9 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 | 2.6666666666666665 |
Live test: http://sqlfiddle.com/#!17/f18276/9
rows between unbounded preceding and current row can be also written as rows unbounded preceding http://sqlfiddle.com/#!17/f18276/11
And here's the result with order by value when value have duplicates.
select
id, value,
sum(value) over(arrangement),
row_number() over(arrangement) as rn,
rank() over(arrangement) as rank,
dense_rank() over(arrangement) drank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rn,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__drank,
trunc( avg(value) over(arrangement
rows between unbounded preceding and current row), 2) as ra
from tbl
window arrangement as (partition by id order by value)
Output:
| id | value | sum | rn | rank | drank | ra__rn | ra__rank | ra__drank | ra |
|----|-------|-----|----|------|-------|--------|----------|-----------|------|
| 1 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 |
| 1 | 3 | 5 | 2 | 2 | 2 | 2.5 | 2.5 | 2.5 | 2.5 |
| 1 | 4 | 9 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 5 | 2 | 2 | 2 | 2.5 | 2.5 | 2.5 | 1.5 |
| 2 | 2 | 5 | 3 | 2 | 2 | 1.66 | 1.66 | 1.66 | 1.66 |
| 2 | 5 | 10 | 4 | 4 | 3 | 2.5 | 2.5 | 2.5 | 2.5 |
Live test: http://sqlfiddle.com/#!17/2b5aac/16
And here's the result with order by ordered_pk when value have duplicates.
select
id, value,
sum(value) over(arrangement),
row_number() over(arrangement) as rn,
rank() over(arrangement) as rank,
dense_rank() over(arrangement) drank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rn,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__drank,
trunc( avg(value) over(arrangement
rows between unbounded preceding and current row), 2) as ra
from tbl
window arrangement as (partition by id order by ordered_pk)
| id | value | sum | rn | rank | drank | ra__rn | ra__rank | ra__drank | ra |
|----|-------|-----|----|------|-------|--------|----------|-----------|------|
| 1 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 |
| 1 | 4 | 6 | 2 | 2 | 2 | 3 | 3 | 3 | 3 |
| 1 | 3 | 9 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 2 | 2 | 1.5 | 1.5 | 1.5 | 1.5 |
| 2 | 2 | 5 | 3 | 3 | 3 | 1.66 | 1.66 | 1.66 | 1.66 |
| 2 | 5 | 10 | 4 | 4 | 4 | 2.5 | 2.5 | 2.5 | 2.5 |
Live test: http://sqlfiddle.com/#!17/baaf9/2

If I assume that you have an ordering column in the table, then what you want is:
select t.*,
avg(value) over (partition by id
order by ?
rows between unbounded preceding and current row
) as running_avg
from t;
The ? is the ordering column.
In other words, Postgres has a single built-in function that does exactly what you want -- and the function happens to be standard SQL.
The window frame using rows is required, because the default is range.
If you do not have an ordering column, then you should add one. I strongly advise you to NOT use ctid for this purpose. It might seem like it works on small sets of data, but it is not stable over time and it might not work on larger sets of data.
If you expect your data to be ordered by inserts, then use a serial column to capture the insert order.

Related

Get the count of longest streak including the break point

I am working on the problem where I have to get the count of streak with max value, but to get the exact result I have to count that point as well where the streak breaks. My table looks like this
+-----------------+--------+-------+
| customer_number | Months | Flags |
+-----------------+--------+-------+
| 1 | 12 | 1 |
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 1 | 4 | 1 |
| 1 | 5 | 1 |
| 1 | 8 | 1 |
| 1 | 9 | 1 |
| 1 | 10 | 1 |
| 1 | 11 | 1 |
| 6 | 12 | 1 |
| 6 | 1 | 1 |
| 6 | 2 | 1 |
| 6 | 3 | 1 |
| 6 | 4 | 1 |
| 6 | 5 | 4 |
| 6 | 9 | 1 |
| 6 | 10 | 1 |
| 6 | 11 | 1 |
| 7 | 5 | 1 |
| 8 | 9 | 1 |
| 8 | 10 | 1 |
| 8 | 11 | 1 |
| 9 | 9 | 1 |
| 9 | 10 | 1 |
| 9 | 11 | 1 |
| 10 | 11 | 1 |
+-----------------+--------+-------+
and my desired output is
+----------+--------------------+
| Customer | Consecutive streak |
+----------+--------------------+
| 1 | 10 |
| 6 | 6 |
| 7 | 1 |
| 8 | 3 |
| 9 | 3 |
| 10 | 1 |
+----------+--------------------+
the code I have
SELECT customer_number, max(streak) max_consecutive_streak FROM (
SELECT customer_number, COUNT(*) as streak
FROM
(select *,
(row_number() over (order by customer_number) -
row_number() over (order by customer_number)
) as counts
from table1
) cc
group by customer_number, counts
)
GROUP BY 1;
It is working good but for customer_number 6 it returns 5 but I want it to be 6, means it should count 4 as well in its longest streak as the streak breaks at this point. Any idea how can I achieve that?
You can use a cte with row_number:
with cte(r, id, flag) as (
select row_number() over (order by c.customer_number), c.* from customers c
),
freq(id, t, f) as (
select c2.id, c2.f, count(*) from
(select c.id, (select sum(c1.flag!=c.flag) from cte c1 where c1.id=c.id and c1.r <= c.r) f from cte c)
c2 group by c2.id, c2.f
)
select id, max(f) from freq group by id;

SQL Windowing accumulative sum with grouping

I've got a table like this
|week_no|value|attribute|
-------------------------
| 1 | 3 | a |
| 2 | 3 | a |
| 3 | 3 | a |
| 1 | 4 | b |
| 2 | 4 | b |
| 3 | 4 | b |
I'd like to have an accumulative account of the value column
|week_no|value|attribute|accum_value|
-------------------------------------
| 1 | 3 | a | 3 |
| 2 | 3 | a | 6 |
| 3 | 3 | a | 9 |
| 1 | 4 | b | 4 |
| 2 | 4 | b | 8 |
| 3 | 4 | b | 12 |
I've attempted doing the above by using this windowing function though it doesn't seem to be returning the correct values
SUM(value) OVER(ORDER BY 1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS accum_value
The correct window function would use partition by:
SUM(value) OVER (PARTITION BY attribute ORDER BY week_no
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS accum_value

Aggregation by positive/negative values v.2

I've posted several topics and every query had some problems :( Changed table and examples for better understanding
I have a table called PROD_COST with 5 fields
(ID,Duration,Cost,COST_NEXT,COST_CHANGE).
I need extra field called "groups" for aggregation.
Duration = number of days the price is valid (1 day=1row).
Cost = product price in this day.
-Cost_next = lead(cost,1,0).
Cost_change = Cost_next - Cost.
example:
+----+---------+------+-------------+-------+
|ID |Duration | Cost | Cost_change | Groups|
+----+---------+------+-------------+-------+
| 1 | 1 | 10 | -1,5 | 1 |
| 2 | 1 | 8,5 | 3,7 | 2 |
| 3 | 1 | 12.2 | 0 | 2 |
| 4 | 1 | 12.2 | -2,2 | 3 |
| 5 | 1 | 10 | 0 | 3 |
| 6 | 1 | 10 | 3.2 | 4 |
| 7 | 1 | 13.2 | -2,7 | 5 |
| 8 | 1 | 10.5 | -1,5 | 5 |
| 9 | 1 | 9 | 0 | 5 |
| 10 | 1 | 9 | 0 | 5 |
| 11 | 1 | 9 | -1 | 5 |
| 12 | 1 | 8 | 1.5 | 6 |
+----+---------+------+-------------+-------+
Now i need to group("Groups" field) by Cost_change. It can be positive,negative or 0 values.
Some kind guy advised me this query:
select id, COST_CHANGE, sum(GRP) over (order by id asc) +1
from
(
select *, case when sign(COST_CHANGE) != sign(isnull(lag(COST_CHANGE)
over (order by id asc),COST_CHANGE)) and Cost_change!=0 then 1 else 0 end as GRP
from PROD_COST
) X
But there is a problem: If there are 0 values between two positive or negative values than it groups it separately, for example:
+-------------+--------+
| Cost_change | Groups |
+-------------+--------+
| 9.262 | 5777 |
| -9.262 | 5778 |
| 9.262 | 5779 |
| 0.000 | 5779 |
| 9.608 | 5780 |
| -11.231 | 5781 |
| 10.000 | 5782 |
+-------------+--------+
I need to have:
+-------------+--------+
| Cost_change | Groups |
+-------------+--------+
| 9.262 | 5777 |
| -9.262 | 5778 |
| 9.262 | 5779 |
| 0.000 | 5779 |
| 9.608 | 5779 | -- Here
| -11.231 | 5780 |
| 10.000 | 5781 |
+-------------+--------+
In other words, if there's 0 values between two positive ot two negative values than they should be in one group, because Sequence: MINUS-0-0-MINUS - no rotation. But if i had MINUS-0-0-PLUS, than GROUPS should be 1-1-1-2, because positive valus is rotating with negative value.
Thank you for attention!
I'm Using Sql Server 2012
I think the best approach is to remove the zeros, do the calculation, and then re-insert them. So:
with pcg as (
select pc.*, min(id) over (partition by grp) as grpid
from (select pc.*,
(row_number() over (order by id) -
row_number() over (partition by sign(cost_change)
order by id
) as grp
from prod_cost pc
where cost_change <> 0
) pc
)
select pc.*, max(groups) over (order by id)
from prod_cost pc left join
(select pcg.*, dense_rank() over (order by grpid) as groups
from pcg
) pc
on pc.id = pcg.id;
The CTE assigns a group identifier based on the lowest id in the group, where the groups are bounded by actual sign changes. The subquery turns this into a number. The outer query then accumulates the maximum value, to give a value to the 0 records.

SQL Increment number in select statement

I have an issue where I need group a set of values and increase the group number when the variance between 2 columns is greater than or equal to 4, please see below.
UPDATE: I added a date column so you can view the order, but I need the group to update based off of the variance not the date.
+--------+-------+-------+----------+--------------+
| Date | Col 1 | Col 2 | Variance | Group Number |
+--------+-------+-------+----------+--------------+
| 1-Jun | 2 | 1 | 1 | 1 |
| 2-Jun | 1 | 1 | 0 | 1 |
| 3-Jun | 3 | 2 | 1 | 1 |
| 4-Jun | 4 | 1 | 3 | 1 |
| 5-Jun | 5 | 1 | 4 | 2 |
| 6-Jun | 1 | 1 | 0 | 2 |
| 7-Jun | 23 | 12 | 11 | 3 |
| 8-Jun | 12 | 11 | 1 | 3 |
| 9-Jun | 2 | 1 | 1 | 3 |
| 10-Jun | 13 | 4 | 9 | 4 |
| 11-Jun | 2 | 1 | 1 | 4 |
+--------+-------+-------+----------+--------------+
The group number is simply the number of times that 4 or greater appears in the variance column. You can get this using a correlated subquery:
select t.*,
(select 1 + count(*)
from table t2
where t2.date < t.date and t2.variance >= 4
) as GroupNumber
from table t;
In SQL Server 2012+, you can also do this using a cumulative sum:
select t.*,
sum(case when variance >= 4 then 1 else 0 end) over
(order by date rows between unbounded preceding and 1 preceding
) as GroupNumber
from table t;

Sequential Group By in sql server

For this Table:
+----+--------+-------+
| ID | Status | Value |
+----+--------+-------+
| 1 | 1 | 4 |
| 2 | 1 | 7 |
| 3 | 1 | 9 |
| 4 | 2 | 1 |
| 5 | 2 | 7 |
| 6 | 1 | 8 |
| 7 | 1 | 9 |
| 8 | 2 | 1 |
| 9 | 0 | 4 |
| 10 | 0 | 3 |
| 11 | 0 | 8 |
| 12 | 1 | 9 |
| 13 | 3 | 1 |
+----+--------+-------+
I need to sum sequential groups with the same Status to produce this result.
+--------+------------+
| Status | Sum(Value) |
+--------+------------+
| 1 | 20 |
| 2 | 8 |
| 1 | 17 |
| 2 | 1 |
| 0 | 15 |
| 1 | 9 |
| 3 | 1 |
+--------+------------+
How can I do that in SQL Server?
NB: The values in the ID column are contiguous.
Per the tag I added to your question this is a gaps and islands problem.
The best performing solution will likely be
WITH T
AS (SELECT *,
ID - ROW_NUMBER() OVER (PARTITION BY [STATUS] ORDER BY [ID]) AS Grp
FROM YourTable)
SELECT [STATUS],
SUM([VALUE]) AS [SUM(VALUE)]
FROM T
GROUP BY [STATUS],
Grp
ORDER BY MIN(ID)
If the ID values were not guaranteed contiguous as stated then you would need to use
ROW_NUMBER() OVER (ORDER BY [ID]) -
ROW_NUMBER() OVER (PARTITION BY [STATUS] ORDER BY [ID]) AS Grp
Instead in the CTE definition.
SQL Fiddle