SQL group by changing column - sql

Suppose I have a table sorted by date as so:
+-------------+--------+
| DATE | VALUE |
+-------------+--------+
| 01-09-2020 | 5 |
| 01-15-2020 | 5 |
| 01-17-2020 | 5 |
| 02-03-2020 | 8 |
| 02-13-2020 | 8 |
| 02-20-2020 | 8 |
| 02-23-2020 | 5 |
| 02-25-2020 | 5 |
| 02-28-2020 | 3 |
| 03-13-2020 | 3 |
| 03-18-2020 | 3 |
+-------------+--------+
I want to group by changes in value within that given date range, and add a value that increments each time as an added column to denote that.
I have tried a number of different things, such as using the lag function:
SELECT value, value - lag(value) over (order by date) as count
GROUP BY value
In short, I want to take the table above and have it look like:
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 01-15-2020 | 5 | 1 |
| 01-17-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-13-2020 | 8 | 2 |
| 02-20-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-25-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
| 03-13-2020 | 3 | 4 |
| 03-18-2020 | 3 | 4 |
+-------------+--------+-------+
I want to eventually have it all in one small table with the earliest date for each.
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
+-------------+--------+-------+
Any help would be very appreciated

you can use a combination of Row_number and Dense_rank functions to get the required results like below:
;with cte
as
(
select t.DATE,t.VALUE
,Dense_rank() over(partition by t.VALUE order by t.DATE) as d_rank
,Row_number() over(partition by t.VALUE order by t.DATE) as r_num
from table t
)
Select t.Date,t.Value,d_rank as count
from cte
where r_num = 1

You can use a lag and cumulative sum and a subquery:
SELECT value,
SUM(CASE WHEN prev_value = value THEN 0 ELSE 1 END) OVER (ORDER BY date)
FROM (SELECT t.*, LAG(value) OVER (ORDER BY date) as prev_value
FROM t
) t
Here is a db<>fiddle.

You can recursively use lag() and then row_number() analytic functions :
WITH t2 AS
(
SELECT LAG(value,1,value-1) OVER (ORDER BY date) as lg,
t.*
FROM t
)
SELECT t2.date,t2.value, ROW_NUMBER() OVER (ORDER BY t2.date) as count
FROM t2
WHERE value - lg != 0
Demo
and filter through inequalities among the returned values from those functions.

Related

Selecting rows that doesn't have duplicates

Let's say I have the following table:
| sku | id | value | count |
|-----|----|-------|-------|
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 3 | 3 | 3 |
I want to select rows that don't have the same count for the same id. So my desired outcome is:
| sku | id | value | count |
|-----|----|-------|-------|
| A | 3 | 3 | 3 |
I need something that works with Postgres 10
A simple method is window functions:
select t.*
from (select t.*, count(*) over (partition by sku, id) as cnt
from t
) t
where cnt = 1;
This assumes you really mean the sku/id combination.

PostgreSQL get relative average with group by

I have a table as follows. The rows are in a specific order.
id | value
------+---------------------
1 | 2
1 | 4
1 | 3
2 | 2
2 | 2
2 | 5
I would want to group the rows by the column 'id' and get the average of value displayed in each column in terms of the previous values of the column (As explained in the following example within brackets)
id | value | RelativeAverage
------+-------------+--------------------
1 | 2 | (2/1) = 2
1 | 4 | (2+4 /2) = 3
1 | 3 | (2+4+3 / 3) = 3
2 | 2 | (2/1) = 2
2 | 2 | (2+2 / 2) = 2
2 | 5 | (2+2+5 / 3) = 9
Is there an approach with which I can achieve this?
Thanks in Advance
Wrong query:
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by id);
Output (wrong):
| id | value | sum | rank | relative_average |
|----|-------|-----|------|------------------|
| 1 | 2 | 9 | 1 | 9 |
| 1 | 4 | 9 | 1 | 9 |
| 1 | 3 | 9 | 1 | 9 |
| 2 | 1 | 8 | 1 | 8 |
| 2 | 2 | 8 | 1 | 8 |
| 2 | 5 | 8 | 1 | 8 |
You need something that sorts correctly in order for sum and rank to work properly on your actual arrangement of your data. You can use table row's hidden ctid field, but that is Postgres-specific
Correct query:
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by tbl.ctid);
Output (correct):
| id | value | sum | rank | relative_average |
|----|-------|-----|------|--------------------|
| 1 | 2 | 2 | 1 | 2 |
| 1 | 4 | 6 | 2 | 3 |
| 1 | 3 | 9 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 |
Best way is to introduce a serial primary key, so doing a running-total(sum over()) based on actual arrangement of your data could be achieved.
CREATE TABLE tbl
(ordered_pk serial primary key, "id" int, "value" int)
;
INSERT INTO tbl
("id", "value")
VALUES
(1, 2),
(1, 4),
(1, 3),
(2, 1),
(2, 2),
(2, 5)
;
Correct query:
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by ordered_pk);
Output (correct):
| id | value | sum | rank | relative_average |
|----|-------|-----|------|--------------------|
| 1 | 2 | 2 | 1 | 2 |
| 1 | 4 | 6 | 2 | 3 |
| 1 | 3 | 9 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 |
Live test: http://sqlfiddle.com/#!17/f18276/1
You can order by value, but it will yield different result, not necessarily wrong output, but different because of different arrangement of values. And then you also need to use row_number instead of rank/dense_rank due to possible duplication of values. Here I made an example of duplicate values.
Correct query:
select
id, value,
sum(value) over(arrangement),
row_number() over(arrangement),
rank() over(arrangement),
dense_rank() over(arrangement),
sum(value) over(arrangement)::numeric / row_number() over(arrangement)
as relative_average
from tbl
window arrangement as (partition by id order by value)
Output:
| id | value | sum | row_number | rank | dense_rank | relative_average |
|----|-------|-----|------------|------|------------|--------------------|
| 1 | 2 | 2 | 1 | 1 | 1 | 2 |
| 1 | 3 | 5 | 2 | 2 | 2 | 2.5 |
| 1 | 4 | 9 | 3 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 5 | 2 | 2 | 2 | 2.5 |
| 2 | 2 | 5 | 3 | 2 | 2 | 1.6666666666666667 |
| 2 | 5 | 10 | 4 | 4 | 3 | 2.5 |
Live test:
http://sqlfiddle.com/#!17/2b5aac/1
Not so proud of my other answer
Just use avg.
Today I Learned rows between unbounded preceding and current row. And it works with the actual arrangement of data even in the absence of a good candidate field for order by. It looks like that at least you can get away with using Postgres' hidden ctid field, or you can even avoid using serial primary. Recommending though to use serial primary key or date created field to order by upon.
Here's a better query. No need to divide, just use avg
select
id, value,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id);
Output
| id | value | avg |
|----|-------|--------------------|
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 1 | 3 | 3 |
| 2 | 1 | 1 |
| 2 | 2 | 1.5 |
| 2 | 5 | 2.6666666666666665 |
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id order by id);
Output:
| id | value | sum | rank | relative_average | avg |
|----|-------|-----|------|------------------|--------------------|
| 1 | 2 | 9 | 1 | 9 | 2 |
| 1 | 4 | 9 | 1 | 9 | 3 |
| 1 | 3 | 9 | 1 | 9 | 3 |
| 2 | 1 | 8 | 1 | 8 | 1 |
| 2 | 2 | 8 | 1 | 8 | 1.5 |
| 2 | 5 | 8 | 1 | 8 | 2.6666666666666665 |
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id order by tbl.ctid);
Output:
| id | value | sum | rank | relative_average | avg |
|----|-------|-----|------|--------------------|--------------------|
| 1 | 2 | 2 | 1 | 2 | 2 |
| 1 | 4 | 6 | 2 | 3 | 3 |
| 1 | 3 | 9 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 | 2.6666666666666665 |
select
id, value,
sum(value) over(arrangement), rank() over(arrangement),
sum(value) over(arrangement)::numeric / rank() over(arrangement)
as relative_average,
avg(value) over(arrangement rows between unbounded preceding and current row)
from tbl
window arrangement as (partition by id order by ordered_pk);
Output:
| id | value | sum | rank | relative_average | avg |
|----|-------|-----|------|--------------------|--------------------|
| 1 | 2 | 2 | 1 | 2 | 2 |
| 1 | 4 | 6 | 2 | 3 | 3 |
| 1 | 3 | 9 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 1.5 | 1.5 |
| 2 | 5 | 8 | 3 | 2.6666666666666665 | 2.6666666666666665 |
Live test: http://sqlfiddle.com/#!17/f18276/9
rows between unbounded preceding and current row can be also written as rows unbounded preceding http://sqlfiddle.com/#!17/f18276/11
And here's the result with order by value when value have duplicates.
select
id, value,
sum(value) over(arrangement),
row_number() over(arrangement) as rn,
rank() over(arrangement) as rank,
dense_rank() over(arrangement) drank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rn,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__drank,
trunc( avg(value) over(arrangement
rows between unbounded preceding and current row), 2) as ra
from tbl
window arrangement as (partition by id order by value)
Output:
| id | value | sum | rn | rank | drank | ra__rn | ra__rank | ra__drank | ra |
|----|-------|-----|----|------|-------|--------|----------|-----------|------|
| 1 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 |
| 1 | 3 | 5 | 2 | 2 | 2 | 2.5 | 2.5 | 2.5 | 2.5 |
| 1 | 4 | 9 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 5 | 2 | 2 | 2 | 2.5 | 2.5 | 2.5 | 1.5 |
| 2 | 2 | 5 | 3 | 2 | 2 | 1.66 | 1.66 | 1.66 | 1.66 |
| 2 | 5 | 10 | 4 | 4 | 3 | 2.5 | 2.5 | 2.5 | 2.5 |
Live test: http://sqlfiddle.com/#!17/2b5aac/16
And here's the result with order by ordered_pk when value have duplicates.
select
id, value,
sum(value) over(arrangement),
row_number() over(arrangement) as rn,
rank() over(arrangement) as rank,
dense_rank() over(arrangement) drank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rn,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__rank,
trunc( sum(value) over(arrangement)::numeric
/ row_number() over(arrangement), 2) as ra__drank,
trunc( avg(value) over(arrangement
rows between unbounded preceding and current row), 2) as ra
from tbl
window arrangement as (partition by id order by ordered_pk)
| id | value | sum | rn | rank | drank | ra__rn | ra__rank | ra__drank | ra |
|----|-------|-----|----|------|-------|--------|----------|-----------|------|
| 1 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 |
| 1 | 4 | 6 | 2 | 2 | 2 | 3 | 3 | 3 | 3 |
| 1 | 3 | 9 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 2 | 2 | 1.5 | 1.5 | 1.5 | 1.5 |
| 2 | 2 | 5 | 3 | 3 | 3 | 1.66 | 1.66 | 1.66 | 1.66 |
| 2 | 5 | 10 | 4 | 4 | 4 | 2.5 | 2.5 | 2.5 | 2.5 |
Live test: http://sqlfiddle.com/#!17/baaf9/2
If I assume that you have an ordering column in the table, then what you want is:
select t.*,
avg(value) over (partition by id
order by ?
rows between unbounded preceding and current row
) as running_avg
from t;
The ? is the ordering column.
In other words, Postgres has a single built-in function that does exactly what you want -- and the function happens to be standard SQL.
The window frame using rows is required, because the default is range.
If you do not have an ordering column, then you should add one. I strongly advise you to NOT use ctid for this purpose. It might seem like it works on small sets of data, but it is not stable over time and it might not work on larger sets of data.
If you expect your data to be ordered by inserts, then use a serial column to capture the insert order.

Hive window functions: last value of previous partition

Using Hive window functions, I would like to get the last value of the previous partition:
| name | rank | type |
| one | 1 | T1 |
| two | 2 | T2 |
| thr | 3 | T2 |
| fou | 4 | T1 |
| fiv | 5 | T2 |
| six | 6 | T2 |
| sev | 7 | T2 |
Following query:
SELECT
name,
rank,
first_value(rank over(partition by type order by rank)) as new_rank
FROM my_table
Would give:
| name | rank | type | new_rank |
| one | 1 | T1 | 1 |
| two | 2 | T2 | 2 |
| thr | 3 | T2 | 2 |
| fou | 4 | T1 | 4 |
| fiv | 5 | T2 | 5 |
| six | 6 | T2 | 5 |
| sev | 7 | T2 | 5 |
But what I need is "the last value of the previous partition":
| name | rank | type | new_rank |
| one | 1 | T1 | NULL |
| two | 2 | T2 | 1 |
| thr | 3 | T2 | 1 |
| fou | 4 | T1 | 3 |
| fiv | 5 | T2 | 4 |
| six | 6 | T2 | 4 |
| sev | 7 | T2 | 4 |
This seems quite tricky. This is a variant of group-and-islands. Here is the idea:
Identify the "islands" where type is the same (using difference of row numbers).
Then use lag() to introduce the previous rank into the island.
Do a min scan to get the new rank that you want.
So:
with gi as (
select t.*,
(seqnum - seqnum_t) as grp
from (select t.*,
row_number() over (partition by type order by rank) as seqnum_t,
row_number() over (order by rank) as seqnum
from t
) t
),
gi2 as (
select gi.*, lag(rank) over (order by gi.rank) as prev_rank
from gi
)
select gi2.*,
min(prev_rank) over (partition by type, grp) as new_rank
from gi2
order by rank;
Here is a SQL Fiddle (albeit using Postgres).

PostgreSQL - select count of repeated continuous sequences

I have the following table/data:
| user_id | action_id | data |
-------------------------------------
| 10 | 1 | fly |
| 10 | 2 | train |
| 10 | 3 | fly |
| 10 | 4 | fly |
| 10 | 5 | fly |
| 10 | 6 | train |
| 10 | 7 | fly |
| 10 | 8 | train |
| 10 | 9 | fly |
| 10 | 10 | fly |
Is there a way in postgresql to count repeated continuous 'fly' occurrences? In this example, the results should be:
counts
------
1
3
1
2
Yes, it's possible, using the lag window function and a cumulative sum:
with FlagCTE as (
select t.action_id, t.data,
case when t.data = 'fly' and t.data = lag(t.data) over (order by t.action_id) then 0 else 1 end as Flag
from some_table t),
GroupCTE as (
select t.action_id,
t.data,
sum(t.Flag) over (order by t.action_id) as GroupId
from FlagCTE t
where t.data = 'fly')
select count(*) as counts
from GroupCTE t
group by t.GroupId
order by t.GroupId
SQLFiddle Demo

Aggregation by positive/negative values v.2

I've posted several topics and every query had some problems :( Changed table and examples for better understanding
I have a table called PROD_COST with 5 fields
(ID,Duration,Cost,COST_NEXT,COST_CHANGE).
I need extra field called "groups" for aggregation.
Duration = number of days the price is valid (1 day=1row).
Cost = product price in this day.
-Cost_next = lead(cost,1,0).
Cost_change = Cost_next - Cost.
example:
+----+---------+------+-------------+-------+
|ID |Duration | Cost | Cost_change | Groups|
+----+---------+------+-------------+-------+
| 1 | 1 | 10 | -1,5 | 1 |
| 2 | 1 | 8,5 | 3,7 | 2 |
| 3 | 1 | 12.2 | 0 | 2 |
| 4 | 1 | 12.2 | -2,2 | 3 |
| 5 | 1 | 10 | 0 | 3 |
| 6 | 1 | 10 | 3.2 | 4 |
| 7 | 1 | 13.2 | -2,7 | 5 |
| 8 | 1 | 10.5 | -1,5 | 5 |
| 9 | 1 | 9 | 0 | 5 |
| 10 | 1 | 9 | 0 | 5 |
| 11 | 1 | 9 | -1 | 5 |
| 12 | 1 | 8 | 1.5 | 6 |
+----+---------+------+-------------+-------+
Now i need to group("Groups" field) by Cost_change. It can be positive,negative or 0 values.
Some kind guy advised me this query:
select id, COST_CHANGE, sum(GRP) over (order by id asc) +1
from
(
select *, case when sign(COST_CHANGE) != sign(isnull(lag(COST_CHANGE)
over (order by id asc),COST_CHANGE)) and Cost_change!=0 then 1 else 0 end as GRP
from PROD_COST
) X
But there is a problem: If there are 0 values between two positive or negative values than it groups it separately, for example:
+-------------+--------+
| Cost_change | Groups |
+-------------+--------+
| 9.262 | 5777 |
| -9.262 | 5778 |
| 9.262 | 5779 |
| 0.000 | 5779 |
| 9.608 | 5780 |
| -11.231 | 5781 |
| 10.000 | 5782 |
+-------------+--------+
I need to have:
+-------------+--------+
| Cost_change | Groups |
+-------------+--------+
| 9.262 | 5777 |
| -9.262 | 5778 |
| 9.262 | 5779 |
| 0.000 | 5779 |
| 9.608 | 5779 | -- Here
| -11.231 | 5780 |
| 10.000 | 5781 |
+-------------+--------+
In other words, if there's 0 values between two positive ot two negative values than they should be in one group, because Sequence: MINUS-0-0-MINUS - no rotation. But if i had MINUS-0-0-PLUS, than GROUPS should be 1-1-1-2, because positive valus is rotating with negative value.
Thank you for attention!
I'm Using Sql Server 2012
I think the best approach is to remove the zeros, do the calculation, and then re-insert them. So:
with pcg as (
select pc.*, min(id) over (partition by grp) as grpid
from (select pc.*,
(row_number() over (order by id) -
row_number() over (partition by sign(cost_change)
order by id
) as grp
from prod_cost pc
where cost_change <> 0
) pc
)
select pc.*, max(groups) over (order by id)
from prod_cost pc left join
(select pcg.*, dense_rank() over (order by grpid) as groups
from pcg
) pc
on pc.id = pcg.id;
The CTE assigns a group identifier based on the lowest id in the group, where the groups are bounded by actual sign changes. The subquery turns this into a number. The outer query then accumulates the maximum value, to give a value to the 0 records.