PostgreSQL - column value changed - select query optimization - sql

Say we have a table:
CREATE TABLE p
(
id serial NOT NULL,
val boolean NOT NULL,
PRIMARY KEY (id)
);
Populated with some rows:
insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID VAL
1 1
2 0
3 0
4 1
5 1
6 1
7 0
I want to determine when the value has been changed. So the result of my query should be:
ID VAL
2 0
4 1
7 0
I have a solution with joins and subqueries:
select min(id) id, val from
(
select p1.id, p1.val, max(p2.id) last_prev
from p p1
join p p2
on p2.id < p1.id and p2.val != p1.val
group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;
But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?
SQL Fiddle

This is how I would do it with an analytic:
SELECT id, val
FROM ( SELECT id, val
,LAG(val) OVER (ORDER BY id) AS prev_val
FROM p ) x
WHERE val <> COALESCE(prev_val, val)
ORDER BY id
Update (some explanation):
Analytic functions operate as a post-processing step. The query result is broken into groupings (partition by) and the analytic function is applied within the context of a grouping.
In this case, the query is a selection from p. The analytic function being applied is LAG. Since there is no partition by clause, there is only one grouping: the entire result set. This grouping is ordered by id. LAG returns the value of the previous row in the grouping using the specified order. The result is each row having an additional column (aliased prev_val) which is the val of the preceding row. That is the subquery.
Then we look for rows where the val does not match the val of the previous row (prev_val). The COALESCE handles the special case of the first row which does not have a previous value.
Analytic functions may seem a bit strange at first, but a search on analytic functions finds a lot of examples walking through how they work. For example: http://www.cs.utexas.edu/~cannata/dbms/Analytic%20Functions%20in%20Oracle%208i%20and%209i.htm Just remember that it is a post-processing step. You won't be able to perform filtering, etc on the value of an analytic function unless you subquery it.

Window function
Instead of calling COALESCE, you can provide a default from the window function lag() directly. A minor detail in this case since all columns are defined NOT NULL. But this may be essential to distinguish "no previous row" from "NULL in previous row".
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.
If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL values - even though your columns are defined NOT NULL. Basic lag() returns NULL in case there is no previous row:
SELECT id, val
FROM (
SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Or employ the additional parameters of lag() once again:
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Recursive CTE
As proof of concept. :)
Performance won't keep up with posted alternatives.
WITH RECURSIVE cte AS (
SELECT id, val
FROM p
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id < p.id
)
UNION ALL
SELECT p.id, p.val
FROM cte
JOIN p ON p.id > cte.id
AND p.val <> cte.val
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id > cte.id
AND p0.val <> cte.val
AND p0.id < p.id
)
)
SELECT * FROM cte;
With an improvement from #wildplasser.
SQL Fiddle demonstrating all.

Can even be done without window functions.
SELECT * FROM p p0
WHERE EXISTS (
SELECT * FROM p ex
WHERE ex.id < p0.id
AND ex.val <> p0.val
AND NOT EXISTS (
SELECT * FROM p nx
WHERE nx.id < p0.id
AND nx.id > ex.id
)
);
UPDATE: Self-joining a non-recursive CTE (could also be a subquery instead of a CTE)
WITH drag AS (
SELECT id
, rank() OVER (ORDER BY id) AS rnk
, val
FROM p
)
SELECT d1.*
FROM drag d1
JOIN drag d0 ON d0.rnk = d1.rnk -1
WHERE d1.val <> d0.val
;
This nonrecursive CTE approach is surprisingly fast, although it needs an implicit sort.

Using 2 row_number() computations: This is also possible to do with usual "islands and gaps" SQL technique (could be useful if you can't use lag() window function for some reason:
with cte1 as (
select
*,
row_number() over(order by id) as rn1,
row_number() over(partition by val order by id) as rn2
from p
)
select *, rn1 - rn2 as g
from cte1
order by id
So this query will give you all islands
ID VAL RN1 RN2 G
1 1 1 1 0
2 0 2 1 1
3 0 3 2 1
4 1 4 2 2
5 1 5 3 2
6 1 6 4 2
7 0 7 3 4
You see, how G field could be used to group this islands together:
with cte1 as (
select
*,
row_number() over(order by id) as rn1,
row_number() over(partition by val order by id) as rn2
from p
)
select
min(id) as id,
val
from cte1
group by val, rn1 - rn2
order by 1
So you'll get
ID VAL
1 1
2 0
4 1
7 0
The only thing now is you have to remove first record which can be done by getting min(...) over() window function:
with cte1 as (
...
), cte2 as (
select
min(id) as id,
val,
min(min(id)) over() as mid
from cte1
group by val, rn1 - rn2
)
select id, val
from cte2
where id <> mid
And results:
ID VAL
2 0
4 1
7 0

A simple inner join can do it. SQL Fiddle
select p2.id, p2.val
from
p p1
inner join
p p2 on p2.id = p1.id + 1
where p2.val != p1.val

Related

Is there a way to collapse ordered rows by terminal values with postgres window clause

I have a table foo:
some_fk
some_field
some_date_field
1
A
1990-01-01
1
B
1990-01-02
1
C
1990-03-01
1
X
1990-04-01
2
B
1990-01-01
2
B
1990-01-05
2
Z
1991-04-11
2
C
1992-01-01
2
B
1992-02-01
2
Y
1992-03-01
3
C
1990-01-01
some_field has 6 possible values: [A,B,C,X,Y,Z]
Where [A,B,C] signify opening or continuation events and [X,Y,Z] signify closing events. How do I get each span of time between the first opening event and closing event of each span, partitioned by some_fk, as shown in the table below:
some_fk
some_date_field_start
some_date_field_end
1
1990-01-01
1990-04-01
2
1990-01-01
1991-04-11
2
1992-01-01
1992-03-01
3
1990-01-01
NULL
*Note that a non-terminated time span ends with NULL
I do have a solution that involves 3 common table expressions, but I'm wondering if there is a (better/more elegant/canonical) way to do this in PostgreSQL without nested queries.
My approach was something like:
WITH ranked AS (
SELECT
RANK() OVER (PARTITION BY some_fk ORDER BY some_date_field) AS "rank"
some_fk,
some_field,
some_date_field
FROM foo
), openers AS (
SELECT * FROM ranked WHERE some_field IN ('A','B','C')
), closers AS (
SELECT
*,
LAG("rank") OVER (PARTITION BY some_fk ORDER BY "rank") AS rank_lag
FROM ranked WHERE some_field IN ('X','Y','Z')
)
SELECT DISTINCT
openers.some_fk,
FIRST_VALUE(openers.some_date_field) OVER (PARTITION BY some_fk ORDER BY "rank")
AS some_date_field_start,
closers.some_date_field AS some_date_field_end
FROM openers
JOIN closers
ON openers.some_fk = closers.some_fk
WHERE openers.some_date_field BETWEEN COALESCE(closers.rank_lag, 0) AND closers.rank
... but I feel there must be a better way.
Thanks in advance for the help.
Another approach is to create a grouping ID by creating a running sum of the closing events. Then in an outer SQL you can Group By and pick min() and max() dates.
Select some_fk,min(some_date) as some_date_field_start, max(some_date) as some_date_field _end
From (
Select some_fk,some_date,
Sum(Case When some_field in ('X','Y','Z') Then 1 Else 0 End)
Over (Partition By some_fk Order By some_date
Rows Between Unbounded Preceding And 1 Preceding)
as some_grouping
From foo
)
Group By some_fk,some_grouping
Order By some_fk,some_grouping
This seems a little simpler at least to me.
The basis of the query is to use LAG to determine if the previous record was a closure.
SELECT *,
LAG(some_field) OVER (PARTITION BY some_fk ORDER BY some_date_field) Previous_some_field
FROM foo
This allows you to filter on the correct 4 records from your expected results, with the first 2 columns included; your mistake was to put the WHERE clause onto that query directly, when what you want to do is use it as is in a sub-query and write the WHERE in the main query.From that point, you have several possibilities to finish the query.
Here is a version using a scalar subquery:
SELECT some_fk, some_date_field AS some_date_field_start,
(
SELECT MIN(some_date_field)
FROM foo
WHERE some_fk = F.some_fk AND some_date_field > F.some_date_field AND some_field IN ('X','Y','Z')
) AS some_date_field_end
FROM (
SELECT *,
LAG(some_field) OVER (PARTITION BY some_fk ORDER BY some_date_field) Previous_some_field
FROM foo
) F
WHERE some_field IN ('A','B','C')
AND COALESCE(previous_some_field,'Z') IN ('X','Y','Z')
Here is another version using a CROSS JOIN LATERAL:
SELECT some_fk, some_date_field AS some_date_field_start, some_date_field_end
FROM (
SELECT *,
LAG(some_field) OVER (PARTITION BY some_fk ORDER BY some_date_field) Previous_some_field
FROM foo
) F1
CROSS JOIN LATERAL (
SELECT MIN(some_date_field) AS some_date_field_end
FROM foo
WHERE some_fk = F1.some_fk AND some_date_field > F1.some_date_field AND some_field IN ('X','Y','Z')
) F2
WHERE some_field IN ('A','B','C')
AND COALESCE(previous_some_field,'Z') IN ('X','Y','Z')

select value based on max of other column

I have a few questions about a table I'm trying to make in Postgres.
The following table is my input:
id
area
count
function
1
100
20
living
1
200
30
industry
2
400
10
living
2
400
10
industry
2
400
20
education
3
150
1
industry
3
150
1
education
I want to group by id and get the dominant function based on max area. With summing up the rows for area and count. When area is equal it should be based on max count, when area and count is equal it should be based on prior function (i still have to decide if education is prior to industry or vice versa). So the result should be:
id
area
count
function
1
300
50
industry
2
1200
40
education
3
300
2
industry
I tried a lot of things and maybe it's easy, but i don't get it. Can someone help to get the right SQL?
One method uses row_number() and conditional aggregation:
select id, sum(area), sum(count),
max(function) over (filter where seqnum = 1) as function
from (select t.*,
row_number() over (partition by id order by area desc) as seqnum
from t
) t
group by id;
Another method uses ``distinct on`:
select id, sum(area) over (partition by id) as area,
sum(count) over (partition by id) as count,
function
from t
order by id, area desc;
Use a scalar sub-query for "function".
select t.id, sum(t.area), sum(t.count),
(
select "function"
from the_table
where id = t.id
order by area desc, count desc, "function" desc
limit 1
) as "function"
from the_table as t
group by t.id order by t.id;
SQL Fiddle
you can use sum as window function:
select distinct on (t.id)
id,
sum(area) over (partition by id) as area,
sum(count) over (partition by id) as count,
( select function from tbl_test where tbl_test.id = t.id order by count desc limit 1 ) as function
from tbl_test t
This is how you get the function for each group based on id:
select id, function
from yourtable yt1
left join yourtable yt2
on yt1.id = yt2.id and yt1.area < yt2.area
where yt2.area.id is null;
(we ensure that no yt2 exists that would be of the same id but of higher areay)
This would work nicely, but you might have several max areas with different values. To cope with this isue, let's ensure that exactly one is chosen:
select id, max(function) as function
from yourtable yt1
left join yourtable yt2
on yt1.id = yt2.id and yt1.area < yt2.area
where yt2.area.id is null
group by id;
Now, let's join this to our main table;
select yourtable.id, sum(yourtable.area), sum(yourtable.count), t.function
from yourtable
join (
select id, max(function) as function
from yourtable yt1
left join yourtable yt2
on yt1.id = yt2.id and yt1.area < yt2.area
where yt2.area.id is null
group by id
) t
on yourtable.id = t.id
group by yourtable.id;

Sql Range Groups Start and End Id

I have a query that I want to break into 'chunks' of size 200 and return the start id and end id of each 'chunk'.
Example:
select t.id
from t
where t.x = y --this predicate will cause the ids to not be sequential
If the example was the query I'm trying to break into 'chunks' I'd want to return:
(1st ID, 200th ID), (201st ID, 400th ID)...(start of final range ID, end of range ID)
Edit: For the final range, if it is not a full 200 rows it should still supply the final id in the query.
Is there a way to do this with just SQL or will I have to resort to application processing and/or multiple queries similar to a pagination implementation?
If there is a way to do this in SQL please supply an example.
Hmmm, I think the easiest way is to use row_number():
select id
from (select t.*, row_number() over (order by id) as seqnum
from t
where t.x = y
) t
where (seqnum % 200) in (0, 1);
EDIT:
Based on your comments:
select min(id) as startid, max(id) as endid
from (select t.*,
floor((row_number() over (order by id) - 1) / 200) as grp
from t
where t.x = y
) t
group by grp;
L for Left and R for Right
WITH cte AS (
SELECT
t.id,
row_number() over (order by id) as seqnum
FROM Table t
WHERE t.x = y
)
SELECT L.id as start_id, COALESCE(R.id, (SELECT MAX(ID) FROM cte) ) as end_id
FROM cte L
LEFT JOIN cte R
ON L.seqnum = R.seqnum - 199
WHERE L.seqnum % 200 = 1
SqlFiddleDemo
filtering only even number and block of 4.
See how R.seqnum - 199 for a block of size 200

SQL query for column threaded relationship

This is a simplified view of a table. I apologize, but I could not save a picture of the table so I hope this is ok.
c1___c2
1____a
1____b
2____a
2____b
2____c
2____d
3____e
3____a
4____z
5____d
The result is that due to the relationships of column C2,
Group 1 would include, 1,2,3,5 (because they have overlapping c2 values basically stating a=b=c=d=e)
Group 2 would include 4
I have millions of rows with this kind of data and currently there is a cursor job that runs x number of times to build these groups. I am able to visualize how this should work, but I have not been able to build a query that can pull out this relationship.
Any suggestions?
Thank you
Tested on SQL Server 2012:
WITH t AS (
SELECT
t.c1,
t.c2,
tm.c1_min
FROM
Test t
JOIN
(
SELECT
c2,
MIN(c1) AS c1_min
FROM
Test
GROUP BY
c2
) AS tm
ON
t.c2 = tm.c2
),
rt AS (
SELECT
c1_min,
c1,
1 AS cnt
FROM
t
UNION ALL
SELECT
rt.c1_min,
t.c1,
rt.cnt + 1 AS cnt
FROM
rt
JOIN
t
ON
rt.c1 = t.c1_min
AND
rt.c1 < t.c1
)
SELECT
SUM(t.rst) OVER (ORDER BY t.ord ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_number,
t.c1
FROM
(
SELECT
t.c1,
t.rst,
t.ord
FROM
(
SELECT
rt.c1,
CASE
WHEN rt.c1_min = MIN(rt.c1_min) OVER (ORDER BY rt.c1_min, rt.c1 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) THEN 0
ELSE 1
END AS rst,
ROW_NUMBER() OVER (ORDER BY rt.c1_min, rt.c1) AS ord,
ROW_NUMBER() OVER (PARTITION BY rt.c1 ORDER BY rt.c1_min, rt.cnt) AS qfy
FROM
rt
) AS t
WHERE
t.qfy = 1
) AS t;

Group By Retrieve 4 Values

I have the following query
SELECT Cod ,
MIN(Id) AS id_Min,
-- retrieve value min in the middle as id_Min_Middle,
-- retrieve value max in the middle as id_Max_Middle,
MAX(Id) AS id_Max,
COUNT(*) AS Tot
FROM Table a ( NOLOCK )
GROUP BY Cod
HAVING COUNT(*)=4
How could I retrieve the values between min and max as I have done for min and max?
If I use (SUM(Id) - (MIN(Id)+MAX(Id)) I get the sum of middle min and max, but not the values I want.
EXAMPLES
Cod | Id
Stack 10
Stack 15
Stack 11
Stack 40
Overflow 1
Overflow 120
Overflow 15
Overflow 100
Required output
Cod | Min | Min_In_The_Middle | Max_In_The_Middle | Max
Stack 10 11 15 40
Overflow 1 15 100 120
Just only one [Table|[Clustered] Index]]Scan (demo here):
SELECT pvt.Cod,
pvt.[1] AS MinValue,
pvt.[2] AS MinInterValue,
pvt.[3] AS MaxInterValue,
pvt.[4] AS MaxValue
FROM
(
SELECT x.Cod, x.ID, x.RowNumAsc
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY t.Cod ORDER BY t.ID ASC) RowNumAsc,
ROW_NUMBER() OVER(PARTITION BY t.Cod ORDER BY t.ID DESC) RowNumDesc
FROM MyTable t
) x
WHERE x.RowNumAsc = 1 AND x.RowNumDesc = 4
OR x.RowNumAsc = 2 AND x.RowNumDesc = 3
OR x.RowNumAsc = 3 AND x.RowNumDesc = 2
OR x.RowNumAsc = 4 AND x.RowNumDesc = 1
) y
PIVOT ( MAX(y.ID) FOR y.RowNumAsc IN ([1], [2], [3], [4]) ) pvt;
Try using this, best of luck
WITH temp AS
(SELECT cod, MIN (ID) min_id, MAX (ID) max_id
FROM tab
GROUP BY cod
HAVING COUNT (ID) = 4)
SELECT code, temp.min_id,
(SELECT MIN (ID)
FROM tab
WHERE cod = temp.cod AND ID NOT IN (temp.min_id)
GROUP BY cod) min_mid_id,
(SELECT MAX (ID)
FROM tab
WHERE cod = temp.cod AND ID NOT IN (temp.max_id)
GROUP BY cod) max_min_id, temp.max_id
FROM temp;
I'm not sure what it means for your question to be tagged plsql and sql-server. But I'll assume you're working with a database system that supports CTEs and window functions.
To generalize what you're been trying to do, first assign row numbers to the rows, then use whatever technique you want to achieve the pivot:
;WITH OrderedValues as (
SELECT Cod,Id,ROW_NUMBER() OVER (PARTITION BY Cod ORDER BY Id) as rn
COUNT(*) OVER (PARTITION BY Cod) as Cnt
FROM Table (NOLOCK)
), With4Values as (
SELECT * from OrderedValues where Cnt=4
)
SELECT Cod,
--However you want to do the pivot. Here I'll use MAX/CASE
MAX(CASE WHEN rn=1 THEN Id END) as Value1,
MAX(CASE WHEN rn=2 THEN Id END) as Value2,
MAX(CASE WHEN rn=3 THEN Id END) as Value3,
MAX(CASE WHEN rn=4 THEN Id END) as Value4
FROM
With4Values
GROUP BY
Cod
You can hopefully see that this is more easily extended to more columns than answering your overly specific questions about 3 rows, or 4 rows. But if you need to deal with an arbitrary number of columns, you'll have to switch to dynamic SQL.
I understand you want to exclude the extreme values and find min and max for the rest.
This is what I think of, but I had no chance to run and test it...
WITH Extremes AS ( SELECT Cod, MAX(ID) AS Id_Max, MIN(ID) AS Id_Min
FROM [Table] a GROUP BY Cod)
SELECT
e.Cod,
e.Id_Min,
MIN(a.Id) AS id_Min_Middle,
MAX(a.Id) AS id_Max_Middle,
e.Id_Max
FROM Extremes e
LEFT JOIN [Table] a ON a.Cod = e.Cod AND a.Id > e.Id_Min AND a.Id < e.Id_Max
GROUP BY e.Cod