Order query results by items in WHERE clause [duplicate] - sql

This question already has answers here:
ORDER BY the IN value list
(17 answers)
Closed 3 months ago.
How to let the query result be ordered by the exact order of passed items in the WHERE clause?
For example, using this query:
SELECT id, name FROM my_table
WHERE id in (1,3,5,2,4,6)
ORDER BY id
Result:
id | name
---------
1 | a
2 | b
3 | c
4 | d
5 | e
6 | f
What I expected:
id | name
---------
1 | a
3 | c
5 | e
2 | b
4 | d
6 | f
I noticed that there is a FIELD() function in MySQL. Is there an equivalent function in PostgreSQL?

Pass an array and use WITH ORDINALITY. That's cleanest and fastest:
SELECT id, t.name
FROM unnest ('{1,3,5,2,4,6}'::int[]) WITH ORDINALITY u(id, ord)
JOIN my_table t USING (id)
ORDER BY u.ord;
Assuming values in the passed array are distinct. Else, this solution preserves duplicates, while IN removes them. You'd have to define which behavior you want. But then the desired sort order is also ambiguous, which would make the question moot.
See:
ORDER BY the IN value list
PostgreSQL unnest() with element number

#chris Kao, use Position in postgresql.
Approach : 1
SELECT id, name FROM my_table
WHERE id in (1,3,5,2,4,6)
order by position(id::text in '1,3,5,2,4,6')
output:
id|name|
--+----+
1|a |
3|c |
5|e |
2|b |
4|d |
6|f |
Aprroach : 2
select id, name
from my_table mt
where id in (1,3,5,2,4,6)
order by array_position(array[1,3,5,2,4,6], mt.id);

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);
It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

Get last element of an ordered set in postgresql

I am trying to get the last element of an ordered set, stored in a database table. The ordering is defined by one of the columns in the table. Also the table contains multiple sets, so I want the last one for each of the sets.
As an example consider the following table:
benchmarks=# select id,sorter from aggtest ;
id | sorter
----+--------
1 | 1
3 | 1
5 | 1
2 | 2
7 | 2
4 | 1
6 | 2
(7 rows)
Sorter 1 and 2 define each of the sets, sets are ordered by the id column. To get the last element of each set, I defined an aggregate function:
CREATE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $2;
$$;
CREATE AGGREGATE public.last (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
As explained here.
However when I use this I get:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter;
last | sorter
------+--------
4 | 1
6 | 2
(2 rows)
However, I want to get (5,1) and (7,2) as these are the last ids (numerically) in the set. Looking at how the aggregate mechanism works, I can see quite well, why the result is not what I want. The items are returned in the order I added them, and then aggregated so that the last one I added is returned.
I tried sorting by ids, so that each group is sorted independently, however that does not work:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter,id;
ERROR: column "aggtest.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...(id),sorter from aggtest group by sorter order by sorter,id;
If I wrap the sorting criteria in another aggregate, I get wrong data again:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter,last(id);
last | sorter
------+--------
4 | 1
6 | 2
(2 rows)
Also grouping by id in addition to sorter does not work obviously.
Of course there is an easier way, to get the last (highest) id for each group by using the max aggregate. However, I am not so much interested in the id but as in data associated with it (i.e. in the same row). Hence I do not to sort by id and then aggregate so that the row with the highest id is returned for each group.
What is the best way to accomplish this?
EDIT: Why does max(id) grouped by sorter not work
Assume the following complete table (unsorter represents the additional data I have in the table):
benchmarks=# select * from aggtest ;
id | sorter | unsorter
----+--------+----------
1 | 1 | 1
3 | 1 | 2
5 | 1 | 3
2 | 2 | 4
7 | 2 | 5
4 | 1 | 6
6 | 2 | 7
(7 rows)
I would like to retrieve the lines:
id | sorter | unsorter
----+--------+----------
5 | 1 | 3
7 | 2 | 5
However with max(id) and grouping by sorter I get:
benchmarks=# select max(id),sorter,unsorter from aggtest group by sorter;
ERROR: column "aggtest.unsorter" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select max(id),sorter,unsorter from aggtest group by sorter;
Using a max(unsorter) obviously does not work either:
benchmarks=# select max(id),sorter,max(unsorter) from aggtest group by sorter;
max | sorter | max
-----+--------+-----
5 | 1 | 6
7 | 2 | 7
(2 rows)
However using distinct (the accepted answer) I get:
benchmarks=# select distinct on (sorter) id,sorter,unsorter from aggtest order by sorter, id desc;
id | sorter | unsorter
----+--------+----------
5 | 1 | 3
7 | 2 | 5
(2 rows)
Which has the correct additional data. The join approach also seems to work, by is slightly slower on the test data.
Why not use a window function:
select id, sorter
from (
select id, sorter,
row_number() over (partition by sorter order by id desc) as rn
from aggtest
) t
where rn = 1;
Or using Postgres distinct on operator which is usually faster:
select distinct on (sorter) id, sorter
from aggtest
order by sorter, id desc
You write:
Of course there is an easier way, to get the last (highest) id for
each group by using the max aggregate. However, I am not so much
interested in the id but as in data associated with it (i.e. in the
same row).
This query will give you the data associated with the highest id of each sorter group.
select a.* from aggtest a
join (
select max(id) max_id, sorter
from aggtest
group by sorter
) b on a.id = b.max_id and a.sorter = b.sorter
select distinct max(id) over (partition by sorter) id,sorter
from aggtest order by 2 asc
returns:
5;1
7;2

CTE to represent a logical table for the rows in a table which have the max value in one column

I have an "insert only" database, wherein records aren't physically updated, but rather logically updated by adding a new record, with a CRUD value, carrying a larger sequence. In this case, the "seq" (sequence) column is more in line with what you may consider a primary key, but the "id" is the logical identifier for the record. In the example below,
This is the physical representation of the table:
seq id name | CRUD |
----|-----|--------|------|
1 | 10 | john | C |
2 | 10 | joe | U |
3 | 11 | kent | C |
4 | 12 | katie | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
This is the logical representation of the table, considering the "most recent" records:
seq id name | CRUD |
----|-----|--------|------|
2 | 10 | joe | U |
3 | 11 | kent | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
In order to, for instance, retrieve the most recent record for the person with id=12, I would currently do something like this:
SELECT
*
FROM
PEOPLE P
WHERE
P.ID = 12
AND
P.SEQ = (
SELECT
MAX(P1.SEQ)
FROM
PEOPLE P1
WHERE P.ID = 12
)
...and I would receive this row:
seq id name | CRUD |
----|-----|--------|------|
5 | 12 | sue | U |
What I'd rather do is something like this:
WITH
NEW_P
AS
(
--CTE representing all of the most recent records
--i.e. for any given id, the most recent sequence
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
The first SQL example using the the subquery already works for us.
Question: How can I leverage a CTE to simplify our predicates when needing to leverage the "most recent" logical view of the table. In essence, I don't want to inline a subquery every single time I want to get at the most recent record. I'd rather define a CTE and leverage that in any subsequent predicate.
P.S. While I'm currently using DB2, I'm looking for a solution that is database agnostic.
This is a clear case for window (or OLAP) functions, which are supported by all modern SQL databases. For example:
WITH
ORD_P
AS
(
SELECT p.*, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY seq DESC) rn
FROM people p
)
,
NEW_P
AS
(
SELECT * from ORD_P
WHERE rn = 1
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
PS. Not tested. You may need to explicitly list all columns in the CTE clauses.
I guess you already put it together. First find the max seq associated with each id, then use that to join back to the main table:
WITH newp AS (
SELECT id, MAX(seq) AS latestseq
FROM people
GROUP BY id
)
SELECT p.*
FROM people p
JOIN newp n ON (n.latestseq = p.seq)
ORDER BY p.id
What you originally had would work, or moving the CTE into the "from" clause. Maybe you want to use a timestamp field rather than a sequence number for the ordering?
Following up from #Glenn's answer, here is an updated query which meets my original goal and is on par with #mustaccio's answer, but I'm still not sure what the performance (and other) implications of this approach vs the other are.
WITH
LATEST_PERSON_SEQS AS
(
SELECT
ID,
MAX(SEQ) AS LATEST_SEQ
FROM
PERSON
GROUP BY
ID
)
,
LATEST_PERSON AS
(
SELECT
P.*
FROM
PERSON P
JOIN
LATEST_PERSON_SEQS L
ON
(
L.LATEST_SEQ = P.SEQ)
)
SELECT
*
FROM
LATEST_PERSON L2
WHERE
L2.ID = 12

select rows satisfying some criteria and with maximum value in a certain column

I have a table of metadata for updates to a software package. The table has columns id, name, version. I want to select all rows where the name is one of some given list of names and the version is maximum of all the rows with that name.
For example, given these records:
+----+------+---------+
| id | name | version |
+----+------+---------+
| 1 | foo | 1 |
| 2 | foo | 2 |
| 3 | bar | 4 |
| 4 | bar | 5 |
+----+------+---------+
And a task "give me the highest versions of records "foo" and "bar", I want the result to be:
+----+------+---------+
| id | name | version |
+----+------+---------+
| 2 | foo | 2 |
| 4 | bar | 5 |
+----+------+---------+
What I come up with so far, is using nested queries:
SELECT *
FROM updates
WHERE (
id IN (SELECT id
FROM updates
WHERE name = 'foo'
ORDER BY version DESC
LIMIT 1)
) OR (
id IN (SELECT id
FROM updates
WHERE name = 'bar'
ORDER BY version DESC
LIMIT 1)
);
This works, but feels wrong. If I want to filter on more names, I have to replicate the whole subquery multiple times. Is there a better way to do this?
select distinct on (name) id, name, version
from metadata
where name in ('foo', 'bar')
order by name, version desc
NOT EXISTS is a way to avoid unwanted sub optimal tuples:
SELECT *
FROM updates uu
WHERE uu.zname IN ('foo', 'bar')
AND NOT EXISTS (
SELECT *
FROM updates nx
WHERE nx.zname = uu.zanme
AND nx.version > uu.version
);
Note: I replaced name by zname, since it is more or less a keyword in postgresql.
Update after rereading the Q:
I want to select all rows where the name is one of some given list
of names and the version is maximum of all the rows with that name.
If there can be ties (multiple rows with the maximum version per name), you could use the window function rank() in a subquery. Requires PostgreSQL 8.4+.
SELECT *
FROM (
SELECT *, rank() OVER (PARTITION BY name ORDER BY version DESC) AS rnk
FROM updates
WHERE name IN ('foo', 'bar')
)
WHERE rnk = 1;