Parallel unnest() and sort order in PostgreSQL - sql

I understand that using
SELECT unnest(ARRAY[5,3,9]) as id
without an ORDER BY clause, the order of the result set is not guaranteed. I could for example get:
id
--
3
5
9
But what about the following request:
SELECT
unnest(ARRAY[5,3,9]) as id,
unnest(ARRAY(select generate_series(1, array_length(ARRAY[5,3,9], 1)))) as idx
ORDER BY idx ASC
Is it guaranteed that the 2 unnest() calls (which have the same length) will unroll in parallel and that the index idx will indeed match the position of the item in the array?
I am using PostgreSQL 9.3.3.

Yes, that is a feature of Postgres and parallel unnesting is guaranteed to be in sync (as long as all arrays have the same number of elements).
Postgres 9.4 adds a clean solution for parallel unnest:
Unnest multiple arrays in parallel
The order of resulting rows is not guaranteed, though. Actually, with a statement as simple as:
SELECT unnest(ARRAY[5,3,9]) AS id;
the resulting order of rows is "guaranteed", but Postgres does not assert anything. The query optimizer is free to order rows as it sees fit as long as the order is not explicitly defined. This may have side effects in more complex queries.
If the second query in your question is what you actually want (add an index number to unnested array elements), there is a better way with generate_subscripts():
SELECT unnest(ARRAY[5,3,9]) AS id
, generate_subscripts(ARRAY[5,3,9], 1) AS idx
ORDER BY idx;
Details in this related answer:
How to access array internal index with postgreSQL?
You will be interested in WITH ORDINALITY in Postgres 9.4:
PostgreSQL unnest() with element number
Then you can use:
SELECT * FROM unnest(ARRAY[5,3,9]) WITH ORDINALITY tbl(id, idx);

Short answer: No, idx will not match the array positions, when accepting the premise that unnest() output may be randomly ordered.
Demo:
since the current implementation of unnest actually output the rows in the order of elements, I suggest to add a layer on top of it to simulate a random order:
CREATE FUNCTION unnest_random(anyarray) RETURNS setof anyelement
language sql as
$$ select unnest($1) order by random() $$;
Then check out a few executions of your query with unnest replaced by unnest_random:
SELECT
unnest_random(ARRAY[5,3,9]) as id,
unnest_random(ARRAY(select generate_series(1, array_length(ARRAY[5,3,9], 1)))) as idx
ORDER BY idx ASC
Example of output:
id | idx
----+-----
3 | 1
9 | 2
5 | 3
id=3 is associated with idx=1 but 3 was in 2nd position in the array. It's all wrong.
What's wrong in the query: it assumes that the first unnest will shuffle the elements using the same permutation as the second unnest (permutation in the mathematic sense: the relationship between order in the array and order of the rows). But this assumption contradicts the premise that the order output of unnest is unpredictable to start with.
About this question:
Is it guaranteed that the 2 unnest() calls (which have the same
length) will unroll in parallel
In select unnest(...) X1, unnest(...) X2, with X1 and X2 being of type SETOF something and having the same number of rows, X1 and X2 will be paired in the final output so that the X1 value at row N will face the X2 value at the same row N.
(it's a kind of UNION for columns, as opposed to a cartesian product).
But I wouldn't describe this pairing as unroll in parallel, so I'm not sure this is what you meant.
Anyway this pairing doesn't help with the problem since it happens after the unnest calls have lost the array positions.
An alternative: In this thread from the pgsql-sql mailing list, this function is suggested:
CREATE OR REPLACE FUNCTION unnest_with_ordinality(anyarray, OUT value
anyelement, OUT ordinality integer)
RETURNS SETOF record AS
$$
SELECT $1[i], i FROM
generate_series(array_lower($1,1),
array_upper($1,1)) i;
$$
LANGUAGE sql IMMUTABLE;
Based on this, we can order by the second output column:
select * from unnest_with_ordinality(array[5,3,9]) order by 2;
value | ordinality
-------+------------
5 | 1
3 | 2
9 | 3
With postgres 9.4 and above: The WITH ORDINALITY clause that can follow SET RETURNING function calls will provide this functionality in a generic way.

Related

Remove duplicate entries from string array column of postgres

I have a PostgreSQL table where there is column which has array of strings. The row have some unique array strings or some have duplicate strings also. I want to remove duplicate strings from each row if they exists.
I have tried to some queries but couldn't make it happen.
Following is the table:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8","viper"}
7 | {"ferrariff","viper","viper","volt"}
I am expecting following output:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8"}
7 | {"ferrariff","viper","volt"}
Since each row's array is independent, a plain correlated subquery with an ARRAY constructor would do the job:
SELECT *, ARRAY(SELECT DISTINCT unnest (vehicle_types)) AS vehicle_types_uni
FROM vehicle;
See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Note that NULL is converted to an empty array ('{}'). We'd need to special-case it, but it is excluded in the UPDATE below anyway.
Fast and simple. But don't use this. You didn't say so, but typically you'd want to preserve original order of array elements. Your rudimentary sample suggests as much. Use WITH ORDINALITY in the correlated subquery, which becomes a bit more sophisticated:
SELECT *, ARRAY (SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
) AS vehicle_types_uni
FROM vehicle;
See:
PostgreSQL unnest() with element number
UPDATE to actually remove dupes:
UPDATE vehicle
SET vehicle_types = ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
)
WHERE cardinality(vehicle_types) > 1 -- optional
AND vehicle_types <> ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
); -- suppress empty updates (optional)
Both added WHERE conditions are optional to improve performance. The 1st one is completely redundant. Each condition also excludes the NULL case. The 2nd one suppresses all empty updates.
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you tried to do that without preserving original order, you'd likely update most rows without need, just because the order or elements changed even without dupes.
Requires Postgres 9.4 or later.
db<>fiddle here
I don't claim it's efficient, but something like this might work:
with expanded as (
select veh_id, unnest (vehicle_types) as vehicle_type
from vehicles
)
select veh_id, array_agg (distinct vehicle_type)
from expanded
group by veh_id
If you really want to get fancy and do something that is worst case O(n), you can write a custom function:
create or replace function unique_array(input_array text[])
returns text[] as $$
DECLARE
output_array text[];
i integer;
BEGIN
output_array = array[]::text[];
for i in 1..cardinality(input_array) loop
if not (input_array[i] = any (output_array)) then
output_array := output_array || input_array[i];
end if;
end loop;
return output_array;
END;
$$
language plpgsql
Usage example:
select veh_id, unique_array(vehicle_types)
from vehicles

Postgres union of queries in loop

I have a table with two columns. Let's call them
array_column and text_column
I'm trying to write a query to find out, for K ranging from 1 to 10, in how many rows does the value in text_column appear in the first K elements of array_column
I'm expecting results like:
k | count
________________
1 | 70
2 | 85
3 | 90
...
I did manage to get these results by simply repeating the query 10 times and uniting the results, which looks like this:
SELECT 1 AS k, count(*) FROM table WHERE array_column[1:1] #> ARRAY[text_column]
UNION ALL
SELECT 2 AS k, count(*) FROM table WHERE array_column[1:2] #> ARRAY[text_column]
UNION ALL
SELECT 3 AS k, count(*) FROM table WHERE array_column[1:3] #> ARRAY[text_column]
...
But that doesn't looks like the correct way to do it. What if I wanted a very large range for K?
So my question is, is it possible to perform queries in a loop, and unite the results from each query? Or, if this is not the correct approach to the problem, how would you do it?
Thanks in advance!
You could use array_positions() which returns an array of all positions where the argument was found in the array, e.g.
select t.*,
array_positions(array_column, text_column)
from the_table t;
This returns a different result but is a lot more efficient as you don't need to increase the overall size of the result. To only consider the first ten array elements, just pass a slice to the function:
select t.*,
array_positions(array_column[1:10], text_column)
from the_table t;
To limit the result to only rows that actually contain the value you can use:
select t.*,
array_positions(array_column[1:10], text_column)
from the_table t
where text_column = any(array_column[1:10]);
To get your desired result, you could use unnest() to turn that into rows:
select k, count(*)
from the_table t, unnest(array_positions(array_column[1:10], text_column)) as k
where text_column = any(array_column[1:10])
group by k
order by k;
You can use the generate_series function to generate a table with the expected number of rows with the expected values and then join to it within the query, like so:
SELECT t.k AS k, count(*)
FROM table
--right join ensures that you will get a value of 0 if there are no records meeting the criteria
right join (select generate_series(1,10) as k) t
on array_column[1:t.k] #> ARRAY[text_column]
group by t.k
This is probably the closest thing to using a loop to go through the results without using something like PL/SQL to do an actual loop in a user-defined function.

How to get index of an array value in PostgreSQL?

I have a table called pins like this:
id (int) | pin_codes (jsonb)
--------------------------------
1 | [4000, 5000, 6000]
2 | [8500, 8400, 8600]
3 | [2700, 2300, 2980]
Now, I want the row with pin_code 8600 and with its array index. The output must be like this:
pin_codes | index
------------------------------
[8500, 8500, 8600] | 2
If I want the row with pin_code 2700, the output :
pin_codes | index
------------------------------
[2700, 2300, 2980] | 0
What I've tried so far:
SELECT pin_codes FROM pins WHERE pin_codes #> '[8600]'
It only returns the row with wanted value. I don't know how to get the index on the value in the pin_codes array!
Any help would be great appreciated.
P.S:
I'm using PostgreSQL 10
If you were storing the array as a real array not as a json, you could use array_position() to find the (first) index of a given element:
select array_position(array['one', 'two', 'three'], 'two')
returns 2
With some text mangling you can cast the JSON array into a text array:
select array_position(translate(pin_codes::text,'[]','{}')::text[], '8600')
from the_table;
The also allows you to use the "operator"
select *
from pins
where '8600' = any(translate(pin_codes::text,'[]','{}')::text[])
The contains #> operator expects arrays on both sides of the operator. You could use it to search for two pin codes at a time:
select *
from pins
where translate(pin_codes::text,'[]','{}')::text[] #> array['8600','8400']
Or use the overlaps operator && to find rows with any of multiple elements:
select *
from pins
where translate(pin_codes::text,'[]','{}')::text[] && array['8600','2700']
would return
id | pin_codes
---+-------------------
2 | [8500, 8400, 8600]
3 | [2700, 2300, 2980]
If you do that a lot, it would be more efficient to store the pin_codes as text[] rather then JSON - then you can also index that column to do searches more efficiently.
Use the function jsonb_array_elements_text() using with ordinality.
with my_table(id, pin_codes) as (
values
(1, '[4000, 5000, 6000]'::jsonb),
(2, '[8500, 8400, 8600]'),
(3, '[2700, 2300, 2980]')
)
select id, pin_codes, ordinality- 1 as index
from my_table, jsonb_array_elements_text(pin_codes) with ordinality
where value::int = 8600;
id | pin_codes | index
----+--------------------+-------
2 | [8500, 8400, 8600] | 2
(1 row)
As has been pointed out previously the array_position function is only available in Postgres 9.5 and greater.
Here is custom function that achieves the same, derived from nathansgreen at github.
-- The array_position function was added in Postgres 9.5.
-- For older versions, you can get the same behavior with this function.
create function array_position(arr ANYARRAY, elem ANYELEMENT, pos INTEGER default 1) returns INTEGER
language sql
as $BODY$
select row_number::INTEGER
from (
select unnest, row_number() over ()
from ( select unnest(arr) ) t0
) t1
where row_number >= greatest(1, pos)
and (case when elem is null then unnest is null else unnest = elem end)
limit 1;
$BODY$;
So in this specific case, after creating the function the following worked for me.
SELECT
pin_codes,
array_position(pin_codes, 8600) AS index
FROM pins
WHERE array_position(pin_codes, 8600) IS NOT NULL;
Worth bearing in mind that it will only return the index of the first occurrence of 8600, you can use the pos argument to index which ever occurrence that you like.
In short, normalize your data structure, or don't do this in SQL. If you want this index of the sub-data element given your current data structure, then do this in your application code (take result, cast to list/array, get index).
Try to unnest the string and assign numbers as follows:
with dat as
(
select 1 id, '8700, 5600, 2300' pins
union all
select 2 id, '2300, 1700, 1000' pins
)
select dat.*, t.rn as index
from
(
select id, t.pins, row_number() over (partition by id) rn
from
(
select id, trim(unnest(string_to_array(pins, ','))) pins from dat
) t
) t
join dat on dat.id = t.id and t.pins = '2300'
If you insist on storing Arrays, I'd defer to klins answer.
As the alternative answer and extension to my comment...don't store SQL data in arrays. 'Normalize' your data in advance and SQL will handle it significantly better. Klin's answer is good, but may suffer for performance as it's outside of what SQL does best.
I'd break the Array prior to storing it. If the number of pincodes is known, then simply having the table pin_id,pin1,pin2,pin3, pinetc... is functional.
If the number of pins is unknown, a first table as pin that stored the pin_id and any info columns related to that pin ID, and then a second table as pin_id, pin_seq,pin_value is also functional (though you may need to pivot this later on to make sense of the data). In this case, select pin_seq where pin_value = 260 would work.

Browse subcolumns, but discard some

I have a table (or view) in my PostgreSQL database and want to do the following:
Query the table and feed a function in my application subsequent n-tuples of rows from the query, but only those that satisfy some condition. I can do the n-tuple listing using a cursor, but I don't know how to do the condition checking on database level.
For example, the query returns:
3
2
4
2
0
1
4
6
2
And I want triples of even numbers. Here, they would be:
(2,4,2) (4,2,0) (4,6,2)
Obviously, I cannot discard the odd numbers from the query result. Instead using cursor, a query returning arrays in similar manner would also be acceptable solution, but I don't have any good idea how to use them to do this.
Of course, I could check it at application level, but I think it'd be cleaner to do it on database level. Is it possible?
With the window function lead() (as mentioned by #wildplasser):
SELECT *
FROM (
SELECT tbl_id, i AS i1
, lead(i) OVER (ORDER BY tbl_id) AS i2
, lead(i, 2) OVER (ORDER BY tbl_id) AS i3
FROM tbl
) sub
WHERE i1%2 = 0
AND i2%2 = 0
AND i3%2 = 0;
There is no natural order of rows - assuming you want to order by tbl_id in the example.
% .. modulo operator
SQL Fiddle.
You can also use an array aggregate for this instead of using lag:
SELECT
a[1] a1, a[2] a2, a[3] a3
FROM (
SELECT
array_agg(i) OVER (ORDER BY tbl_id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM
tbl
) x(a)
WHERE a[1] % 2 = 0 AND a[2] % 2 = 0 AND a[3] % 2 = 0;
No idea if this'll be better, worse, or the same as Erwin's answer, just putting it in for completeness.

Implementing a total order ranking in PostgreSQL 8.3

The issue with 8.3 is that rank() is introduced in 8.4.
Consider the numbers [10,6,6,2].
I wish to achieve a rank of those numbers where the rank is equal to the row number:
rank | score
-----+------
1 | 10
2 | 6
3 | 6
4 | 2
A partial solution is to self-join and count items with a higher or equal, score. This produces:
1 | 10
3 | 6
3 | 6
4 | 2
But that's not what I want.
Is there a way to rank, or even just order by score somehow and then extract that row number?
If you want a row number equivalent to the window function row_number(), you can improvise in version 8.3 (or any version) with a (temporary) SEQUENCE:
CREATE TEMP SEQUENCE foo;
SELECT nextval('foo') AS rn, *
FROM (SELECT score FROM tbl ORDER BY score DESC) s;
db<>fiddle here
Old sqlfiddle
The subquery is necessary to order rows before calling nextval().
Note that the sequence (like any temporary object) ...
is only visible in the same session it was created.
hides any other table object of the same name.
is dropped automatically at the end of the session.
To use the sequence in the same session repeatedly run before each query:
SELECT setval('foo', 1, FALSE);
There's a method using an array that works with PG 8.3. It's probably not very efficient, performance-wise, but will do OK if there aren't a lot of values.
The idea is to sort the values in a temporary array, then extract the bounds of the array, then join that with generate_series to extract the values one by one, the index into the array being the row number.
Sample query assuming the table is scores(value int):
SELECT i AS row_number,arr[i] AS score
FROM (SELECT arr,generate_series(1,nb) AS i
FROM (SELECT arr,array_upper(arr,1) AS nb
FROM (SELECT array(SELECT value FROM scores ORDER BY value DESC) AS arr
) AS s2
) AS s1
) AS s0
Do you have a PK for this table?
Just self join and count items with: a higher or equal score and higher PK.
PK comparison will break ties and give you desired result.
And after you upgrade to 9.1 - use row_number().