How to shuffle array in PostgreSQL 9.6 and also lower versions? - sql

The following custom stored function -
CREATE OR REPLACE FUNCTION words_shuffle(in_array varchar[])
RETURNS varchar[] AS
$func$
SELECT array_agg(letters.x) FROM
(SELECT UNNEST(in_array) x ORDER BY RANDOM()) letters;
$func$ LANGUAGE sql STABLE;
was shuffling character array in PostgreSQL 9.5.3:
words=> select words_shuffle(ARRAY['a','b','c','d','e','f']);
words_shuffle
---------------
{c,d,b,a,e,f}
(1 row)
But now after I have switched to PostgreSQL 9.6.2 the function stopped working:
words=> select words_shuffle(ARRAY['a','b','c','d','e','f']);
words_shuffle
---------------
{a,b,c,d,e,f}
(1 row)
Probably because the ORDER BY RANDOM() stopped working:
words=> select unnest(ARRAY['a','b','c','d','e','f']) order by random();
unnest
--------
a
b
c
d
e
f
(6 rows)
I am looking please for a better method to shuffle character array, which would work in the new PostgreSQL 9.6, but also in 9.5.
I need it for my word game in development, which uses Pl/PgSQL functions.
UPDATE:
Reply by Tom Lane:
Expansion of SRFs in the targetlist now happens after ORDER BY.
So the ORDER BY is sorting a single dummy row and then the unnest
happens after that. See
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=9118d03a8

Generally, a set returning function should be placed in FROM clause:
select array_agg(u order by random())
from unnest(array['a','b','c','d','e','f']) u
array_agg
---------------
{d,f,b,e,c,a}
(1 row)
For the documentation (emphasis added):
Currently, functions returning sets can also be called in the select list of a query. For each row that the query generates by itself, the function returning set is invoked, and an output row is generated for each element of the function's result set. Note, however, that this capability is deprecated and might be removed in future releases.

No doubt, this is a change and due to some "improvement" in the optimizer. Given that the documentation sort of says that this works, it is frustrating.
However, I would suggest that you not depend on the subquery:
SELECT array_agg(letters.x ORDER BY random())
FROM UNNEST(in_array) l(x);
This should also work in order versions of Postgres.
The documentation says:
Alternatively, supplying the input values from a sorted subquery will
usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
But this syntax is not allowed in the SQL standard, and is not
portable to other database systems.
(I freely admit that "will usually work" is not a guarantee. But having a substandard code sample in the documentation is really misleading. Why doesn't it show the correct sample using the ORDER BY clause in the aggregation function?)
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html

Related

Starting from a column type, how to find supported aggregations in Postgres?

I'm trying to figure out from a column type, which aggregates the data type supports. There's a lot of variety amongst types, just a sample below (some of these support more aggregates, of course):
uuid count()
text count(), min(), max()
integer count(), min, max(),avg(),sum()
I've been thrashing around in the system catalogs and views, but haven't found what I'm after. (See "thrashing around.") I've poked at pg_type, pg_aggregate, pg_operator, and a few more.
Is there a straightforward way to start from a column type and gather all supported aggregates?
For background, I'm writing a client-side cross-tab code generator, and the UX is better when the tool automatically prevents you from selecting an aggregation that's not supported. I've hacked in some hard-coded rules for now, but would like to improve the system.
We're on Postgres 11.4.
A plain list of available aggregate functions can be based on pg_proc like this:
SELECT oid::regprocedure::text AS agg_func_plus_args
FROM pg_proc
WHERE prokind = 'a'
ORDER BY 1;
Or with separate function name and arguments:
SELECT proname AS agg_func, pg_get_function_identity_arguments(oid) AS args
FROM pg_proc
WHERE prokind = 'a'
ORDER BY 1, 2;
pg_proc.prokind replaces proisagg in Postgres 11. In Postgres 10 or older use:
...
WHERE proisagg
...
Related:
How to drop all of my functions in PostgreSQL?
How to get function parameter lists (so I can drop a function)
To get a list of available functions for every data type (your question), start with:
SELECT type_id::regtype::text, array_agg(proname) AS agg_functions
FROM (
SELECT proname, unnest(proargtypes::regtype[])::text AS type_id
FROM pg_proc
WHERE proisagg
ORDER BY 2, 1
) sub
GROUP BY type_id;
db<>fiddle here
Just a start. Some of the arguments are just "direct" (non-aggregated) (That's also why some functions are listed multiple times - due to those additional non-aggregate columns, example string_agg). And there are special cases for "ordered-set" and "hypothetical-set" aggregates. See the columns aggkind and aggnumdirectargs of the additional system catalog pg_aggregate. (You may want to exclude the exotic special cases for starters ...)
And many types have an implicit cast to one of the types listed by the query. Prominent example string_agg() works with varchar, too, but it's only listed for text above. You can extend the query with information from pg_cast to get the full picture.
Plus, some aggregates work for pseudo types "any", anyarray etc. You'll want to factor those in for every applicable data type.
The complication of multiple aliases for the same data type names can be eliminated easily, though: cast to regtype to get canonical names. Or use pg_typeof() which returns standard names. Related:
Type conversion. What do I do with a PostgreSQL OID value in libpq in C?
PostgreSQL syntax error in parameterized query on "date $1"
How do I translate PostgreSQL OID using python
Man, that is just stunning Thank you. The heat death of the universe will arrive before I could have figured that out. I had to tweak one line for PG 11 compatibility...says the guy who did not say what version he was on. I've reworked the query to get close to what I'm after and included a bit of output for the archives.
with aggregates as (
SELECT pro.proname aggregate_name,
CASE
WHEN array_agg(typ.typname ORDER BY proarg.position) = '{NULL}'::name[] THEN
'{}'::name[]
ELSE
array_agg(typ.typname ORDER BY proarg.position)
END aggregate_types
FROM pg_proc pro
CROSS JOIN LATERAL unnest(pro.proargtypes) WITH ORDINALITY proarg (oid,
position)
LEFT JOIN pg_type typ
ON typ.oid = proarg.oid
WHERE pro. prokind = 'a' -- I needed this for PG 11, I didn't say what version I was using.
GROUP BY pro.oid,
pro.proname
ORDER BY pro.proname),
-- The *super helpful* code above is _way_ past my skill level with Postgres. So, thrashing around a bit to get close to what I'm after.
-- First up, a CTE to sort everything by aggregation and then combine the types.
aggregate_summary as (
select aggregate_name,
array_agg(aggregate_types) as types_array
from aggregates
group by 1
order by 1)
-- Finally, the previous CTE is used to get the details and a count of the types.
select aggregate_name,
cardinality(types_array) as types_count, -- Couldn't get array_length to work here. ¯\_(ツ)_/¯
types_array
from aggregate_summary
limit 5;
And a bit of output:
aggregate_name types_count types_array
array_agg 2 {{anynonarray},{anyarray}}
avg 7 {{int8},{int4},{int2},{numeric},{float4},{float8},{interval}}
bit_and 4 {{int2},{int4},{int8},{bit}}
bit_or 4 {{int2},{int4},{int8},{bit}}
bool_and 1 {{bool}}
Still on my wish list are
Figuring out how to execute arrays (we aren't using array fields now, and only have a few places that we ever might. At that point, I don't expect we'll try and support pivots on arrays. tab tool
Getting all of the aliases for the various types. it seems like (?) int8, etc. can come through from pg_attribute in multiple ways. For example, timestamptz can come back from "timestamp with time zone".
These results are going to be consumed by client-side code and processed, so I don't need to get Postgres to figure everything out in one query, just enough for me to get the job done.
In any case, thanks very, very much.
There's the pg_proc catalog table, that lists all functions. The column proisagg marks aggregation functions and the column proargtypes holds an array of the OIDs of the argument types.
So for example to get a list of all aggregation functions with the names of their arguments' type you could use:
SELECT pro.proname aggregationfunctionname,
CASE
WHEN array_agg(typ.typname ORDER BY proarg.position) = '{NULL}'::name[] THEN
'{}'::name[]
ELSE
array_agg(typ.typname ORDER BY proarg.position)
END aggregationfunctionargumenttypes
FROM pg_proc pro
CROSS JOIN LATERAL unnest(pro.proargtypes) WITH ORDINALITY proarg (oid,
position)
LEFT JOIN pg_type typ
ON typ.oid = proarg.oid
WHERE pro.proisagg
GROUP BY pro.oid,
pro.proname
ORDER BY pro.proname;
Of course you may need to extend that, e.g. joining and respecting the schemas (pg_namespace) and check for compatible types in pg_type (have a look at the typcategory column for that), etc..
Edit:
I overlooked, that proisagg was removed in version 11 (I'm still mostly on a 9.6) as the other answers mentioned. So for the sake of completeness: As of version 11 replace WHERE pro.proisagg with WHERE pro.prokind = 'a'.
I've been playing around with the suggestions a bit, and want to post one adaptation based on one of Erwin's scripts:
select type_id::regtype::text as type_name,
array_agg(proname) as aggregate_names
from (
select proname,
unnest(proargtypes::regtype[])::text AS type_id
from pg_proc
where prokind = 'a'
order by 2, 1
) subquery
where type_id in ('"any"', 'bigint', 'boolean','citext','date','double precision','integer','interval','numeric','smallint',
'text','time with time zone','time without time zone','timestamp with time zone','timestamp without time zone')
group by type_id;
That brings back details on the types specified in the where clause. Not only is this useful for my current work, it's useful to my understanding generally. I've run into cases where I've had to recast something, like an integer to a double, to get it to work with an aggregate. So far, this has been pretty much trial and error. If you run the query above (or one like it), it's easier to see from the output where you need recasting between similar seeming types.

dynamically select arbitrary n columns with window function

Abstract of my real select statement:
select
lag(somecol) over (partition by thing_id) as prev_1,
lag(somecol,2) over (partition by thing_id) as prev_2,
lag(somecol,3) over (partition by thing_id) as prev_3,
othercol,
...
In the real query the over is much more complex which leads to pretty dense, unreadable code. Additionally, getting the last 3 rows is hardcoded (vs n=whatever).
Is there any way in straight SQL to iteratively or recursively specify these prev_x columns, so that 1) the code is more readable and 2) you could dynamically specify the number n of prev cols?
To answer just the first question, to make the code more readable, Postgres allows to define the window, name it and then reference it several times in the query.
See the docs for Window Functions:
When a query involves multiple window functions, it is possible to
write out each one with a separate OVER clause, but this is
duplicative and error-prone if the same windowing behavior is wanted
for several functions. Instead, each windowing behavior can be named
in a WINDOW clause and then referenced in OVER. For example:
SELECT sum(salary) OVER w, avg(salary) OVER w
FROM empsalary
WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);
I don't know if this feature is part of the SQL standard or not, but I know that SQL Server doesn't support it.
So, your query would look like this:
select
lag(somecol) over w as prev_1,
lag(somecol,2) over w as prev_2,
lag(somecol,3) over w as prev_3,
othercol,
...
from
...
WINDOW w AS (partition by thing_id)
;
Regarding your second question how to "dynamically specify the number n of prev cols" - you'll need to generate the text of SELECT statement dynamically to achieve that. RDBMS assume stable schema, i.e. the number of columns in tables and queries is usually fixed, not dynamic.

What is the difference between these two similar SQL ARRAY_TO_STRING statements?

I am sniffing out an error in my SQL query that revolves around the ARRAY_AGG function. Below are the following two lines of SQL.
The below line is the correct and complete SQL version that I ultimately want to have. This behavior is the behavior that results in a correct query.
ARRAY_TO_STRING((ARRAY_AGG(R.version ORDER BY R.released_on DESC))[1:10], ', ')
and this line here I thought would be equivalent but when I execute this line of SQL I receive ERROR: more than one row returned by a subquery used as an expression
ARRAY_TO_STRING(ARRAY_AGG((SELECT "releases"."version" FROM "releases" ORDER BY "releases"."released_on" DESC LIMIT 10)), ',')
I am arriving at this line of SQL through arel and is proving to be a little challenging but is it possible that the two lines could be equivalent? The first line is slicing an array to retrieve 10 items while the second is essentially doing the same thing albeit in rows. Could this be tweaked or will this need to be rewritten?
I haven't used ruby but it looks like the nesting in different between the two lines. Your ARRAY_AGG is getting two different sets of data.
First:
ARRAY_AGG(R.version ORDER BY R.released_on DESC)
Second:
ARRAY_AGG((SELECT "releases"."version" FROM "releases" ORDER BY "releases"."released_on" DESC LIMIT 10))
Looking there I think on your second statement is wrong, from what I see online it looks like ARRAY_AGG should be use as a column function not on a table set. Below are some links that I was looking at but I wan't sure which is correct as ARRAY_AGG looks to be a sql function and i'm not sure what flavor you are using...
https://msdn.microsoft.com/en-us/library/azure/mt763803.aspx
https://www.postgresql.org/docs/8.4/static/functions-aggregate.html
With that I think it should look like:
(SELECT ARRAY_AGG("releases"."version") FROM "releases" ORDER BY "releases"."released_on" DESC LIMIT 10)
Just a guess..

Postgresql Writing max() Window function with multiple partition expressions?

I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.

Pairwise array sum aggregate function?

I have a table with arrays as one column, and I want to sum the array elements together:
> create table regres(a int[] not null);
> insert into regres values ('{1,2,3}'), ('{9, 12, 13}');
> select * from regres;
a
-----------
{1,2,3}
{9,12,13}
I want the result to be:
{10, 14, 16}
that is: {1 + 9, 2 + 12, 3 + 13}.
Does such a function already exist somewhere? The intagg extension looked like a good candidate, but such a function does not already exist.
The arrays are expected to be between 24 and 31 elements in length, all elements are NOT NULL, and the arrays themselves will also always be NOT NULL. All elements are basic int. There will be more than two rows per aggregate. All arrays will have the same number of elements, in a query. Different queries will have different number of elements.
My implementation target is: PostgreSQL 9.1.13
General solutions for any number of arrays with any number of elements. Individual elements or the the whole array can be NULL, too:
Simpler in 9.4+ using WITH ORDINALITY
SELECT ARRAY (
SELECT sum(elem)
FROM tbl t
, unnest(t.arr) WITH ORDINALITY x(elem, rn)
GROUP BY rn
ORDER BY rn
);
See:
PostgreSQL unnest() with element number
Postgres 9.3+
This makes use of an implicit LATERAL JOIN
SELECT ARRAY (
SELECT sum(arr[rn])
FROM tbl t
, generate_subscripts(t.arr, 1) AS rn
GROUP BY rn
ORDER BY rn
);
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Postgres 9.1
SELECT ARRAY (
SELECT sum(arr[rn])
FROM (
SELECT arr, generate_subscripts(arr, 1) AS rn
FROM tbl t
) sub
GROUP BY rn
ORDER BY rn
);
The same works in later versions, but set-returning functions in the SELECT list are not standard SQL and were frowned upon by some. Should be OK since Postgres 10, though. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
db<>fiddle here
Old sqlfiddle
Related:
Is there something like a zip() function in PostgreSQL that combines two arrays?
If you need better performances and can install Postgres extensions, the agg_for_vecs C extension provides a vec_to_sum function that should meet your need. It also offers various aggregate functions like min, max, avg, and var_samp that operate on arrays instead of scalars.
I know the original question and answer are pretty old, but for others who find this... The most elegant and flexible solution I've found is to create a custom aggregate function. Erwin's answer presents some great simple solutions if you only need the single resulting array, but doesn't translate to a solution that could include other table columns and aggregations, in a GROUP BY for example.
With a custom array_add function and array_sum aggregate function:
CREATE OR REPLACE FUNCTION array_add(_a numeric[], _b numeric[])
RETURNS numeric[]
AS
$$
BEGIN
RETURN ARRAY(
SELECT coalesce(a, 0) + coalesce(b, 0)
FROM unnest(_a, _b) WITH ORDINALITY AS x(a, b, n)
ORDER BY n
);
END
$$ LANGUAGE plpgsql;
CREATE AGGREGATE array_sum(numeric[])
(
sfunc = array_add,
stype = numeric[],
initcond = '{}'
);
Then (using the names from your example):
SELECT array_sum(a) a_sums
FROM regres;
Returns your array of sums, and it can just as well be used anywhere other aggregate functions could be used, so if your table also had a column name you wanted to group by, and another array of numbers, column b:
SELECT name, array_sum(a) a_sums, array_sum(b) b_sums
FROM regres
GROUP BY name;
You won't get quite the performance you'd get out of the built-in sum function and just selecting sum(a[1]), sum(a[2]), sum(a[3]), you'd have to implement the array_add function as a compiled C function to get that. But in cases where you don't have the ability to add custom C functions (like a managed cloud database, e.g. AWS RDS), or you're not aggregating huge numbers of rows, the difference probably won't be noticed.