Postgresql - Map array aggregates into a single array in a particular order - sql

I have a PostgreSQL table containing a column of 1 dimensional array data. I wish to perform an aggregate query on this column, obtaining min/max/mean for each element of the array as well as the group count, returning the result as a 1 dimensional array. The array lengths in the table may vary, but I can be certain that in any grouping I perform, all arrays will be of the same length.
In a simple form, say my arrays are of length 2 and have readings for x and y, I want to return the result as
{Min(x), Max(x), Mean(x), Min(y), Max(y), Mean(y), Count()}
I am able to get a result in the form {Min(x), Min(y), Max(x), Max(y), Mean(x), Mean(y) Count()} but I can't get from there to my desired result.
Here's an example showing where I am so far (this time with arrays of length 3, but without the mean aggregation as there isnt one for arrays built in to pgSql):
(SQLFiddle here)
CREATE TABLE my_test(some_key numeric, event_data bigint[]);
INSERT INTO my_test(some_key, event_data) VALUES
(1, {11,12,13}),
(1, {5,6,7}),
(1, {-11,-12,-13});
SELECT MIN(event_data) || MAX(event_data) || COUNT(event_data) FROM my_test GROUP BY some_key;
The above gives me
{11,12,13,-11,-12,-13,3}
However, I don't know how to transform a result like the above into what I want, which is:
{11,-11,12,-12,13,-13,3}
What function should I use to transform the above?
Note that the aggregation functions above don't exactly match with those I am using to get min, max - I'm using the aggs_for_vecs extension to give me min, max and mean.

I would recommend using array operations and aggregation:
select x.some_key,
array_agg(u.val order by x.n, u.nn)
from (select t.some_key, ed.n, min(val) as minval, max(val) as maxval
from my_test t cross join lateral
unnest(t.event_data) with ordinality as ed(val, n)
group by t.some_key, ed.n
) x cross join lateral
unnest(array[x.minval, x.maxval]) with ordinality u(val, nn)
group by x.some_key;
Personally, I would prefer an array with three elements and the min/max as a record:
select x.some_key, array_agg((x.minval, x.maxval) order by x.n)
from (select t.some_key, ed.n, min(val) as minval, max(val) as maxval
from my_test t cross join lateral
unnest(t.event_data) with ordinality as ed(val, n)
group by t.some_key, ed.n
) x
group by x.some_key;
Here is a db<>fiddle.

Related

SQL Unnest- how to use correctly?

Say I have some data in a table, t.
id, arr
--, ---
1, [1,2,3]
2, [4,5,6]
SQL
SELECT AVG(n) FROM UNNEST(
SELECT arr FROM t AS n) AS avg_arr
This returns the error, 'Mismatched input 'SELECT'. Expecting <expression>.
What is the correct way to unnest an array and aggregate the unnested values?
unnest is normally used with a join and will expand the array into relation (i.e. for every element of array an row will be introduced). To calculate average you will need to group values back:
-- sample data
WITH dataset (id, arr) AS (
VALUES (1, array[1,2,3]),
(2, array[4,5,6])
)
--query
select id, avg(n)
from dataset
cross join unnest (arr) t(n)
group by id
Output:
id
_col1
1
2.0
2
5.0
But you also can use array functions. Depended on presto version either array_average:
select id, array_average(n)
from dataset
Or for older versions more cumbersome approach with manual aggregation via reduce:
select id, reduce(arr, 0.0, (s, x) -> s + x, s -> s) / cardinality(arr)
from dataset

Wrong order of elements after GROUP BY using ST_MakeLine

I have a table (well, it's CTE) containing path, as array of node IDs, and table of nodes with their geometries. I am trying to SELECT paths with their start and end nodes, and geometries, like this:
SELECT *
FROM (
SELECT t.path_id, t.segment_num, t.start_node, t.end_node, ST_MakeLine(n.geom) AS geom
FROM (SELECT path_id, segment_num, nodes[1] AS start_node, nodes[array_upper(nodes,1)] AS end_node, unnest(nodes) AS node_id
FROM paths
) t
JOIN nodes n ON n.id = t.node_id
GROUP BY path_id, segment_num, start_node, end_node
) rs
This seems to be working just fine when I try it on individual path samples, but when I run this on large dataset, small number of resulting geometries are bad - clearly the ST_MakeLine received points in wrong order. I suspect parallel aggregation resulting in wrong order, but maybe I am missing something else here?
How can I ensure correct order of points into ST_MakeLine?
If I am correct about the parallel aggregation, postgres docs are saying that Scans of common table expressions (CTEs) are always parallel restricted, but does that mean I have to make CTE with unnested array and mark it AS MATERIALIZED so it does not get optimized back into query?
Thanks for reminding me of ST_MakeLine(geom ORDER BY something) possibility, ST_MakeLine is aggregate function after all. I dont have any explicit ordering column available (order is position in nodes array, but one node can be present multiple times). Fortunately, unnest can be used in FROM clause with WITH ORDINALITY and therefore create an ordering column for me. Working solution:
SELECT *
FROM (SELECT t.path_id, t.segment_num, t.start_node, t.end_node, ST_MakeLine(n.geom ORDER BY node_order) AS geom
FROM (SELECT path_id, segment_num, nodes[1] AS start_node, nodes[array_upper(nodes,1)] AS end_node, a.elem AS node_id, a.nr AS node_order
FROM paths, unnest(nodes) WITH ORDINALITY a(elem, nr)
) t
JOIN nodes n ON n.id = t.node_id
GROUP BY path_id, segment_num, start_node, end_node
) rs
In order for ST_MakeLine to create a LineString in the right order you must explicitly state it with an ORDER BY. The following examples show how the order of points make a huge difference in the output:
Without ordering
WITH j (id,geom) AS (
VALUES
(3,'SRID=4326;POINT(1 2)'::geometry),
(1,'SRID=4326;POINT(3 4)'::geometry),
(0,'SRID=4326;POINT(1 9)'::geometry),
(2,'SRID=4326;POINT(8 3)'::geometry)
)
SELECT ST_MakeLine(geom) FROM j;
Ordering by id:
WITH j (id,geom) AS (
VALUES
(3,'SRID=4326;POINT(1 2)'::geometry),
(1,'SRID=4326;POINT(3 4)'::geometry),
(0,'SRID=4326;POINT(1 9)'::geometry),
(2,'SRID=4326;POINT(8 3)'::geometry)
)
SELECT ST_MakeLine(geom ORDER BY id) FROM j;
Demo: db<>fiddle

Aggregate arrays element-wise in presto/athena

I have a table which has an array column. The size of the array is guaranteed to be same in all rows. Is it possible to do an element-wise aggregation on the arrays to create a new array?
For e.g. if my aggregation is the avg function then:
Array 1: [1,3,4,5]
Array 2: [3,5,6,1]
Output: [2,4,5,3]
I would want to write queries like these:
select
timestamp_column,
avg(array_column) as new_array
from
my_table
group by
timestamp_column
The array contains close to 200 elements, so I would prefer not to hardcode each element in the query :)
This can be done by combining 2 lesser known SQL constructs: UNNEST WITH ORDINALITY, and array_agg with ORDER BY.
The first step is to unpack the arrays into rows usingCROSS JOIN UNNEST(a) WITH ORDINALITY. For each element in each array, it will output a row containing the element value and the position of that element in the array.
Then you use a standardard GROUP BY on the ordinal, and sum the values.
Finally, you reassemble the sums back into an array using array_agg(value_sum ORDER BY ordinal). The critical part of this expression is the ORDER BY clause in the array_agg call. Without this the values would be an an arbitrary order.
Here is a full example:
WITH t(a) AS (VALUES array [1, 3, 4, 5], array [3, 5, 6, 1])
SELECT array_agg(value_sum ORDER BY ordinal)
FROM (
SELECT ordinal, sum(value) AS value_sum
from t
CROSS JOIN UNNEST(t.a) WITH ORDINALITY AS x(value, ordinal)
GROUP BY ordinal);

Average interval between timestamps in an array

In a PostgreSQL 9.x database, I have a column which is an array of type timestamp. Each array has between 1..n timestamps.
I'm trying to extract the average interval between all elements in each array.
I understand using a window function on the source table might be the ideal way to tackle this but in this case I am trying to do it as an operation on the array.
I've looked at several other questions that are trying to calculate the moving average of another column etc or the avg (median date of a list of timestamps).
For example the average interval I'm looking for on an array with 3 elements like this:
'{"2012-10-09 17:04:05.710887"
,"2013-10-18 22:30:08.973749"
,"2014-10-22 22:18:18.885973"}'::timestamp[]
Would be:
-368d
Wondering if I need to unpack the array through a function?
One way of many possible: unnest, join, avg in a lateral subquery:
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord)
JOIN unnest(t.arr) WITH ORDINALITY a2(ts, ord) ON (a2.ord = a1.ord + 1)
) avg ON true;
db<>fiddle here
The [INNER] JOIN in the subquery produces exactly the set of combinations relevant for intervals between elements.
I get 371 days 14:37:06.587543, not '-368d', btw.
Related, with more explanation:
PostgreSQL unnest() with element number
You can also only unnest once and use the window functions lead() or lag(), but you were trying to avoid window functions. And you need to make sure of the original order of elements in any case ...
(There is no array function you could use directly to get what you need - in case you were hoping for that.)
Alternative with CTE
Might be appealing to still unnest only once (even while avoiding window functions):
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
WITH a AS (SELECT * FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord))
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM a a1
JOIN a a2 ON (a2.ord = a1.ord +1)
) avg ON true;
But I expect the added CTE overhead to cost more than unnesting twice. Mostly just demonstrating a WITH clause in a subquery.

Pairwise array sum aggregate function?

I have a table with arrays as one column, and I want to sum the array elements together:
> create table regres(a int[] not null);
> insert into regres values ('{1,2,3}'), ('{9, 12, 13}');
> select * from regres;
a
-----------
{1,2,3}
{9,12,13}
I want the result to be:
{10, 14, 16}
that is: {1 + 9, 2 + 12, 3 + 13}.
Does such a function already exist somewhere? The intagg extension looked like a good candidate, but such a function does not already exist.
The arrays are expected to be between 24 and 31 elements in length, all elements are NOT NULL, and the arrays themselves will also always be NOT NULL. All elements are basic int. There will be more than two rows per aggregate. All arrays will have the same number of elements, in a query. Different queries will have different number of elements.
My implementation target is: PostgreSQL 9.1.13
General solutions for any number of arrays with any number of elements. Individual elements or the the whole array can be NULL, too:
Simpler in 9.4+ using WITH ORDINALITY
SELECT ARRAY (
SELECT sum(elem)
FROM tbl t
, unnest(t.arr) WITH ORDINALITY x(elem, rn)
GROUP BY rn
ORDER BY rn
);
See:
PostgreSQL unnest() with element number
Postgres 9.3+
This makes use of an implicit LATERAL JOIN
SELECT ARRAY (
SELECT sum(arr[rn])
FROM tbl t
, generate_subscripts(t.arr, 1) AS rn
GROUP BY rn
ORDER BY rn
);
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Postgres 9.1
SELECT ARRAY (
SELECT sum(arr[rn])
FROM (
SELECT arr, generate_subscripts(arr, 1) AS rn
FROM tbl t
) sub
GROUP BY rn
ORDER BY rn
);
The same works in later versions, but set-returning functions in the SELECT list are not standard SQL and were frowned upon by some. Should be OK since Postgres 10, though. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
db<>fiddle here
Old sqlfiddle
Related:
Is there something like a zip() function in PostgreSQL that combines two arrays?
If you need better performances and can install Postgres extensions, the agg_for_vecs C extension provides a vec_to_sum function that should meet your need. It also offers various aggregate functions like min, max, avg, and var_samp that operate on arrays instead of scalars.
I know the original question and answer are pretty old, but for others who find this... The most elegant and flexible solution I've found is to create a custom aggregate function. Erwin's answer presents some great simple solutions if you only need the single resulting array, but doesn't translate to a solution that could include other table columns and aggregations, in a GROUP BY for example.
With a custom array_add function and array_sum aggregate function:
CREATE OR REPLACE FUNCTION array_add(_a numeric[], _b numeric[])
RETURNS numeric[]
AS
$$
BEGIN
RETURN ARRAY(
SELECT coalesce(a, 0) + coalesce(b, 0)
FROM unnest(_a, _b) WITH ORDINALITY AS x(a, b, n)
ORDER BY n
);
END
$$ LANGUAGE plpgsql;
CREATE AGGREGATE array_sum(numeric[])
(
sfunc = array_add,
stype = numeric[],
initcond = '{}'
);
Then (using the names from your example):
SELECT array_sum(a) a_sums
FROM regres;
Returns your array of sums, and it can just as well be used anywhere other aggregate functions could be used, so if your table also had a column name you wanted to group by, and another array of numbers, column b:
SELECT name, array_sum(a) a_sums, array_sum(b) b_sums
FROM regres
GROUP BY name;
You won't get quite the performance you'd get out of the built-in sum function and just selecting sum(a[1]), sum(a[2]), sum(a[3]), you'd have to implement the array_add function as a compiled C function to get that. But in cases where you don't have the ability to add custom C functions (like a managed cloud database, e.g. AWS RDS), or you're not aggregating huge numbers of rows, the difference probably won't be noticed.