Joining arrays within group by clause - sql

We have a problem grouping arrays into a single array.
We want to join the values from two columns into one single array and aggregate these arrays of multiple rows.
Given the following input:
| id | name | col_1 | col_2 |
| 1 | a | 1 | 2 |
| 2 | a | 3 | 4 |
| 4 | b | 7 | 8 |
| 3 | b | 5 | 6 |
We want the following output:
| a | { 1, 2, 3, 4 } |
| b | { 5, 6, 7, 8 } |
The order of the elements is important and should correlate with the id of the aggregated rows.
We tried the array_agg() function:
SELECT array_agg(ARRAY[col_1, col_2]) FROM mytable GROUP BY name;
Unfortunately, this statement raises an error:
ERROR: could not find array type for data type character varying[]
It seems to be impossible to merge arrays in a group by clause using array_agg().
Any ideas?

UNION ALL
You could "unpivot" with UNION ALL first:
SELECT name, array_agg(c) AS c_arr
FROM (
SELECT name, id, 1 AS rnk, col1 AS c FROM tbl
UNION ALL
SELECT name, id, 2, col2 FROM tbl
ORDER BY name, id, rnk
) sub
GROUP BY 1;
Adapted to produce the order of values you later requested. The manual:
The aggregate functions array_agg, json_agg, string_agg, and xmlagg,
as well as similar user-defined aggregate functions, produce
meaningfully different result values depending on the order of the
input values. This ordering is unspecified by default, but can be
controlled by writing an ORDER BY clause within the aggregate call, as
shown in Section 4.2.7. Alternatively, supplying the input values from
a sorted subquery will usually work.
Bold emphasis mine.
LATERAL subquery with VALUES expression
LATERAL requires Postgres 9.3 or later.
SELECT t.name, array_agg(c) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
CROSS JOIN LATERAL (VALUES (t.col1), (t.col2)) v(c)
GROUP BY 1;
Same result. Only needs a single pass over the table.
Custom aggregate function
Or you could create a custom aggregate function like discussed in these related answers:
Selecting data into a Postgres array
Is there something like a zip() function in PostgreSQL that combines two arrays?
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
, STYPE = anyarray
, INITCOND = '{}'
);
Then you can:
SELECT name, array_agg_mult(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Or, typically faster, while not standard SQL:
SELECT name, array_agg_mult(ARRAY[col1, col2]) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
GROUP BY 1;
The added ORDER BY id (which can be appended to such aggregate functions) guarantees your desired result:
a | {1,2,3,4}
b | {5,6,7,8}
Or you might be interested in this alternative:
SELECT name, array_agg_mult(ARRAY[ARRAY[col1, col2]] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Which produces 2-dimensional arrays:
a | {{1,2},{3,4}}
b | {{5,6},{7,8}}
The last one can be replaced (and should be, as it's faster!) with the built-in array_agg() in Postgres 9.5 or later - with its added capability of aggregating arrays:
SELECT name, array_agg(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Same result. The manual:
input arrays concatenated into array of one higher dimension (inputs
must all have same dimensionality, and cannot be empty or null)
So not exactly the same as our custom aggregate function array_agg_mult();

select n, array_agg(c) as c
from (
select n, unnest(array[c1, c2]) as c
from t
) s
group by n
Or simpler
select
n,
array_agg(c1) || array_agg(c2) as c
from t
group by n
To address the new ordering requirement:
select n, array_agg(c order by id, o) as c
from (
select
id, n,
unnest(array[c1, c2]) as c,
unnest(array[1, 2]) as o
from t
) s
group by n

Related

Creating a category tree table from an array of categories in PostgreSQL

How to generate ids and parent_ids from the arrays of categories. The number or depth of subcategories can be anything between 1-10 levels.
Example PostgreSQL column. Datatype character varying array.
data_column
character varying[] |
----------------------------------
[root_1, child_1, childchild_1] |
[root_1, child_1, childchild_2] |
[root_2, child_2] |
I would like to convert the column of arrays into the table as shown below that I assume is called the Adjacency List Model. I know there is also the Nested Tree Sets Model and Materialised Path model.
Final output table
id | title | parent_id
------------------------------
1 | root_1 | null
2 | root_2 | null
3 | child_1 | 1
4 | child_2 | 2
5 | childchild_1 | 3
6 | childchild_2 | 3
Final output tree hierarchy
root_1
--child_1
----childchild_1
----childchild_2
root_2
--child_2
step-by-step demo: db<>fiddle
You can do this with a recursive CTE
WITH RECURSIVE cte AS
( SELECT data[1] as title, 2 as idx, null as parent, data FROM t -- 1
UNION
SELECT data[idx], idx + 1, title, data -- 2
FROM cte
WHERE idx <= cardinality(data)
)
SELECT DISTINCT -- 3
title,
parent
FROM cte
The starting query of the recursion: Get all root elements and data you'll need within the recursion
The recursive part: Get element of new index and increase the index
After recursion: Query the columns you finally need. The DISTINCT removes tied elements (e.g. two times the same root_1).
Now you have created the hierarchy. Now you need the ids.
You can generate them in many different ways, for example using the row_number() window function:
WITH RECURSIVE cte AS (...)
SELECT
*,
row_number() OVER ()
FROM (
SELECT DISTINCT
title,
parent
FROM cte
) s
Now, every row has its own id. The order criterion may be tweaked a little. Here we have only little chance to change this without any further information. But the algorithm stays the same.
With the ids of each column, we can create a self join to join the parent id by using the parent title column. Because a self join is a repetition of the select query, it makes sense to encapsulate it into a second CTE to avoid code replication. The final result is:
WITH RECURSIVE cte AS
( SELECT data[1] as title, 2 as idx, null as parent, data FROM t
UNION
SELECT data[idx], idx + 1, title, data
FROM cte
WHERE idx <= cardinality(data)
), numbered AS (
SELECT
*,
row_number() OVER ()
FROM (
SELECT DISTINCT
title,
parent
FROM cte
) s
)
SELECT
n1.row_number as id,
n1.title,
n2.row_number as parent_id
FROM numbered n1
LEFT JOIN numbered n2 ON n1.parent = n2.title

Insert in table such that it does not create consecutive duplicates H2

I need to create insert query for H2 db. I need to insert in table such that it do not create consecutive duplicate in table and ignore it. For example,
First name | Last name | Date
A | Z | 2018-12-02
B | Y | 2018-12-03
A | X | 2018-12-04
If I have to insert row `| A | W | 2018-12-01 in above table sorted in ascending order by date, it checks for consecutive duplicate in Firstname column. Since it create consecutive duplicate in table therefore it ignore it.
First name | Last name | Date
A | W | 2018-12-01
A | Z | 2018-12-02
B | Y | 2018-12-03
A | X | 2018-12-04
In H2 you can use the following SQL:
INSERT INTO tableName SELECT * FROM VALUES ('A', 'W', DATE '2018-12-01') T(F, L, D)
WHERE NOT EXISTS (
SELECT * FROM tableName
QUALIFY "First name" = F
AND DENSE_RANK() OVER (ORDER BY "Date")
- DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER() IN (-1, 0)
);
Here the window version of hypothetical set DENSE_RANK function (don't mix it with window DENSE_RANK function, that's a different one) is used to determine the insert position of a new row:
DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER()
This aggregate function is a part of the SQL Standard, but in the SQL Standard in may not be used as a window function, but H2 is less restrictive.
Then the plain DENSE_RANK window function is used to number existing rows in the table. The difference between number of a row and number of a hypothetical row is -1 for row before the inserted row and 0 for row after the inserted row.
We need to check only rows with the same "First name" value. So the whole filter criteria will be
"First name" = F
AND DENSE_RANK() OVER (ORDER BY "Date")
- DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER() IN (-1, 0)
In the SQL Standard you can't filter results after evaluation of window functions, but H2 has non-standard QUALIFY clause (from Teradata) for that purpose, in other DBMS a subquery is needed.
The final condition to decide whether row may be inserted is
WHERE NOT EXISTS (
SELECT * FROM tableName
QUALIFY "First name" = F
AND DENSE_RANK() OVER (ORDER BY "Date")
- DENSE_RANK(D) WITHIN GROUP (ORDER BY "Date") OVER() (-1, 0)
);
If there are no such rows, insertion of a new row will not create two sequential rows with the same "First name" value.
This condition can be used in the plain standard insert from select command.
This solution is not expected to be fast if a table has many rows. A more efficient solution can use subqueries for lookups for previous and next rows such as NOT EXISTS SELECT * FROM (SELECT * FROM tableName WHERE "Date" > D ORDER BY "Date" FETCH FIRST ROW ONLY) UNION (SELECT * FROM tableName WHERE "Date" < D ORDER BY "Date" DESC FETCH FIRST ROW ONLY) WHERE "First name" = F, but H2 doesn't allow references to outer tables from deeply nested queries, so D here needs to be replaced with JDBC parameter (… VALUES (?1, ?2, ?3) … WHERE "Date" < ?3 …). You can try to build such command by yourself.

Splitting string and sorting in postgresql

I have a table in postgresql with a text column that has values like this:
column
-----------
CA;TB;BA;CB
XA;VA
GA;BA;LA
I want to sort the elements that are in each value, so that the query results like this:
column
-----------
BA;CA;CB;TB
VA;XA
BA;GA;LA
I have tried with string_to_array, regexp_split_to_array, array_agg, but I don't seem to get close to it.
Thanks.
I hope this would be easy to understand:
WITH tab AS (
SELECT
*
FROM
unnest(ARRAY[
'CA;TB;BA;CB',
'XA;VA',
'GA;BA;LA']) AS txt
)
SELECT
string_agg(val, ';')
FROM (
SELECT
txt,
regexp_split_to_table(txt, ';') AS val
FROM
tab
ORDER BY
val
) AS sub
GROUP BY
txt;
First I split values to rows (regexp_split_to_table) and sort. Then group by original values and join again with string_agg.
Output:
BA;CA;CB;TB
BA;GA;LA
VA;XA
I'm probably overcomplicating it:
t=# with a(c)as (values('CA;TB;BA;CB')
,('XA;VA')
,('GA;BA;LA'))
, w as (select string_agg(v,';') over (partition by c order by v), row_number() over (partition by c),count(1) over(partition by c) from a,unnest(string_to_array(a.c,';')) v)
select * from w where row_number = count;
string_agg | row_number | count
-------------+------------+-------
BA;CA;CB;TB | 4 | 4
BA;GA;LA | 3 | 3
VA;XA | 2 | 2
(3 rows)
and here with a little ugly hack:
with a(c)as (values
('CA;TB;BA;CB')
,('XA;VA')
,('GA;BA;LA'))
select translate(array_agg(v order by v)::text,',{}',';') from a, unnest(string_to_array(a.c,';')) v group by c;
translate
-------------
BA;CA;CB;TB
BA;GA;LA
VA;XA
(3 rows)

How to get an array in postgres where the array size is greater than 1

I have a table that looks like this:
val | fkey | num
------------------
1 | 1 | 1
1 | 2 | 1
1 | 3 | 1
2 | 3 | 1
What I would like to do is return a set of rows in which values are grouped by 'val', with an array of fkeys, but only where the array of fkeys is greater than 1. So, in the above example, the return would look something like:
1 | [1,2,3]
I have the following query aggregates the arrays:
SELECT val, array_agg(fkey)
FROM mytable
GROUP BY val;
But this returns something like:
1 | [1,2,3]
2 | [3]
What would be the best way of doing this? I guess one possibility would be to use my existing query as a subquery, and do a sum / count on that, but that seems inefficient. Any feedback would really help!
Use Having clause to filter the groups which is having more than fkey
SELECT val, array_agg(fkey)
FROM mytable
GROUP BY val
Having Count(fkey) > 1
Using the HAVING clause as #Fireblade pointed out is probably more efficient, but you can also leverage subqueries:
SQLFiddle: Subquery
SELECT * FROM (
select val, array_agg(fkey) fkeys
from mytable
group by val
) array_creation
WHERE array_length(fkeys,1) > 1
You could also use the array_length function in the HAVING clause, but again, #Fireblade has used count(), which should be more efficient. Still:
SQLFiddle: Having Clause
SELECT val, array_agg(fkey) fkeys
FROM mytable
GROUP BY val
HAVING array_length(array_agg(fkey),1) > 1
This isn't a total loss, though. Using the array_length in the having can be useful if you want a distinct list of fkeys:
SELECT val, array_agg(DISTINCT fkey) fkeys
There may still be other ways, but this method is more descriptive, which may allow your SQL to be easier to understand when you come back to it, years from now.

SQL: Limit by unknown number of occurences

Having a SQL table, consistent of the columns id and type. I Want to select only the first occurences of a type without using WHERE, since i dont know which types wild occur first, and without LIMIT since i don't know how many.
id | type
---------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
E.g.:
SELECT id FROM table ORDER BY type (+ ?) should only return id 1 and 2
SELECT id FROM table ORDER BY type DESC (+ ?) should only return id 3, 4 and 5
Can this be acheived via standard and simple SQL operators?
That's easy. You must use a where clause and evaluate the minimum type there.
SELECT *
FROM mytable
WHERE type = (select min(type) from mytable)
ORDER BY id;
EDIT: Do the same with max() if you want to get the maximum type records.
EDIT: In case the types are not ascending as in your example, you will have to get the type of the minimum/maximum id instead of getting the minimum/maximum type:
SELECT *
FROM mytable
WHERE type = (select type from mytable where id = (select min(id) from mytable))
ORDER BY id;