SQL Unnest- how to use correctly? - sql

Say I have some data in a table, t.
id, arr
--, ---
1, [1,2,3]
2, [4,5,6]
SQL
SELECT AVG(n) FROM UNNEST(
SELECT arr FROM t AS n) AS avg_arr
This returns the error, 'Mismatched input 'SELECT'. Expecting <expression>.
What is the correct way to unnest an array and aggregate the unnested values?

unnest is normally used with a join and will expand the array into relation (i.e. for every element of array an row will be introduced). To calculate average you will need to group values back:
-- sample data
WITH dataset (id, arr) AS (
VALUES (1, array[1,2,3]),
(2, array[4,5,6])
)
--query
select id, avg(n)
from dataset
cross join unnest (arr) t(n)
group by id
Output:
id
_col1
1
2.0
2
5.0
But you also can use array functions. Depended on presto version either array_average:
select id, array_average(n)
from dataset
Or for older versions more cumbersome approach with manual aggregation via reduce:
select id, reduce(arr, 0.0, (s, x) -> s + x, s -> s) / cardinality(arr)
from dataset

Related

PartiQL/SQL: JSON-SUPER array query to extract values to table on Redshift

I have a somewhat complicated SUPER array that I brought in to Redshift using a REST API. The 'API_table' currently looks like this:table example
One of the sample columns "values" reads as follows:
values
[{"value":[{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:15:00.000-05:00"},,{"value":"6.8","qualifiers":["P"],"dateTime":"2023-01-30T20:00:00.000-05:00"},...
I've queried the "value" data using:
SELECT c.values[0].value[0].value as v
FROM API_table c;
However, this only returns the first value "6.9" in each row and not all the "value" items in the row. The same approach doesn't work for extracting the "dateTime" items as it produced NULL values:
SELECT c.values[0].value[0].dateTime as dt
FROM API_table c;
The above example only resembles one row of the table. My question is-- are there ways to query the data in every row of the table so that all the values ("value" & "dateTime") of every row can be extracted onto a new table?
The desired result is:
v
dt
6.9
2023-01-30T17:45:00.000-05:00
6.9
2023-01-30T18:00:00.000-05:00
6.9
2023-01-30T18:15:00.000-05:00
Many thanks.
I tried the following query but it only returned singular "value' results for each row.
SELECT c.values[0].value[0].value as v
FROM API_table c;
When applied to the "dateTime" items, it yielded NULL values:
SELECT c.values[0].value[0].dateTime as dt
FROM API_table c;
===================================================================
#BillWeiner thanks, I worked through both the CTE and test case examples and got the desired results (especially with CTE). The only issue that remains is knowing how to select the original table/column that contains the entire super array so that it can be inserted into test1 (or col1 in the CTE case).
There are super arrays in every row of column 'values' so the issue remains in selecting the column 'values' and extracting each of the multiple value ("6.9") and dateTime objects from each row.
================================================================
I've managed to get the query going when the json strings are explicitly stated in the insert into test1 values query.
Now I'm running this query:
SET enable_case_sensitive_identifier TO true;
create table test1 (jvalues varchar(2048));
insert into test1 select c.values from ph_api c;
create table test2 as select json_parse(jvalues) as svalues from test1;
with recursive numbers(n) as
( select 0 as n
union all
select n + 1
from numbers n
where n.n < 20
),
exp_top as
( select c.svalues[n].value
from test2 c
cross join numbers n
)
,
exp_bot as
( select c.value[n]
from exp_top c
cross join numbers n
where c.value is not null
)
select *, value.value as v, value."dateTime" as dt
from exp_bot
where value is not null;
However, I'm getting an error--ERROR: column "jvalues" is of type character varying but expression is of type super Hint: You will need to rewrite or cast the expression. when I try to insert the source table with insert into test1 SELECT c.values from table c;
I would like to be able to SELECT this source data:
sourceinfo
variable
values
{"siteName":"YAN","siteCode":[{"value":"01"}]
{“variableCode":[{"value":"00600","network":"ID"}
[{“value":[{"value":"3.9","qualifiers":["P"],"dateTime":"2023-01-30T17:30:00.000-05:00"},{"value":"4.9","qualifiers":["P"],"dateTime":"2023-01-30T17:45:00.000-05:00"}]
{"siteName":"YAN","siteCode":[{"value":"01"}]
{“variableCode":[{"value":"00600","network":"ID"}
[{“value":[{"value":"5.9","qualifiers":["P"],"dateTime":"2023-01-30T18:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:15:00.000-05:00"}]
as the jvalues so that it could be unrolled into a desired result of:
v
dt
3.9
2023-01-30T17:30:00.000-05:00
4.9
2023-01-30T17:45:00.000-05:00
5.9
2023-01-30T18:00:00.000-05:00
6.9
2023-01-30T18:15:00.000-05:00
================================================================
The following query worked to select the desired json strings:
with exp_top as
( select s.value
from <source_table> c, c.values s
)
select s.value, s."dateTime" from exp_top c, c.value s;
Yes. You need to expand each array element into its own row. A recursive CTE (or something similar) will be needed to expand the arrays into rows. This can be done based on the max array length in the super or with some fixed set of numbers. This set of numbers will need to be crossed joined with your table to extract each array element.
I wrote up a similar answer previously - Extract value based on specific key from array of jsons in Amazon Redshift - take a look and see if this gets you unstuck. Let me know if you need help adapting this to your situation.
==============================================================
Based on the comments it looks like a more specific example is needed. This little test case should help you understand what is needed to make this work.
I've repeated your data a few times to create multiple rows and to populate the outer array with 2 inner arrays. This hopefully show how to unroll multiple nested arrays manually (the compact Redshift unrolling method is below but hard to understand if you don't get the concepts down first).
First set up the test data:
SET enable_case_sensitive_identifier TO true;
create table test1 (jvalues varchar(2048));
insert into test1 values
('[{"value":[{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:45:00.000-05:00"},{"value":"6.8","qualifiers":["P"],"dateTime":"2023-01-30T20:00:00.000-05:00"}]}, {"value":[{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:45:00.000-05:00"},{"value":"6.8","qualifiers":["P"],"dateTime":"2023-01-30T20:00:00.000-05:00"}]}]'),
('[{"value":[{"value":"5.9","qualifiers":["P"],"dateTime":"2023-01-30T17:30:00.000-05:00"},{"value":"5.9","qualifiers":["P"],"dateTime":"2023-01-30T17:45:00.000-05:00"},{"value":"8.9","qualifiers":["P"],"dateTime":"2023-01-30T18:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:45:00.000-05:00"},{"value":"6.8","qualifiers":["P"],"dateTime":"2023-01-30T20:00:00.000-05:00"}]}, {"value":[{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T17:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T18:45:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:00:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:15:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:30:00.000-05:00"},{"value":"6.9","qualifiers":["P"],"dateTime":"2023-01-30T19:45:00.000-05:00"},{"value":"6.8","qualifiers":["P"],"dateTime":"2023-01-30T20:00:00.000-05:00"}]}]');
create table test2 as select json_parse(jvalues) as svalues from test1;
Note that we have to turn on case sensitivity for the session to be able to select "dateTime" correctly.
Then unroll the arrays manually:
with recursive numbers(n) as
( select 0 as n
union all
select n + 1
from numbers n
where n.n < 20
),
exp_top as
( select row_number() over () as r, n as x, c.svalues[n].value
from test2 c
cross join numbers n
)
,
exp_bot as
( select r, x, n as y, c.value[n]
from exp_top c
cross join numbers n
where c.value is not null
)
select *, value.value as v, value."dateTime" as dt
from exp_bot
where value is not null;
This version
creates the numbers 0 - 19,
expands the outer array (2 elements in each row) by cross joining
with these numbers,
expands the inner array by the same method,
produces the desired results
Redshift has a built in method for doing this unrolling of super arrays and it is defined in the FROM clause. You can produce the same results from:
with exp_top as (select inx1, s.value from test2 c, c.svalues s at inx1)
select inx1, inx2, c.value[inx2] as value, s.value, s."dateTime" from exp_top c, c.value s at inx2;
Much more compact. This code has been tested and runs as is in Redshift. If you see the "dateTime" value as NULL it is likely that you don't have case sensitivity enabled.
==========================================================
To also have the original super column in the final result:
with exp_top as (select c.svalues, inx1, s.value from test2 c, c.svalues s at inx1)
select svalues, inx1, inx2, c.value[inx2] as value, s.value, s."dateTime" from exp_top c, c.value s at inx2;
==========================================================
I think that unrolling your actual data will be simpler than the code I provided for the general question.
First you don't need to use the test1 and test2 tables, you can query your table directly. If you still want to use test2 then use your table as the source of the "create table test2 ..." statement. But let's see if we can just use your source table.
with exp_top as (
select s.value from <your table> c, c.values s
)
select s.value, s."dateTime" from exp_top c, c.value s;
This code is untested but should work.

How does BigQuery manage a struct field in a SELECT

The following queries a struct from a public data source:
SELECT year FROM `bigquery-public-data.words.eng_gb_1gram` LIMIT 1000
Its schema is:
And the resultset is:
It seems BigQuery automatically translates a struct to all its (leaf) fields when accessed, is that correct? Or how does BigQuery handle directly calling a struct in a select statement?
Two things are going on. You have an array of structs (aka "records").
Each element of the array appears on a separate line in the result set.
Each field in the struct is a separate column.
So, your results are not for a struct but for an array of structs.
You can see what happens for a single struct using:
select year[safe_ordinal(1)]
from . . .
You will get a single row for each row in the data, with the first element of the year array in the row. It will have separate columns, with the names of year.year, year.term_frequency and so on. If you wanted these as "regular" columns, you can use:
select year[ordinal(1)].*
from . . .
Then the columns are year, term_frequency, and so on.
As you might know - RECORD can be NULLABLE - in this case it is a STRUCT and RECORD can be REPEATED - in this case it is an array of record
You can use dot-start notion with the struct to select out all its fields as you do with tables' individual rows with SELECT * FROM tbl or its equivalent SELECT t.* FROM tbl t
So, for example below code
with tbl as (
select struct(1 as a, 2 as b, 3 as c) as col_struct,
[ struct(11 as x, 12 as y, 13 as z),
struct(21, 22, 23),
struct(31, 32, 33)
] as col_array
)
select col_struct.*
from tbl
produces
as if those are the rows of "mini" table called col_struct
Same dot-star notion - does not work for arrays - if you want to output separately elements of array - you need to first to unnest that array. like in below example
with tbl as (
select struct(1 as a, 2 as b, 3 as c) as col_struct,
[ struct(11 as x, 12 as y, 13 as z),
struct(21, 22, 23),
struct(31, 32, 33)
] as col_array
)
select rec
from tbl, unnest(col_array) rec
which outputs
And now, because each row is a struct - you can use dot-star notion
select rec.*
from tbl, unnest(col_array) rec
with output
And, finally - you can combine above as
select col_struct.*, rec.*
from tbl t, t.col_array rec
with output
Note: from tbl t, t.col_array rec is a shortcut for from tbl, unnest(col_array) rec
One more note - if you reference field name that is used in multiple places of your schema - the engine picks most outer matching one. And if by chance this matching one is within the ARRAY - you first need to unnest that array. And if this one is part of STRUCT - you need to make sure you fully qualify the path
For example - with above simplified data
select a from tbl // will not work
select col_struct.a from tbl // will work
select col_array.x from tbl // will not work
select x from tbl, unnest(col_array) // will work
There are many more can be said about subject based on what exactly your use case - but above is some hopefully helpful basics

Example of table function

Is the UNNEST an example of a table-function? It seems to produce a single named column if I'm understanding it correctly. Something like:
`vals`
[1,2,3]
unnest(vals) as v
`v`
1
2
3
with Table as (
select [1,2,3] vals
) select v from Table, UNNEST(vals) as v
Is this an example of a table-function? If not, what kind of function is it? Are there any other predefined table functions in BQ?
The UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY. You can also use UNNEST outside of the FROM clause with the IN operator.
So, you might may call it table function if you wish :o)
You can read more about UNNEST here
It seems to produce a single named column if I'm understanding it correctly
Not exactly correct. See example below
with Table as (
select [struct(1 as a,2 as b),struct(3, 4), struct(5, 6)] vals
)
select v.* from Table, UNNEST(vals) as v
with output

Postgresql - Map array aggregates into a single array in a particular order

I have a PostgreSQL table containing a column of 1 dimensional array data. I wish to perform an aggregate query on this column, obtaining min/max/mean for each element of the array as well as the group count, returning the result as a 1 dimensional array. The array lengths in the table may vary, but I can be certain that in any grouping I perform, all arrays will be of the same length.
In a simple form, say my arrays are of length 2 and have readings for x and y, I want to return the result as
{Min(x), Max(x), Mean(x), Min(y), Max(y), Mean(y), Count()}
I am able to get a result in the form {Min(x), Min(y), Max(x), Max(y), Mean(x), Mean(y) Count()} but I can't get from there to my desired result.
Here's an example showing where I am so far (this time with arrays of length 3, but without the mean aggregation as there isnt one for arrays built in to pgSql):
(SQLFiddle here)
CREATE TABLE my_test(some_key numeric, event_data bigint[]);
INSERT INTO my_test(some_key, event_data) VALUES
(1, {11,12,13}),
(1, {5,6,7}),
(1, {-11,-12,-13});
SELECT MIN(event_data) || MAX(event_data) || COUNT(event_data) FROM my_test GROUP BY some_key;
The above gives me
{11,12,13,-11,-12,-13,3}
However, I don't know how to transform a result like the above into what I want, which is:
{11,-11,12,-12,13,-13,3}
What function should I use to transform the above?
Note that the aggregation functions above don't exactly match with those I am using to get min, max - I'm using the aggs_for_vecs extension to give me min, max and mean.
I would recommend using array operations and aggregation:
select x.some_key,
array_agg(u.val order by x.n, u.nn)
from (select t.some_key, ed.n, min(val) as minval, max(val) as maxval
from my_test t cross join lateral
unnest(t.event_data) with ordinality as ed(val, n)
group by t.some_key, ed.n
) x cross join lateral
unnest(array[x.minval, x.maxval]) with ordinality u(val, nn)
group by x.some_key;
Personally, I would prefer an array with three elements and the min/max as a record:
select x.some_key, array_agg((x.minval, x.maxval) order by x.n)
from (select t.some_key, ed.n, min(val) as minval, max(val) as maxval
from my_test t cross join lateral
unnest(t.event_data) with ordinality as ed(val, n)
group by t.some_key, ed.n
) x
group by x.some_key;
Here is a db<>fiddle.

Aggregate arrays element-wise in presto/athena

I have a table which has an array column. The size of the array is guaranteed to be same in all rows. Is it possible to do an element-wise aggregation on the arrays to create a new array?
For e.g. if my aggregation is the avg function then:
Array 1: [1,3,4,5]
Array 2: [3,5,6,1]
Output: [2,4,5,3]
I would want to write queries like these:
select
timestamp_column,
avg(array_column) as new_array
from
my_table
group by
timestamp_column
The array contains close to 200 elements, so I would prefer not to hardcode each element in the query :)
This can be done by combining 2 lesser known SQL constructs: UNNEST WITH ORDINALITY, and array_agg with ORDER BY.
The first step is to unpack the arrays into rows usingCROSS JOIN UNNEST(a) WITH ORDINALITY. For each element in each array, it will output a row containing the element value and the position of that element in the array.
Then you use a standardard GROUP BY on the ordinal, and sum the values.
Finally, you reassemble the sums back into an array using array_agg(value_sum ORDER BY ordinal). The critical part of this expression is the ORDER BY clause in the array_agg call. Without this the values would be an an arbitrary order.
Here is a full example:
WITH t(a) AS (VALUES array [1, 3, 4, 5], array [3, 5, 6, 1])
SELECT array_agg(value_sum ORDER BY ordinal)
FROM (
SELECT ordinal, sum(value) AS value_sum
from t
CROSS JOIN UNNEST(t.a) WITH ORDINALITY AS x(value, ordinal)
GROUP BY ordinal);