How to concatenate arrays grouped by another column in Presto? - sql

Is this possible in SQL (preferably Presto):
I want to reshape this table:
id, array
1, ['something']
1, ['something else']
2, ['something']
To this table:
id, array
1, ['something', 'something else']
2, ['something']

In Presto you can use array_agg. Assuming that on input, all your arrays are single-element, this would look like this:
select id, array_agg(array[0])
from ...
group by id;
If, however, your input arrays are not necessarily single-element, you can combine this with flatten, like this:
select id, flatten(array_agg(array))
from ...
group by id;

If you want an array that shows the distinct items in the aggregated array then this should work:
select id, array_distinct(flatten(array_agg(array))) as array
from ...
group by id

Related

SQL Unnest- how to use correctly?

Say I have some data in a table, t.
id, arr
--, ---
1, [1,2,3]
2, [4,5,6]
SQL
SELECT AVG(n) FROM UNNEST(
SELECT arr FROM t AS n) AS avg_arr
This returns the error, 'Mismatched input 'SELECT'. Expecting <expression>.
What is the correct way to unnest an array and aggregate the unnested values?
unnest is normally used with a join and will expand the array into relation (i.e. for every element of array an row will be introduced). To calculate average you will need to group values back:
-- sample data
WITH dataset (id, arr) AS (
VALUES (1, array[1,2,3]),
(2, array[4,5,6])
)
--query
select id, avg(n)
from dataset
cross join unnest (arr) t(n)
group by id
Output:
id
_col1
1
2.0
2
5.0
But you also can use array functions. Depended on presto version either array_average:
select id, array_average(n)
from dataset
Or for older versions more cumbersome approach with manual aggregation via reduce:
select id, reduce(arr, 0.0, (s, x) -> s + x, s -> s) / cardinality(arr)
from dataset

sql: JSON_QUERY() function to extract objects

I have a field in my dataset that include json objects in the following format:
cars
[{"element":{"name":"honda","id":"34"}}]
[{"element":{"name":"Lexus","id":"56"}}]
I am using the following query to extract the names of the cars, but just returns empty (null) rows. Any ideas what I am doing wrong?
select JSON_QUERY(cars,"$.name") AS car_names
from myTable
limit 100
Consider below approach
select *,
( select string_agg(json_extract_scalar(car, '$.element.name'))
from unnest(json_extract_array(cars)) car
) car_names
from `project.dataset.table`
if applied to sample data in your question - as in below example
with `project.dataset.table` as (
select '[{"element":{"name":"honda","id":"34"}}]' cars union all
select '[{"element":{"name":"Lexus","id":"56"}}]'
)
select *,
( select string_agg(json_extract_scalar(car, '$.element.name'))
from unnest(json_extract_array(cars)) car
) car_names
from `project.dataset.table`
the output is
if you are trying to extract a scalar value, You should simply use JSON_VALUE(expression,path)
For Example:
An object of Info contain a variable name and another object address,
You can get the value of name by using JSON_VALUE as it isn't an object
BUT
To get the address, you have to use JSON_QUERY.

How does BigQuery manage a struct field in a SELECT

The following queries a struct from a public data source:
SELECT year FROM `bigquery-public-data.words.eng_gb_1gram` LIMIT 1000
Its schema is:
And the resultset is:
It seems BigQuery automatically translates a struct to all its (leaf) fields when accessed, is that correct? Or how does BigQuery handle directly calling a struct in a select statement?
Two things are going on. You have an array of structs (aka "records").
Each element of the array appears on a separate line in the result set.
Each field in the struct is a separate column.
So, your results are not for a struct but for an array of structs.
You can see what happens for a single struct using:
select year[safe_ordinal(1)]
from . . .
You will get a single row for each row in the data, with the first element of the year array in the row. It will have separate columns, with the names of year.year, year.term_frequency and so on. If you wanted these as "regular" columns, you can use:
select year[ordinal(1)].*
from . . .
Then the columns are year, term_frequency, and so on.
As you might know - RECORD can be NULLABLE - in this case it is a STRUCT and RECORD can be REPEATED - in this case it is an array of record
You can use dot-start notion with the struct to select out all its fields as you do with tables' individual rows with SELECT * FROM tbl or its equivalent SELECT t.* FROM tbl t
So, for example below code
with tbl as (
select struct(1 as a, 2 as b, 3 as c) as col_struct,
[ struct(11 as x, 12 as y, 13 as z),
struct(21, 22, 23),
struct(31, 32, 33)
] as col_array
)
select col_struct.*
from tbl
produces
as if those are the rows of "mini" table called col_struct
Same dot-star notion - does not work for arrays - if you want to output separately elements of array - you need to first to unnest that array. like in below example
with tbl as (
select struct(1 as a, 2 as b, 3 as c) as col_struct,
[ struct(11 as x, 12 as y, 13 as z),
struct(21, 22, 23),
struct(31, 32, 33)
] as col_array
)
select rec
from tbl, unnest(col_array) rec
which outputs
And now, because each row is a struct - you can use dot-star notion
select rec.*
from tbl, unnest(col_array) rec
with output
And, finally - you can combine above as
select col_struct.*, rec.*
from tbl t, t.col_array rec
with output
Note: from tbl t, t.col_array rec is a shortcut for from tbl, unnest(col_array) rec
One more note - if you reference field name that is used in multiple places of your schema - the engine picks most outer matching one. And if by chance this matching one is within the ARRAY - you first need to unnest that array. And if this one is part of STRUCT - you need to make sure you fully qualify the path
For example - with above simplified data
select a from tbl // will not work
select col_struct.a from tbl // will work
select col_array.x from tbl // will not work
select x from tbl, unnest(col_array) // will work
There are many more can be said about subject based on what exactly your use case - but above is some hopefully helpful basics

Aggregate arrays element-wise in presto/athena

I have a table which has an array column. The size of the array is guaranteed to be same in all rows. Is it possible to do an element-wise aggregation on the arrays to create a new array?
For e.g. if my aggregation is the avg function then:
Array 1: [1,3,4,5]
Array 2: [3,5,6,1]
Output: [2,4,5,3]
I would want to write queries like these:
select
timestamp_column,
avg(array_column) as new_array
from
my_table
group by
timestamp_column
The array contains close to 200 elements, so I would prefer not to hardcode each element in the query :)
This can be done by combining 2 lesser known SQL constructs: UNNEST WITH ORDINALITY, and array_agg with ORDER BY.
The first step is to unpack the arrays into rows usingCROSS JOIN UNNEST(a) WITH ORDINALITY. For each element in each array, it will output a row containing the element value and the position of that element in the array.
Then you use a standardard GROUP BY on the ordinal, and sum the values.
Finally, you reassemble the sums back into an array using array_agg(value_sum ORDER BY ordinal). The critical part of this expression is the ORDER BY clause in the array_agg call. Without this the values would be an an arbitrary order.
Here is a full example:
WITH t(a) AS (VALUES array [1, 3, 4, 5], array [3, 5, 6, 1])
SELECT array_agg(value_sum ORDER BY ordinal)
FROM (
SELECT ordinal, sum(value) AS value_sum
from t
CROSS JOIN UNNEST(t.a) WITH ORDINALITY AS x(value, ordinal)
GROUP BY ordinal);

How can I aggregate Jsonb columns in postgres using another column type

I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)