Aggregate rows into complex json - Athena - sql

I've an Athena query which gives me the below table for a given IDs:
ID
ID_2
description
state
First
row
abc
[MN, SD]
Second
row
xyz
[AL, CA ]
I'm using the array_agg function to merge states into an array. Within the query itself I want convert the output into the format below:
ID
ID_2
custom_object
First
row
{'description': 'abc', 'state': ['MN', 'SD']}
I'm looking at the Athena docs but haven't found function that does just this. I'm experimenting with multimap_agg and map_agg but this seems to be too complex to achieve. How can I do this, please help!

You can do it after aggregation by combining casts to json and creating map:
-- sample data
WITH dataset (ID, ID_2, description, state) AS (
VALUES ('First', 'row', 'abc', array['MN', 'SD']),
('Second', 'row', 'xyz', array['AL', 'CA' ])
)
-- query
select ID,
ID_2,
cast(
map(
array [ 'description',
'state' ],
array [ cast(description as json),
cast(state as json) ]
) as json
) custom_object
from dataset
Output:
ID
ID_2
custom_object
First
row
{"description":"abc","state":["MN","SD"]}
Second
row
{"description":"xyz","state":["AL","CA"]}

Related

How to pivot table without using aggregate function SQL?

I would like to pivot table, but to use PIVOT() have to use any aggregation functions such as max(), min(), mean(), count(), sum(). I don't need use these functions, but I need to transform my table without using them.
Source table
SOURCE
ATTRIBUTE
CATEGORY
GOOGLE
MOVIES
1
YAHOO
JOURNAL
2
GOOGLE
MUSIC
1
AOL
MOVIES
3
The new table should be like this:
ATTRIBUTE
GOOGLE
YAHOO
AOL
MOVIES
1
3
Will be grateful if someone would help me.
The pivot syntax requires an aggregate function. It's not optional.
https://docs.snowflake.com/en/sql-reference/constructs/pivot.html
SELECT ... FROM ... PIVOT ( <aggregate_function> ( <pivot_column> )
FOR <value_column> IN ( <pivot_value_1> [ , <pivot_value_2> ... ] ) )
[ ... ]
In the Snowflake documentation, optional syntax is in square braces. The <aggregate_function> is not in square braces, so it's required. If you only have one value, any of the aggregate functions you listed except count will work and give the same result.
create or replace table T1("SOURCE" string, ATTRIBUTE string, CATEGORY int);
insert into T1("SOURCE", attribute, category) values
('GOOGLE', 'MOVIES', 1),
('YAHOO', 'JOURNAL', 2),
('GOOGLE', 'MUSIC', 1),
('AOL', 'MOVIES', 3);
select *
from T1
PIVOT ( sum ( CATEGORY )
for "SOURCE" in ( 'GOOGLE', 'YAHOO', 'AOL' ));
ATTRIBUTE
'GOOGLE'
'YAHOO'
'AOL'
MOVIES
1
null
3
MUSIC
1
null
null
JOURNAL
null
2
null
Here's an alternative approach utilising ARRAY_AGG() which maybe more flexible. For example if you wanted to pivot the Attribute instead of the category, easy peasy. The 'category' here appears to be more of a classification or label instead of something we'd want summed.
The array doesn't need to be constrained any particular data type. This can be extended to pivots with many objects.
WITH CTE AS ( SELECT 'GOOGLE' SOURCE, 'MOVIES' ATTRIBUTE, 1 CATEGORY UNION ALL SELECT 'YAHOO', 'JOURNAL', 2 UNION ALL SELECT 'GOOGLE', 'MUSIC', 1 UNION ALL SELECT 'AOL', 'MOVIES', 3 );
SELECT
ATTRIBUTE, GOOGLE[0] GOOGLE, YAHOO[0] YAHOO, AOL[0] AOL
FROM
CTE PIVOT (ARRAY_AGG(CATEGORY) FOR SOURCE IN ('GOOGLE', 'YAHOO', 'AOL')
) AS A (ATTRIBUTE, GOOGLE, YAHOO, AOL);

Postgres grouping query result by row

Im new to Postgres and using it now for the first time with nodejs.
I got a table with sensor readings (temperature in this case).
The Table is something like:
sensorID
timestamp
temperature
1
8:22pm
22C
2
8:23pm
21C
3
8:25pm
24C
2
8:26pm
27C
2
8:28pm
19C
1
8:31pm
28C
Is there a way to create a single query to get the data to nodejs formatted like this:
[
{sensorID: 1,
data: [
{timestamp:8:22pm, temperature:22C},
{timestamp:8:31pm, temperature:28C}
]},
{sensorID: 2,
data: [
{timestamp:8:23pm, temperature:21C},
{timestamp:8:26pm, temperature:27C},
{timestamp:8:28pm, temperature:19C}
]},
{sensorID: 3,
data: [
{timestamp:8:25pm, temperature:24C}
]}
]
Welcome to SO.
To build and aggregate json objects in PostgreSQL (via SQL) you can use the functions jsonb_build_object and jsonb_agg:
WITH j AS (
SELECT
sensorid,
jsonb_build_object(
'data',
jsonb_agg(
jsonb_build_object(
'timestamp',timestamp,
'temperature',temperature)))
FROM t
GROUP BY sensorid
ORDER BY sensorid)
SELECT jsonb_agg(j) FROM j;
Demo: db<>fiddle

How can I flatten table in SQL in Google Big Query?

I have this table
And tried to achieve the following output:
I found different articles (like this) how to do it, unfortunately they do not work with my table.
The schema of the table is the following:
Consider below approach - less verbose and easy to manage if any adjustments needed
select * from (
select id_car, kv.element.key, kv.element.value
from `project.dataset.table`, unnest(table.keyvalue.list) as kv
)
pivot (min(value) for key in ('id', 'model', 'status', 'speed'))
if applied to sample data in your question - output is
I created a table with the schema you mentioned and data you gave:
I ran the following query on this table:
Select id_car,STRING_AGG(id,'') as id, STRING_AGG(model,'') as model, STRING_AGG(Status,'') as status, STRING_AGG(speed,'') as speed from (SELECT id_car,
if(my.element.key = "id", my.element.value,'') as id,
if(my.element.key = "model", my.element.value, '') as `model`,
if(my.element.key = "Status", my.element.value, '') as Status,
if(my.element.key = "speed", my.element.value, '') as speed,
FROM `ProjectID.Dataset.Table`, unnest(table.keyvalue.list) as my) group by id_car
This gives me the same output that you expect:

Spark INLINE Vs. LATERAL VIEW EXPLODE differences?

In Spark, for the following use case, I'd like to understand what are the main differences between using the INLINE and EXPLODE ... I'm not sure if there are any performance implications or if one method is preferred over the other one or if there are any other uses cases where one is appropriate and the other is not...
The use case is to select 2 fields from a complex data type (array of structs), my instinct was to use INLINE since it explodes an array of structs
For example:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
),
inline_data AS (
SELECT id,
INLINE(array_of_structs)
FROM sample
)
SELECT id,
name AS person_name,
age AS person_age
FROM inline_data
And using LATERAL VIEW EXPLODE:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
person.name,
person.age
FROM sample
LATERAL VIEW EXPLODE(array_of_structs) exploded_people as person
The documentation clearly states what each one of these do but I'd like to better understand when to pick one over the other one.
EXPLODE UDTF will generate rows of struct (single column of type struct), and to get person name you need to use person.name:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
person.name,
person.age
FROM sample
LATERAL VIEW explode(array_of_structs) exploded_people as person
Result:
id,name,age
1,frank,40
1,maria,51
And INLINE UDTF will generate a row-set with N columns (N = number of top level elements in the struct), so you do not need to use dot notation person.name because name and other struct elements are already extracted by INLINE:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
name,
age
FROM sample
LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state
Result:
id,name,age
1,frank,40
1,maria,51
Both INLINE and EXPLODE are UDTFs and require LATERAL VIEW in Hive. In Spark it works fine without lateral view. The only difference is that EXPLODE returns dataset of array elements(struct in your case) and INLINE is used to get struct elements already extracted. You need to define all struct elements in case of INLINE like this: LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state
From performance perspective both INLINE and EXPLODE work the same, you can use EXPLAIN command to check the plan. Extraction of struct elements in the UDTF or after UDTF does not affect performance.
INLINE requires to describe all struct elements (in Hive) and EXPLODE does not, so, explode may be more convenient if you do not need to extract all struct elements of if you do not need to extract elements at all. INLINE is convenient when you need to extract all or most of struct elements.
Your first code example works only in Spark. In Hive 2.1.1 it throws an exception because lateral view required.
In Spark this will work also:
inline_data AS (
SELECT id,
EXPLODE(array_of_structs) as person
FROM sample
)
And to get age column you need to use person.age

postgresql Looping through JSONB array and performing SELECTs

I have jsonb in one of my table
the jsonb looks like this
my_data : [
{pid: 1, stock: 500},
{pid: 2, stock: 1000},
...
]
pid refers to products' table id ( which is pid )
EDIT: The table products has following properties: pid (PK), name
I want to loop over my_data[] in my JSONB and fetch pid's name from product table.
I need the result to look something like this (including the product names from the second table) ->
my_data : [
{
product_name : "abc",
pid: 1,
stock : 500
},
...
]
How should I go about performing such jsonb inner join?
Edit :- tried S-Man's solutions and i'm getting this error
"invalid reference to FROM-clause entry for table \"jc\""
here is the
SQL QUERY
step-by-step demo:db<>fiddle
SELECT
jsonb_build_object( -- 5
'my_data',
jsonb_agg( -- 4
elems || jsonb_build_object('product_name', mot.product_name) -- 3
)
)
FROM
mytable,
jsonb_array_elements(mydata -> 'my_data') as elems -- 1
JOIN
my_other_table mot ON (elems ->> 'pid')::int = mot.pid -- 2
Expand JSON array into one row per array element
Join the other table against the current one using the pid values (notice the ::int cast, because otherwise it would be text value)
The new columns from the second table now can be converted into a JSON object. This one can be concatenate onto the original one using the || operator
After that recreating the array from the array elements again
Putting this in array into a my_data element
Another way is using jsonb_set() instead of step 5 do reset the array into the original array directly:
step-by-step demo:db<>fiddle
SELECT
jsonb_set(
mydata,
'{my_data}',
jsonb_agg(
elems || jsonb_build_object('product_name', mot.product_name)
)
)
FROM
mytable,
jsonb_array_elements(mydata -> 'my_data') as elems
JOIN
my_other_table mot ON (elems ->> 'pid')::int = mot.pid
GROUP BY mydata