BigQuery array of structs, flatten into one row

BigQuery array of structs, flatten into one row - google-bigquery

Here is my schema
I would like to flatten this into one row, so that all the structs in the prices array turn into columns so that column name would be combined of "market" + " " + "wait", and the columns field value would be "price". Is this possible?

Consier below query. You need a dynamic query cause you don't know what value is in market and wait field in advance and a column name with a space is not allowed in bigquery.
CREATE TEMP TABLE sample_table AS
SELECT 1 id, 'token001' token,
[STRUCT('market10' AS market, 100 AS wait, 100 AS price),
STRUCT('market11' AS market, 101 AS wait, 101 AS price)] prices, 100 result
UNION ALL
SELECT 2 id, 'token002' token,
[STRUCT('market20' AS market, 200 AS wait, 200 AS price),
STRUCT('market21' AS market, 201 AS wait, 201 AS price)] prices, 200 result;
EXECUTE IMMEDIATE FORMAT("""
SELECT * FROM (
SELECT id, token, result, market || '_' || wait AS col_name, price
FROM sample_table, UNNEST(prices)
) PIVOT (ANY_VALUE(price) FOR col_name IN ('%s'))
""", (SELECT STRING_AGG(market || '_' || wait, "','") FROM sample_table, UNNEST(prices)));
+----+----+----------+--------+--------------+--------------+--------------+--------------+
| ow | id | token | result | market10_100 | market11_101 | market20_200 | market21_201 |
+----+----+----------+--------+--------------+--------------+--------------+--------------+
| 1 | 1 | token001 | 100 | 100 | 101 | null | null |
| 2 | 2 | token002 | 200 | null | null | 200 | 201 |
+----+----+----------+--------+--------------+--------------+--------------+--------------+

Related

Is there a way in SQL to aggregate a column across rows and potentially duplicate rows based on another field value in Redshift?

So I have a table, let's call it shipment_items that lists by a shipment_id the individual items contained within a shipment and their quantities.
+-------------+-------------+----------+
| shipment_id | item_name | quantity |
+-------------+-------------+----------+
| 1 | cleanser | 1 |
| 1 | moisturizer | 2 |
| 2 | cleanser | 2 |
| 2 | body wash | 1 |
| 3 | cleanser | 1 |
| 3 | moisturizer | 2 |
| 4 | cleanser | 1 |
| 4 | moisturizer | 1 |
+-------------+-------------+----------+
What I want is to return a table that looks like this
+------------------------------------+----------+
| items | num_ship |
+------------------------------------+----------+
| cleanser, moisturizer, moisturizer | 2 |
| body wash, cleanser, cleanser | 1 |
| cleanser, moisturizer | 1 |
+------------------------------------+----------+
Is there any way in sql to do that? I'm thinking something with list_agg, but the tricky part is duplicating the item_names based on the quantity field. What I'm trying to show in the new table is that there were 2 shipments that contained 2 moisturizers and 1 cleanser, and 1 shipment containing 2 cleansers and 1 body wash.
** EDIT **
Resolved thanks to #Gordon Linoff
new resulting table will look like this
+------------------------------------+----------+
| items | num_ship |
+------------------------------------+----------+
| cleanser: 1, moisturizer: 2 | 2 |
| body wash: 1, cleanser: 2 | 1 |
| cleanser: 1, moisturizer: 1 | 1 |

You can use listagg():
select listagg(item_name, ', ') within group (order by item_name) as items,
quantity
from t
group by quantity
order by quantity desc;
EDIT:
I think you want two levels of aggregation:
select items, count(*)
from (select shipment_id,
listagg(distinct item_name, ', ') within group (order by item_name) as items
from t
group by shipment_id
) s
group by items
order by count(*) desc;
This does not include duplicates in the item list.
EDIT II:
For exact matches, include the quantity:
select items, count(*)
from (select shipment_id,
listagg(distinct item_name || ':' || quantity, ', ') within group (order by item_name) as items
from t
group by shipment_id
) s
group by items
order by count(*) desc;

SQL blank rows between rows

I am trying to output a blank row after each row.
For example:
SELECT id,job,amount FROM table
+----+-----+--------+
| id | job | amount |
+----+-----+--------+
| 1 | 100 | 123 |
| 2 | 200 | 321 |
| 3 | 300 | 421 |
+----+-----+--------+
To the following:
+----+-----+--------+
| id | job | amount |
+----+-----+--------+
| 1 | 100 | 123 |
| | | |
| 2 | 200 | 321 |
| | | |
| 3 | 300 | 421 |
+----+-----+--------+
I know I can do similar things with a UNION like:
SELECT null AS id, null AS job, null AS amount
UNION
SELECT id,job,amount FROM table
Which would give me a blank row at the beginning, but for the life of me I can't figure out how to do it every second row. A nested SELECT/UNION? - Have tried but nothing seemed to work.
The DBMS is SQL Server 2016

This is an akward requirement, that would most probably better handled on application side. Here is, however, one way to do it:
select id, job, amount
from (
select id, job, amount, id order_by from mytable
union all
select null, null, null, id from mytable
) t
order by order_by, id desc
The trick is to add an additional column to the unioned query, that keeps track of the original id, and can be used to sort the records in the outer query. You can then use id desc as second sorting criteria, which will put null values in second position.
Demo on DB Fiddle:
with mytable as (
select 1 id, 100 job, 123 amount
union all select 2, 200, 321
union all select 3, 300, 421
)
select id, job, amount
from (
select id, job, amount, id order_by from mytable
union all
select null, null, null, id from mytable
) t
order by order_by, id desc;
id | job | amount
---: | ---: | -----:
1 | 100 | 123
null | null | null
2 | 200 | 321
null | null | null
3 | 300 | 421
null | null | null

In SQL Server, you can just use apply:
select v.id, v.job, v.amount
from t cross apply
(values (id, job, amount, id, 1),
(null, null, null, id, 2)
) v(id, job, amount, ord1, ord2)
order by ord1, ord2;

Add nested column to BigQuery table, joining on value of another nested column in standard SQL

I have a reasonably complex dataset being pulled into BigQuery table via an Airflow DAG which cannot easily be adjusted.
This job pulls data into a table with this format:
| Line_item_id | Device |
|--------------|----------------|
| 123 | 202; 5; 100 |
| 124 | 100; 2 |
| 135 | 504; 202; 2 |
At the moment, I am using this query (written in standard SQL within the BQ Web UI) to split the device ids into individual nested rows:
SELECT
Line_item_id,
ARRAY(SELECT AS STRUCT(SPLIT(RTRIM(Device,';'),'; '))) as Device,
Output:
| Line_item_id | Device |
|--------------|--------|
| 123 | 202 |
| | 203 |
| | 504 |
| 124 | 102 |
| | 2 |
| 135 | 102 |
The difficulty I am facing is I have a separate match table containing the device ids and their corresponding names. I need to add the device names to the above table, as nested values next to their corresponding ids.
The match table looks something like this (with many more rows):
| Device_id | Device_name |
|-----------|-------------|
| 202 | Smartphone |
| 203 | AppleTV |
| 504 | Laptop |
The ideal output I am looking for would be:
| Line_item_id | Device_id | Device_name |
|--------------|-----------|-------------|
| 123 | 202 | Android |
| | 203 | AppleTV |
| | 504 | Laptop |
| 124 | 102 | iphone |
| | 2 | Unknown |
| 135 | 102 | iphone |
If anybody knows how to achieve this I would be grateful for help.
EDIT:
Gordon's solution works perfectly, but in addition to this, if anybody wants to re-nest the data afterwards (so you effectively end up with the same table and additional nested rows), this was the query I finally ended up with:
select t.line_item_id, ARRAY_AGG(STRUCT(d as id, ot.name as name)) as device
from first_table t cross join
unnest(split(Device, '; ')) d join
match_table ot
on ot.id = d
GROUP BY line_item_id

You can move the parsing logic to the from clause and then join in what you want:
select *
from (select 124 as line_item_id, '203; 100; 6; 2' as device) t cross join
unnest(split(device, '; ')) d join
other_table ot
on ot.device = d;

Below is for BigQuery Standard SQL. No GROUP BY required ...
#standardSQL
SELECT * EXCEPT(Device),
ARRAY(
SELECT AS STRUCT Device_id AS id, Device_name AS name
FROM UNNEST(SPLIT(REPLACE(Device, ' ', ''), ';')) Device_id WITH OFFSET
JOIN `project.dataset.devices`
USING(Device_id)
ORDER BY OFFSET
) Device
FROM `project.dataset.items`
If to apply to sample data from your question - result is
FYI: I used below data to test
WITH `project.dataset.items` AS (
SELECT 123 Line_item_id, '202; 5; 100' Device UNION ALL
SELECT 124, '100; 2' UNION ALL
SELECT 135, '504; 202; 2'
), `project.dataset.devices` AS (
SELECT '202' Device_id, 'Smartphone' Device_name UNION ALL
SELECT '203', 'AppleTV' UNION ALL
SELECT '504', 'Laptop' UNION ALL
SELECT '5', 'abc' UNION ALL
SELECT '100', 'xyz' UNION ALL
SELECT '2', 'zzz'
)

What you need is to UNNEST the contents of your devices array, and then to roll it back up after joining with the devices metatable:
select
line_item_id,
array_agg(struct(device_id as device_id, device_name as device_name)) as devices
from (
select
d.line_item_id,
device_id,
n.device_name
from `mydataset.basetable` d, unnest(d.device_ids) as device_id
left join `mydataset.devices_table` n on n.device_id = device_id
)
group by line_item_id
Hope this helps.

How to create a column containing a JSON that has names defined from the value of a column in another table?

I have a source table with data in VARCHAR format like the example below.
I want to insert the data in another table in a JSON format (the result column itself can be of JSON or VARCHAR type).
For each Id, there is at least 1 JSONName/JSONValue pair.
But each Id doesn't have the same kinds and number of JSONName/JSONValue pairs.
Each Id can have maximum 50 JSONName/JSONValue pairs.
The order of the pairs in the value of the ResultJSON column doesn't matter.
SourceTable:
____________________________
| Id | JSONName | JSONValue |
|____|__________|___________|
| 1 | Name | John |
| 2 | Name | Henry |
| 2 | Age | 32 |
| 3 | Age | 56 |
| 3 | Location | US |
| 4 | Age | 24 |
| 4 | Name | Andrew |
| 4 | Location | |
What I want:
Expected ResultTable:
____________________________________________________
| Id | ResultJSON |
|____|______________________________________________|
| 1 | {"Name":"John"} |
| 2 | {"Name":"Henry","Age":"32"} |
| 3 | {"Age":"56", "Location":"US"} |
| 4 | {"Age":"24","Name":"Andrew","Location":null} |
What I get with my current query:
Wrong resultTable:
_______________________________________________________________________________________________________________________________
| Id | ResultJSON |
|____|_________________________________________________________________________________________________________________________|
| 1 | [{"JSONName":"Name","JSONValue":"John"}] |
| 2 | [{"JSONName":"Name","JSONValue":"Henry"},{"JSONName":"Age","JSONValue":"32"}] |
| 3 | [{"JSONName":"Age","JSONValue":"56"},{"JSONName":"Location","JSONValue":"US"}] |
| 4 | [{"JSONName":"Age","JSONValue":"24"},{"JSONName":"Name","JSONValue":"Andrew"},{"JSONName":"Location","JSONValue":null}] |
Current query:
INSERT INTO ResultTable
(
Id
,ResultJSON
)
SELECT
SourceTable.Id
,JSON_AGG(SourceTable.JSONName,SourceTable.JSONValue)
FROM SourceTable
INNER JOIN OtherTable ON SourceTable.Id=OtherTable.Id
Is it possible to do it with Teradata JSON functions? If not, what would be the most optimized query to do it?

You can remove the unwanted parts using a RegEx:
SELECT
SourceTable.Id
,RegExp_Replace(Cast(Json_Agg(SourceTable.JSONName AS "#A",SourceTable.JSONValue AS "#B") AS VARCHAR(32000)), '"#A":|,"#B"|^\[|\]$|}(?=,{")|(?<="},){')
FROM SourceTable
GROUP BY 1
The RegEx removes all of the follwing:
"#A":
,"#B"
a leading [
a trailing ]
} if it's followed by ,{"
{ if it's following "},
Edit:
Based on the comments this RegEx leaves superfluous opening braces. This seems to work better:
'"#A":|,"#B"|^[|]$|}(?=,)|(?<=,){'

Here is the query I got in the end:
INSERT INTO DB.RESULT_TABLE
(
ResultId
,ResultJSON
)
WITH RECURSIVE MergedTable (Id, mergedList, rnk)
AS
(
SELECT
Id
,TRIM('"' || JSONName ||'":'|| COALESCE('"' || JSONValue || '"','null')) AS mergedList
,rnk
FROM DB.SOURCE_TABLE
WHERE rnk = 1
UNION ALL
SELECT
SourceTable.Id
,MergedTable.mergedList || ',' || TRIM('"' || SourceTable.JSONName ||'":' || COALESCE('"' || SourceTable.JSONValue || '"','null')) AS mergedList
,SourceTable.rnk
FROM DB.SOURCE_TABLE SourceTable
INNER JOIN MergedTable MergedTable
ON MergedTable.rnk + 1 = SourceTable.rnk
AND SourceTable.Id = MergedTable.Id
)
SELECT
MergedTable.Id AS ResultId
,'{' || MergedTable.mergedList || '}' AS ResultJSON
FROM MergedTable
QUALIFY RANK() OVER (PARTITION BY ResultId ORDER BY rnk DESC) = 1
;

Effectively joining aggregated data back to raw data in order to trace all the raw data that was used to compute the aggregation

I have some raw data like so:
rawId| Name | Quantity| AggId
-----|------|---------|------
1 | Foo | 10 | NULL
2 | Foo | 20 | NULL
3 | Foo | 30 | NULL
4 | Bar | 40 | NULL
5 | Bar | 50 | NULL
6 | Bar | 60 | NULL
I want to aggregate them:
SELECT name, sum(quantity)
FROM foobar
GROUP BY name
And store these results somwhere:
AggId| Name | Quantity
-----|------|---------
1 | Foo | 60
2 | Bar | 150
My goal here is to be able to trace which records from the raw table were used to compute the aggregation in the aggregated table. In other words, I want to update all the AggId values for foo in the raw table to 1, and all the AggId values for bar in the raw table to 2.
Currently I'm joining the aggregated back to the raw on the grouped columns to find which aggIds are associated to which rawIds:
SELECT a.aggI r.rawId
FROM agg a JOIN raw r ON (a.name = r.name)
Is there a better way to accomplish this? For example, perhaps through an analytic function?
SELECT rawId, name, quantity,
SUM(quantity) OVER (PARTITION BY name) grouped_qty
FROM raw;
Results in
rawId| Name | Quantity| grouped_qty
-----|------|---------|------
1 | Foo | 10 | 60
2 | Foo | 20 | 60
3 | Foo | 30 | 60
4 | Bar | 40 | 60
5 | Bar | 50 | 60
6 | Bar | 60 | 60
It would be nice if I could get the analytic function to generate a sequence Id for the aggregated set; but I'm not sure if this is possible.

Perhaps this does what you want?
SELECT name, sum(quantity), listagg(foobar.rawid, ',')
FROM foobar
GROUP BY name;
This will include a list of the raw ids that go into each row. Do note that there is a limit of 4000 characters on the length of the listagg() column.

I believe you want DENSE_RANK() if you leave out the partition by it will use the entire result set:
SELECT rawId, name, quantity,
SUM(quantity) OVER (PARTITION BY name) grouped_qty,
DENSE_RANK() OVER (ORDER BY name) as name_rank
FROM raw;
More info: https://msdn.microsoft.com/en-us/library/ms189798.aspx

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery array of structs, flatten into one row - google-bigquery

Here is my schema I would like to flatten this into one row, so that all the structs in the prices array turn into columns so that column name would be combined of "market" + " " + "wait", and the columns field value would be "price". Is this possible?

Related

Is there a way in SQL to aggregate a column across rows and potentially duplicate rows based on another field value in Redshift?

SQL blank rows between rows

Add nested column to BigQuery table, joining on value of another nested column in standard SQL

How to create a column containing a JSON that has names defined from the value of a column in another table?

Effectively joining aggregated data back to raw data in order to trace all the raw data that was used to compute the aggregation

Categories

Resources