Bigquery update / insert in nested arrays and arrays of structs - sql

Editing the question to have a better view ..
There are 2 tables - Staging and Core.
I am having trouble copying the data from Staging to Core.
Conditions
If id, Year and local_id matches in both staging and core -> the data for that specific Array row should be updated from staging to core
If id does not match in staging and core -> A new Row should be inserted in CORE with values from STAGING
If id matches but either of local_id/Year do not match, then a new row should be inserted in the data array.
BigQuery schema for STAGING
[
{
"name": "id",
"type": "STRING"
},
{
"name": "content",
"type": "STRING"
},
{
"name": "createdAt",
"type": "TIMESTAMP"
},
{
"name": "sourceFileName",
"type": "STRING"
},
{
"name": "data",
"type": "record",
"fields": [
{
"name": "local_id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "year",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "country",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
BigQuery schema for CORE
[
{
"name": "id",
"type": "STRING"
},
{
"name": "content",
"type": "STRING"
},
{
"name": "createdAt",
"type": "TIMESTAMP"
},
{
"name": "data",
"type": "record",
"mode": "REPEATED",
"fields": [
{
"name": "local_id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "year",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "country",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
Big Query content for staging -
{"id":"1","content":"content1","createdAt":"2020-07-23 12:46:15.054410 UTC","sourceFileName":"abc.json","data":{"local_id":"123","year":2018,"country":"PL"}}
{"id":"1","content":"content3","createdAt":"2020-07-23 12:46:15.054410 UTC","sourceFileName":"abc.json","data":{"local_id":"123","year":2021,"country":"SE"}}
{"id":"2","content":"content4","createdAt":"2020-07-23 12:46:15.054410 UTC","sourceFileName":"abc.json","data":{"local_id":"334","year":2021,"country":"AZ"}}
{"id":"2","content":"content5","createdAt":"2020-07-23 12:46:15.054410 UTC","sourceFileName":"abc.json","data":{"local_id":"337","year":2021,"country":"NZ"}}
Example Content structure
Big Query content for core -
{"id":"1","content":"content1","createdAt":"2020-07-23 12:46:15.054410 UTC","data":[{"local_id":"123","year":2018,"country":"SE"},{"local_id":"33","year":2019,"country":"PL"},{"local_id":"123","year":2020,"country":"SE"}]}
Example Content structure

Try using the MERGE statement:
MERGE `dataset.destination` D
USING (select id, array(select data) data from `dataset.source`) S
ON D.id = S.id
WHEN MATCHED THEN
UPDATE SET data = S.data
WHEN NOT MATCHED THEN
INSERT (id, data) VALUES(S.id, S.data)

I was finally able to nail the problem.
To merge 2 records, I had to resort to subqueries pushing in some work. Although, I still think there are chances of improvement to this code.
-- INSERT IDs
INSERT `deep_test.main_table` (people_id)
(
SELECT distinct(people_id) FROM `deep_test.staging_test`
WHERE people_id NOT IN ( SELECT people_id FROM `deep_test.main_table` )
);
-- UPDATE TALENT RECORD
UPDATE
`deep_test.main_table` gold
SET
talent = B.talent
FROM
(
SELECT
gold.people_id as people_id,
ARRAY_AGG(aggregated_stage.talent) as talent
FROM
`deep_test.main_table` gold
JOIN
(
SELECT
A.people_id,
A.talent
FROM
(
SELECT
ARRAY_AGG( t
ORDER BY
t.createdAt DESC LIMIT 1 )[OFFSET(0)] A
FROM
`deep_test.staging_test` t
GROUP BY
t.people_id,
t.talent.people_l_id,
t.talent.fiscalYear
)
) as aggregated_stage
ON gold.people_id = aggregated_stage.people_id
WHERE aggregated_stage.talent is not null
GROUP BY people_id
)
B
WHERE
B.people_id = gold.people_id;
-- UPDATE COUNTRY CODE
UPDATE `deep_test.core` core
set core.country_code = countries.number
FROM
(
select people_id , (select country from UNNEST(talent) as d order by d.fiscalYear DESC limit 1) as country FROM `deep_test.core`
) B, `deep_test.countries` countries
WHERE
core.people_id = B.people_id
AND countries.code = B.country;
This creates a subquery and assigns the results to a variable. This variable can be used as a table in for querying and joining the results with another table.

to create an array field, use the ARRAY() function.
to append to an array field, use the ARRAY_CONCAT() function.
this query can be used to do "Updated if present" requirement:
UPDATE `destenation` d
SET
d.data = ARRAY_CONCAT( d.data, ARRAY(
SELECT
s.data
FROM
`source` s
WHERE
d.id = s.id) )
WHERE d.id in (SELECT id from `source` s)
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#update_using_joins
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#creating_arrays_from_subqueries
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#combining_arrays

Related

BigQuery select rows with two (or more / less) matches in a repeated field

I am having a schema that looks like:
[
{
"name": "name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "frm",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "c",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "n",
"type": "STRING",
"mode": "REQUIRED"
}
]
},
{
"name": "",
"type": "STRING",
"mode": "NULLABLE"
}
]
With a sample record that looks like this:
I am trying to write a query that selects this row when there is a row in frm that matches C = 'X' and another row that has C = 'Z'. Only when both conditions are true, I would love to select the "name" of the parent row. I actually have no clue how I could achieve this. Any suggestions?
E.g. this works, but I am unnesting frm two times, there must a more efficient way I guess.
SELECT name FROM `t2`
WHERE 'X' in UNNEST(frm.c) AND 'Y' in UNNEST(frm.c)
Consider below approach
select name
from your_table t
where 2 = (
select count(distinct c)
from t.frm
where c in ('X', 'Z')
)

How to use JSON values in Oracle PL/SQL Inner Join SELECT statement

I have two tables T1 and T2. In T1 I have a column C1 that contains a value :
"businessKeys": [{
"name": "REF_ID",
"value": "2634",
"type": "Integer"
}, {
"name": "VERSION_REF_ID",
"value": "91950",
"type": "Integer"
}, {
"name": "SCENARIO",
"value": "test1",
"type": "String"
}, {
"name": "CYCLE",
"value": "2021Q3-1",
"type": "String"
}
]
In Table T2 I have a column C2 :
{
"businessKeys": [{
"name": "REF_ID",
"value": "2634",
"type": "Integer"
}, {
"name": "VERSION_REF_ID",
"value": "91950",
"type": "Integer"
}, {
"name": "SCENARIO",
"value": "test1",
"type": "String"
}, {
"name": "CYCLE",
"value": "2021Q3-1",
"type": "String"
}
],
"secondaryKeys": [{
"name": "EQUATION_ID",
"value": "Value1",
"type": "String"
}, {
"name": "EQUATION_NAME",
"value": "Value 2",
"type": "String"
}, {
"name": "USECASE",
"value": "Test Use Case",
"type": "String"
}, {
"name": "RECORD_DATE",
"value": "07-01-2023",
"type": "Date"
}, {
"name": "OUTPUT_VALUE",
"value": "0",
"type": "Float"
}
]
}
How do I get "secondaryKeys" from T2.C2 if I match "businessKeys"?
If it wasn't JSON fields I would have a simple SELECT :
SELECT t2.secondaryKeys from T1 t1, T2 t2
WHERE t1.businessKeys = t2.businessKeys
I also need to retrieve certain value from SecondaryKeys : OUTPUT_VALUE.
I assumed there's an additional column id in table t1, which you use to select a (unique?) row from t1. Notice where that goes: the where clause at the end of subquery q1 in the with clause.
This solution depends critically on the JSON structure being very rigid: the businessKeys object value is always an array with four object members with exactly those keys AND exactly those values for the key name, and similarly for secondaryKeys. These can be relaxed easily in later Oracle versions, which support filter expressions in JSON paths, the JSON_EQUAL condition, etc.; in Oracle 12.1 (and even 12.2) it would be quite a bit harder.
with
q1 (bk_ref_id, bk_version_ref_id, bk_scenario, bk_cycle) as (
select j1.bk_ref_id, j1.bk_version_ref_id, j1.bk_scenario, j1.bk_cycle
from t1 cross apply
json_table(c1, '$.businessKeys'
columns ( bk_ref_id integer path '$[0].value'
, bk_version_ref_id integer path '$[1].value'
, bk_scenario varchar2 path '$[2].value'
, bk_cycle varchar2 path '$[3].value'
)
) j1
where t1.id = 101 -------- INPUT ID GOES HERE --------
)
, q2 (bk_ref_id, bk_version_ref_id, bk_scenario, bk_cycle, sk_output_value) as (
select j2.bk_ref_id, j2.bk_version_ref_id, j2.bk_scenario, j2.bk_cycle,
j2.sk_output_value
from t2 cross apply
json_table(c2, '$'
columns
( sk_output_value number path '$.secondaryKeys[4].value'
, nested path '$.businessKeys'
columns ( bk_ref_id integer path '$[0].value'
, bk_version_ref_id integer path '$[1].value'
, bk_scenario varchar2 path '$[2].value'
, bk_cycle varchar2 path '$[3].value'
)
)
) j2
)
select q2.sk_output_value
from q1 join q2 using (bk_ref_id, bk_version_ref_id, bk_scenario, bk_cycle)
;

Inserting data from one BigQuery table to another returns 0 rows on group by

I am trying to do insert data from one BigQuery table to another by running the query shown below but I get 0 rows in return. However if I take out the Survey column, I get the correct number of rows in return.
Both the nested fields have the same type of schema. I have checked and double checked the column names too but can´t seem to figure out what´s wrong with Survey field.
INSERT INTO destination_table
(
Title, Description, Address, Survey
)
SELECT
Title as Title,
Description as Description,
[STRUCT(
ARRAY_AGG(STRUCT(Address_Instance.Field1, Address_Instance.Field2)) AS Address_Record
)]
as Address,
[STRUCT(
ARRAY_AGG(STRUCT(Survey_Instance.Field1, Survey_Instance.Field2)) AS Survey_Record
)]
as Survey
FROM
source_table,
UNNEST(Survey) AS Survey,
UNNEST(Survey_Instance) as Survey_Instance,
GROUP BY
Title,
Description
Here´s how the schema of my source table looks like:
[
{
"name": "Title",
"type": "STRING"
},
{
"name": "Description",
"type": "STRING"
},
{
"name": "Address",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Address_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
{
"name": "Survey",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Survey_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
]
While mapping to the destination table, I rename the nested repeated records but that´s not causing any problems. I am wondering if I am overlooking something important and need some suggestions and advice. Basically an extra set of eyes to help me figure what I am doing wrong.
Would appreciate some help. Thanks in advance.
Use explicit JOINs in general. In this case, use LEFT JOIN:
FROM source_table st LEFT JOIN
UNNEST(st.Survey) Survey
ON 1=1 LEFT JOIN
UNNEST(Survey.Survey_Instance) Survey_Instance
ON 1=1

How to load data from query into table on BigQuery

I have The following BigQuery tables:
orders:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
customers:
[
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
I want to create new_orders as follows:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
So I created an empty table for new_orders and wrote this query:
SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id
My problem is how to load the data from this query result into the new table.
I have like 15M rows. To the best of my knowledge regular insert is cost-expensive and incredibly slow. How can I do this as a load job?
you could do this from BigQuery Console
There follow these steps:
1) Show Options
2) Destination Table
3) choose dataset and provide "new_orders" as Table ID
4) then set "Write Preference" to "Write if empty" as this is one time thing as you said
If needed, look also for this tutorial: https://cloud.google.com/bigquery/docs/writing-results
You could use the bq command line tool:
bq query --append_table \
--nouse_legacy_sql \
--allow_large_results \
--destination_table project.orderswh.new_orders 'SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id'

How to handle dynamic schema in bigquery

My data looks like this:
row 1 - {"id": "1", "object": "user","metadata": {"name": "1234"}}
rows 2 - {"id": "1", "object": "user","metadata": {"name": "1234","email": "abc#abc.com"}}
I created the table using row 1
metadata RECORD NULLABLE
metadata.tenant STRING NULLABLE
object STRING NULLABLE
id STRING NULLABLE
But my insert will fail on row 2. what should my schema look like so that it can handle changes in metadata field?
For the example shown in your question - I would go with below schema
[
{
"name": "id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "object",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "metadata",
"type": "STRING",
"mode": "NULLABLE"
}
]
And below is an example of how I would process it
#standardSQL
WITH `yourProject.yourDataset.yourTable` AS (
SELECT 1 AS id, 'user' AS object, '{"name": "BI Architect", "email": "abc#abc.com"}' AS metadata UNION ALL
SELECT 2, 'expert', '{"name": "Elliott Brossard"}'
)
SELECT
id,
object,
JSON_EXTRACT_SCALAR(metadata, '$.name') AS name,
JSON_EXTRACT_SCALAR(metadata, '$.email') AS email
FROM `yourProject.yourDataset.yourTable`
ORDER BY id
resulted in below output
id object name email
1 user BI Architect abc#abc.com
2 expert Elliott Brossard null