bigquery joins on nested repeated

bigquery joins on nested repeated - google-bigquery

I am having trouble joining on a repeated nested field while still preserving the original row structure in BigQuery.
For my example I'll call the two tables being joined A and B.
Records in table A look something like:
{
"url":"some url",
"repeated_nested": [
{"key":"some key","property":"some property"}
]
}
and records in table B look something like:
{
"key":"some key",
"property2": "another property"
}
I am hoping to find a way to join this data together to generate a row that looks like:
{
"url":"some url",
"repeated_nested": [
{
"key":"some key",
"property":"some property",
"property2":"another property"
}
]
}
The very first query I tried was:
SELECT
url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM A
AS lefttable
LEFT OUTER JOIN B
AS righttable
ON lefttable.key=righttable.key
This doesn't work because BQ can't join on repeated nested fields. There is not a unique identifier for each row. If I were to do a FLATTEN on repeated_nested then I'm not sure how to get the original row put back together correctly.
The data is such that a url will always have the same repeated_nested field with it. Because of that, I was able to make a workaround using a UDF to sort of roll up this repeated nested object into a JSON string and then unroll it again:
SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM
JS(
(
SELECT basetable.url as url, repeated_nested
FROM A as basetable
LEFT JOIN (
SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
FROM
(
SELECT
url,
CONCAT(
'{"key": "', repeated_nested.key, '",',
' "property": "', repeated_nested.property, '",',
' "property2": "', mapping_table.property2, '"',
'}'
)
) as repeated_nested_json
FROM (
SELECT
url, repeated_nested.key, repeated_nested.property
FROM A
GROUP BY url, repeated_nested.key, repeated_nested.property
) as urltable
LEFT OUTER JOIN [SDF.alchemy_to_ric]
AS mapping_table
ON urltable.repeated_nested.key=mapping_table.key
)
GROUP BY url
) as companytable
ON basetable.url = urltable.url
),
// input columns:
url, repeated_nested_json,
// output schema:
"[{'name': 'url', 'type': 'string'},
{'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
[ { 'name': 'key', 'type':'string' },
{ 'name': 'property', 'type':'string' },
{ 'name': 'property2', 'type':'string' }]
}]",
// UDF:
"function(row, emit) {
parsed_repeated_nested = [];
try {
if ( row.repeated_nested_json != null ) {
parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
}
} catch (ex) { }
emit({
url: row.url,
repeated_nested: parsed_repeated_nested
});
}"
)
This solution works fine for small tables. But the real life tables I'm working with have many more columns than in my example above. When there are other fields in addition to url and repeated_nested_json they all have to be passed through the UDF. When I work with tables that are around the 50 gb range everything is fine. But when I apply the UDF and query to tables that are 500-1000 gb, I get an Internal Server Error from BQ.
In the end I just need all of the data in new line delimited JSON format in GCS. As a last ditch effort I tried concatenating all of the fields into a JSON string (so that I only had 1 column) in the hopes that I could export it as CSV and have what I need. However, the export process escaped the double quotes and adds double quotes around the JSON string. According to the BQ docs on jobs (https://cloud.google.com/bigquery/docs/reference/v2/jobs) there is a property configuration.query.tableDefinitions.(key).csvOptions.quote that could help me. But I can't figure out how to make it work.
Does anybody have advice on how they have dealt with this sort of situation?

I have never had to do this, but you should be able to use flatten, then join, then use nest to get repeated fields again.
The docs state that BigQuery always flattens query results, but that appears to be false: you can choose to not have results flattened if you set a destination table. You should then be able to export that table as JSON to Storage.
See also this answer for how to get nest to work.

#AndrewBackes - we rolled out some fixes for UDF memory-related issues this week; there are some details on the root cause here https://stackoverflow.com/a/36275462/5265394 and here https://stackoverflow.com/a/35563562/5265394.
The UDF version of your query is now working for me; could you verify on your side?

Related

How can I convert SuiteQL query code into corresponding json format for REST API call

I have the following SuiteQL code generated using the NetSuite: Workbook Export chrome extension:
`
SELECT
BUILTIN_RESULT.TYPE_DATE(TRANSACTION.trandate) AS trandate,
BUILTIN_RESULT.TYPE_STRING(TRANSACTION.tranid) AS tranid,
BUILTIN_RESULT.TYPE_STRING(item.displayname) AS displayname,
BUILTIN_RESULT.TYPE_CURRENCY(BUILTIN.CONSOLIDATE(transactionLine.rate, 'LEDGER', 'DEFAULT', 'DEFAULT', 1, 100, 'DEFAULT'), BUILTIN.CURRENCY(BUILTIN.CONSOLIDATE(transactionLine.rate, 'LEDGER', 'DEFAULT', 'DEFAULT', 1, 100, 'DEFAULT'))) AS rate,
BUILTIN_RESULT.TYPE_FLOAT(transactionLine.quantity) AS quantity
FROM
TRANSACTION,
item,
transactionLine,
(SELECT
PreviousTransactionLink.nextdoc AS nextdoc,
PreviousTransactionLink.nextdoc AS nextdoc_join,
transaction_SUB.name_crit AS name_crit_0
FROM
PreviousTransactionLink,
(SELECT
transaction_0.ID AS ID,
transaction_0.ID AS id_join,
CUSTOMLIST234.name AS name_crit
FROM
TRANSACTION transaction_0,
CUSTOMLIST234
WHERE
transaction_0.custbody1 = CUSTOMLIST234.ID(+)
) transaction_SUB
WHERE
PreviousTransactionLink.previousdoc = transaction_SUB.ID(+)
) PreviousTransactionLink_SUB
WHERE
(((transactionLine.item = item.ID(+) AND TRANSACTION.ID = transactionLine.TRANSACTION) AND TRANSACTION.ID = PreviousTransactionLink_SUB.nextdoc(+)))
AND ((TRANSACTION.TYPE IN ('CustInvc') AND transactionLine.itemtype IN ('InvtPart', 'Kit') AND NVL(transactionLine.mainline, 'F') = ? AND (UPPER(PreviousTransactionLink_SUB.name_crit_0) NOT LIKE ? OR PreviousTransactionLink_SUB.name_crit_0 IS NULL) AND TRUNC(TRANSACTION.trandate) > TO_DATE(?, 'YYYY-MM-DD')))
`
When I try to paste it into Power Automate's API call, I get the following 400 error:
"Invalid search query. Detailed unprocessed description follows. Invalid number of parameters. Expected: 3. Provided: 0."
My call's query was formatted as follows:
`
{
"q": "SELECT BUILTIN_RESULT.TYPE_DATE(TRANSACTION.trandate) AS trandate, BUILTIN_RESULT.TYPE_STRING(TRANSACTION.tranid) AS tranid, BUILTIN_RESULT.TYPE_STRING(item.displayname) AS displayname, BUILTIN_RESULT.TYPE_CURRENCY(BUILTIN.CONSOLIDATE(transactionLine.rate, 'LEDGER', 'DEFAULT', 'DEFAULT', 1, 100, 'DEFAULT'), BUILTIN.CURRENCY(BUILTIN.CONSOLIDATE(transactionLine.rate, 'LEDGER', 'DEFAULT', 'DEFAULT', 1, 100, 'DEFAULT'))) AS rate, BUILTIN_RESULT.TYPE_FLOAT(transactionLine.quantity) AS quantity FROM TRANSACTION, item, transactionLine, (SELECT PreviousTransactionLink.nextdoc AS nextdoc, PreviousTransactionLink.nextdoc AS nextdoc_join, transaction_SUB.name_crit AS name_crit_0 FROM PreviousTransactionLink, (SELECT transaction_0.ID AS ID,transaction_0.ID AS id_join, CUSTOMLIST234.name AS name_crit FROM TRANSACTION transaction_0, CUSTOMLIST234 WHERE transaction_0.custbody1 = CUSTOMLIST234.ID(+) ) transaction_SUB WHERE PreviousTransactionLink.previousdoc = transaction_SUB.ID(+) ) PreviousTransactionLink_SUB WHERE (((transactionLine.item = item.ID(+) AND TRANSACTION.ID = transactionLine.TRANSACTION) AND TRANSACTION.ID = PreviousTransactionLink_SUB.nextdoc(+))) AND ((TRANSACTION.TYPE IN ('CustInvc') AND transactionLine.itemtype IN ('InvtPart', 'Kit') AND NVL(transactionLine.mainline, 'F') = ? AND (UPPER(PreviousTransactionLink_SUB.name_crit_0) NOT LIKE ? OR PreviousTransactionLink_SUB.name_crit_0 IS NULL) AND TRUNC(TRANSACTION.trandate) > TO_DATE(?, 'YYYY-MM-DD')))"
}
`
When I try using SuiteQL in my API calls using a simple query, my API calls work, so I'm pretty sure I'm screwing up the fomrat of the above's json format. I tried the following simple query call and it was succesfull:
`
{
"q": "SELECT email, COUNT(*) as count FROM transaction GROUP BY email"
}
`
I have tried using json beautifier to try to fix my json but I haven't been able to do so successfully.
Below is a pic of the HTTP action I'm using in power Automate to make the query:
HTTP POST Action
For context, I'm an accountant by trade trying to learn how to do some basic coding. Any hint that will help me correctly format the above query will be greatly appreciated. Thanks!

As per my comment, typically, question marks in an SQL statement are deemed as being parameters.
Code based frameworks use them as placeholders for filling the parameters to abstract them from the string itself.
Now, in PowerAutomate and with your question, I think it's just a little bit of seeing the forest for the trees.
The easiest way to populate the question marks is to literally replace them in the string itself with the relevant variable or expression.
So if I took an SQL statement with a parameter like you have ...
SELECT * FROM [dbo].[Test] WHERE Parameter = ?
... I can do the following (this is an EXTREMELY basic example) ...
Using the expressions, you can populate strings within other variables or steps in the flow.
Just be careful when it comes to encapsulating your different parameters with quotes, either single or double. The SQL statement will need them in the cases where values are numbers.

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?

You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249

The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

How To Query an array of JSONB

I have table (orders) with jsonb[] column named steps in Postgres db.
I need create SQL query to select records where Step1 and Step2 and Step3 has success status
[
{
"step_name"=>"Step1",
"status"=>"success",
"timestamp"=>1636120240
},
{
"step_name"=>"Step2",
"status"=>"success",
"timestamp"=>1636120275
},
{
"step_name"=>"Step3",
"status"=>"success",
"timestamp"=>1636120279
},
{
"step_name"=>"Step4",
"timestamp"=>1636120236
"status"=>"success"
}
]
table structure
id | name | steps (jsonb)

'Normalize' steps into a list of JSON items and check whether every one of them has "status":"success". BTW your example is not valid JSON. All => need to be replaced with : and a comma is missing.
select id, name from orders
where
(
select bool_and(j->>'status' = 'success')
from jsonb_array_elements(steps) j
where j->>'step_name' in ('Step1','Step2','Step3') -- if not all steps but only these are needed
);

You can use JSON value contain operation for check condition exist or not
Demo
select
*
from
test
where
steps #> '[{"step_name":"Step1","status":"success"},{"step_name":"Step2","status":"success"},{"step_name":"Step3","status":"success"}]'

Using Athena to get terminatingrule from rulegrouplist in AWS WAF logs

I followed these instructions to get my AWS WAF data into an Athena table.
I would like to query the data to find the latest requests with an action of BLOCK. This query works:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;
My issue is cleanly identifying the "terminatingrule" - the reason the request was blocked. As an example, a result has
terminatingrule = AWS-AWSManagedRulesCommonRuleSet
And
rulegrouplist = [
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesAmazonIpReputationList",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesKnownBadInputsRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesLinuxRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesCommonRuleSet",
"terminatingrule": {
"rulematchdetails": "null",
"action": "BLOCK",
"ruleid": "NoUserAgent_HEADER"
},
"excludedrules":"null"
}
]
The piece of data I would like separated into a column is rulegrouplist[terminatingrule].ruleid which has a value of NoUserAgent_HEADER
AWS provide useful information on querying nested Athena arrays, but I have been unable to get the result I want.
I have framed this as an AWS question but since Athena uses SQL queries, it's likely that anyone with good SQL skills could work this out.

It's not entirely clear to me exactly what you want, but I'm going to assume you are after the array element where terminatingrule is not "null" (I will also assume that if there are multiple you want the first).
The documentation you link to say that the type of the rulegrouplist column is array<string>. The reason why it is string and not a complex type is because there seems to be multiple different schemas for this column, one example being that the terminatingrule property is either the string "null", or a struct/object – something that can't be described using Athena's type system.
This is not a problem, however. When dealing with JSON there's a whole set of JSON functions that can be used. Here's one way to use json_extract combined with filter and element_at to remove array elements where the terminatingrule property is the string "null" and then pick the first of the remaining elements:
SELECT
element_at(
filter(
rulegrouplist,
rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON)
),
1
) AS first_non_null_terminatingrule
FROM waf_logs
WHERE action = 'BLOCK'
ORDER BY date DESC
You say you want the "latest", which to me is ambiguous and could mean both first non-null and last non-null element. The query above will return the first non-null element, and if you want the last you can change the second argument to element_at to -1 (Athena's array indexing starts from 1, and -1 is counting from the end).
To return the individual ruleid element of the json:
SELECT from_unixtime(timestamp / 1000e0) AS date, action, httprequest.clientip AS ip, httprequest.uri AS request, httprequest.country as country, terminatingruleid, json_extract(element_at(filter(rulegrouplist,rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON) ),1), '$.terminatingrule.ruleid') AS ruleid
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC

I had the same issue but the solution posted by Theo didn't work for me, even though the table was created according to the instructions linked to in the original post.
Here is what worked for me, which is basically the same as Theo's solution, but without the json conversion:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist,
element_at(filter(ruleGroupList, ruleGroup -> ruleGroup.terminatingRule IS NOT NULL),1).terminatingRule.ruleId AS ruleId
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;

Using Postgres JSON Functions on table columns

I have searched extensively (in Postgres docs and on Google and SO) to find examples of JSON functions being used on actual JSON columns in a table.
Here's my problem: I am trying to extract key values from an array of JSON objects in a column, using jsonb_to_recordset(), but get syntax errors. When I pass the object literally to the function, it works fine:
Passing JSON literally:
select *
from jsonb_to_recordset('[
{ "id": 0, "name": "400MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"},
{ "id": 0, "name": "1000MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"}
]') as f(name text);`
results in:
400MB-PDF.pdf
1000MB-PDF.pdf
It extracts the value of the key "name".
Here's the JSON in the column, being extracted using:
select journal.data::jsonb#>>'{context,data,files}'
from journal
where id = 'ap32bbofopvo7pjgo07g';
resulting in:
[ { "id": 0, "name": "400MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"},
{ "id": 0, "name": "1000MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"}
]
But when I try to pass jsonb#>>'{context,data,files}' to jsonb_to_recordset() like this:
select id,
journal.data::jsonb#>>::jsonb_to_recordset('{context,data,files}') as f(name text)
from journal
where id = 'ap32bbofopvo7pjgo07g';
I get a syntax error. I have tried different ways but each time it complains about a syntax error:
Version:
PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit

The expressions after select must evaluate to a single value. Since jsonb_to_recordset returns a set of rows and columns, you can't use it there.
The solution is a cross join lateral, which allows you to expand one row into multiple rows using a function. That gives you single rows that select can act on. For example:
select *
from journal j
cross join lateral
jsonb_to_recordset(j.data#>'{context, data, files}') as d(id int, name text)
where j.id = 'ap32bbofopvo7pjgo07g'
Note that the #>> operator returns type text, and the #> operator returns type jsonb. As jsonb_to_recordset expects jsonb as its first parameter I'm using #>.
See it working at rextester.com

jsonb_to_recordset is a set-valued function and can only be invoked in specific places. The FROM clause is one such place, which is why your first example works, but the SELECT clause is not.
In order to turn your JSON array into a "table" that you can query, you need to use a lateral join. The effect is rather like a foreach loop on the source recordset, and that's where you apply the jsonb_to_recordset function. Here's a sample dataset:
create table jstuff (id int, val jsonb);
insert into jstuff
values
(1, '[{"outer": {"inner": "a"}}, {"outer": {"inner": "b"}}]'),
(2, '[{"outer": {"inner": "c"}}]');
A simple lateral join query:
select id, r.*
from jstuff
join lateral jsonb_to_recordset(val) as r("outer" jsonb) on true;
id | outer
----+----------------
1 | {"inner": "a"}
1 | {"inner": "b"}
2 | {"inner": "c"}
(3 rows)
That's the hard part. Note that you have to define what your new recordset looks like in the AS clause -- since each element in our val array is a JSON object with a single field named "outer", that's what we give it. If your array elements contain multiple fields you're interested in, you declare those in a similar manner. Be aware also that your JSON schema needs to be consistent: if an array element doesn't contain a key named "outer", the resulting value will be null.
From here, you just need to pull the specific value you need out of each JSON object using the traversal operator as you were. If I wanted only the "inner" value from the sample dataset, I would specify select id, r.outer->>'inner'. Since it's already JSONB, it doesn't require casting.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

bigquery joins on nested repeated - google-bigquery

Related

How can I convert SuiteQL query code into corresponding json format for REST API call

Extract complex json with random key field

How To Query an array of JSONB

Using Athena to get terminatingrule from rulegrouplist in AWS WAF logs

Using Postgres JSON Functions on table columns

Categories

Resources