How to retrieve the list of dynamic nested keys of BigQuery nested records - google-bigquery

My ELT tools imports my data in bigquery and generates/extends automatically the schema for dynamic nested keys (in the schema below, under properties)
It looks like this
How can I get the list of nested keys of a repeated record ? so for example I can group by properties when those items have said property non-null ?
I have tried
select column_name
from my_schema.INFORMATION_SCHEMA.COLUMNS
where
table_name = 'my_table
But it will only list first level keys
From the picture above, I want, as a first step, a SQL query that returns
message
user_id
seeker
liker_id
rateable_id
rateable_type
from_organization
likeable_type
company
existing_attempt
...
My real goal through, is to group/count my data based on a non-null value of a 2nd level nested properties properties.filters.[filter_type]
The schema may evolve when our application adds more filters, so this need to be dynamically generated, I can't just hard-code the list of nested keys.
Note: this is very similar to this question How to extract all the keys in a JSON object with BigQuery but in my case my data is already in a shcema and it's not a JSON object
EDIT:
Suppose I have a list of such records with nested properties, how do I write a SQL query that adds a field "enabled_filters" which aggregates, for each item, the list of properties for wihch said property is not null ?
Example input (properties.x are dynamic and not known by the programmer)
search_id
properties.filters.school
properties.filters.type
1
MIT
master
2
Princetown
null
3
null
master
Example output
search_id
enabled_filters
1
["school", "type"]
2
["school"]
3
["type"]

Have you looked at COLUMN_FIELD_PATHS? It should give you the paths for all columns.
select field_path from my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS where table_name = '<table>'
[https://cloud.google.com/bigquery/docs/information-schema-column-field-paths]

The field properties is not nested by array only by structures. Then a UDF in JavaScript to parse thise field should work fast enough.
CREATE TEMP FUNCTION jsonObjectKeys(input STRING, shownull BOOL,fullname Bool)
RETURNS Array<String>
LANGUAGE js AS """
function test(input,old){
var out=[]
for(let x in input){
let te=input[x];
out=out.concat(te==null ? (shownull?[x+'==null']:[]) : typeof te=='object' ? test(te,old+x+'.') : [fullname ? old+x : x] );
}
return out;
Object.keys(JSON.parse(input));
}
return test(JSON.parse(input),"");
""";
with tbl as (select struct(1 as alpha,struct(2 as x, 3 as y,[1,2,3] as z ) as B) A from unnest(generate_array(1,10*1))
union all select struct(null,struct(null,1,[999])) )
select *,
TO_JSON_STRING (A ) as string_output,
jsonObjectKeys(TO_JSON_STRING (A),true,false) as output1,
jsonObjectKeys(TO_JSON_STRING (A),false,true) as output2,
concat('["', array_to_string(jsonObjectKeys(TO_JSON_STRING (A),false,true),'","' ) ,'"]') as output_sring,
jsonObjectKeys(TO_JSON_STRING (A.B),false,true) as outpu
from tbl

Related

Filter based on 2 JSON properties after using json_extract

Imported database tables :
id | JSON
-------------|---------
Signed 32int | Raw JSON
It is easier to search via the properties of the JSON data than by id of the row itself. Each piece of JSON data contains (for this demo):
json: {
displayProperties: {},
hash: "foo"
itemType: "bar"
}
When I select I would like to matching hash, and then filter those results by a matching itemType.
My query :
SELECT json_extract(ItemDefinition.json, '$')
FROM ItemDefinition, json_tree(ItemDefinition.json, '$')
WHERE json_tree.key = 'hash' AND json_tree.value IN ${hashList}
However this returns every item that has a matching hash value. From here, I would like to also filter by key: itemType and value: "19". So I tried :
SELECT json_extract(ItemDefinition.json, '$')
FROM ItemDefinition, json_tree(ItemDefinition.json, '$')
WHERE json_tree.key = 'hash' AND json_tree.value IN ${hashList}
AND WHERE json_tree.key = 'itemType' AND json_tree.value = 19
But this isn't syntactically correct, let alone output what I am looking for. Error:
SQLITE_ERROR: near "WHERE": syntax error
The title of the question turned out to not be accurate to what I was looking for. I miss-understood what json_tree actually did. json_tree actually builds a new object with values that are filled in by the database.
What I was actually looking for was to filter by a specific value in the json column, which can be achieved by json_extract. json_extract('{column}', $.{filterValue}) will pull the raw json object out of the json column
This is the query that is working for me now:
SELECT json_extract(ItemDefinition.json, '$')
FROM ItemDefinition, json_tree(ItemDefinition.json, '$')
WHERE json_tree.key = 'hash'
AND json_tree.value IN ${hashList}
AND json_extract(ItemDefinition.json, '$.itemType') = 19
This selects the json column from ItemDefinition
Creates a json_tree from the json column
Filters results by json tree key and value
Finally filters by the property itemType from the raw json column

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

Postgresql - pick up field from object array to text array

how can I pick up all id field '{"se":[{"id":"123"}, {"id":"456"}]}' and get ["123", "456"]
I tried the SQL below, but it not work, the json path always need a index.
select '{"se":[{"id":"123"}, {"id":"456"}]}'::JSONB #> '{se, id}'
only could get the first one as text
select '{"se":[{"id":"123"}, {"id":"456"}]}'::JSONB #> '{se, 0, id}'
That should be done in few separate steps:
First take out the 'se' object
then expand the array items to separate json objects
finally find the value of the id key.
If you need those ids to be a list again then wrap the results with a jsonb_agg function.
SELECT
jsonb_agg(id) id_list
FROM
(SELECT jsonb_array_elements('{"se":[{"id":"123"}, {"id":"456"}]}'::jsonb #> '{se}') -> 'id' AS id) ids
;

Get all entries for a specific json tag only in postgresql

I have a database with a json field which has multiple parts including one called tags, there are other entries as below but I want to return only the fields with "{"tags":{"+good":true}}".
"{"tags":{"+good":true}}"
"{"has_temps":false,"tags":{"+good":true}}"
"{"tags":{"+good":true}}"
"{"has_temps":false,"too_long":true,"too_long_as_of":"2016-02-12T12:28:28.238+00:00","tags":{"+good":true}}"
I can get part of the way there with this statement in my where clause trips.metadata->'tags'->>'+good' = 'true' but that returns all instances where tags are good and true including all entries above. I want to return entries with the specific statement "{"tags":{"+good":true}}" only. So taking out the two entries that begin has_temps.
Any thoughts on how to do this?
With jsonb column the solution is obvious:
with trips(metadata) as (
values
('{"tags":{"+good":true}}'::jsonb),
('{"has_temps":false,"tags":{"+good":true}}'),
('{"tags":{"+good":true}}'),
('{"has_temps":false,"too_long":true,"too_long_as_of":"2016-02-12T12:28:28.238+00:00","tags":{"+good":true}}')
)
select *
from trips
where metadata = '{"tags":{"+good":true}}';
metadata
-------------------------
{"tags":{"+good":true}}
{"tags":{"+good":true}}
(2 rows)
If the column's type is json then you should cast it to jsonb:
...
where metadata::jsonb = '{"tags":{"+good":true}}';
If I get you right, you can check text value of the "tags" key, like here:
select true
where '{"has_temps":false,"too_long":true,"too_long_as_of":"2016-02-12T12:28:28.238+00:00","tags":{"+good":true}}'::json->>'tags'
= '{"+good":true}'

CDE: multiple select

I am trying to create a dashboard that consist from two components: CCC Bar chart and Multiple select component.
I use the multiply select component for assignment param value, which is using in the datasource. (MDX query):
SELECT
NON EMPTY {[Measures].[doc_count]} ON COLUMNS,
NON EMPTY {[Dimension Usage date_publish.Hierarchy date_publish].[date_publish].Members} ON ROWS
FROM [Docs]
WHERE CrossJoin({${param_hosts}}, {[event].[active]})
So if I set (multiply select component) property value array with pairs:
( {arg:[host].[news.com] value:news.com}, {{arg:[host].[somesite.com] value:somesite.com}} ), everything work perfect.
The parameter is bound to the component receives the correct value, for example: [host]. [News.com], [host]. [Somesite.com].
But if I try to fill the multiply select component from the datasource it become unworkable.
As DataSource, I use the sql over sqlJndi with query: SELECT distinct (host) as Id, concat ('[host]. [', Host, ']') as Value FROM docs_fact where dim_event_id = 1;
The result this query is a table:
id value
news.com | [host].[news.com]
somesite.com | [host].[somesite.com]
Parameter is assigned a value: news.com, somesite.com
Changing the properties of the Value as id only affects which of the fields (id or value) will be shown to the user, and the parameter's value is not affected.
Tell me please, is it possible to specify which of the columns to be used for display to the user and which of the columns to be used to generate results?
No, but you can change the dataset on the clientside using the postFetch function on the multi-select component.
function (dataset) {
for (var i=0; i < dataset.resultset.length; i++) {
var temp = dataset.resultset[i][0];
dataset.resultset[i][0] = dataset.resultset[i][1];
dataset.resultset[i][1] = temp;
}
return dataset;
}
Or similar