I can't query repeated fields in a Google BigQuery table

I can't query repeated fields in a Google BigQuery table - google-bigquery

I've got a table in Google BigQuery which has repeated records in it, I've followed the guide at https://cloud.google.com/bigquery/docs/nested-repeated to create the table successfully and I've populated the table with some test data using
INSERT INTO `<project>.<dataset>.<table>` (<list of fields, ending with repeated record name>)
VALUES
(
"string1", false, 200.0, "string2", 0.2, 2.345, false, "2020-01-02 12:34:56",
[
("repeated field str1", CAST(2.01 AS FLOAT64), CAST(201 as NUMERIC), false),
("repeated field str2", CAST(4.01 AS FLOAT64), CAST(702 as NUMERIC), true)
]
);
(etc)
And the table is successfully populated, also I can query the data with
select * from <dataset>.<table>
and all fields, repeated and non-repeated, are returned.
I can also successfully query the non-repeated fields from the table, as long as no repeated fields are specified in the query.
However when I want to include specific repeated fields in the query (and I'm following the guide at https://cloud.google.com/bigquery/docs/legacy-nested-repeated) for example
SELECT normalfield1, normalfield2, normalfield3,
repeatedData.field1, repeatedData.field2, repeatedData.field3
FROM `profile_dataset.profile_betdatamultiples`;
I get error
Cannot access field <field name> on a value with type ARRAY<STRUCT<fieldname1 STRING, fieldname2 FLOAT64, fieldname3 NUMERIC, ...>> at [8:14]"
(annoyingly GCP truncates the error message so I can't see all of it)
Are there any suggestions for how to proceed here?
Thanks!

Below is for BigQuery Standard SQL
#standardSQL
SELECT normalfield1, normalfield2, normalfield3,
data.field1, data.field2, data.field3
FROM `project.profile_dataset.profile_betdatamultiples`,
UNNEST(repeatedData) data
If to apply to sample data in your question
output is

I recreated the table with this code:
CREATE TABLE `temp.experiment` AS
SELECT "string1" s1, false b, 200.0 i1, "string2" s2, 0.2 f1, 2.345 f2, false b2, TIMESTAMP("2020-01-02 12:34:56") t1,
[
STRUCT ("repeated field str1" AS s1, CAST(2.01 AS FLOAT64) AS f2, CAST(201 as NUMERIC) AS n1, false AS b),
STRUCT ("repeated field str2", CAST(4.01 AS FLOAT64), CAST(702 as NUMERIC), true)
] AS b1
Now I can query particular nested rows like this:
SELECT s1, b, s2
, b1[OFFSET(0)].s1 AS arr_s1, b1[OFFSET(0)].f2, b1[OFFSET(0)].n1
FROM `temp.experiment`
You might want to UNNEST instead of [OFFSET(0)], but the question doesn't say what results you are expecting.

Related

Json Arrays of objects PostgreSQL Table format

I have a JSON file (array of objects) which I have to convert into a table format using a PostgreSQL query.
Follow Sample Data.
"b", "c", "d", "e" are to be extracted as separate tables as they are arrays and in these arrays, there are objects
I have tried using json_populate_recordset() but it only works if I have a single array.
[{a:"1",b:"2"},{a:"10",b:"20"}]
I have referred to some links and codes.
jsonb_array_element example
postgreSQL functions
Expected Output
Sample Data:
{
"b":[
{columnB1:value, columnB2:value},
{columnB1:value, columnB2:value},
],
"c":[
{columnC1:value, columnC2:value, columnC3:value},
{columnC1:value, columnC2:value, columnC3:value},
{columnC1:value, columnC2:value, columnC3:value}
],
"d":[
{columnD1:value, columnD2:value},
{columnD1:value, columnD2:value},
],
"e":[
{columnE1:value, columnE2:value},
]
}
expected output
b should be one table in which columnA1 and columnA2 are displayed with their values.
Similarly table c, d, e with their respective columns and values.
Expected Output

You can use jsonb_to_recordset() but you need to unnest your JSON. You need to do this inline as this is a JSON Processing Function which cannot used derived values.
I am using validated JSON as simplified and formatted at end of this answer
To unnest your JSON use below notation which extracts JSON object field with the given key.
--one level
select '{"a":1}'::json->'a'
result : 1
--two levels
select '{"a":{"b":[2]}}'::json->'a'->'b'
result : [2]
We now expand this to include json_to_recordset()
select * from
json_to_recordset(
'{"a":{"b":[{"f1":2,"f2":4},{"f1":3,"f2":6}]}}'::json->'a'->'b' --inner table b
)
as x("f1" int, "f2" int); --fields from table b
or using json_array_elements. Either way we need to list our fields. With second solution type will be json not int so you cant sum etc
with b as (select json_array_elements('{"a":{"b":[{"f1":2,"f2":4},{"f1":3,"f2":6}]}}'::json->'a'->'b') as jx)
select jx->'f1' as f1, jx->'f2' as f2 from b;
Output
f1 f2
2 4
3 6
We now use your data structure in jsonb_to_recordset()
select * from jsonb_to_recordset( '{"a":{"b":[{"columnname1b":"value1b","columnname2b":"value2b"},{"columnname1b":"value","columnname2b":"value"}],"c":[{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"}]}}'::jsonb->'a'->'b') as x(columnname1b text, columnname2b text);
Output:
columnname1b columnname2b
value1b value2b
value value
For table c
select * from jsonb_to_recordset( '{"a":{"b":[{"columnname1b":"value1b","columnname2b":"value2b"},{"columnname1b":"value","columnname2b":"value"}],"c":[{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"}]}}'::jsonb->'a'->'c') as x(columnname1 text, columnname2 text);
Output
columnname1 columnname2
value value
value value
value value
Sample JSON
{
"a": {
"b": [
{
"columnname1b": "value1b",
"columnname2b": "value2b"
},
{
"columnname1b": "value",
"columnname2b": "value"
}
],
"c": [
{
"columnname1": "value",
"columnname2": "value"
},
{
"columnname1": "value",
"columnname2": "value"
},
{
"columnname1": "value",
"columnname2": "value"
}
]
}
}

Well, I came up with some ideas, here is one that worked. I was able to get one table at a time.
https://www.postgresql.org/docs/9.5/functions-json.html
I am using json_populate_recordset.
The column used in the first select statement comes from a table whose column is a JSON type which we are trying to extract into a table.
The 'tablename from column' in the json_populate_recordset function, is the table we are trying to extract followed with b its columns and datatypes.
WITH input AS(
SELECT cast(column as json) as a
FROM tablename
)
SELECT b.*
FROM input c,
json_populate_recordset(NULL::record,c.a->'tablename from column') as b(columnname1 datatype, columnname2 datatype)

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?

You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249

The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

Select rows from table with jsonb column based on arbitrary jsonb filter expression

Test data
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3}')
, ('{"a":11,"b":12, "c":13}')
, ('{"a":21,"b":22, "c":23}')
Problem statement: I want to receive an arbitrary JSONB parameter which acts as a filter on column t.data, such as
{ "b":{ "from":0, "to":20 }, "c":13 }
and use this to select matching rows from my test table t.
In this example, I want rows where b is between 0 and 20 and c = 13.
No error is required if the filter specifies a "column" (or "tag") which does not exist in t.data - it just fails to find a match.
I've used numeric values for simplicity but would like an approach which generalises to text as well.
What I have tried so far. I looked at the containment approach, which works for equality conditions, but am stumped on a generic way of handling range conditions:
select * from t
where t.data#> '{"c":13}'::jsonb;
Background: This problem arose when building a generic table-preview page on a website (for Admin users).
The page displays a filter based on various columns in whichever table is selected for preview.
The filter is then passed to a function in Postgres DB which applies this dynamic filter condition to the table.
It returns a jsonb array of the rows matching the filter specified by the user.
This jsonb array is then used to populate the Preview resultset.
The columns which make up the filter may change.
My Postgres version is 9.6 - thanks.

if you want to parse { "b":{ "from":0, "to":20 }, "c":13 } you need a parser. It is out of scope of json functions, but you can write "generic" query using AND and OR to filter by such json, eg:
https://www.db-fiddle.com/f/jAPBQggG3p7CxqbKLMbPKw/0
with filt(f) as (values('{ "b":{ "from":0, "to":20 }, "c":13 }'::json))
select *
from t
join filt on
(f->'b'->>'from')::int < (data->>'b')::int
and
(f->'b'->>'to')::int > (data->>'b')::int
and
(data->>'c')::int = (f->>'c')::int
;

Thanks for the comments/suggestions.
I will definitely look at GraphQL when I have more time - I'm working under a tight deadline at the moment.
It seems the consensus is that a fully generic solution is not achievable without a parser.
However, I got a workable first draft - it's far from ideal but we can work with it. Any comments/improvements are welcome ...
Test data (expanded to include dates & text fields)
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3, "d":"2018-03-10", "e":"2018-03-10", "f":"Blah blah" }')
, ('{"a":11,"b":12, "c":13, "d":"2018-03-14", "e":"2018-03-14", "f":"Howzat!"}')
, ('{"a":21,"b":22, "c":23, "d":"2018-03-14", "e":"2018-03-14", "f":"Blah blah"}')
First draft of code to apply a jsonb filter dynamically, but with restrictions on what syntax is supported.
Also, it just fails silently if the syntax supplied does not match what it expects.
Timestamp handling a bit kludgey, too.
-- Handle timestamp & text types as well as int
-- See is_timestamp(text) function at bottom
with cte as (
select t.data, f.filt, fk.key
from t
, ( values ('{ "a":11, "b":{ "from":0, "to":20 }, "c":13, "d":"2018-03-14", "e":{ "from":"2018-03-11", "to": "2018-03-14" }, "f":"Howzat!" }'::jsonb ) ) as f(filt) -- equiv to cross join
, lateral (select * from jsonb_each(f.filt)) as fk
)
select data, filt --, key, jsonb_typeof(filt->key), jsonb_typeof(filt->key->'from'), is_timestamp((filt->key)::text), is_timestamp((filt->key->'from')::text)
from cte
where
case when (filt->key->>'from') is null then
case jsonb_typeof(filt->key)
when 'number' then (data->>key)::numeric = (filt->>key)::numeric
when 'string' then
case is_timestamp( (filt->key)::text )
when true then (data->>key)::timestamp = (filt->>key)::timestamp
else (data->>key)::text = (filt->>key)::text
end
when 'boolean' then (data->>key)::boolean = (filt->>key)::boolean
else false
end
else
case jsonb_typeof(filt->key->'from')
when 'number' then (data->>key)::numeric between (filt->key->>'from')::numeric and (filt->key->>'to')::numeric
when 'string' then
case is_timestamp( (filt->key->'from')::text )
when true then (data->>key)::timestamp between (filt->key->>'from')::timestamp and (filt->key->>'to')::timestamp
else (data->>key)::text between (filt->key->>'from')::text and (filt->key->>'to')::text
end
when 'boolean' then false
else false
end
end
group by data, filt
having count(*) = ( select count(distinct key) from cte ) -- must match on all filter elements
;
create or replace function is_timestamp(s text) returns boolean as $$
begin
perform s::timestamp;
return true;
exception when others then
return false;
end;
$$ strict language plpgsql immutable;

Alias nested struct columns

How can I alias field1 as index & & field 2 as value
The query gives me an error:
#standardsql
with q1 as (select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>>
[struct('h1',[('1','a')
,('2','b')
])
,('h2',[('3','c')
,('4','d')
])] hits
)
Select * from q1
ORDER by x
Error: Array element type STRUCT<STRING, ARRAY<STRUCT<STRING, STRING>>> does not coerce to STRUCT<id STRING, cd ARRAY<STRUCT<index STRING, value STRING>>> at [5:26]
Thanks a lot for your time in responding
Cheers!

#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT('1' AS index, 'a' AS value), ('2','b')] AS cd),
STRUCT('h2', [STRUCT('3' AS index, 'c' AS value), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x
or below might be even more "readable"
#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT<index STRING, value STRING>('1', 'a'), ('2','b')] AS cd),
STRUCT('h2', [STRUCT<index STRING, value STRING>('3', 'c'), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x

When I try to simulate data in BigQuery using the Standard Version I usually try to name all variables and aliases everywhere possible. For instance, your data works if you build it like so:
with q1 as (
select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>> [struct('h1' as id,[STRUCT('1' as index,'a' as value) ,STRUCT('2' as index ,'b' as value)] as cd), STRUCT('h2',[STRUCT('3' as index,'c' as value) ,STRUCT('4' as index,'d' as value)] as cd)] hits
)
select * from q1
order by x
Notice I've built structs and put aliases inside of them in order for this to work (if you remove the aliases and the structs it might not work, but I found that this seems to be rather intermittent. If you fully describe the variables then it works all the time).
Also as a recommendation, I try to build simulated data piece by piece. First I create the struct and test it to see if BigQuery accepts it. After the validator is green, then I proceed to add more values. If you try to build everything at once you might find this a somewhat challenging task.

Querying for a specific value in a String stored in a database field

{"create_channel":"1","update_comm":"1","channels":"*"}
This is the database field which I want to query.
What would my query look like if I wanted to select all the records that have a "create_channel": "1" and a "update_comm": "1"
Additional question:
View the field below:
{"create_channel":"0","update_comm":"0","channels":[{"ch_id":"33","news":"1","parties":"1","questions ":"1","cam":"1","edit":"1","view_subs":"1","invite_subs":"1"},{"ch_id":"18","news":"1","parties":"1","questions ":"1","cam":"1","edit":"1","view_subs":"1","invite_subs":"1"}]}
How would I go about finding out all those that are subadmins in the News, parties, questions and Cams sections

You can use the ->> operator to return a member as a string:
select *
from YourTable
where YourColumn->>'create_channel' = '1' and
YourColumn->>'update_comm' = '1'
To find a user who has news, parties, questions and cam in channel 33, you can use the #> operator to check if the channels array contains those properties:
select *
from YourTable
where YourColumn->'channels' #> '[{
"ch_id":"33",
"news":"1",
"parties":"1",
"questions ":"1",
"cam":"1"
}]';

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

I can't query repeated fields in a Google BigQuery table - google-bigquery

Below is for BigQuery Standard SQL #standardSQL SELECT normalfield1, normalfield2, normalfield3, data.field1, data.field2, data.field3 FROM `project.profile_dataset.profile_betdatamultiples`, UNNEST(repeatedData) data If to apply to sample data in your question output is

Related

Json Arrays of objects PostgreSQL Table format

Extract complex json with random key field

Select rows from table with jsonb column based on arbitrary jsonb filter expression

Alias nested struct columns

Querying for a specific value in a String stored in a database field

Categories

Resources