BigQuery: Create column of JSON datatype - google-bigquery

I am trying to load json with the following schema into BigQuery:
{
key_a:value_a,
key_b:{
key_c:value_c,
key_d:value_d
}
key_e:{
key_f:value_f,
key_g:value_g
}
}
The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.
Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.
How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?

Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.
If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode
Table schema:
[
{ "name": "key_a",
"type": "STRING" },
{ "name": "key_b",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
},
{ "name": "key_e",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
}
]
Example of JSON to load
{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}
Please note: it should be newline delimited JSON so each row must be on one line

You can't do this directly with BigQuery, but you can make it work in two passes:
(1) Import your JSON data as a CSV file with a single string column.
(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.
Example
I'll start with some JSON:
{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}
Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'รพ') or it tripped over the default ','.
Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.
Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:
a (INTEGER)
b (STRING)
c (STRING -- packed JSON)
We can do this with a UDF:
// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
'extractAndRepack', // Name of the function exported to SQL
['blob'], // Names of input columns
[{'name': 'a', 'type': 'integer'}, // Output schema
{'name': 'b', 'type': 'string'},
{'name': 'c', 'type': 'string'}],
function (row, emit) {
var parsed = JSON.parse(row.blob);
var repacked = JSON.stringify(parsed.c);
emit({a: parsed.a, b: parsed.b, c: repacked});
}
);
And a corresponding query:
SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)
Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.
Row a b c
1 0 zero {"woodchuck":"a"}
2 1 one {"chipmunk":"b"}
3 2 two {"squirrel":"c"}
4 3 three {"chinchilla":"d"}
5 4 four {"capybara":"e"}
6 5 five {"housemouse":"f"}
7 6 six {"molerat":"g"}
8 7 seven {"marmot":"h"}
9 8 eight {"badger":"i"}

One way to do it, is to load this file as CSV instead of JSON (and quote the values or eliminate newlines in the middle), then it will become single STRING column inside BigQuery.
P.S. You are right that having a native JSON data type would have made this scenario much more natural, and BigQuery team is well aware of it.

Related

How to set one schema for n-1 elements and another for the nth element

For example, look at the schema:
"type": "array",
"items": {
"oneOf":[
{"type": "number"},
{"type": "string"}
]
}
}
The schema would then allow a list of numbers and the last element would be a string. and the string can be optional.
For example:
{
"list_of_items" : [ 3, 4, 5, "hello"]
}
is allowed.
{
"list_of_items" : [ 3, "hello" , 4, 5 ]
}
Is not allowed because text is only allowed on the last element.
From reading the documentation. I see how to do a constant amount of numbers and a text element.
Or all numbers,
But I don't see how to create n-1 numbers and text for the nth.
It could be that is is not something json-schema can do.
Thanks.

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

How to filter entities with nested arrays with CosmosDB

I have an entity like this:
{
"id": "xxxx",
"attributes": [{
"name": "name-01",
"value": "value-01"
}, {
"name": "name-02",
"value": "value-02"
}
]
}
Our "questions" to data usually: Give me entities with attribute or attribute with particular value;
in SQL it was written like as:
select *
from c
where
and array_contains(c.attributes, { "name": "name-01", "value": "value-01" }, true)
and array_contains(c.attributes, { "name": "name-02", "value": "value-02" }, true)
but I would like to extend a model to allow have suggestion of values in each attribute by transform an entity to:
{
"id": "xxxx",
"attributes": [{
"name": "name_01",
"value": "value-01",
"suggestions": ["a", "b", "c"]
}, {
"name": "name_02",
"value": "value-02",
"suggestions": ["a", "d", "e"]
}
]
}
With that structure I would like to ask: Give me all entities that has specified attribute and value equals to "XYZ" or suggestions array contains "XYZ";
In general scenario if always add value into array of suggestions the ask would be "Give me all entities that has specified attribute and suggestions contains XYZ"
N.B. Also I would like to make queries : Give me all entities that has more ALL specified attributes with constraints per each by suggestions?
Please suggest how to write such queries or rebuild a structure of entities in Cosmos DB;
P.S. We can technically switch from SQL to other protocol to better make such queries;
This should be doable using ARRAY_CONTAINS along with iterating the attributes array.
Give me items with value "value-01" or suggestion "f":
SELECT DISTINCT VALUE(c)
FROM c JOIN attr IN c.attributes
WHERE attr["value"] = "value-01" OR ARRAY_CONTAINS(attr.suggestions, "f")
Give me items with value "value-01" or both suggestions "a" and "f":
SELECT DISTINCT VALUE(c)
FROM c JOIN attr IN c.attributes
WHERE attr["value"] = "value-01" OR
ARRAY_CONTAINS(attr.suggestions, "a") AND ARRAY_CONTAINS(attr.suggestions, "f")

Convert CSV containing nested JSON rows to SQL table

I have a CSV file with several million rows, and want to load it as a PostgreSQL table. One of the rows in the column 'json_doc' as an example contains:
{"id": <>,
"base":
{"ateco":
[
{
"code": "<>",
"rootCode": "<>",
"description": "<>"
}
],
"founded": "<>",
"legalName": "<>",
"legalForms":
[
{
"name": "<>",
"level": <>
},
{
"name": "<>",
"level": <>
}
]
},
"name": "<>",
"people":
{
"items":
[
{
"name": "<>",
"givenName": "<>",
"familyName": "<>"
}
]
},
"country": "<>",
"locations": {}
}
Which as you can see has many nested dictionaries. And there are several million of these.
I'd like to get this file into an SQL table with even the sub-dictionary values in their own columns. How can I do this? It would seem I have to use some sort of name spacing technique for the nested data as there are some duplicate keys i.e. 'name'.
The data will be analysed using Pandas, but I'd like to get this straight into Postgres if possible. Any assistance greatly appreciated.
The result will look like:
id | base_ateco_code | etc | base_ateco_legalForms_name | etc |
Unless there are any ideas about this - it's a pretty open project from my employer - I just need to be able to use this information as part of a JOIN with another table.
Many thanks.

How to generate JSON array from multiple rows, then return with values of another table

I am trying to build a query which combines rows of one table into a JSON array, I then want that array to be part of the return.
I know how to do a simple query like
SELECT *
FROM public.template
WHERE id=1
And I have worked out how to produce the JSON array that I want
SELECT array_to_json(array_agg(to_json(fields)))
FROM (
SELECT id, name, format, data
FROM public.field
WHERE template_id = 1
) fields
However, I cannot work out how to combine the two, so that the result is a number of fields from public.template with the output of the second query being one of the returned fields.
I am using PostGreSQL 9.6.6
Edit, as requested more information, a definition of field and template tables and a sample of each queries output.
Currently, I have a JSONB row on the template table which I am using to store an array of fields, but I want to move fields to their own table so that I can more easily enforce a schema on them.
Template table contains:
id
name
data
organisation_id
But I would like to remove data and replace it with the field table which contains:
id
name
format
data
template_id
At the moment the output of the first query is:
{
"id": 1,
"name": "Test Template",
"data": [
{
"id": "1",
"data": null,
"name": "Assigned User",
"format": "String"
},
{
"id": "2",
"data": null,
"name": "Office",
"format": "String"
},
{
"id": "3",
"data": null,
"name": "Department",
"format": "String"
}
],
"id_organisation": 1
}
This output is what I would like to recreate using one query and both tables. The second query outputs this, but I do not know how to merge it into a single query:
[{
"id": 1,
"name": "Assigned User",
"format": "String",
"data": null
},{
"id": 2,
"name": "Office",
"format": "String",
"data": null
},{
"id": 3,
"name": "Department",
"format": "String",
"data": null
}]
The feature you're looking for is json concatenation. You can do that by using the operator ||. It's available since PostgreSQL 9.5
SELECT to_jsonb(template.*) || jsonb_build_object('data', (SELECT to_jsonb(field) WHERE template_id = templates.id)) FROM template
Sorry for poorly phrasing what I was trying to achieve, after hours of Googling I have worked it out and it was a lot more simple than I thought in my ignorance.
SELECT id, name, data
FROM public.template, (
SELECT array_to_json(array_agg(to_json(fields)))
FROM (
SELECT id, name, format, data
FROM public.field
WHERE template_id = 1
) fields
) as data
WHERE id = 1
I wanted the result of the subquery to be a column in the ouput rather than compiling the entire output table as a JSON.