JSON Schema - bq load errors - google-bigquery

This file is part json used:
{"body1": {"posts": {"children": [{"row": {"acceptedanswerid": "26", "answercount": "5", "body": "<p>Now that the Engineer update has come, there will be lots of Engineers building up everywhere. How should this best be handled?</p>\n", "commentcount": "7", "creationdate": "2010-07-07T19:06:25.043", "id": "1", "lastactivitydate": "2010-08-27T22:38:43.840", "lasteditdate": "2010-08-27T22:38:43.840", "lasteditordisplayname": "", "lasteditoruserid": "56", "owneruserid": "11", "posttypeid": "1", "score": "10", "tags": "<strategy><team-fortress-2><tactics>", "title": "In Team Fortress 2, what is a good strategy to deal with lots of engineers turtling on the other team?", "viewcount": "1166"}}, {"row": {"acceptedanswerid": "184", "answercount": "3", "body": "<p>I know I can create a Warp Gate and teleport to Pylons, but I have no idea how to make Warp Prisms or know if there's any other unit capable of transporting.</p>\n\n<p>I would in particular like this to built remote bases in 1v1</p>\n", "commentcount": "2", "creationdate": "2010-07-07T19:07:58.427", "id": "2", "lastactivitydate": "2010-07-08T00:21:13.163", "lasteditdate": "2010-07-08T00:16:46.013", "lasteditordisplayname": "", "lasteditoruserid": "68", "owneruserid": "10", "posttypeid": "1", "score": "5", "tags": "<starcraft-2><how-to><protoss>", "title": "What protoss unit can transport others?", "viewcount": "398"}}]}}}
This is the schema used:
{
"name":"body1", "type": "STRING",
"name":"posts", "type": "STRING",
"name":"children", "type":"RECORD",
"fields": [
{"name": "row", "type": "STRING"},
{"name": "acceptedanswerid", "type": "STRING"},
{"name": "answercount", "type": "STRING"},
{"name": "body", "type": "STRING"},
{"name": "commentcount", "type": "STRING"},
{"name": "creationdate", "type": "STRING"},
{"name": "id", "type": "string"},
{"name": "lasteditdate", "type": "integer"},
{"name": "lasteditordisplayname", "type": "string"},
{"name": "lasteditoruserid", "type": "string"},
{"name": "owneruserid", "type": "string"},
{"name": "posttypeid", "type": "string"},
{"name": "score", "type": "string"},
{"name": "tags", "type": "string"},
{"name": "title", "type": "string"},
{"name": "viewcount", "type": "string"}
]
}
The problem is in the implementation of the scheme. But I didn't find the detailed scheme to build the model. Anyone can help me?
Following the suggestion of Gil, I modified your original design for this valid json:
{
"name":"body1", "type": "RECORD",
"fields": [
{"name":"posts", "type": "RECORD",
"fields": [
{"name":"children", "type": "RECORD",
"fields": [
{"name": "row", "type": "STRING"},
{"name": "acceptedanswerid", "type": "STRING"},
{"name": "answercount", "type": "STRING"},
{"name": "body", "type": "STRING"},
{"name": "commentcount", "type": "STRING"},
{"name": "creationdate", "type": "STRING"},
{"name": "id", "type": "string"},
{"name": "lasteditdate", "type": "integer"},
{"name": "lasteditordisplayname", "type": "string"},
{"name": "lasteditoruserid", "type": "string"},
{"name": "owneruserid", "type": "string"},
{"name": "posttypeid", "type": "string"},
{"name": "score", "type": "string"},
{"name": "tags", "type": "string"},
{"name": "title", "type": "string"},
{"name": "viewcount", "type": "string"}
]}]}]}
The bq command return:
File: 0 / Offset:0 / Line:1 / Column:8 / Field:body1: no such field

Looking at the raw data you've provided, it looks like "children" is a child of "posts", which in turn is a child of "body1" - meaning that everything is nested, and not 3 fields in the same hierarchy as you've described.
You should create your schema to reflect this, e.g. (not tested):
{
"name":"body1", "type": "RECORD"
"fields": [
"name":"posts", "type": "RECORD"
"fields": [
"name":"children", "type": "RECORD"
"fields": [
{"name": "row", "type": "STRING"},
{"name": "acceptedanswerid", "type": "STRING"},
{"name": "answercount", "type": "STRING"},
{"name": "body", "type": "STRING"},
{"name": "commentcount", "type": "STRING"},
{"name": "creationdate", "type": "STRING"},
{"name": "id", "type": "string"},
{"name": "lasteditdate", "type": "integer"},
{"name": "lasteditordisplayname", "type": "string"},
{"name": "lasteditoruserid", "type": "string"},
{"name": "owneruserid", "type": "string"},
{"name": "posttypeid", "type": "string"},
{"name": "score", "type": "string"},
{"name": "tags", "type": "string"},
{"name": "title", "type": "string"},
{"name": "viewcount", "type": "string"}
]
]
]
}
EDIT 1
OK, I took your input example and ran it through a schema generator (https://github.com/tottokug/BigQuerySchemaGenerator), and it gave:
[
{
"name": "body1",
"type": "RECORD",
"fields": [
{
"name": "posts",
"type": "RECORD",
"fields": [
[
{
"name": "row",
"type": "RECORD",
"fields": [
{
"name": "acceptedanswerid",
"type": "STRING"
},
{
"name": "answercount",
"type": "STRING"
},
{
"name": "body",
"type": "STRING"
},
{
"name": "commentcount",
"type": "STRING"
},
{
"name": "creationdate",
"type": "STRING"
},
{
"name": "id",
"type": "STRING"
},
{
"name": "lastactivitydate",
"type": "STRING"
},
{
"name": "lasteditdate",
"type": "STRING"
},
{
"name": "lasteditordisplayname",
"type": "STRING"
},
{
"name": "lasteditoruserid",
"type": "STRING"
},
{
"name": "owneruserid",
"type": "STRING"
},
{
"name": "posttypeid",
"type": "STRING"
},
{
"name": "score",
"type": "STRING"
},
{
"name": "tags",
"type": "STRING"
},
{
"name": "title",
"type": "STRING"
},
{
"name": "viewcount",
"type": "STRING"
}
]
}
],
[
{
"name": "row",
"type": "RECORD",
"fields": [
{
"name": "acceptedanswerid",
"type": "STRING"
},
{
"name": "answercount",
"type": "STRING"
},
{
"name": "body",
"type": "STRING"
},
{
"name": "commentcount",
"type": "STRING"
},
{
"name": "creationdate",
"type": "STRING"
},
{
"name": "id",
"type": "STRING"
},
{
"name": "lastactivitydate",
"type": "STRING"
},
{
"name": "lasteditdate",
"type": "STRING"
},
{
"name": "lasteditordisplayname",
"type": "STRING"
},
{
"name": "lasteditoruserid",
"type": "STRING"
},
{
"name": "owneruserid",
"type": "STRING"
},
{
"name": "posttypeid",
"type": "STRING"
},
{
"name": "score",
"type": "STRING"
},
{
"name": "tags",
"type": "STRING"
},
{
"name": "title",
"type": "STRING"
},
{
"name": "viewcount",
"type": "STRING"
}
]
}
]
]
}
]
}
]
Does this work?

Related

Avro Schema: multiple records reference same data type issue: Unknown union branch

I have Avro Schema: customer record import the CustomerAddress subset.
[
{
"type": "record",
"namespace": "com.example",
"name": "CustomerAddress",
"fields": [
{ "name": "address", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "postcode", "type": ["string", "int"] },
{ "name": "type","type": {"type": "enum","name": "type","symbols": ["POBOX","RESIDENTIAL","ENTERPRISE"]}}
]
},
{
"type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "middle_name", "type": ["null", "string"], "default": null },
{ "name": "last_name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "height", "type": "float" },
{ "name": "weight", "type": "float" },
{ "name": "automated_email", "type": "boolean", "default": true },
{ "name": "customer_emails", "type": {"type": "array","items": "string"},"default": []},
{ "name": "customer_address", "type": "com.example.CustomerAddress" }
]
}
]
i have JSON payload:
{
"Customer" : {
"first_name": "John",
"middle_name": null,
"last_name": "Smith",
"age": 25,
"height": 177.6,
"weight": 120.6,
"automated_email": true,
"customer_emails": ["ning.chang#td.com", "test#td.com"],
"customer_address":
{
"address": "21 2nd Street",
"city": "New York",
"postcode": "10021",
"type": "RESIDENTIAL"
}
}
}
when i runt the command: java -jar avro-tools-1.8.2.jar fromjson --schema-file customer.avsc customer.json
got the following exception:
Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch Customer
In your JSON data you use the key Customer but you have to use the fully qualified name. So it should be com.example.Customer.

How can I update records based on conditions on nested fields?

I have a bigquery table with the following schema:
{
"fields": [
{
"name": "products",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "qty",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "variant_name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "order_id",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "test_mode",
"type": "BOOLEAN",
"mode": "REQUIRED"
},
{
"name": "amount",
"type": "FLOAT",
"mode": "REQUIRED"
},
{
"name": "transaction_reference",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "id",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "source",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "currency",
"type": "STRING",
"mode": "REQUIRED"
}
]
},
{
"name": "processed_at",
"type": "TIMESTAMP",
"mode": "REQUIRED",
"description": "bq-datetime"
},
{
"name": "inserted_at",
"type": "TIMESTAMP",
"mode": "REQUIRED",
"description": "bq-datetime"
}
]
}
As you can see the products field is a nested one. What I would like to achieve is an update condition like the following:
UPDATE `dataset.table`
SET processed_at = '2021-04-17T16:07:30.993806'
WHERE processed_at = '1970-01-01T00:00:00' AND products.order_id = 9366054;
And whenever I try to do so, I get the following error
Cannot access field order_id on a value with type ARRAY<STRUCT<name STRING, qty INT64, variant_name STRING, ...>>
I know that to SELECT stuff with the same logic, I can use the UNNEST statement, I am not able to apply this to UPDATE.
Try EXISTS:
WHERE processed_at = '1970-01-01T00:00:00'
and EXISTS (SELECT *
FROM UNNEST(products)
WHERE order_id = 9366054
);

Export BigQuery table schema to JSON Schema

It is possible to export a bigquery table schema to a JSON file but the resulting JSON file is a bigquery table schema and not a JSON schema.
I am looking for a way to generate a JSON schema using a bigquery table based on the standard available here: https://json-schema.org/
This looks something like this:
{
"definitions": {},
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/root.json",
"type": "object",
"title": "The Root Schema",
"required": [
"glossary"
],
"properties": {
"glossary": {
"$id": "#/properties/glossary",
"type": "object",
"title": "The Glossary Schema",
"required": [
"title",
"GlossDiv"
],
"properties": {
"title": {
"$id": "#/properties/glossary/properties/title",
"type": "string",
"title": "The Title Schema",
"default": "",
"examples": [
"example glossary"
],
"pattern": "^(.*)$"
},
"GlossDiv": {
"$id": "#/properties/glossary/properties/GlossDiv",
"type": "object",
"title": "The Glossdiv Schema",
"required": [
"title",
"GlossList"
],
"properties": {
"title": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/title",
"type": "string",
"title": "The Title Schema",
"default": "",
"examples": [
"S"
],
"pattern": "^(.*)$"
},
"GlossList": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList",
"type": "object",
"title": "The Glosslist Schema",
"required": [
"GlossEntry"
],
"properties": {
"GlossEntry": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry",
"type": "object",
"title": "The Glossentry Schema",
"required": [
"ID",
"SortAs",
"GlossTerm",
"Acronym",
"Abbrev",
"GlossDef",
"GlossSee"
],
"properties": {
"ID": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/ID",
"type": "string",
"title": "The Id Schema",
"default": "",
"examples": [
"SGML"
],
"pattern": "^(.*)$"
},
"SortAs": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/SortAs",
"type": "string",
"title": "The Sortas Schema",
"default": "",
"examples": [
"SGML"
],
"pattern": "^(.*)$"
},
"GlossTerm": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/GlossTerm",
"type": "string",
"title": "The Glossterm Schema",
"default": "",
"examples": [
"Standard Generalized Markup Language"
],
"pattern": "^(.*)$"
},
"Acronym": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/Acronym",
"type": "string",
"title": "The Acronym Schema",
"default": "",
"examples": [
"SGML"
],
"pattern": "^(.*)$"
},
"Abbrev": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/Abbrev",
"type": "string",
"title": "The Abbrev Schema",
"default": "",
"examples": [
"ISO 8879:1986"
],
"pattern": "^(.*)$"
},
"GlossDef": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/GlossDef",
"type": "object",
"title": "The Glossdef Schema",
"required": [
"para",
"GlossSeeAlso"
],
"properties": {
"para": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/GlossDef/properties/para",
"type": "string",
"title": "The Para Schema",
"default": "",
"examples": [
"A meta-markup language, used to create markup languages such as DocBook."
],
"pattern": "^(.*)$"
},
"GlossSeeAlso": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/GlossDef/properties/GlossSeeAlso",
"type": "array",
"title": "The Glossseealso Schema",
"items": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/GlossDef/properties/GlossSeeAlso/items",
"type": "string",
"title": "The Items Schema",
"default": "",
"examples": [
"GML",
"XML"
],
"pattern": "^(.*)$"
}
}
}
},
"GlossSee": {
"$id": "#/properties/glossary/properties/GlossDiv/properties/GlossList/properties/GlossEntry/properties/GlossSee",
"type": "string",
"title": "The Glosssee Schema",
"default": "",
"examples": [
"markup"
],
"pattern": "^(.*)$"
}
}
}
}
}
}
}
}
}
}
}
BigQuery does not use the json-schema standard for the tables schema. I found two projects that have the code available to go from json-schema to BigQuery schema:
jsonschema-bigquery
jsonschema-transpiler
You could try using those projects as reference to create the opposite transformation. Also, you could create a feature request to the BigQuery team, asking to include the json-schema standard as an output format option.
No this is not possible without writing a program to do so for you.
There is a feature request made by me that requests this functionality.
https://issuetracker.google.com/issues/145308573

BQ command line how to use "--noflatten_results" option to have nested fields

I need to query table with nested and repeated fields, using bq command line give me flattened result while i need to get result as the original format.
The orignal format is looking like
{
"fields": [
{
"fields": [
{
"mode": "REQUIRED",
"name": "version",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "hash",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "header",
"type": "STRING"
},
{
"name": "organization",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "date",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "encoding",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "message_type",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "receiver_code",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "sender_code",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "segment_separator",
"type": "STRING"
},
{
"fields": [
{
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
},
{
"name": "value",
"type": "STRING"
},
{
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
},
{
"name": "value",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "composite_elements",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "elements",
"type": "RECORD"
},
{
"name": "description",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "name",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "segments",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "message_identifier",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "element_separator",
"type": "STRING"
},
{
"name": "composite_element_separator",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "messages",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "syntax",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "encoding",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "file_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "size",
"type": "INTEGER"
}
]
}
So how to export (locally) data with the nesting representation ?
[EDIT]
Export to Google to have nested representation
It's seem the only solution to export the nesting representation it's to export to table then extract to Google Storage and finally download the file.
bq query --destination_table=DEV.EDI_DATA_EXPORT --replace \
--allow_large_results --noflatten_results \
"select * from DEV.EDI_DATA where syntax='EDIFACT' " \
&& bq extract --destination_format=NEWLINE_DELIMITED_JSON DEV.EDI_DATA_EXPORT gs://mybucket/data.json \
&& gsutil cp gs://mybucket/data.json .
It's surprising to me...
Whenever you use -noflatten_results you also have to use --allow_large_results and --destination_table. This stores the non-flattened results in a new table.

Merge two Json Schemas

I am new to JSON and JSON schema validation.
I have the following schema to validate a single employee object:
{
"$schema":"http://json-schema.org/draft-03/schema#",
"title":"Employee Type Schema",
"type":"object",
"properties":
{
"EmployeeID": {"type": "integer","minimum": 101,"maximum": 901,"required":true},
"FirstName": {"type": "string","required":true},
"LastName": {"type": "string","required":true},
"JobTitle": {"type": "string"},
"PhoneNumber": {"type": "string","required":true},
"Email": {"type": "string","required":true},
"Address":
{
"type": "object",
"properties":
{
"AddressLine": {"type": "string","required":true},
"City": {"type": "string","required":true},
"PostalCode": {"type": "string","required":true},
"StateProvinceName": {"type": "string","required":true}
}
},
"CountryRegionName": {"type": "string"}
}
}
and I have the following schema to validate an array of the same employee object:
{
"$schema": "http://json-schema.org/draft-03/schema#",
"title": "Employee set",
"type": "array",
"items":
{
"type": "object",
"properties":
{
"EmployeeID": {"type": "integer","minimum": 101,"maximum": 301,"required":true},
"FirstName": {"type": "string","required":true},
"LastName": {"type": "string","required":true},
"JobTitle": {"type": "string"},
"PhoneNumber": {"type": "string","required":true},
"Email": {"type": "string","required":true},
"Address":
{
"type": "object",
"properties":
{
"AddressLine": {"type": "string","required":true},
"City": {"type": "string","required":true},
"PostalCode": {"type": "string","required":true},
"StateProvinceName": {"type": "string","required":true}
}
},
"CountryRegionName": {"type": "string"}
}
}
}
Can you please show me how to merge them so that way I can use one single schema to validate both single employee object or an entire collection. Thanks.
(Note: this question was also asked on the JSON Schema Google Group, and this answer is adapted from there.)
With "$ref", you can have something like this for your array:
{
"type": "array",
"items": {"$ref": "/schemas/path/to/employee"}
}
If you want something to be an array or a single item, then you can use "oneOf":
{
"oneOf": [
{"$ref": "/schemas/path/to/employee"}, // the root schema, defining the object
{
"type": "array", // the array schema.
"items": {"$ref": "/schemas/path/to/employee"}
}
]
}
The original Google Groups answer also contains some advice on using "definitions" to organise schemas so all these variants can exist in the same file.