Correct way to create "record" field in Avro schema - schema

I am trying to understand Avro schemas and stuck with complex types (record). The problem is very simple: create a schema which contains one record filed with two primitive fields (string and timestamp) nested to record. I see two options for the schema:
option 1
{
"type": "record",
"name": "cool_subject",
"namespace": "com.example",
"fields": [
{
"name": "field_1",
"type": "record"
"fields": [
{"name": "operation", "type": "string"},
{"name": "timestamp", "type": "long", "logical_type": "timestamp_millis"}
]
}
]
}
option 2
{
"type": "record",
"name": "cool_subject",
"namespace": "com.example",
"fields": [
{
"name": "field_1",
"type": {
"type": "record",
"name": "field_1_type",
"fields": [
{"name": "operation", "type": "string"},
{"name": "timestamp", "type": {"type": "long", "logical_type": "timestamp_millis"}}
]
}
}
]
}
The difference is in the "type" attribute.
As far as I know opt2 is the correct way. Am I right? Is opt1 valid?

The second one is correct. The first one is not valid.
A record schema is something that looks like this:
{
"type": "record",
"name": <Name of the record>,
"fields": [...],
}
And for fields, it should be like this:
[
{
"name": <name of field>,
"type": <type of field>,
},
...
]
So in the case of a field which contains a record, it should always look like this:
[
{
"name": <name of field>,
"type": {
"type": "record",
"name": <Name of the record>,
"fields": [...],
}
},
...
]
The format in the first example would make it unclear if the name "field_1" was the name of the field or the name of the record.

Related

Inserting data from one BigQuery table to another returns 0 rows on group by

I am trying to do insert data from one BigQuery table to another by running the query shown below but I get 0 rows in return. However if I take out the Survey column, I get the correct number of rows in return.
Both the nested fields have the same type of schema. I have checked and double checked the column names too but can´t seem to figure out what´s wrong with Survey field.
INSERT INTO destination_table
(
Title, Description, Address, Survey
)
SELECT
Title as Title,
Description as Description,
[STRUCT(
ARRAY_AGG(STRUCT(Address_Instance.Field1, Address_Instance.Field2)) AS Address_Record
)]
as Address,
[STRUCT(
ARRAY_AGG(STRUCT(Survey_Instance.Field1, Survey_Instance.Field2)) AS Survey_Record
)]
as Survey
FROM
source_table,
UNNEST(Survey) AS Survey,
UNNEST(Survey_Instance) as Survey_Instance,
GROUP BY
Title,
Description
Here´s how the schema of my source table looks like:
[
{
"name": "Title",
"type": "STRING"
},
{
"name": "Description",
"type": "STRING"
},
{
"name": "Address",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Address_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
{
"name": "Survey",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Survey_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
]
While mapping to the destination table, I rename the nested repeated records but that´s not causing any problems. I am wondering if I am overlooking something important and need some suggestions and advice. Basically an extra set of eyes to help me figure what I am doing wrong.
Would appreciate some help. Thanks in advance.
Use explicit JOINs in general. In this case, use LEFT JOIN:
FROM source_table st LEFT JOIN
UNNEST(st.Survey) Survey
ON 1=1 LEFT JOIN
UNNEST(Survey.Survey_Instance) Survey_Instance
ON 1=1

Load Avro file to GCS with nested record using customized column name

I was trying to load an Avro file with nested record. One of the record was having a union of schema. When loaded to BigQuery, it created a very long name like com_mycompany_data_nestedClassname_value on each union element. That name is long. Wondering if there is a way to specify name without having the full package name prefixed.
For example. The following Avro schema
{
"type": "record",
"name": "EventRecording",
"namespace": "com.something.event",
"fields": [
{
"name": "eventName",
"type": "string"
},
{
"name": "eventTime",
"type": "long"
},
{
"name": "userId",
"type": "string"
},
{
"name": "eventDetail",
"type": [
{
"type": "record",
"name": "Network",
"namespace": "com.something.event",
"fields": [
{
"name": "hostName",
"type": "string"
},
{
"name": "ipAddress",
"type": "string"
}
]
},
{
"type": "record",
"name": "DiskIO",
"namespace": "com.something.event",
"fields": [
{
"name": "path",
"type": "string"
},
{
"name": "bytesRead",
"type": "long"
}
]
}
]
}
]
}
Came up with
Is that possible to make the long field name like eventDetail.com_something_event_Network_value to be something like eventDetail.Network
Avro loading is not as flexible as it should be in BigQuery (basic example is that it does not support load a subset of the fields (reader schema). Also, renaming of the columns is not supported today in BigQuery refer here. Only options are recreate your table with the proper names (create a new table from your existing table) or recreate the table from your previous table

JSON Schema to represent a name and value with value constrained by name

I have the following JSON snippets which are all valid
"units": { "name": "EU", "value": "Grams" }
"units": { "name": "EU", "value": "Kilograms" }
"units": { "name": "US", "value": "Ounces" }
"units": { "name": "US", "value": "Pounds" }
The name values can be EU and US and the valid value value should depend on the name value.
It's easy to use JSON Schema enums for both these properties, but can I enforce the additional constraint using JSON Schema?
I would consider changing the overall schema so that there is a parent child relationship between a name object and value object, but ideally this would be avoided.
I managed to crack it using https://www.jsonschemavalidator.net/ to work though an example. The following schema provides the solution:
"units": {
"type":"object",
"oneOf": [ {
"properties": {
"name": { "enum": [ "EU" ] },
"value": { "enum" : ["Grams", "Kilograms"]}}}, {
"properties": {
"name": { "enum": [ "US" ] },
"value": { "enum": ["Ounces", "Pounds"]}}}]
}

Bigquery fails to load data from Google Cloud Storage

I have a json record which looks like
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40",
"line_item": [{"line":"1","sku":"10","amount":10},
{"line":"2","sku":"20","amount":20}]}
I am trying to load record stated above into the table which has schema definition as,
"fields": [
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "STRING"
}
]
I am getting following error "message":
JSON parsing error in row starting at position 0 at file:
gs://gcs_bucket/file0. JSON object specified for non-record field:
line_item
I want line_item json string which can have more than 1 row as array of json string in line item column in table.
Any suggestion?
The first thing is that your input JSON should't have a "\n" character, so you should save it like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
One example of how your JSON file should look like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"2","order_id":"2", "line_item": [{"line":"2","sku":"20","amount":20}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"3","order_id":"3", "line_item": [{"line":"3","sku":"30","amount":30}, {"line":"3","sku":"30","amount":30}]}
And also your schema is not correct. It should be:
[
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "RECORD",
"fields": [{"name": "line", "type": "STRING"}, {"name": "sku", "type": "STRING"}, {"name": "amount", "type": "INTEGER"}]
}
]
For a better understanding of how schemas work, I've tried writing sort of a guide in this answer. Hopefully it can be of some value.
If your data content is saved for instance in a filed called gs://gcs_bucket/file0 and your schema in schema.json then this command should work for you:
bq load --source_format=NEWLINE_DELIMITED_JSON dataset.table gs://gcs_bucket/file0 schema.json
(supposing you are using the CLI tool as it seems to be the case in your question).

How to create new table with nested schema entirely in BigQuery

I've got a nested table A in BigQuery with a schema as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
}
]
}
I would like to enrich table A with data from other table and save result as a new nested table. Let's say I would like to add "description" field to table A (creating table B), so my schema will be as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
How do I do this in BigQuery? It seems, that there are no functions for creating nested structures in BigQuery SQL (except NEST functions, which produces a list - but this function doesn't seem to work, failing with Unexpected error)
The only way of doing this I can think of, is to:
use string concatenation functions to produce table B with single field called "json" with content being enriched data from A, converted to json string
export B to GCS as set of files F
load F as table C
Is there an easier way to do it?
To enrich schema of existing table one can use tables patch API
https://cloud.google.com/bigquery/docs/reference/v2/tables/patch
Request will look like below
PATCH https://www.googleapis.com/bigquery/v2/projects/{project_id}/datasets/{dataset_id}/tables/{table_id}?key={YOUR_API_KEY}
{
"schema": {
"fields": [
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
]
}
}
Before Patch
After Patch