Null nested fields in Google BigQuery - sql

I'm trying to upload a json file to BigQuery contaning a nested field which is null but it's not accepting.
I tried a lot of different syntax but I always got the error:
File: 0 / Offset:0 / Line:1 / Column:410, missing required field(s)
I tried to sent the value as many different values listed below and even ommiting it...
"quotas": []
"quotas": null
"quotas": "null"
etc...
The schema definition...
[..]
"name": "quotas",
"type": "record",
"mode": "repeated",
"fields":[
{
"name": "service",
"type": "string",
"mode": "nullable"
},
[..]
]
[..]

From what I can tell in the logs for the import worker for that job, the line in question is missing a required field (the field name starts with "msi"). The line is otherwise well-formatted from what I can tell.
I've filed a bug that BigQuery should give the name of the required field or fields that are missing to make this easier to debug in the future.

Related

How to handle schema errors in rapidjson?

How can I detect the following error situation:
A rapidjson::SchemaDocument is constructed from a rapidjson::Document, but the JSON contained in that Document is no proper schema; for example,
{ "type": "object", "properties": [1] }.
Currently, all I get is an access violation when I validate a document against this faulty schema.
Thanks
Hans

Is there a way to add a default to a json schema array

I just want to understand if there is a way to add a default set of values to an array. (I don't think there is.)
So ideally I would like something like how you might imagine the following working. i.e. the fileTypes element defaults to an array of ["jpg", "png"]
"fileTypes": {
"description": "The accepted file types.",
"type": "array",
"minItems": 1,
"items": {
"type": "string",
"enum": ["jpg", "png", "pdf"]
},
"default": ["jpg", "png"]
},
Of course, all that being said... the above actually does seem to be validate as json schema however for example in VS code this default value does not populate like other defaults (like for strings) populate when creating documents.
It appears to be valid based on the spec.
9.2. "default"
There are no restrictions placed on the value of this keyword. When multiple occurrences of this keyword are applicable to a single sub-instance, implementations SHOULD remove duplicates.
This keyword can be used to supply a default JSON value associated with a particular schema. It is RECOMMENDED that a default value be valid against the associated schema.
See https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.9.2
It's up to the tooling to take advantage of that keyword in the JSON Schema and sounds like VS code is not.

Common type restrictions in JSON schema

can I have common type restrictions or a new type, which I could use for more properties in JSON scema? I am referencing some type properties, but I do not get what I would like to. For instance:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Common types",
"definitions": {
"YN": {
"description": "Y or N field (can be empty, too)",
"type": "string",
"minLength": 0,
"maxLength": 1,
"enum": [ "Y", "N", "" ]
},
"HHMM": {
"description": "Time in HHMM format (or empty).",
"type": "string",
"minLength": 0,
"maxLength": 4,
"pattern": "^[0-2][0-9][0-5][0-9]$|^$"
}
},
"properties" : {
"is_registered": {
"description": "User registered. (this description is overriden)",
"$ref": "#/definitions/YN"
},
"is_valid": {
"description": "User valid. (this description is overriden)",
"$ref": "#/definitions/YN"
},
"timeofday": {
"description": "User registered at HHMM. (this description is overriden)",
"$ref": "#/definitions/HHMM"
}
}
}
In the presented schema I have two strings with some restrictions (enum, pattern, etc.). I do not want to repeat these restrictions in every field of such type. Therefore, I have defined them in definitions and reused them. If type constraints changes, I change only the definitions.
However, there are two issues I have.
First, description is duplicated. If I load this into XMLSpy, only the description of the type is shown and not the description of the actual field. If the description of the type is empty, description of field is not used. I tried combining title, and description in a way that title would be used from common definition and description would be used from field description. It seems that always title and description are used from the common type definition. How could I use common type and the description of the field, which tells, what this field actually is.
Second, if description is inherited from definitions, can I just use common pattern or any other type property, and reference the pattern defined somehow in definitions or somewhere else?
In order to answer your question, first consider that JSON Schema does not have inheritance, only references.
For draft-04, using references mean that the WHOLE subschema object is replaced by the referenced schema object. This means your more specific field description is lost. (You can wrap them in allOf's but it probably won't do what you want in terms of generating documentation).
If you can move to draft 2019-09, $ref can be used alongside other keywords, as it is then classified as an applicator keyword. I don't know if the tooling your using will work as you expect though.
In terms of your second question, not built in to JSON Schema. $ref references can only be used when you have a schema (or subschema). If you want to de-duplicate common parts which are NOT schemas, a common approach is to use a templating engine and compile your schemas at build or runtime. I've seen this done at scale using jsonnet.

Loading AVRO from Bucket via CLI into BigQuery with Date partition

I'm trying to import data into BigQuery via AVRO with a Date partition. When importing via the cli an error is related to a partitioned date has to be a Date or Timestamp but it is getting an Integer.
Given an AVRO file similar to the one below:
{
"namespace": "test_namespace",
"name": "test_name",
"type": "record",
"fields": [
{
"name": "partition_date",
"type": "int",
"logicalType": "date"
},
{
"name": "unique_id",
"type": "string"
},
{
"name": "value",
"type": "double"
}
}
I am then using the following commands through the CLI to try and create a new table
bg load \
--replace \
--source_format=AVRO \
--use_avro_logical_types=True \
--time_partitioning_field partition_date \
--clustering_fields unique_id \
mydataset.mytable \
gs://mybucket/mydata.avro
The expectation is that a new table that is partitioned on the Date column "partition_date" and then clustered by "unique_id".
Edit: Please see the error below
The field specified for the time partition can only be of type TIMESTAMP or DATE. The type found is: INTEGER.
The exact command I am using is as follows:
bq load --replace --source_format=AVRO --use_avro_logical_types=True --time_partitioning_field "partition_date" --clustering_fields "unique_id" BQ_DATASET BUCKET_URI
This is the AVRO schema that I am using
{
"namespace": "example.avro",
"type": "record",
"name": "Test",
"fields": [
{ "name": "partition_date", "type": "int", "logicalType": "date" },
{ "name": "unique_id", "type": "string"},
{ "name": "value", "type": "float" }
]
}
It's worth noting that this is an old Google Project (about 2 - 3 years old) if that is any relevance.
I'm also on windows 10 with the latest Google SDK.
Google finally got back to me (7 months later). In this time I no longer have access to the initial project that I had issues with. However I'm documenting a successful example for those finding this later with a new project.
Following a comment from the issue tracker here I found that I was not using a complex type for the logical date field.
So this:
{
"name": "partition_date",
"type": "int",
"logicalType": "date"
}
Should have been written like this (Note the nested complex object for type):
{
"name": "partition_date",
"type": {
"type": "int",
"logicalType": "date"
}
}
Although the avro specification lists a date as the number of days from the unix epoch (1 Jan 1970) I had to write the partition_date as datetime.date(1970, 1, 1) instead of just 0.
The commands (bq) were unchanged from the original post.
As stated I don't know if this would have fixed my issue with the original project but hopefully this helps the next person.
I haven't received any error message doing the same loading operation, generating equal AVRO data schema and using the desired Bigdata sink table structure.
According to GCP documentation you've used --use_avro_logical_types=True flag along bq command-line properly propagating conversion data types, keeping DATA Avro logical type be translated to the equivalent Date type in Bigquery.
You can refer to my Bigquery table schema, validating table structure on your side, as you haven't provided table structure and error message itself I can't suggest more so far:
$ bq show --project_id=<Project_ID> <Dataset>.<Table>
Table <Project_ID>:<Dataset>.<Table>
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ------------------------- ------------ ------------- ------------ ----------------------------- ------------------ --------
22 Apr 12:03:57 |- partition_date: date 3 66 DAY (field: partition_date) unique_id
|- unique_id: string
|- value: float
I have used FLOAT type for value to plainly convert AVRO DOUBLE data type as per recommendations here.
bq CLI version:
$ bq version
This is BigQuery CLI 2.0.56
Feel free to expand the origin question with more specific information on the issue you're hitting, further assisting more accurately with the solution.
UPDATE:
I've checked information provided, but I'm still confused with the error you're getting. Apparently I see that in your case flag use_avro_logical_types=True does not perform logical type conversion. However I've found this PIT feature request where people are asking to "whitelist" their projects in order to afford AVRO logicaltype functionality, i.e. this comment. Since this feature have been rolled out to globe community, it might be the oversight that some GCP projects are not enabled to use it.

BigQuery: How do I add a field to a REPEATED record?

I've got a table in Google BigQuery that consists of a few fields, then a REPEATED record which may contain one or more objects. I want to create a new table with an extra field in the REPEATED data, and copy my original data into the new table, populating the new field with the output of GENERATE_UUID() so there is one unique identifier per REPEATED line of data.
I had a similar question at How do I copy from one BigQuery Table to another when the target contains REPEATED fields? but I don't know how to adapt this to fit my current use case.
Here's my "new" Schema 1 (ie Schema 2 from the above link)
[
{"name": "id", "type": "NUMERIC", "mode": "REQUIRED"},
{"name": "name", "type": "STRING", "mode": "REQUIRED"},
{"name": "created", "type": "TIMESTAMP", "mode": "REQUIRED"},
{"name": "valid", "type": "BOOLEAN", "mode": "REQUIRED"},
{"name": "parameters", "type": "RECORD", "mode": "REPEATED", "fields":
[
{"name": "parameter1", "type": "STRING", "mode": "REQUIRED"},
{"name": "parameter2", "type": "FLOAT", "mode": "REQUIRED"},
{"name": "parameter3", "type": "BOOLEAN", "mode": "REQUIRED"}
]
}
]
and I'd like it to end up like this, Schema 2:
[
{"name": "id", "type": "NUMERIC", "mode": "REQUIRED"},
{"name": "name", "type": "STRING", "mode": "REQUIRED"},
{"name": "created", "type": "TIMESTAMP", "mode": "REQUIRED"},
{"name": "valid", "type": "BOOLEAN", "mode": "REQUIRED"},
{"name": "parameters", "type": "RECORD", "mode": "REPEATED", "fields":
[
{"name": "uuid", "type": "STRING", "mode": "REQUIRED"},
{"name": "parameter1", "type": "STRING", "mode": "REQUIRED"},
{"name": "parameter2", "type": "FLOAT", "mode": "REQUIRED"},
{"name": "parameter3", "type": "BOOLEAN", "mode": "REQUIRED"}
]
}
]
So I've got my new table (Table 2) created with this Schema. I want to copy from Table 1, and I'm trying something like this:
insert into table2_with_uuid(id, name, created, valid, parameters)
select id, name, created, valid,
[(
GENERATE_UUID(), parameters.parameter1, parameters.parameter2, parameters.parameter3
)]
from table1_no_guid;
This gives me an error saying:
Cannot access field ceId on a value with type ARRAY<STRUCT<parameter1 (etc)
Does anyone have any suggestions as to how to proceed? Thanks!
I have followed the procedure in Data Manipulation Language syntax on the official documentation.
Then, basically what you want is to update repeated records. I have followed all the examples, from the inserts to the updates up to the moment in which a second comment is added to the repeated record.
Then I applied the UNNEST query:
insert into `testing.followingDMLmod` (product, quantity, supply_constrained, comments)
select product, quantity, supply_constrained,
[(
GENERATE_UUID(), com.created, com.comment
)]
from `testing.followingDML` , UNNEST(comments) com;
which of course works but does not provide the desired result.
As per the official documentation "BigQuery natively supports several schema changes such as adding a new nested field to a record or relaxing a nested field's mode." Then, perhaps the path is copying the table and afterwards adding the extra field.
That can be done following managing table schemas documentation. That is, either using the API and calling tables.patch, which was discussed with more detail in this other stack overflow post, or using a JSON file with the schema from the command line.
I have personally followed the second approach (JSON schema file) and worked perfectly for me. With more detail the steps I followed are (as found in here):
Use Copy table in the BigQuery UI to get a replica of your table without "id". My starting table is followingDML and the copy followingDMLmod.
Copy the schema from your table into a JSON file (here called myschema.json) by running the following command in the Cloud Shell
bq show \
--schema \
--format=prettyjson \
testing.followingDMLmod > myschema.json
Open the schema in a text editor. For example running
vim myschema.json
Now modify the schema to add the new nested column to the end of the fields array. (If you have never used vim, a very simplified explanation would be "esc" returns you to normal mode, and while in normal mode clicking "i" allows you to write into the opened file, ":w" saves the file and ":q" exits the file)
I included the field "id":
{
"mode": "NULLABLE",
"name": "id",
"type": "STRING"
}
Now you need to update the schema by running
bq update testing.followingDMLmod myschema.json
Finally, back on the BigQuery UI I used the query
UPDATE `testing.followingDMLmod`
SET comments = ARRAY(
SELECT AS STRUCT * REPLACE(GENERATE_UUID() AS id)
FROM UNNEST(comments)
)
WHERE true
to populate the id field after creation. Following what is suggested in this stack overflow post. Now the end result is truly what it was expected!
Everyone's correct. And incorrect. The unnest replaces the original data with one line per repeated record. Trying this query:
insert into dummydata_withuuid (id, name, created, valid, parameters)
select id, name, created, valid,
[(
GENERATE_UUID(), parameters.parameter1, parameters.parameter2, parameters.parameter3
)]
from dummydata_nouuid;
shows an error on the first parameters.parameter1, "Cannot access field parameter1 on a value with type ARRAY> at [5:29]"
However, remove the insert into... and modify as below line and the query is valid.
-- insert into dummydata_withuuid (id, name, created, valid, parameters)
select id, name, created, valid,
[(
GENERATE_UUID(), parameters
)]
from dummydata_nouuid;
And I can save the results as another table, which is a long way round of getting the answer I need. Is there something I need to modify in my insert into... line to make the query valid?
I've managed to find an answer to this just before I posted - but thought it'd be useful to others to share the method. Here's the query that worked:
insert into table2_with_uuid(id, name, created, valid, parameters)
select id, name, created, valid,
[(
GENERATE_UUID(), params.parameter1, params.parameter2, params.parameter3
)]
from table1_no_guid, UNNEST(parameters) params;
Hope this is useful! Please feel free to add to my result or comment to continue the conversation.