Terraform Bigquery Table Schema in a separate JSON file - google-bigquery

I know we can define table schemas in Terraform files. My question is, Is there a way to define schemas in separate files and Terraform import it at run time. That way it will be offer better management and readability
resource "google_bigquery_table" "default" {
dataset_id = google_bigquery_dataset.default.dataset_id
table_id = "bar"
time_partitioning {
type = "DAY"
}
labels = {
env = "default"
}
**schema = <<EOF
[
{
"name": "permalink",
"type": "STRING",
"mode": "NULLABLE",
"description": "The Permalink"
}
]**
EOF
}
So basically what I am asking is how can I move the schema portion to separate files and at run time TF imports that.

If the schema will not be dynamically generated, then you can use the file function for this purpose:
schema = file("${path.module}/nested_path_to/schema.json")
schema.json:
[
{
"name": "permalink",
"type": "STRING",
"mode": "NULLABLE",
"description": "The Permalink"
}
]
If the schema will be dynamically generated, then you should use the templatefile function for this purpose:
schema = templatefile("${path.module}/nested_path_to/schema.json.tmpl", { variable_name = variable_value } )

Related

How to create a bigquery external table with custom partitions

I would like to create a biggqury external table for data stored in:
gs://mybucket/market/US/v1/part-001.avro
gs://mybucket/market/US/v2/part-001.avro
...
gs://mybucket/market/CA/v1/part-001.avro
gs://mybucket/market/CA/v2/part-001.avro
...
With a table definition file
{
"avroOptions": {
"useAvroLogicalTypes": true
},
"hivePartitioningOptions": {
"mode": "CUSTOM",
"sourceUriPrefix": "gs://mybucket/market/{cc:STRING}/{version:STRING}"
},
"sourceFormat": "AVRO",
"sourceUris": [
"gs://mybucket/market/*/*/*.avro"
]
}
and a schema file:
[
{
"mode": "NULLABLE",
"name": "id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
...
The table is created with command:
bq mk --external_table_definition=tabledef.json ds.mytable schema.json
But when trying to query the table, following error is reported
Failed to expand table ds.mytable with file pattern gs://mybucket/market/*/*/*.avro: matched no files. File: gs://mybucket/market/*/*/*.avro
I also tried many variations as below, but none of them worked.
gs://mybucket/market/*
gs://mybucket/market/**.avro
gs://mybucket/market/**/*.avro
Can anyone see what the problem might be?
Thanks

Using $vars within json schema $ref is undefined

While following the documentation for using variables in json schema I noticed the following example fails. It looks like the number-type doesn't get stored as a variable and cannot be read.
{
"$id": "http://example.com/number#",
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": ["natural", "integer"]
},
"value": {
"$ref": "#/definitions/{+number-type}",
"$vars": {
"number-type": {"$ref": "1/type"}
}
}
},
"required": ["type", "value"],
"definitions": {
"natural": {
"type": "integer",
"minimum": 0
},
"integer": {
"type": "integer"
}
}
}
results in
Could not find a definition for #/definitions/{+number-type}
tl;dr $vars is not a JSON Schema keyword. It is an implementation specific extension.
The documentation you link to is not JSON Schema. It is documentation for a specific library which adds a preprocessing step to its JSON Schema processing model.
As such, this would only ever work when using that library, and would not create an interoperable or reuseable JSON Schema, if that's a consideration.
If you are using that library specifically, it sounds like a bug, and you should file an Issue in the appropriate repo. As you haven't provided any code, I can't tell what implementation you are using, so I can't be sure on that.

JSON Schema v7: formatMinimum & formatMaximum validate everything

I am using ajv json schema library (v7) and trying to validate a date based on some value. It looks pretty straightforward with using formatMinimum/formatMaximum but it seems that every date is validated when using these keywords
Here's my schema
"some-date": {
"type": "object",
"properties": {
"data": {
"type": "object",
"properties": {
"value": {
"type": "string",
"format": "date-time",
"formatMinimum": "2021-03-10T14:25:00.000Z"
}
}
}
}
}
Here's the json:
{
"some-date": {
"data": {
"value": "2011-03-10T14:25:00.000Z"
}
}
}
Here's how I am validating:
const ajv = new Ajv({allErrors: true})
require('ajv-formats')(ajv)
require('ajv-errors')(ajv)
require('ajv-keywords')(ajv)
const validate = ajv.validate(mySchema)
const isValid = validate(myJSON)
I've tried it on JSONSchemalint and it validates the above json with the given schema. Also, I have tried with several dates and it validates everything.
Please let me know if I am missing something.
Thanks
I'm not sure where you're getting formatMinimum and formatMaximum from, but they are not standard keywords in the JSON Schema specification, under any version. Are they documented as supported keywords in the implementation that you are using?

JSON Schema - require all properties

The required field in JSON Schema
JSON Schema features the properties, required and additionalProperties fields. For example,
{
"type": "object",
"properties": {
"elephant": {"type": "string"},
"giraffe": {"type": "string"},
"polarBear": {"type": "string"}
},
"required": [
"elephant",
"giraffe",
"polarBear"
],
"additionalProperties": false
}
Will validate JSON objects like:
{
"elephant": "Johnny",
"giraffe": "Jimmy",
"polarBear": "George"
}
But will fail if the list of properties is not exactly elephant, giraffe, polarBear.
The problem
I often copy-paste the list of properties to the list of required, and suffer from annoying bugs when the lists don't match due to typos and other silly errors.
Is there a shorter way to denote that all properties are required, without explicitly naming them?
You can just use the "minProperties" property instead of explicity naming all the fields.
{
"type": "object",
"properties": {
"elephant": {"type": "string"},
"giraffe": {"type": "string"},
"polarBear": {"type": "string"}
},
"additionalProperties": false,
"minProperties": 3
}
I doubt there exists a way to specify required properties other than explicitly name them in required array.
But if you encounter this issue very often I would suggest you to write a small script that post-process your json-schema and add automatically the required array for all defined objects.
The script just need to traverse the json-schema tree, and at each level, if a "properties" keyword is found, add a "required" keyword with all defined keys contained in properties at the same level.
Let the machines do the bore stuff.
I do this in code with a one-liner, for instance, if I want to use required for insert in a DB, but only want to validate against the schema when performing an update.
prepareSchema(action) {
const actionSchema = R.clone(schema)
switch (action) {
case 'insert':
actionSchema.$id = `/${schema.$id}-Insert`
actionSchema.required = Object.keys(schema.properties)
return actionSchema
default:
return schema
}
}
if you using the library jsonschema in python use custom validators:
first create custom validator:
# Custom validator for requiring all properties listed in the instance to be in the 'required' list of the instance
def allRequired(validator, allRequired, instance, schema):
if not validator.is_type(instance, "object"):
return
if allRequired and "required" in instance:
# requiring all properties to 'required'
instanceRequired = instance["required"]
instanceProperties = list(instance["properties"].keys())
for property in instanceProperties:
if property not in instanceRequired:
yield ValidationError("%r should be required but only the following are required: %r" % (property, instanceRequired))
for property in instanceRequired:
if property not in instanceProperties:
yield ValidationError("%r should be in properties but only the following are properties: %r" % (property, instanceProperties))
then extend an exsitsing validator:
all_validators = dict(Draft4Validator.VALIDATORS)
all_validators['allRequired'] = allRequired
customValidator = jsonschema.validators.extend(
validator=Draft4Validator,
validators=all_validators
)
now test:
schema = {"allRequired": True}
instance = {"properties": {"name": {"type": "string"}}, "required": []}
v = customValidator(schema)
errors = validateInstance(v, instance)
you will get the error:
'name' should be required but only the following are required: []
As suggested by others, here's such post-processing python code:
def schema_to_strict(schema):
if schema['type'] not in ['object', 'array']:
return schema
if schema['type'] == 'array':
schema['items'] = schema_to_strict(schema['items'])
return schema
for k, v in schema['properties'].items():
schema['properties'][k] = schema_to_strict(v)
schema['required'] = list(schema['properties'].keys())
schema['additionalProperties'] = False
return schema
You can use the function below:
export function addRequiredAttributeRecursive(schema) {
if (schema.type === 'object') {
schema.required = [];
Object.keys(schema.properties).forEach((key) => {
schema.required.push(key);
if (schema.properties[key].type === 'object') {
schema.properties[key] = addRequiredAttributeRecursive(
schema.properties[key],
);
} else if (schema.properties[key].type === 'array') {
schema.properties[key].items = addRequiredAttributeRecursive(
schema.properties[key].items,
);
}
});
} else if (schema.type === 'array') {
if (schema.items.type === 'object') {
schema.items = addRequiredAttributeRecursive(schema.items);
}
}
return schema;
}
It recursively write the required attribute for every property on all objects from the schema you have.
If you are using Javascript, you can use property getter.
{
"type": "object",
"properties": {
"elephant": {"type": "string"},
"giraffe": {"type": "string"},
"polarBear": {"type": "string"}
},
get required() { return Object.keys(this.properties) },
"additionalProperties": false
}

Append data from SELECT to existing table

I'm trying to append data fetched from a SELECT to another existing table but I keep getting the following error:
Provided Schema does not match Table projectId:datasetId.existingTable
Here is my request body:
{'projectId': projectId,
'configuration': {
'query': {
'query': query,
'destinationTable': {
'projectId': projectId,
'datasetId': datasetId,
'tableId': tableId
},
'writeDisposition': "WRITE_APPEND"
}
}
}
Seems like the writeDisposition option does not get evaluated.
In order for the append to work, the schema of the existing table must match exactly the schema of the query results you're appending. Can you verify that this is the case (one way to check this would be to save this query as a table and compare the schema with the table you are appending to).
Ok think I got something here. That's a weird one...
Actually it does not work if you have the same schema exactly (field mode).
Here is the source table schema:
"schema": {
"fields": [
{
"name": "ID_CLIENT",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "IDENTITE",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
If if I use the copy functionality from the browser interface (bigquery.cloud.google.com), I get the exact same schema which is expected:
"schema": {
"fields": [
{
"name": "ID_CLIENT",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "IDENTITE",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
But then I cannot append from the following fetch to the copied table:
SELECT ID_CLIENT + 1 AS ID_CLIENT, RIGHT(IDENTITE,12) AS IDENTITE FROM datasetid.client
although it returns the same schema, at least from the browser interface view, internally this returns the following schema:
"schema": {
"fields": [
{
"name": "ID_CLIENT",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "IDENTITE",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
Which isn't the same schema exactly (check mode).
And weirder this select:
SELECT ID_CLIENT, IDENTITE FROM datasetid.client
returns this schema:
"schema": {
"fields": [
{
"name": "ID_CLIENT",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "IDENTITE",
"type": "STRING",
"mode": "REQUIRED"
}
]
}
Conclusion:
Don't rely on tables schema information from the browser interface, always use Tables.get API.
Copy doesn't really work as expected...
I have successfully appended data to existing table from a CSV file using bq command line tool. The only difference i see here is the configuration to have
write_disposition instead of writeDisposition as shown in the original question.
What i did is add the append flag to bq command line utility (python scripts) for load and it worked like charm.
I have to update the bq.py with the following.
Added a new flag called --append for load function
in the _Load class under RunWithArgs checked to see if append was set if so set 'write_disposition' = 'WRITE_APPEND'
The code is changed for bq.py as follows
In the __init__ function for _Load Class add the following
**flags.DEFINE_boolean(
'append', False,
'If true then data is appended to the existing table',
flag_values=fv)**
And in the function RunWithArgs for _Load class after the following statement
if self.replace:
opts['write_disposition'] = 'WRITE_TRUNCATE'
---> Add the following text
**if self.append:
opts['write_disposition'] = 'WRITE_APPEND'**
Now in the command line
> bq.py --append=true <mydataset>.<existingtable> <filename>.gz
will append the contents of compressed (gzipped) csv file to the existing table.