Schema evolution when adding new field - confluent-schema-registry

Imagine there are to separate apps: producer and consumer.
The code of producer:
import os
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
avsc_dir = os.path.dirname(os.path.realpath(__file__))
value_schema = avro.load(os.path.join(avsc_dir, "basic_schema.avsc"))
config = {'bootstrap.servers': 'localhost:9092', 'schema.registry.url': 'http://0.0.0.0:8081'}
producer = AvroProducer(config=config, default_value_schema=value_schema)
producer.produce(topic='testavro', value={'first_name': 'Andrey', 'last_name': 'Volkonsky'})
basic_schema.avsc file is located within producer app. Its content:
{
"name": "basic",
"type": "record",
"doc": "basic schema for tests",
"namespace": "python.test.basic",
"fields": [
{
"name": "first_name",
"doc": "first name",
"type": "string"
},
{
"name": "last_name",
"doc": "last name",
"type": "string"
}
]
}
For now it does not matter what's inside consumer.
We run producer once and everything is ok. Then I want to add age field:
basic_schema.avsc:
{
"name": "basic",
"type": "record",
"doc": "basic schema for tests",
"namespace": "python.test.basic",
"fields": [
{
"name": "first_name",
"doc": "first name",
"type": "string"
},
{
"name": "last_name",
"doc": "last name",
"type": "string"
},
{
"name": "age",
"doc": "age",
"type": "int"
}
]
}
Here I got error:
confluent_kafka.avro.error.ClientError: Incompatible Avro schema:409
They say here https://docs.confluent.io/platform/current/schema-registry/avro.html#summary that for compitability type == BACKWARD consumers should be updated first.
I cannot understand technically. I mean do I have to copy basic_schema.avsc file to consumer
and run it?

If you registered schema with BACKWARDS compatibility (the default), confluent schema registry simply wont allow you to make an incompatible change - adding a mandatory field.
you can add optional field or use forward compatibility
the rules about what should be upgraded first is correct regardless of what changes the compatibility rule actually allows you to make.
edit - additional info
don't use forward compatibility simply because you might have a need to add mandatory fields. Use the compatibility that makes sense for your case based on who can update first e.g. it may be impossible to make all producers upgrade at the same time.
so if using backwards compatibility AND need to add a mandatory field, you probably need a new version of the service e.g. topic.v1 and topic.v2 where v2 of the service uses the schema with the new mandatory field and you can deprecate v1 service...for example

Related

Writing failed row inserts in a streaming job to bigquery using apache beam JAVA SDK?

While running a streaming job its always good to have logs of rows which were not processed while inserting into big query. Catching and write those into another big query table will give an idea for what went wrong.
Below are the steps that you can try to achieve the same.
Pre-requisites:
apache-beam >= 2.10.0 or latest
Using the getFailedInsertsWithErr() function available in the sdk you can easily catch the failed inserts and push to another table for performing RCA. This becomes an important feature for debugging streaming pipelines which are running infinitely.
BigQueryInsertError is an error function that is thrown back by big query for a failed TableRow. This will contain the following parameters
Row.
Error stacktrace and error message payload.
Table reference object.
The above parameters can be captured and pushed into another bq table. Example schema for error records.
"fields": [{
"name": "timestamp",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"name": "payloadString",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "errorMessage",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "stacktrace",
"type": "STRING",
"mode": "NULLABLE"
}
]
}

JSON Schema Array Validation Woes Using oneOf

Hope I might find some help with this validation issue: I have a JSON array that can have multiple object types (video, image). Within that array, the objects have a rel field value. The case I'm working on is that there can only be one object with "rel": "primaryMedia" allowed — either a video or an image.
Here are the object representations for the image and video case, both showing the "rel": "primaryMedia".
{
"media": [
{
"caption": "Caption goes here",
"id": "ncim87659842",
"rel": "primaryMedia",
"type": "image"
},
{
"description": "Shaima Swileh arrived in San Francisco after fighting for 17 months to get a waiver from the U.S. government to be allowed into the country to visit her son.",
"headline": "Yemeni mother arrives in U.S. to be with dying 2-year-old son",
"id": "mmvo1402810947621",
"rel": "primaryMedia",
"type": "video"
}
]
}
Here's a stripped-down version of the schema I created to validate this using oneOf (assuming this will handle my case). It doesn't however, work as intended.
{
"$id": "http://example.com/schema/rockcms/article.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {},
"properties": {
"media": {
"items": {
"oneOf": [
{
"additionalProperties": false,
"properties": {
"caption": {
"type": "string"
},
"id": {
"type": "string"
},
"rel": {
"enum": [
"primaryMedia"
],
"type": "string"
},
"type": {
"enum": [
"image"
],
"type": "string"
}
},
"required": [
"caption",
"id",
"rel",
"type"
],
"type": "object"
},
{
"additionalProperties": false,
"properties": {
"description": {
"type": "string"
},
"headline": {
"type": "string"
},
"id": {
"type": "string"
},
"rel": {
"enum": [
"primaryMedia"
],
"type": "string"
},
"type": {
"enum": [
"video"
],
"type": "string"
}
},
"required": [
"description",
"headline",
"id",
"rel",
"type"
],
"type": "object"
}
]
},
"type": "array"
}
}
}
Using the JSON Schema validator at https://www.jsonschemavalidator.net, the schema validates when data is correctly presented, but the doesn't work when trying to catch errors.
In the case below, headline is added for a video, and it's missing id. This should fail because headline isn't allowed on video, and id is required.
{
"media": [
{
"headline": "Yemeni mother arrives in U.S. to be with dying 2-year-old son",
"rel": "primaryMedia",
"type": "video"
}
]
}
The results I get from the validator, however, aren't entirely expected. Seems it's conflating the two object schemas in its response.
In addition to this, I've separately found that the schema will allow the population of BOTH a video and image object in media, which isn't expected.
Been trying to figure out what I've done wrong, but am stumped. Would very much appreciate some feedback, if anyone has to offer. Thanks in advance!
Validator Response:
Message: JSON is valid against no schemas from 'oneOf'.
Schema path: #/properties/media/items/oneOf
Message: Property 'headline' has not been defined and the schema does not allow additional properties.
Schema path: #/properties/media/items/oneOf/0/additionalProperties
Message: Value "video" is not defined in enum.
Schema path: #/properties/media/items/oneOf/0/properties/type/enum
Message: Required properties are missing from object: description, id.
Schema path: #/properties/media/items/oneOf/1/required
Message: Required properties are missing from object: caption, id.
Schema path: #/properties/media/items/oneOf/0/required
Your question is formed of three parts, but the first two are linked, so I'll address those, although they don't have a "solution" as such.
How can I validate that only one object in an array has a specific key, and the others do not.
You cannot do this with JSON Schema.
The set of JSON Schema keywords which are applicable to arrays do not have a means for expressing "one of the values must match a schema", but rather are applicable to either ALL of the items in a the array, or A SPECIFIC item in the array (if items is an array as opposed to an object).
https://datatracker.ietf.org/doc/html/draft-handrews-json-schema-validation-01#section-6.4
The validation output is not what I expect. I expect to only see the failing branch of oneOf which relates to my object type, which
is defined by the type key.
The JSON Schema specification (as of draft-7) does not specify any format for returning errors, however the error structure you get is pretty "complete" in terms of what it's telling you (and is similar to how we are specifying errors should be returned for draft-8).
Consider, the validator knows nothing about your schema or your JSON instance in terms of your business logic.
When validating a JSON instance, a validator may step through all values, and test validation rules against all applicable subschemas. Looking at your errors, both of the schemas in oneOf are applicable to all items in your array, and so all are tested for validation. If one does not satisfy the condition, the others will be tested also. The validator cannot know, when using a oneOf, what your intent was.
You MAY be able to get around this issue, by using an if / then combination. If your if schema is simply a const of the type, and your then schema is the full object type, you may get a cleaner error response, but I haven't tested this theory.
I want ALL items in the array to be one of the types. What's going on here?
From the spec...
items:
If "items" is a schema, validation succeeds if all elements in the
array successfully validate against that schema.
oneOf:
An instance validates successfully against this keyword if it
validates successfully against exactly one schema defined by this
keyword's value.
What you have done is say: Each item in the array should be valid according to [schemaA]. SchemA: The object should be valid according to on of these schemas: [the schemas inside the oneOf].
I can see how this is confusing when you think "items must be one of the following", but items applies the value schema to each of the items in the array independantly.
To make it do what you mean, move the oneOf above items, and then refactor one of the media types to another items schema.
Here's a sudo schema.
oneOf: [items: properties: type: const: image], [items: properties: type: image]
Let me know if any of this isn't clear or you have any follow up questions.

How to version an API in Azure API Management using Azure Resource Manager

When creating a new API in an Azure API Management Service using the portal, you can specify whether you would like the API to be versioned. However, I can't find a way to replicate this when creating an API in the Management service using ARM. Is this not currently supported, or am I missing something?
I have tried creating a versioned API in the portal and comparing the created template to the template of a non-versioned API and can't see a difference.
Thanks in advance.
To achieve this through ARM scripts you'll need to create an ApiVersionSet resource first:
{
"name": "[concat(variables('ManagementServiceName'), '/', variables('VersionSetName'))]",
"type": "Microsoft.ApiManagement/service/api-version-sets",
"apiVersion": "2017-03-01",
"properties": {
"description": "Api Description",
"displayName": "Api Name",
"versioningScheme": "Segment"
}
}
Then update the apiVersionSetId property on the Microsoft.ApiManagement/service/apis resource:
{
"type": "Microsoft.ApiManagement/service/apis",
"name": "[concat(variables('ManagementServiceName'), '/', variables('ApiName'))]",
"apiVersion": "2017-03-01",
"dependsOn": [
"[resourceId('Microsoft.ApiManagement/service/api-version-sets', variables('ManagementServiceName'), variables('VersionSetName'))]"
],
"properties": {
"displayName": "string",
"apiRevision": "1",
"description": "",
"serviceUrl": "string",
"path": "string",
"protocols": [
"https"
],
"isCurrent": true,
"apiVersion": "v1",
"apiVersionName": "v1",
"apiVersionDescription": "string",
"apiVersionSetId": "[concat('/api-version-sets', variables('VersionSetName'))]"
}
}
resource for the api-version-sets
"name": "my-api-version-sets",
"type": "api-version-sets",
"apiVersion": "2018-01-01",
"properties": {
"displayName": "Provider API",
"versioningScheme": "Segment"
},
"dependsOn": [
"[concat('Microsoft.ApiManagement/service/', variables('apiManagementServiceName'))]"
]
Then other to apis
"apiVersion": "2018-01-01",
"type": "apis",
"properties": {
....
"isCurrent": true,
"apiVersion": "v1",
"apiVersionSetId": "/api-version-sets/my-api-version-sets"
You can specify the version on Azure ARM portal in path,header or as a query string.But the old azure API management portal not support in build versioning.Any way you can specify the versioning in Web API URL suffix.
Still if you have any issue kindly add some image and describe your issue.
Azure ARM Portal (New APIM)
Azure APIM Portal (OLD)
Thanks,
Infaaz

ARM - How can I get the access key from a storage account to use in AppSettings later in the template?

I'm creating an Azure Resource Manager template that instantiates multiple resources, including an Azure storage account and an Azure App Service with a Web App.
I'd like to be able to capture the primary access key (or the full connection string, either way is fine) from the newly-created storage account, and use that as a value for one of the AppSettings for the Web App.
Is that possible?
Use the listkeys helper function.
"appSettings": [
{
"name": "STORAGE_KEY",
"value": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value]"
}
]
This quickstart does something similar:
https://azure.microsoft.com/en-us/documentation/articles/cache-web-app-arm-with-redis-cache-provision/
The syntax has changed since the other answer was accepted. The error you will now hit is 'Template language expression property 'key1' doesn't exist, available properties are 'keys'
Keys are now represented as an array of keys, and the syntax is now:
"StorageAccount": "[Concat('DefaultEndpointsProtocol=https;AccountName=',variables('StorageAccountName'),';AccountKey=',listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('StorageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value)]",
See: http://samcogan.com/retrieve-azure-storage-key-in-arm-script/
I faced with this issue two times. First in the 2015 and last today in May of 2017.
I need to add connection strings to the WebApp - I want to add strings automatically from generated resources during deployment from the ARM template. It can help later to not add manually this values.
First time I used old version of the function listKeys (it looks like old version returns result not as object but as value):
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2015-05-01-preview').key1)]"
},
Today last version of the working template is:
"resources": [
{
"apiVersion": "2015-08-01",
"type": "config",
"name": "connectionstrings",
"dependsOn": [
"[resourceId('Microsoft.Web/Sites/', parameters('webSiteName'))]"
],
"properties": {
"DefaultConnection": {
"value": "[concat('Data Source=tcp:', reference(resourceId('Microsoft.Sql/servers/', parameters('sqlserverName'))).fullyQualifiedDomainName, ',1433;Initial Catalog=', parameters('databaseName'), ';User Id=', parameters('administratorLogin'), '#', parameters('sqlserverName'), ';Password=', parameters('administratorLoginPassword'), ';')]",
"type": "SQLServer"
},
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
},
"AzureWebJobsDashboard": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
}
}
},
Thanks.
below is example for adding storage account to ADLA
"storageAccounts": [
{
"name": "[parameters('DataLakeAnalyticsStorageAccountname')]",
"properties": {
"accessKey": "[listKeys(variables('storageAccountid'),'2015-05-01-preview').key1]"
}
}
],
in variable you can keep
"variables": {
"apiVersion": "[providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]]",
"storageAccountid": "[concat(resourceGroup().id,'/providers/','Microsoft.Storage/storageAccounts/', parameters('DataLakeAnalyticsStorageAccountname'))]"
},

Error loading file stored in Google Cloud Storage to Big Query

I have been trying to create a job to load a compressed json file from Google Cloud Storage to a Google BigQuery table. I have read/write access in both Google Cloud Storage and Google BigQuery. Also, the uploaded file belongs in the same project as the BigQuery one.
The problem happens when I access to the resource behind this url https://www.googleapis.com/upload/bigquery/v2/projects/NUMERIC_ID/jobs by means of a POST request. The content of the request to the abovementioned resource can be found as follows:
{
"kind" : "bigquery#job",
"projectId" : NUMERIC_ID,
"configuration": {
"load": {
"sourceUris": ["gs://bucket_name/document.json.gz"],
"schema": {
"fields": [
{
"name": "id",
"type": "INTEGER"
},
{
"name": "date",
"type": "TIMESTAMP"
},
{
"name": "user_agent",
"type": "STRING"
},
{
"name": "queried_key",
"type": "STRING"
},
{
"name": "user_country",
"type": "STRING"
},
{
"name": "duration",
"type": "INTEGER"
},
{
"name": "target",
"type": "STRING"
}
]
},
"destinationTable": {
"datasetId": "DATASET_NAME",
"projectId": NUMERIC_ID,
"tableId": "TABLE_ID"
}
}
}
}
However, the error doesn't make any sense and can also be found below:
{
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}
],
"code": 400,
"message": "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}
}
I know the problem doesn't lie either in the project id or in the access token placed in the authentication header, because I have successfully created an empty table before. Also I specify the content-type header to be application/json which I don't think is the issue here, because the body content should be json encoded.
Thanks in advance
Your HTTP request is malformed -- BigQuery doesn't recognize this as a load job at all.
You need to look into the POST request, and check the body you send.
You need to ensure that all the above (which seams correct) is the body of the POST call. The above Json should be on a single line, and if you manually creating the multipart message, make sure there is an extra newline between the headers and body of each MIME type.
If you are using some sort of library, make sure the body is not expected in some other form, like resource, content, or body. I've seen libraries that use these differently.
Try out the BigQuery API explorer: https://developers.google.com/bigquery/docs/reference/v2/jobs/insert and ensure your request body matches the one made by the API.