Writing failed row inserts in a streaming job to bigquery using apache beam JAVA SDK? - google-bigquery

While running a streaming job its always good to have logs of rows which were not processed while inserting into big query. Catching and write those into another big query table will give an idea for what went wrong.
Below are the steps that you can try to achieve the same.

Pre-requisites:
apache-beam >= 2.10.0 or latest
Using the getFailedInsertsWithErr() function available in the sdk you can easily catch the failed inserts and push to another table for performing RCA. This becomes an important feature for debugging streaming pipelines which are running infinitely.
BigQueryInsertError is an error function that is thrown back by big query for a failed TableRow. This will contain the following parameters
Row.
Error stacktrace and error message payload.
Table reference object.
The above parameters can be captured and pushed into another bq table. Example schema for error records.
"fields": [{
"name": "timestamp",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"name": "payloadString",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "errorMessage",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "stacktrace",
"type": "STRING",
"mode": "NULLABLE"
}
]
}

Related

Schema evolution when adding new field

Imagine there are to separate apps: producer and consumer.
The code of producer:
import os
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
avsc_dir = os.path.dirname(os.path.realpath(__file__))
value_schema = avro.load(os.path.join(avsc_dir, "basic_schema.avsc"))
config = {'bootstrap.servers': 'localhost:9092', 'schema.registry.url': 'http://0.0.0.0:8081'}
producer = AvroProducer(config=config, default_value_schema=value_schema)
producer.produce(topic='testavro', value={'first_name': 'Andrey', 'last_name': 'Volkonsky'})
basic_schema.avsc file is located within producer app. Its content:
{
"name": "basic",
"type": "record",
"doc": "basic schema for tests",
"namespace": "python.test.basic",
"fields": [
{
"name": "first_name",
"doc": "first name",
"type": "string"
},
{
"name": "last_name",
"doc": "last name",
"type": "string"
}
]
}
For now it does not matter what's inside consumer.
We run producer once and everything is ok. Then I want to add age field:
basic_schema.avsc:
{
"name": "basic",
"type": "record",
"doc": "basic schema for tests",
"namespace": "python.test.basic",
"fields": [
{
"name": "first_name",
"doc": "first name",
"type": "string"
},
{
"name": "last_name",
"doc": "last name",
"type": "string"
},
{
"name": "age",
"doc": "age",
"type": "int"
}
]
}
Here I got error:
confluent_kafka.avro.error.ClientError: Incompatible Avro schema:409
They say here https://docs.confluent.io/platform/current/schema-registry/avro.html#summary that for compitability type == BACKWARD consumers should be updated first.
I cannot understand technically. I mean do I have to copy basic_schema.avsc file to consumer
and run it?
If you registered schema with BACKWARDS compatibility (the default), confluent schema registry simply wont allow you to make an incompatible change - adding a mandatory field.
you can add optional field or use forward compatibility
the rules about what should be upgraded first is correct regardless of what changes the compatibility rule actually allows you to make.
edit - additional info
don't use forward compatibility simply because you might have a need to add mandatory fields. Use the compatibility that makes sense for your case based on who can update first e.g. it may be impossible to make all producers upgrade at the same time.
so if using backwards compatibility AND need to add a mandatory field, you probably need a new version of the service e.g. topic.v1 and topic.v2 where v2 of the service uses the schema with the new mandatory field and you can deprecate v1 service...for example

CosmosDB source returning DF-SYS-01

I've built some pipelines and had success importing data from most of my CosmosDB tables, but this one constantly gives me an error which I don't understand. I think it might be caused by table structure, but would highly appreciate second opinion and possible solutions.
Table item:
{
"id": "someGuid",
"name": "someString",
"pins": [
{
"type": "someString",
"latitude": 47.03923,
"longitude": -122.89136,
"name": "someString"
},
{
"type": "someString",
"latitude": 28.53823,
"longitude": -81.37739,
"name": "someString"
}
],
"_rid": "vj04AOrfr2s8CT0AAAAAAA==",
"_self": "dbs/vj04AA==/colls/vj04AOrfr2s=/docs/vj04AOrfr2s8CT0AAAAAAA==/",
"_etag": "\"ac00ddc8-0000-0700-0000-5e7428230000\"",
"_attachments": "attachments/",
"_ts": 1584670755
}
Columns are identified correct in Source.Projection Tab where pins are []string, but source can't be loaded ((
"{"message":"at : (StructType(StructField(area,StringType,true), StructField(date,StringType,true), StructField(resultType,StringType,true), StructField(results,ArrayType(StringType,true),true), StructField(test,StringType,true)),StringType) (of class scala.Tuple2). Details:at : (StructType(StructField(area,StringType,true), StructField(date,StringType,true), StructField(resultType,StringType,true), StructField(results,ArrayType(StringType,true),true), StructField(test,StringType,true)),StringType) (of class scala.Tuple2)","failureType":"UserError","target":"Pins","errorCode":"DFExecutorUserError"}"
Was able to solve it by : 1. restarting ADF session 2. Resetting schema for data source - it automatically picked correct schema. Hopefully this answer will help someone in the feature.

Error loading multiple files to bigquery too many positional args

~edited: I'm running the bq command line using my VM instance in Google Compute Engine
Ive been trying to load multiple csv files to bigquery using bq command line, and i keep getting this error
Too many positional args, still have ['/home/username/csvschema.json']
All my files contain the same schema since I copied and paste it only and rename for testing purposes. So not sure why I keep getting this error. [testFiles_1.csv, testFiles_2.csv, testFiles_3.csv]
These are the steps I took:
1. Created my bigquery table and manually insert 1 file there so I dont need to manually add schema, but rather auto detect.
2. Then, I type this command:
bq load --skip_leading_rows=1 gcstransfer.testFile /home/username/testfile_*.csv /home/username/csvschema.json
My schema contains by running the bq show --format=prettyjson dataset.table
[
{
"mode": "NULLABLE",
"name": "Channel",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Date",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "ID",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Referral",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Browser",
"type": "STRING"
}
]
I tried omitting the JSON part, but I get this error instead:
BigQuery error in load operation: Error decoding JSON schema from file /home/username/testfile_2.csv: No JSON object could be decoded
To specify a one-column schema, use "name:string".
Looks like you cannot use wildcards when loading from local data source. For this you can upload the files to a GCS Bucket and load them from there. See the Limitations paragraph in the docs: https://cloud.google.com/bigquery/docs/loading-data-local
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.

Apache Druid segment merge task submition failure

I am using Druid 0.9.1.1 and trying to merge all the segment of a datasource per day to a single segment. Whereas the merge task initiation fails with error :
{"error":"Instantiation of [simple type, class io.druid.timeline.DataSegment] value failed: null (through reference chain: java.util.ArrayList[0])"}
I have got the segment details from segment metadata query. There is no help from driud documents as only specify raw structure of the overall query, but not the required segment detail structure(Below is how druid document suggests).
{
"type": "merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"segments": <JSON list of DataSegment objects to merge>
}
example queries :
{
"type": "merge",
"id": "envoy_merge_task",
"dataSource": "dcap.envoy.diskmounts.kafka",
"segments": [{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5460959,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_1","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5448881,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_2","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5454452,"numRows":41571,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_3","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5456267,"numRows":41569,"aggregators":null,"queryGranularity":null}] }
I have tried different forms of structure for "segments" key, results in same error.
example :
"segments": [{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_1"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_2"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_3"}]
What is right structure for segment-merge tasks.
the format i used for segments is
"segments":[
{
"dataSource": "wikiticker88",
"interval": "2015-09-12T02:00:00.000Z/2015-09-12T03:00:00.000Z",
"version": "2018-01-16T07:23:16.425Z",
"loadSpec": {
"type": "local",
"path": "/home/linux/druid-0.11.0/var/druid/segments/wikiticker88/2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z/2018-01-16T07:23:16.425Z/0/index.zip"
},
"dimensions": "channel,cityName,comment,countryIsoCode,countryName,isAnonymous,isMinor,isNew,isRobot,isUnpatrolled,metroCode,namespace,page,regionIsoCode,regionName,user",
"metrics": "count,added,deleted,delta,user_unique",
"shardSpec": {
"type": "none"
},
"binaryVersion": 9,
"size": 198267,
"identifier": "wikiticker88_2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z_2018-01-16T07:23:16.425Z"
},
]
use this to get your metadata of segments
/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full

ARM - How can I get the access key from a storage account to use in AppSettings later in the template?

I'm creating an Azure Resource Manager template that instantiates multiple resources, including an Azure storage account and an Azure App Service with a Web App.
I'd like to be able to capture the primary access key (or the full connection string, either way is fine) from the newly-created storage account, and use that as a value for one of the AppSettings for the Web App.
Is that possible?
Use the listkeys helper function.
"appSettings": [
{
"name": "STORAGE_KEY",
"value": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value]"
}
]
This quickstart does something similar:
https://azure.microsoft.com/en-us/documentation/articles/cache-web-app-arm-with-redis-cache-provision/
The syntax has changed since the other answer was accepted. The error you will now hit is 'Template language expression property 'key1' doesn't exist, available properties are 'keys'
Keys are now represented as an array of keys, and the syntax is now:
"StorageAccount": "[Concat('DefaultEndpointsProtocol=https;AccountName=',variables('StorageAccountName'),';AccountKey=',listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('StorageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value)]",
See: http://samcogan.com/retrieve-azure-storage-key-in-arm-script/
I faced with this issue two times. First in the 2015 and last today in May of 2017.
I need to add connection strings to the WebApp - I want to add strings automatically from generated resources during deployment from the ARM template. It can help later to not add manually this values.
First time I used old version of the function listKeys (it looks like old version returns result not as object but as value):
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2015-05-01-preview').key1)]"
},
Today last version of the working template is:
"resources": [
{
"apiVersion": "2015-08-01",
"type": "config",
"name": "connectionstrings",
"dependsOn": [
"[resourceId('Microsoft.Web/Sites/', parameters('webSiteName'))]"
],
"properties": {
"DefaultConnection": {
"value": "[concat('Data Source=tcp:', reference(resourceId('Microsoft.Sql/servers/', parameters('sqlserverName'))).fullyQualifiedDomainName, ',1433;Initial Catalog=', parameters('databaseName'), ';User Id=', parameters('administratorLogin'), '#', parameters('sqlserverName'), ';Password=', parameters('administratorLoginPassword'), ';')]",
"type": "SQLServer"
},
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
},
"AzureWebJobsDashboard": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
}
}
},
Thanks.
below is example for adding storage account to ADLA
"storageAccounts": [
{
"name": "[parameters('DataLakeAnalyticsStorageAccountname')]",
"properties": {
"accessKey": "[listKeys(variables('storageAccountid'),'2015-05-01-preview').key1]"
}
}
],
in variable you can keep
"variables": {
"apiVersion": "[providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]]",
"storageAccountid": "[concat(resourceGroup().id,'/providers/','Microsoft.Storage/storageAccounts/', parameters('DataLakeAnalyticsStorageAccountname'))]"
},