Kafka Connect: How to extract a field - sql

I'm using Debezium SQL Server Connector to stream a table into a topic. Thanks to Debezium's ExtractNewRecordState SMT, I'm getting the following message in my topic.
{
"schema":{
"type":"struct",
"fields":[
{
"type":"int64",
"optional":false,
"field":"id"
},
{
"type":"string",
"optional":false,
"field":"customer_code"
},
{
"type":"string",
"optional":false,
"field":"topic_name"
},
{
"type":"string",
"optional":true,
"field":"payload_key"
},
{
"type":"boolean",
"optional":false,
"field":"is_ordered"
},
{
"type":"string",
"optional":true,
"field":"headers"
},
{
"type":"string",
"optional":false,
"field":"payload"
},
{
"type":"int64",
"optional":false,
"name":"io.debezium.time.Timestamp",
"version":1,
"field":"created_on"
}
],
"optional":false,
"name":"test_server.dbo.kafka_event.Value"
},
"payload":{
"id":129,
"customer_code":"DVTPRDFT411",
"topic_name":"DVTPRDFT411",
"payload_key":null,
"is_ordered":false,
"headers":"{\"kafka_timestamp\":1594566354199}",
"payload":"MSG 18",
"created_on":1594595154267
}
}
After adding value.converter.schemas.enable=false, I could get rid of the schema portion and only the payload part is left as shown below.
{
"id":130,
"customer_code":"DVTPRDFT411",
"topic_name":"DVTPRDFT411",
"payload_key":null,
"is_ordered":false,
"headers":"{\"kafka_timestamp\":1594566354199}",
"payload":"MSG 19",
"created_on":1594595154280
}
I'd like to go 1 step further and extract only the customer_code field. I tried ExtractField$Value SMT but I keep getting the exception IllegalArgumentException: Unknown field: customer_code.
My configuration is as following
transforms=unwrap,extract
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=true
transforms.unwrap.delete.handling.mode=drop
transforms.extract.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extract.field=customer_code
I tried a bunch of other SMTs including ExtractField$Key, ValueToKey but I couldn't make it work. I'd be very grateful if you could show me what I've done wrong. According to this tutorial from Confluent, it should work but it didn't.
** UPDATE **
I'm running Kafka Connect using connect-standalone worker.properties sqlserver.properties.
worker.properties
offset.storage.file.filename=C:/development/kafka_2.12-2.5.0/data/kafka/connect/connect.offsets
plugin.path=C:/development/kafka_2.12-2.5.0/plugins
bootstrap.servers=127.0.0.1:9092
offset.flush.interval.ms=10000
rest.port=10082
rest.host.name=127.0.0.1
rest.advertised.port=10082
rest.advertised.host.name=127.0.0.1
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false
sqlserver.properties
name=sql-server-connector
connector.class=io.debezium.connector.sqlserver.SqlServerConnector
database.hostname=127.0.0.1
database.port=1433
database.user=sa
database.password=dummypassword
database.dbname=STGCTR
database.history.kafka.bootstrap.servers=127.0.0.1:9092
database.server.name=wfo
table.whitelist=dbo.kafka_event
database.history.kafka.topic=db_schema_history
transforms=unwrap,extract
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=true
transforms.unwrap.delete.handling.mode=drop
transforms.extract.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.extract.field=customer_code

The schema and payload fields sound like you're using data that was serialized with a JsonConverter with schemas enabled.
You can just set value.converter.schemas.enable=false to achieve your goal.

Related

Counting $lookup and $unwind documents filtered with $match without getting rid of parent document when all results match

I have a collection "Owners" and I want to return a list of "Owner" matching a filter (any filter), plus the count of "Pet" from the "Pets" collection for that owner, except I don't want the dead pets. (made up example)
I need the returned documents to look exactly like an "Owner" document with the addition of the "petCount" field because I'm using Java Pojos with the Mongo Java driver.
I'm using AWS DocumentDB that does not support $lookup with filters yet. If it did I would use this and I'd be done:
db.Owners.aggregate( [
{ $match: {_id: UUID("b13e733d-2686-4266-a686-d3dae6501887")} },
{ $lookup: { from: 'Pets', as: 'pets', 'let': { ownerId: '$_id' }, pipeline: [ { $match: { $expr: { $ne: ['$state', 'DEAD'] } } } ] } },
{ $addFields: { petCount: { $size: '$pets' } } },
{ $project: { pets: 0 } }
]).pretty()
But since it doesn't this is what I got so far:
db.Owners.aggregate( [
{ $match: {_id: { $in: [ UUID("cbb921f6-50f8-4b0c-833f-934998e5fbff") ] } } },
{ $lookup: { from: 'Pets', localField: '_id', foreignField: 'ownerId', as: 'pets' } },
{ $unwind: { path: '$pets', preserveNullAndEmptyArrays: true } },
{ $match: { 'pets.state': { $ne: 'DEAD' } } },
{ "$group": {
"_id": "$_id",
"doc": { "$first": "$$ROOT" },
"pets": { "$push": "$pets" }
}
},
{ $addFields: { "doc.petCount": { $size: '$pets' } } },
{ $replaceRoot: { "newRoot": "$doc" } },
{ $project: { pets: 0 } }
]).pretty()
This works perfectly, except if an Owner only has "DEAD" pets, then the owner doesn't get returned because all the "document copies" got filtered out by the $match. I'd need the parent document to be returned with petCount = 0 when ALL of them are "DEAD". I cannot figure out how to do this.
Any ideas?
These are the supported operations for DocDB 4.0 https://docs.amazonaws.cn/en_us/documentdb/latest/developerguide/mongo-apis.html
EDIT: update to use $filter as $reduce not supported by aws document DB
You can use $filter to keep only not DEAD pets in the lookup array, then count the size of the remaining array.
Here is the Mongo playground for your reference.
$reduce version
You can use $reduce in your aggregation pipeline to to a conditional sum for the state.
Here is Mongo playground for your reference.
As of January 2022, Amazon DocumentDB added support for $reduce, the solution posted above should work for you.
Reference.

Which class in AWS CDK have option to configure Dynamic partitioning for Kinesis delivery stream

I'm using kinesis delivery stream to send stream, from event bridge to s3 bucket. But i can't seem to find which class have the option to configure dynamic partitioning?
this is my code for delivery stream:
new CfnDeliveryStream(this, `Export-delivery-stream`, {
s3DestinationConfiguration: {
bucketArn: bucket.bucketArn,
roleArn: kinesisFirehoseRole.roleArn,
prefix: `test/!{timestamp:yyyy/MM/dd}/`
}
});
I have been working on the same issue for a few days, and have finally gotten something to work. Here is an example of how it can be implemented in CDK. In short, the partitioning has to be enables as you have done, but you need to set the key and .jq expression in the so-called processingConfiguration.
Our incomming json data looks something like this:
{
"data":
{
"timestamp":1633521266990,
"defaultTopic":"Topic",
"data":
{
"OUT1":"Inactive",
"Current_mA":3.92
}
}
}
The CDK code looks as following:
const DeliveryStream = new CfnDeliveryStream(this, 'deliverystream', {
deliveryStreamName: 'deliverystream',
extendedS3DestinationConfiguration: {
cloudWatchLoggingOptions: {
enabled: true,
},
bucketArn: Bucket.bucketArn,
roleArn: deliveryStreamRole.roleArn,
prefix: 'defaultTopic=!{partitionKeyFromQuery:defaultTopic}/!{timestamp:yyyy/MM/dd}/',
errorOutputPrefix: 'error/!{firehose:error-output-type}/',
bufferingHints: {
intervalInSeconds: 60,
},
dynamicPartitioningConfiguration: {
enabled: true,
},
processingConfiguration: {
enabled: true,
processors: [
{
type: 'MetadataExtraction',
parameters: [
{
parameterName: 'MetadataExtractionQuery',
parameterValue: '{defaultTopic: .data.defaultTopic}',
},
{
parameterName: 'JsonParsingEngine',
parameterValue: 'JQ-1.6',
},
],
},
{
type: 'AppendDelimiterToRecord',
parameters: [
{
parameterName: 'Delimiter',
parameterValue: '\\n',
},
],
},
],
},
},
})

Databricks Job API create job with single node cluster

I am trying to figure out why I get the following error, when I use the Databricks Job API.
{
"error_code": "INVALID_PARAMETER_VALUE",
"message": "Cluster validation error: Missing required field: settings.cluster_spec.new_cluster.size"
}
What I did:
I created a Job running on a single node cluster using the Databricks UI.
I copy& pasted the job config json from the UI.
I deleted my job and tried to recreate it by sending a POST using the Job API with the copied json that looks like this:
{
"new_cluster": {
"spark_version": "7.5.x-scala2.12",
"spark_conf": {
"spark.master": "local[*]",
"spark.databricks.cluster.profile": "singleNode"
},
"azure_attributes": {
"availability": "ON_DEMAND_AZURE",
"first_on_demand": 1,
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"custom_tags": {
"ResourceClass": "SingleNode"
},
"enable_elastic_disk": true
},
"libraries": [
{
"pypi": {
"package": "koalas==1.5.0"
}
}
],
"notebook_task": {
"notebook_path": "/pathtoNotebook/TheNotebook",
"base_parameters": {
"param1": "test"
}
},
"email_notifications": {},
"name": " jobName",
"max_concurrent_runs": 1
}
The documentation of the API does not help (can't find anything about settings.cluster_spec.new_cluster.size). The json is copied from the UI, so I guess it should be correct.
Thanks for your help.
Source: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--create
To create a Single Node cluster, include the spark_conf and custom_tags entries shown in the example and set num_workers to 0.
{
"cluster_name": "single-node-cluster",
"spark_version": "7.6.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*]"
},
"custom_tags": {
"ResourceClass": "SingleNode"
}
}

Pubsub to BQ Dataflow template is not parsing RECORD type data

I am using this Dataflow template "Pub/Sub Topic to BigQuery" to parse json schema with RECORD type data structure.
Sample Example :
{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body":
{
"session_duration":"61",
"app_key":"eyrttyuyyu78jkjk",
"device_id":"h1bh41yptik1vtwr8",
"sdk_name":"javascript_native_web",
"sdk_version":"18.04",
"timestamp":"1597057884636",
"hour":"10",
"dow":"1"
},
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}
Schema Defined in BigQuery is as :
[
{
"name":"url",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"body",
"type":"RECORD",
"mode":"REPEATED",
"fields":[
{
"name":"session_duration",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"app_key",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"device_id",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"sdk_name",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"sdk_version",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"timestamp",
"type":"TIMESTAMP",
"mode":"NULLABLE"
},
{
"name":"hour",
"type":"TIME",
"mode":"NULLABLE"
},
{
"name":"dow",
"type":"STRING",
"mode":"NULLABLE"
}
]
},
{
"name":"app_key",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"timestamp",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"ip_address",
"type":"STRING",
"mode":"NULLABLE"
}
]
Error Message:
{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}
If I parse data without RECORD type , it gets parsed correctly and in appropriate bigquery table but with RECORD type it gets ingested to bq generated <error_records> table.
I managed to successfully insert your sample into BigQuery using the Dataflow Pub/Sub to Bigquery template by applying some modifications:
I included the repeated field in an array by putting it inside square brackets [...]
The body.timestamp value is invalid. You can read here about the difference between the BigQuery TIMESTAMP data type and the UNIX timestamp. You have some options on how to handle this depending on what you want to do with this timestamp. If you don't need it for analysis you can easily change the data type of the field to INT64 or STRING as you you have done with the timestamp column of the table.
So the message should be like this:
{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body": [
{
"session_duration":"61",
"app_key":"eyrttyuyyu78jkjk",
"device_id":"h1bh41yptik1vtwr8",
"sdk_name":"javascript_native_web",
"sdk_version":"18.04",
"timestamp":"1597057884636",
"hour":"10",
"dow":"1"
}],
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}
and the schema with the changed data type for the body.timestamp field like this:
[
...
,
{
"name":"body",
"type":"RECORD",
"mode":"REPEATED",
"fields":[
...
,
{
"name":"timestamp",
"type":"STRING",
"mode":"NULLABLE"
},
...
]
},
...
]

AWS cloudformation : how to properly create a redis cache cluster

I want to create an elasticache instance using redis.
I think that I should use it "cluster mode disabled" because everything will fit into one server.
In order to not have a SPOF, I want to create a read replica that will be promoted by AWS in case of a failure of the master.
If possible, it would be great to balance the read only operations between master and slave, but it is not mandatory.
I created a functioning master/read-replica using the aws console then used cloudformer to create the cloudformation json conf. Cloudformer has created me two unlinked AWS::ElastiCache::CacheCluster, but by reading the doc. I don't understand how to link them... For now I have this configuration :
{
"cachehubcache001": {
"Type": "AWS::ElastiCache::CacheCluster",
"Properties": {
"AutoMinorVersionUpgrade": "true",
"AZMode": "single-az",
"CacheNodeType": "cache.t2.small",
"Engine": "redis",
"EngineVersion": "3.2.4",
"NumCacheNodes": "1",
"PreferredAvailabilityZone": { "Fn::FindInMap" : [ "RegionMap", { "Ref" : "AWS::Region" }, "Az1B"]},
"PreferredMaintenanceWindow": "sun:04:00-sun:05:00",
"CacheSubnetGroupName": {
"Ref": "cachesubnethubprivatecachesubnetgroup"
},
"VpcSecurityGroupIds": [
{
"Fn::GetAtt": [
"sgiHubCacheSG",
"GroupId"
]
}
]
}
},
"cachehubcache002": {
"Type": "AWS::ElastiCache::CacheCluster",
"Properties": {
"AutoMinorVersionUpgrade": "true",
"AZMode": "single-az",
"CacheNodeType": "cache.t2.small",
"Engine": "redis",
"EngineVersion": "3.2.4",
"NumCacheNodes": "1",
"PreferredAvailabilityZone": { "Fn::FindInMap" : [ "RegionMap", { "Ref" : "AWS::Region" }, "Az1A"]},
"PreferredMaintenanceWindow": "sun:02:00-sun:03:00",
"CacheSubnetGroupName": {
"Ref": "cachesubnethubprivatecachesubnetgroup"
},
"VpcSecurityGroupIds": [
{
"Fn::GetAtt": [
"sgiHubCacheSG",
"GroupId"
]
}
]
}
},
}
I know that it is wrong, but I can't figure out how to create a correct replica. I can't understand the AWS doc, for a start I can't figure out wich Type I should use between :
http://docs.aws.amazon.com/fr_fr/AWSCloudFormation/latest/UserGuide/aws-resource-elasticache-replicationgroup.html
http://docs.aws.amazon.com/fr_fr/AWSCloudFormation/latest/UserGuide/aws-properties-elasticache-cache-cluster.html
Since cloudformer created AWS::ElastiCache::CacheCluster I'll go with it, but I've got the feeling that it should have created only one resource, and used the NumCacheNodes parameter in order to create two resources.
redis can't use :
NumCacheNodes
AZMode and PreferredAvailabilityZones
so I don't know how to make this solution multi-AZ...
I managed to do this using AWS::ElastiCache::ReplicationGroup, the NumCacheClusters parameter provide the possibility to have numerous servers. Beware : it seem that you have to handle the connection to master/slave yourself (but in case of a master's failure, aws should normally detect it and change the dns of a slave to permit you point do not change your configuration). here is a sample :
"hubElastiCacheReplicationGroup" : {
"Type" : "AWS::ElastiCache::ReplicationGroup",
"Properties" : {
"ReplicationGroupDescription" : "Hub WebServer redis cache cluster",
"AutomaticFailoverEnabled" : "false",
"AutoMinorVersionUpgrade" : "true",
"CacheNodeType" : "cache.t2.small",
"CacheParameterGroupName" : "default.redis3.2",
"CacheSubnetGroupName" : { "Ref": "cachesubnethubprivatecachesubnetgroup" },
"Engine" : "redis",
"EngineVersion" : "3.2.4",
"NumCacheClusters" : { "Ref" : "ElasticacheRedisNumCacheClusters" },
"PreferredMaintenanceWindow" : "sun:04:00-sun:05:00",
"SecurityGroupIds" : [ { "Fn::GetAtt": ["sgHubCacheSG", "GroupId"] } ]
}
},