Debezium outbox pattern | is schema is fixed with SMT/outbox table if we use debezium - confluent-schema-registry

Debezium with outbox pattern
Setting the context:
Using
We wanted to use schema registry to store all event schemas for different business entities
One topic can have multiple version of same schema
One topic can have entirely different schema bounded by business context. Ex customerCreated, customerPhoneUpdated, customerAddressUpdated. (Using one the subject name strtegies)
Wanted to verify if debezium supports point 2 and 3 (specially 3).
Imagine, I have two business event customerCreated and orderCreated and I wanted to store both into same topic “com.business.event”.
customerCreated
{
“id”:”244444”
“name”:”test”,
“address”: “test 123”,
“email” : “test#test.com”
}
orderCreated
{
“id”:”244444”
“value”:”1234”,
“address”: “test 123”,
“phone” : “3333”,
“deliverydate”: “10-12-19”
}
Structure of my outbox table is as per below article
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
Column | Type | Modifiers
--------------+------------------------+-----------
id | uuid | not null
aggregatetype | character varying(255) | not null
aggregateid | character varying(255) | not null
type | character varying(255) | not null
payload | jsonb | not null
Now when I push my business event to above table it will store customerCreated and orderCreated event into the payload column as a String/JSON. If I push this to kafka in a topic “com.business.event” using debezium connector, it will produce the below message. (Printing with schema for example)
customerCreated.json
{
"schema":
{
"type":"struct",
"fields":[
{
"type":"string",
"optional":false,
"field":"eventType"
},
{
"type":"string",
"optional":false,
"name":"io.debezium.data.Json",
"version":1,
"field":"payload"
}
],
"optional":false
},
"payload":
{
"eventType":"Customer Created",
"payload":"{\"id\": \"2971baea-e5a0-46cb-b1b1-273eaf88246a\", \"name\": \"jitender\", \"email\": \"test\", \"address\": \"700 \"}}"
}
}
orderCreated.json
{
"schema":
{
"type":"struct",
"fields":[
{
"type":"string",
"optional":false,
"field":"eventType"
},
{
"type":"string",
"optional":false,
"name":"io.debezium.data.Json",
"version":1,
"field":"payload"
}
],
"optional":false
},
"payload":
{
"eventType":"Order Created",
"payload":"{\"id\": \"2971baea-e5a0-46cb-b1b1-273eaf88246a\", \"value\": \"123\",\"deliverydate\": \"10-12-19\", \"address\": \"test\", \"phone\": \"700 \"}}"
}
}
Problem:
As you can see in above examples schema in schema registry/kafka remains same though payload contains different business entities. Now when I as a consumer goes and tries to deserialise this message, I should know that payload can contain different structure based on the business event they are generated from. In this scenerio, I am not able to utilise schema registry fully as consumer should know all the business entities in advance.
Questions :
What I wanted to do is that debezium should create two different schema’s under the same topic “com.business.event” using subject name strategy (example below).
https://karengryg.io/2018/08/18/multi-schemas-in-one-kafka-topic/
Now as a consumer when I consume the message, my consumer will read the schema id from topic message and get it from schema registry and will decode the message directly with it. After decoding I can ignore the message if I am not interested in business event. By doing this I can have different schema’s under same topic using schema registry.
Can I control the schema in kafka topic when I use debezium in conjunction with schema registry. Outbox table or outbox pattern is a must.

please take a look at https://issues.jboss.org/browse/DBZ-1297 This is probably solution to your problem and questions as it aims to unwind the opaque string into a Kafka Connect. In this case you will have the schema exposed.
Would be good if you could try it for schema per subject name strategy.

Related

Graphql how to create a complex filter

I am creating an app where teachers can search for resources. Each resource is tagged with topics that are used to filter results. I have a complex use case and am looking for some advice.
For example:
Resource 1 is tagged with the topic "Maths". Topic "Maths" has a topic label "Subject", which is tier 1
Resource 2 is tagged with the topic "Algebra". Topic "Algebra" has a topic label "Unit", which is tier 2
Resource 2 is tagged with the topic "2019". Topic "2019" has the topic label "Year", which is tier 1
Resource 2 is tagged with the topic "Calculator". Topic "Calculator" has a topic label "Question Type", which is tier 1
Resource 3 is tagged with the topic "Algebra". Topic "Algebra" has a topic label "Unit", which is tier 2
Resource 3 is tagged with the topic "2018". Topic "2018" has a topic label "Year", which is tier 1
I am trying to write a query that allows the user to get all resources that contain the provided topics.
For example:
Get me all resources tagged with the topic "2019" and topic "Algebra". This should return only Resource 2 as it has both these tags
Get me all resources that are tagged with the topic "Algebra". This should return Resource 2 and Resource 3 as they both are tagged with the topic "Algebra."
My current attempt fails to do this as it does not differntiate between the topics. My query is shown below:
query FilterBlocks($topicIds: [bigint!]) {
block(
where: {
tags: {
topic_id: { _in: $topicIds, _is_null: false }
}
}
) {
id
tags {
id
topic {
id
title
}
}
type
...
}
}
Any advice on how to go about this would be much appreciated.
If I am correct, you are querying for blocks that have a specific tag with a specific topic id.
What you expect: The blocks with just those tags and topics that are actually part of your filter.
What you get: The blocks which fulfil this filter with all their tags and topics.
You need to apply the filter to the subentities as well
query FilterBlocks($topicIds: [bigint!]) {
block(
where: {
tags: {
topic_id: { _in: $topicIds, _is_null: false }
}
}
) {
id
tags(
where: {
topic_id: { _in: $topicIds, _is_null: false }
}
) {
id
topic( where: { id: { _in: $topicIds } } ) {
id
title
}
}
type
...
}
}

AWS Glue / Hive struct with undetermined struct

Adding data to a AWS Glue table where one of the columns is a struct where one of the values has undetermined form.
More specifically there's a known key called 'name', that is a string and another called 'metadata' that can be a dict with any structure.
Ex:
# Row 1
{
"name": "Jane",
"metadata": {
"foo": 123,
"bar": "something"
}
}
# Row 2
{
"name": "Bill",
"metadata": {
"baz": "something else"
}
}
Note how metadata is a different dictionary in the two entries.
How can this be specified as a struct?
struct<
name:string,
metadata:?
>
Ended up doing what I mentioned in the comment, which is to make the column a string and have the JSON blob serialized to string.
SQL queries will then need to deserialize the JSON blob, which is supported in several different implementations, including AWS Athena (the one I'm using).

The provided key element does not match the schema. GraphQL Mutation error

I am trying to test/run a mutation that creates groupChat in my DynamoDB by id,groupChatName, messages, createdTime, createdUser, users. I have 2 seperate tables, UserTable and GroupChatTable.The problem is I keep getting data is null and an error that says "the provided key element does not match the schema. ErrorCode: Validation Exception, request ID." Resolvers are attached to my tables so I am not sure why I am getting this error.
The weird thing is when I check the groupChatTable, my mutation is saved incorrectly as an input.This is what it looks like,
Ex: {"createdTime":{"S":"12:00"},"createdUser":{"S":"Me"},........
Below is the Mutation,Schema type,and Resolver.
createGroupChat(input:{
id: 4
groupChatName: "newgroup"
messages: "we love this group"
createdTime:"12:00"
createdUser: "Me"
users:"we, me"
}) {
id
groupChatName
messages
createdTime
createdUser
users
}
}```
```type GroupChat {
id: ID!
groupChatName: String!
messages: String
createdTime: String!
createdUser: String!
users: String
}```
```{
"version" : "2017-02-28",
"operation" : "PutItem",
"key" : {
"id": $util.dynamodb.toDynamoDBJson($util.autoId()),
},
"attributeValues" : $util.dynamodb.toMapValuesJson($ctx.args)
}```
It looks like the way data is being stored through resolver is incorrect and when it returns it doesn't match the schema
Instead of using $util.dynamodb.toMapValuesJson(($ctx.args))
use: $util.dynamodb.toMapValuesJson($util.parseJson($util.toJson($ctx.args.input)))

Is there a way to use the graphLookup aggregation pipeline stage for arrays?

I am currently working on an application that uses MongoDB as the data repository. I am mainly concerned about the graphLookup query to establish links between different people, based on what flights they took. My document contains an array field, that in turn contains key value pairs. I need to establish the links based on one of the key:value pairs of that array.
I have already tried some queries of aggregation pipeline with $graphLookup as one of the stages and they have all worked fine. But now that I am trying to use it with an array, I am hitting a blank.
Below is the array field from the first document :
"movementSegments":[
{
"carrierCode":"MO269",
"departureDateTimeMillis":1550932676000,
"arrivalDateTimeMillis":1551019076000,
"departurePort":"DOH",
"arrivalPort":"LHR",
"departurePortText":"HAMAD INTERNATIONAL AIRPORT",
"arrivalPortText":"LONDON HEATHROW",
"serviceNameText":"",
"serviceKey":"BA007_1550932676000",
"departurePortLatLong":"25.273056,51.608056",
"arrivalPortLatLong":"51.4706,-0.461941",
"departureWeeklyTemporalSpatialWindow":"DOH_8",
"departureMonthlyTemporalSpatialWindow":"DOH_2",
"arrivalWeeklyTemporalSpatialWindow":"LHR_8",
"arrivalMonthlyTemporalSpatialWindow":"LHR_2"
}
]
The other document has the below field :
"movementSegments":[
{
"carrierCode":"MO269",
"departureDateTimeMillis":1548254276000,
"arrivalDateTimeMillis":1548340676000,
"departurePort":"DOH",
"arrivalPort":"LHR",
"departurePortText":"HAMAD INTERNATIONAL AIRPORT",
"arrivalPortText":"LONDON HEATHROW",
"serviceNameText":"",
"serviceKey":"BA003_1548254276000",
"departurePortLatLong":"25.273056,51.608056",
"arrivalPortLatLong":"51.4706,-0.461941",
"departureWeeklyTemporalSpatialWindow":"DOH_4",
"departureMonthlyTemporalSpatialWindow":"DOH_1",
"arrivalWeeklyTemporalSpatialWindow":"LHR_4",
"arrivalMonthlyTemporalSpatialWindow":"LHR_1"
},
{
"carrierCode":"MO270",
"departureDateTimeMillis":1548254276000,
"arrivalDateTimeMillis":1548340676000,
"departurePort":"DOH",
"arrivalPort":"LHR",
"departurePortText":"HAMAD INTERNATIONAL AIRPORT",
"arrivalPortText":"LONDON HEATHROW",
"serviceNameText":"",
"serviceKey":"BA003_1548254276000",
"departurePortLatLong":"25.273056,51.608056",
"arrivalPortLatLong":"51.4706,-0.461941",
"departureWeeklyTemporalSpatialWindow":"DOH_4",
"departureMonthlyTemporalSpatialWindow":"DOH_1",
"arrivalWeeklyTemporalSpatialWindow":"LHR_4",
"arrivalMonthlyTemporalSpatialWindow":"LHR_1"
}
]
And I am running the below query :
db.person_events.aggregate([
{ $match: { eventId: "22446688" } },
{
$graphLookup: {
from: 'person_events',
startWith: '$movementSegments.carrierCode',
connectFromField: 'carrierCode',
connectToField: 'carrierCode',
as: 'carrier_connections'
}
}
])
The above query creates an array field in the document, but there are no values in it. As per the expectation, both my documents should get linked based on the carrier number.
Just to be clear about the query, the documents contain an eventId field, and the match pipeline returns one document to me after the match stage.
Well, I don't know how I missed it, but here is the solution to my problem which gives me the required results :
db.person_events.aggregate([
{ $match: { eventId: "22446688" } },
{
$graphLookup: {
from: 'person_events',
startWith: '$movementSegments.carrierCode',
connectFromField: 'movementSegments.carrierCode',
connectToField: 'movementSegments.carrierCode',
as: 'carrier_connections'
}
}
])

Kafka Connect S3 sink - how to use the timestamp from the message itself [timestamp extractor]

I've been struggling with a problem using kafka connect and the S3 sink.
First the structure:
{
Partition: number
Offset: number
Key: string
Message: json string
Timestamp: timestamp
}
Normally when posting to Kafka, the timestamp should be set by the producer. Unfortunately there seems to be cases where this didn't happen. This means that the Timestamp might sometimes be null
To extract this timestamp the connector was set to the following value:
"timestamp.extractor":"Record".
Now it is always certain that the Message field itself always contains a timestamp as well.
Message:
{
timestamp: "2019-04-02T06:27:02.667Z"
metadata: {
creationTimestamp: "1554186422667"
}
}
The question however is that now, I would like to use that field for the timestamp.extractor
I was thinking that this would suffice, but this doesn't seem to work:
"timestamp.extractor":"RecordField",
"timestamp.field":"message.timestamp",
This results in a NullPointer as well.
Any ideas as to how to use the timestamp from the kafka message payload itself, instead of the default timestamp field that is set for kafka v0.10+
EDIT:
Full config:
{ "name": "<name>",
"config": {
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"4",
"topics":"<topic>",
"flush.size":"100",
"s3.bucket.name":"<bucket name>",
"s3.region": "<region>",
"s3.part.size":"<partition size>",
"rotate.schedule.interval.ms":"86400000",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class":"io.confluent.connect.s3.format.json.JsonFormat",
"locale":"ENGLISH",
"timezone":"UTC",
"schema.generator.class":"io.confluent.connect.storage.hive.schema.TimeBasedSchemaGenerator",
"partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "3600000",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd",
"timestamp.extractor":"RecordField",
"timestamp.field":"message.timestamp",
"max.poll.interval.ms": "600000",
"request.timeout.ms": "610000",
"heartbeat.interval.ms": "6000",
"session.timeout.ms": "20000",
"s3.acl.canned":"bucket-owner-full-control"
}
}
EDIT 2:
Kafka message payload structure:
{
"reference": "",
"clientId": "",
"gid": "",
"timestamp": "2019-03-19T15:27:55.526Z",
}
EDIT 3:
{
"transforms": "convert_op_creationDateTime",
"transforms.convert_op_creationDateTime.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.convert_op_creationDateTime.target.type": "Timestamp",
"transforms.convert_op_creationDateTime.field": "timestamp",
"transforms.convert_op_creationDateTime.format": "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
}
So I tried doing a transform on the object, but it seems like I've been stuck again on this thing. The pattern seems to be invalid. Looking around the internet it does seem like this is a valid SimpleDatePattern. It seems to be complaining about the 'T'. Updated the message schema as well.
Based on the schema you've shared, you should be setting:
"timestamp.extractor":"RecordField",
"timestamp.field":"timestamp",
i.e. no message prefix to the timestamp field name.
If the data is a string, then Connect will try to parse as milliseconds - source code here.
In any case, message.timestamp assumes the data looks like { "message" : { "timestamp": ... } }, so just timestamp would be correct. And having nested fields didn't use to be possible anyway, so you might want to clarify which version of Connect you have.
I'm not entirely sure how you would get instanceof Date to evalutate to true when using JSON Converter, and even if you had set schema.enable = true, then also in the code, you can see there is only conditions for schema types of numbers and strings, but still assumes that it is milliseconds.
You can try using the TimestampConverter transformation to convert your date string.