We'are using Kafka Connect [distributed, confluence 4.0].
It works very well, except that there always remain an uncommitted messages in the topic that connector listens to. The behavior probably related to the S3 connector configuration the "flush.size": "20000". The lags in the topic are always below the flush-size.
Our data comes in batches, I don't want to wait till next batch arrive, nor reduce the flush.size and create tons of files.
Is there away to set timeout where S3 connector will flush the data even if it didn't reach 20000 events?
thanks!
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"topics": "event",
"tasks.max": "3",
"topics.dir": "connect",
"s3.region": "some_region",
"s3.bucket.name": "some_bucket",
"s3.part.size": "5242880",
"flush.size": "20000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"schema.compatibility": "FULL",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'\''day_ts'\''=YYYYMMdd/'\''hour_ts'\''=H",
"partition.duration.ms": "3600000",
"locale": "en_US",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "time"
}
}
To flush outstanding records periodically on low-volume topics with the S3 Connector you may use the configuration property:
rotate.schedule.interval.ms
(Complete list of configs here)
Keep in mind that by using the property above you might see duplicate messages in the event of reprocessing or recovery from errors, regardless of which partitioner you are using.
Related
I have a kafka-topic and I would like to feed it with AVRO data (currently in JSON). I know the "proper" way to do it is to use schema-registry but for testing purposes I would like to make it work without it.
So I am sending AVRO data as Array[Byte] as opposed to regular Json objects:
val writer = new SpecificDatumWriter[GenericData.Record]("mySchema.avsc")
val out = new ByteArrayOutputStream
val encoder = EncoderFactory.get.binaryEncoder(out, null)
writer.write(myAvroData, encoder)
encoder.flush
out.close
out.toByteArray
The schema is embarked within each data; how can I make it work with kafka-connect? The kafka-connect configuration currently exhibits the following properties (data is written to s3 as json.gz files), and I want to write Parquet files:
{
"name": "someName",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "120",
"topics": "user_sync",
"s3.region": "someRegion",
"s3.bucket.name": "someBucket",
"s3.part.size": "5242880",
"s3.compression.type": "gzip",
"filename.offset.zero.pad.width": "20",
"flush.size": "5000",
"rotate.interval.ms": "600000",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "YYYY/MM/dd/HH",
"timezone" : "UTC",
"locale": "en",
"partition.duration.ms": "600000",
"timestamp.extractor": "RecordField",
"timestamp.field" : "ts",
"schema.compatibility": "NONE"
I suppose I need to change "format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat? But is it enough?
Thanks a lot!
JsonConverter will be unable to consume Avro encoded data since the binary format contains a schema ID from the registry that's needed to be extracted before the converter can determine what the data looks like
You'll want to use the registryless-avro-converter, which will create a Structured object, and then should be able to converted to a Parquet record.
I want to set alarm, when any EMR cluster is terminated(caused by internal errors), I know there is a "IsIdle" option, but my EMR clusters are designed to be persistent, so "IsIdle" is not really fit my case. Is there a health-check metric that I can used?
You can configure Amazon CloudWatch to send a "State Change" event to another service like an AWS Lambda function or an Amazon SNS topic.
To achieve this, open the CloudWatch console, in the navigation pane click on Rules > Create rule.
Service Name: EMR
Event Type: State Change
Specific detail type(s): EMR Cluster State Change
Specific State: TERMINATED and TERMINATED_WITH_ERRORS
Targets: Put the receiving service of your choice.
Here's an example of such an event:
{
"version": "0",
"id": "8535abb0-f87e-4640-b7b6-8de000dfc30a",
"detail-type": "EMR Cluster State Change",
"source": "aws.emr",
"account": "123456789012",
"time": "2016-12-16T21:00:23Z",
"region": "us-east-1",
"resources": [],
"detail": {
"severity": "INFO",
"stateChangeReason": "{\"code\":\"USER_REQUEST\",\"message\":\"Terminated by user request\"}",
"name": "Development Cluster",
"clusterId": "j-1YONHTCP3YZKC",
"state": "TERMINATED",
"message": "Amazon EMR Cluster j-1YONHTCP3YZKC (Development Cluster) has terminated at 2016-12-16 21:00 UTC with a reason of USER_REQUEST."
}
}
I'm setting up a RabbitMQ (v3.8.0) cluster with High Availability.
To enable messages persistency, I set exchanges and queues durable parameter to True.
{
"exchanges": [
{
"name": "my_direct_exchange",
"vhost": "my_vhost",
"type": "direct",
"durable": true,
"auto_delete": false,
"internal": false,
"arguments": {}
}
],
"queues": [
{
"name": "my_queue_direct",
"vhost": "my_vhost",
"durable": true,
"auto_delete": false,
"arguments": {}
}
]
}
Then, it seems there are 2 choices :
Either sending messages with delivery_mode=2
Or, setting lazy mode in queues (via policy configuration)
"policies": [
{
"vhost": "my_vhost",
"name": "my_policy",
"pattern": "",
"apply-to": "all",
"definition": {
"ha-mode": "all",
"ha-sync-mode": "automatic",
"queue-mode": "lazy"
}
}
]
Both of these choices will stores messages on disk.
What is the difference between them ?
To enable messages persistency, I set exchanges and queues durable
parameter to True.
To clarify, the durable parameter for exchanges and queues does not affect individual message persistence. The durable parameter ensures that those exchanges and queues survive broker restarts. True, if you have a non-durable queue with persistent messages, and restart the broker, that queue and those messages will be lost, so the durable parameter is important.
You should use the persistent flag, even with lazy queues. Why? Because you should also be using Publisher Confirms, and a message will only be confirmed when written to disk when persistent is set.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
We are using Kafka S3 Connector for log pipeline, as it guarantees exactly-once semantics. However, we've experienced two data loss events on different topics. We found a suspicious error message in kafka-connect worker's log as below.
[2019-04-10 08:56:22,388] ERROR WorkerSinkTask{id=s3-sink-common-log-4} Commit of offsets threw an unexpected exception for sequence number 2721: {logging_common_log-9=OffsetAndMetadata{offset=4485661604, metadata=''}, logging_common_log-8=OffsetAndMetadata{offset=4485670359, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask:260)
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doCommitOffsetsAsync(ConsumerCoordinator.java:641)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:608)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitAsync(KafkaConsumer.java:1486)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommitAsync(WorkerSinkTask.java:352)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommit(WorkerSinkTask.java:363)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:432)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:209)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The configs of worker and connector are:
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"flush.size": "999999999",
"rotate.schedule.interval.ms": "60000",
"retry.backoff.ms": "5000",
"s3.part.retries": "3",
"s3.retry.backoff.ms": "200",
"s3.part.size": "26214400",
"tasks.max": "3",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"schema.compatibility": "NONE",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"timestamp.extractor": "Record",
"partition.duration.ms": "3600000",
"path.format": "YYYY/MM/dd/HH",
"timezone": "America/Los_Angeles",
"locale": "US",
"append.late.data": "false",
...
},
and
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
offset.storage.partitions=25
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
status.storage.partitions=5
rest.port=8083
plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
plugin.path=/usr/share/java
The questions are:
1. What's the root cause?
2. How to prevent it?
3. How to reproduce it?
Thank you very much for any hints/advice/similar experience!
In nodered,I have been able to subscribe json stream ABC using KSQL node. Now I am trying to push that stream to S3 bucket in form of json file with kafka-s3-connector but i am able to do this with cli only,not with using SQL and S3 node installed in Node-Red.Is there something additional node missing in it,kindly help regarding the same?
i am able to do this with cli only
I am not familiar with NodeRED, but you can send an HTTP POST request into a Kafka Connect Distributed Server with S3 Connector available
curl -XPOST http://connect-server:8083/connectors \
-d '{
"name": "sink-s3",
"config": {
"topics": "your_topic",
"tasks.max": "2",
"name": "sink-s3",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"s3.bucket.name": "example-kafka-bucket",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"__comment": "Confluent Kafka Connect properties",
"flush.size": "200",
"s3.part.size": "5242880",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility": "BACKWARD"
}
}'