Data Loss in Kafka S3 Connector - amazon-s3

We are using Kafka S3 Connector for log pipeline, as it guarantees exactly-once semantics. However, we've experienced two data loss events on different topics. We found a suspicious error message in kafka-connect worker's log as below.
[2019-04-10 08:56:22,388] ERROR WorkerSinkTask{id=s3-sink-common-log-4} Commit of offsets threw an unexpected exception for sequence number 2721: {logging_common_log-9=OffsetAndMetadata{offset=4485661604, metadata=''}, logging_common_log-8=OffsetAndMetadata{offset=4485670359, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask:260)
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doCommitOffsetsAsync(ConsumerCoordinator.java:641)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:608)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitAsync(KafkaConsumer.java:1486)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommitAsync(WorkerSinkTask.java:352)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommit(WorkerSinkTask.java:363)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:432)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:209)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The configs of worker and connector are:
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"flush.size": "999999999",
"rotate.schedule.interval.ms": "60000",
"retry.backoff.ms": "5000",
"s3.part.retries": "3",
"s3.retry.backoff.ms": "200",
"s3.part.size": "26214400",
"tasks.max": "3",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"schema.compatibility": "NONE",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"timestamp.extractor": "Record",
"partition.duration.ms": "3600000",
"path.format": "YYYY/MM/dd/HH",
"timezone": "America/Los_Angeles",
"locale": "US",
"append.late.data": "false",
...
},
and
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
offset.storage.partitions=25
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
status.storage.partitions=5
rest.port=8083
plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
plugin.path=/usr/share/java
The questions are:
1. What's the root cause?
2. How to prevent it?
3. How to reproduce it?
Thank you very much for any hints/advice/similar experience!

Related

Make CloudWatch Event pass an integer timestamp instead of string of UTC time

I'm invoking a scheduled Step Function with a CloudWatch event. The input of the first batch job in the step function state machine is like the following:
{
"version": "0",
"id": "sdjlafgdf05-7c32-435hf-aa3b5a8sfade815",
"detail-type": "Scheduled Event",
"source": "aws.events",
"account": "xxxxxxxx",
"time": "2022-01-14T19:46:49Z",
"region": "us-east-1",
"resources": [
"arn:aws:events:us-east-1:xxxxxxxxxxxx:rule/adfnwelkqnlngqrej-SAFFJKHF734"
],
"detail": {}
}
I want the "time" field can give me the integer. Specifically, instead of "2022-01-14T19:46:49Z", I want "1642189609" (epoch in seconds), so I don't need to parse it in my batch job code. I'm using CDK to build the infrastructure. Is there any way of doing this?
At the moment this is not possible directly with native Step Functions "Intrinsic Functions". So you have two options:
Do it via a lambda as explained in this other answer
Pass it first to EventBridge and then create a rule for EventBridge to foward it to CloudWatch

How to monitor a EMR cluster whether it is terminated using cloudwatch

I want to set alarm, when any EMR cluster is terminated(caused by internal errors), I know there is a "IsIdle" option, but my EMR clusters are designed to be persistent, so "IsIdle" is not really fit my case. Is there a health-check metric that I can used?
You can configure Amazon CloudWatch to send a "State Change" event to another service like an AWS Lambda function or an Amazon SNS topic.
To achieve this, open the CloudWatch console, in the navigation pane click on Rules > Create rule.
Service Name: EMR
Event Type: State Change
Specific detail type(s): EMR Cluster State Change
Specific State: TERMINATED and TERMINATED_WITH_ERRORS
Targets: Put the receiving service of your choice.
Here's an example of such an event:
{
"version": "0",
"id": "8535abb0-f87e-4640-b7b6-8de000dfc30a",
"detail-type": "EMR Cluster State Change",
"source": "aws.emr",
"account": "123456789012",
"time": "2016-12-16T21:00:23Z",
"region": "us-east-1",
"resources": [],
"detail": {
"severity": "INFO",
"stateChangeReason": "{\"code\":\"USER_REQUEST\",\"message\":\"Terminated by user request\"}",
"name": "Development Cluster",
"clusterId": "j-1YONHTCP3YZKC",
"state": "TERMINATED",
"message": "Amazon EMR Cluster j-1YONHTCP3YZKC (Development Cluster) has terminated at 2016-12-16 21:00 UTC with a reason of USER_REQUEST."
}
}

Handle lags in Kafka S3 Connector

We'are using Kafka Connect [distributed, confluence 4.0].
It works very well, except that there always remain an uncommitted messages in the topic that connector listens to. The behavior probably related to the S3 connector configuration the "flush.size": "20000". The lags in the topic are always below the flush-size.
Our data comes in batches, I don't want to wait till next batch arrive, nor reduce the flush.size and create tons of files.
Is there away to set timeout where S3 connector will flush the data even if it didn't reach 20000 events?
thanks!
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"topics": "event",
"tasks.max": "3",
"topics.dir": "connect",
"s3.region": "some_region",
"s3.bucket.name": "some_bucket",
"s3.part.size": "5242880",
"flush.size": "20000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"schema.compatibility": "FULL",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'\''day_ts'\''=YYYYMMdd/'\''hour_ts'\''=H",
"partition.duration.ms": "3600000",
"locale": "en_US",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "time"
}
}
To flush outstanding records periodically on low-volume topics with the S3 Connector you may use the configuration property:
rotate.schedule.interval.ms
(Complete list of configs here)
Keep in mind that by using the property above you might see duplicate messages in the event of reprocessing or recovery from errors, regardless of which partitioner you are using.

My AKS Cluster was brought down, how can I recover?

I have been playing around with load-testing my application on a single agent cluster in AKS. During the testing, the connection to the dashboard stalled and never resumed. My application seems down as well, so I am assuming the cluster is in a bad state.
The API server is restate-f4cbd3d9.hcp.centralus.azmk8s.io
kubectl cluster-info dump shows the following error:
{
"name": "kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/events/kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
"uid": "47f57d3c-d577-11e7-88d4-0a58ac1f0249",
"resourceVersion": "185572",
"creationTimestamp": "2017-11-30T02:36:34Z",
"InvolvedObject": {
"Kind": "Pod",
"Namespace": "kube-system",
"Name": "kube-dns-v20-6c8f7f988b-9wpx9",
"UID": "9d2b20f2-d3f5-11e7-88d4-0a58ac1f0249",
"APIVersion": "v1",
"ResourceVersion": "299",
"FieldPath": "spec.containers{kubedns}"
},
"Reason": "Unhealthy",
"Message": "Liveness probe failed: Get http://10.244.0.4:8080/healthz-kubedns: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
"Source": {
"Component": "kubelet",
"Host": "aks-agentpool-34912234-0"
},
"FirstTimestamp": "2017-11-30T02:23:50Z",
"LastTimestamp": "2017-11-30T02:59:00Z",
"Count": 6,
"Type": "Warning"
}
As well as some Pod Sync errors in Kube-System.
Example of issue:
az aks browse -g REstate.Server -n REstate
Merged "REstate" as current context in C:\Users\User\AppData\Local\Temp\tmp29d0conq
Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out
You'll probably need to ssh to the node to see if the Kubelet service is running. For future you can set Resource quotas from exhausting all resources in the cluster nodes.
Resource Quotas -https://kubernetes.io/docs/concepts/policy/resource-quotas/

Programmatically set active session in Zed Attack Proxy

We're working with the zaproxy api and we're trying to set a session to "active" with the setActiveSession() API-call, which is documented here and takes two argument, the site and the session. The problem we're running into is that we keep getting the error:
{
"code": "illegal_parameter",
"message": "Provided parameter has illegal or unrecognized value",
"detail": "session"
}
Assuming we have the following sessions from the sessions() API-call,
{
"sessions": [
{
"session": [
"Session 1",
{
"JSESSIONID": {
"comment": "",
"domain": "localhost",
"domainAttributeSpecified": false,
"expired": false,
"expiryDate": null,
"name": "JSESSIONID",
"path": "/",
"pathAttributeSpecified": false,
"persistent": false,
"secure": false,
"value": "941A60311B3C63C69C5887F531E7090A",
"version": 0
}
},
"16"
]
}
]
}
What value do we need to send in the session field to make this API-call successful? We've tried value field in the "complex" JSESSIONID object, as well as the name "Session 1" and the "16" (under the assumption that it was an id of some sort), in the session array. All of them return the same error.
[Edit] I just saw that zap is logging the following into the terminal, when we make these calls:
1055328 [ZAP-ProxyThread-106] WARN org.zaproxy.zap.extension.api.API - ApiException while handling API request:
Provided parameter has illegal or unrecognized value (illegal_parameter) : session
at org.zaproxy.zap.extension.httpsessions.HttpSessionsAPI.handleApiAction(Unknown Source)
at org.zaproxy.zap.extension.api.API.handleApiRequest(Unknown Source)
at org.parosproxy.paros.core.proxy.ProxyThread.processHttp(Unknown Source)
at org.parosproxy.paros.core.proxy.ProxyThread.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
After a bit more trial and error in the API-Browser, we've discovered that the correct value is indeed "Session 1", but we were sending the name in quotations marks, i.e. "Session 1", but the correct way to send it is without them, i.e. Session 1.