Ideal value for Kafka Connect Distributed tasks.max configuration setting? - amazon-s3

I am looking to productionize and deploy my Kafka Connect application. However, there are two questions I have about the tasks.max setting which is required and of high importance but details are vague for what to actually set this value to.
If I have a topic with n partitions that I wish to consume data from and write to some sink (in my case, I am writing to S3), what should I set tasks.max to? Should I set it to n? Should I set it to 2n? Intuitively it seems that I'd want to set the value to n and that's what I've been doing.
What if I change my Kafka topic and increase partitions on the topic? I will have to pause my Kafka Connector and increase the tasks.max if I set it to n? If I have set a value of 2n, then my connector should automatically increase the parallelism it operates?

In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task with receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it's reading.
If you change the partition count of the topic you'll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.
edit, just discovered ConnectorContext: https://kafka.apache.org/0100/javadoc/org/apache/kafka/connect/connector/ConnectorContext.html
The connector will have to be written to include this but it looks like Connect has the ability to reconfigure a connector if there's a topic change (partitions added/removed).

We had a problem with the distribution of the workload between the Kafka-Connect(5.1.2) instances, caused by the high number of tasks.max than the number of partitions.
In our case, there were 10 Kafka Connect tasks and 3 partitions of the topic which is to be consumed. 3 of those 10 workers are assigned to the 3 partitions of the topic and the other 7 are not assigned to any partitions(which is expected) but the Kafka Connect were distributing the tasks evenly, without considering their workload. So we were ending up with a task distribution to our instances where some instances are staying idle( because they are not assigned to any unempty worker ) or some instances are working more than the others.
To come up with the issue, we set tasks.max equal to number of partitions of our topics.
It is really unexpected for us to see that Kafka Connect does not consider tasks' assignments while rebalancing. Also, I couldn't find any documentation for the tasks.max setting.

Related

lock redis key when two processes are accessing the same key

Application 1 set a value in Redis.
And we have two instance of application 2 which are running and we would like only one instance should read this value from Redis (please note application 2 takes around 30 sec to 1 min process data )
Can Instance-1 application 2 acquire lock redis key which is created by application 1 , so that instance-2 of application 2 will not read and do the same operation ?
No, there's no concept of record lock in Redis. If you need to achieve some sort of locking you have to use another set of data structures to mimic that behavior. For example
List: You can use a list and then POP the item from the list or...
Redis Stream: Using Redis Stream with ConsumerGroup so that each consumer in your Group only sees a portion of the whole data the needs to be processed and it guarantees you that, when an item is delivered to a consumer, it is not going to be delivered to another one.

Apache Pulsar topic replication with increase in cluster size

I want to understand how the namespace/topic replication works in Apache Pulsar and what affect does the change in cluster size have on the replication factor of the existing and new namespaces/topics.
Consider the following scenario:
I am starting with a single node with the following broker configuration:
# Number of bookies to use when creating a ledger
managedLedgerDefaultEnsembleSize=1
# Number of copies to store for each message
managedLedgerDefaultWriteQuorum=1
# Number of guaranteed copies (acks to wait before write is complete)
managedLedgerDefaultAckQuorum=1
After a few months I decide to increase the cluster size to two with the following configuration for the new broker:
# Number of bookies to use when creating a ledger
managedLedgerDefaultEnsembleSize=2
# Number of copies to store for each message
managedLedgerDefaultWriteQuorum=2
# Number of guaranteed copies (acks to wait before write is complete)
managedLedgerDefaultAckQuorum=2
In the above scenario what will be the behaviour of the cluster:
Does this change the replication factor(RF) of the existing topics?
Do newly created topics have the old RF or the new specified RF?
How does the namespace/topic(Managed Ledger) -> Broker ownership work?
Please note that the two broker nodes have different configurations at this point.
TIA
What you are changing is the default replication settings (ensemble, write, ack). You shouldn't be using different defaults on different brokers, because then you'll get inconsistent behavior depending on which broker the client connects to.
The replication settings are controlled at namespace level. If you don't explicitly set them, you get the default settings. However, you can change the settings on individual namespaces using the CLI or the REST interface. If you start with settings of (1 ensemble, 1 write, 1 ack) on the namespace and then change to (2 ensemble, 2 write, 2 ack), then the following happens:
All new topics in the namespace use the new settings, storing 2 copies of each message
All new messages published to existing topics in the namespace use the new settings, storing 2 copies. Messages that are already stored in existing topics are not changed. They still have only 1 copy.
An important point to note is that the number of brokers doesn't affect the message replication. In Pulsar, the broker just handles the serving (producing/consuming) of the message. Brokers are stateless and can be scaled horizontally. The messages are stored on Bookkeeper nodes (bookies). The replication settings (ensemble, write, ack) refer to Bookkeeper nodes, not brokers. Here is an diagram from the Pulsar website that illustrates this:
So, to move from a setting of (1 ensemble, 1 write, 1 ack) to (2 ensemble, 2 write, 2 ack), you need to add a Bookkeeper node to your cluster (assuming you start with just 1), not another broker.

Dataflow Apache beam Python job stuck at Group by step

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.
Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished
I figured out the thing myself.
Below are the 2 changes that I did in my pipeline:
1. I added a Combine function just after the Group by Key, see screenshot
since the Group by key when running on multiple worker, does a lot of network traffic exchange, and by default the network we use, does not allow the inter network communication, so I have to create a firewall rule to allow traffic from one worker to another worker i.e. ip range to network traffic.

Apache Nifi PutElasticsearch can wait forever to fill up batch size?

I am trying to write streaming data into elasticsearch with apache-nifi.putElasticSearch processor,
PutElasticSearch has property named "Batch Size", when I set this value to 1 all events are written to elasticsearch ASAP.
But such a low "batch size" obviously not working when the load is high. So in order to have a reasonable throughput I need to set it to 1000.
My question is, does PutElasticSearch waits till the batch size of events available. If yes it can wait hours when there are 999 events waiting on processor.
I am searching to understand how logstash doing same job on elasticsearch output plugin. There may be some flushing logic implemented based on time ( if events are waiting ~2 sec flush events to elasticsearch )..
You have any idea?
Edit: I just found logstash implemented this https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-idle_flush_time :)
How can I do same functionality on nifi
According to the code the batch size parameter is a maximum number of FlowFiles from the incoming queue.
For example in case of value batch size = 1000:
1/ if incoming queue contains 1001 flow files - only 1000 will be taken in one transaction.
2/ if incoming queue contains 999 flow files - 999 will be taken in one transaction.
And everything will be processed as soon as there is something in the incoming queue and there are available threads in nifi.
references:
PutElasticsearch.java
ProcessSession.java

kafka: replicas and ISR don't match

Im moving hundreds of topics from one broker to another. The process is
Use kafka-topics.sh to generate existing partition list
Use kafka-reassign-partitions.sh to generate current list of partitions / brokers / etc
Edit this list so every instance of broker 7(to be replaced) is now brokers 7,4 (4 is new broker)
Run kafka-reassign-partitions.sh (broker list) --execute to add new broker
Wait.. watch.. (using --verify).. until complete...
Edit broker list to be brokers 4,7
Do (4) again.. do (5) again..
Run preferred leader election (in case 7 was leading anything)
Edit broker list to remove all instances of broker 7
Do (4) and (5) again
Be happy
This has worked great for hundreds and hundreds of topics.. except for one sticky one. This holdout is refusing to sync up with the new broker (missing from ISR list) even though it's included in list of replicas
Output from kafka-topics.sh (trying to replace broker 7 with broker 4):
Topic: shard_3 Partition: 7 Leader: 3 Replicas: 3,4,7 Isr: 7,3
I've run (4) above several times in hopes of getting this to complete but it doesn't seem to want to. I've waited overnight in case it's just really slow.
Suggestions on how to unstick this one?
Turns out the lead broker was upset about something and the partition list wasn't being kept up to date.
Solution:
bin/zkCli.sh -server <kafka broker in cluster>
get /controller
restart the kafka service on that controller - this passes control to another box
retry partitioning commands