kafka: replicas and ISR don't match - replication

Im moving hundreds of topics from one broker to another. The process is
Use kafka-topics.sh to generate existing partition list
Use kafka-reassign-partitions.sh to generate current list of partitions / brokers / etc
Edit this list so every instance of broker 7(to be replaced) is now brokers 7,4 (4 is new broker)
Run kafka-reassign-partitions.sh (broker list) --execute to add new broker
Wait.. watch.. (using --verify).. until complete...
Edit broker list to be brokers 4,7
Do (4) again.. do (5) again..
Run preferred leader election (in case 7 was leading anything)
Edit broker list to remove all instances of broker 7
Do (4) and (5) again
Be happy
This has worked great for hundreds and hundreds of topics.. except for one sticky one. This holdout is refusing to sync up with the new broker (missing from ISR list) even though it's included in list of replicas
Output from kafka-topics.sh (trying to replace broker 7 with broker 4):
Topic: shard_3 Partition: 7 Leader: 3 Replicas: 3,4,7 Isr: 7,3
I've run (4) above several times in hopes of getting this to complete but it doesn't seem to want to. I've waited overnight in case it's just really slow.
Suggestions on how to unstick this one?

Turns out the lead broker was upset about something and the partition list wasn't being kept up to date.
Solution:
bin/zkCli.sh -server <kafka broker in cluster>
get /controller
restart the kafka service on that controller - this passes control to another box
retry partitioning commands

Related

lock redis key when two processes are accessing the same key

Application 1 set a value in Redis.
And we have two instance of application 2 which are running and we would like only one instance should read this value from Redis (please note application 2 takes around 30 sec to 1 min process data )
Can Instance-1 application 2 acquire lock redis key which is created by application 1 , so that instance-2 of application 2 will not read and do the same operation ?
No, there's no concept of record lock in Redis. If you need to achieve some sort of locking you have to use another set of data structures to mimic that behavior. For example
List: You can use a list and then POP the item from the list or...
Redis Stream: Using Redis Stream with ConsumerGroup so that each consumer in your Group only sees a portion of the whole data the needs to be processed and it guarantees you that, when an item is delivered to a consumer, it is not going to be delivered to another one.

Storing time intervals efficiently in redis

I am trying to track server uptimes using redis.
So the approach I have chosen is as follows:
server xyz will keep on sending my service ping indicating that it was alive and working in the last 30 seconds.
My service will store a list of all time intervals during which the server was active. This will be done by storing a list of {startTime, endTime} in redis, with key as name of the server (xyz)
Depending on a user query, I will use this list to generate server uptime metrics. Like % downtime in between times (T1, T2)
Example:
assume that the time is T currently.
at T+30, server sends a ping.
xyz:["{start:T end:T+30}"]
at T+60, server sends another ping
xyz:["{start:T end:T+30}", "{start:T+30 end:T+60}"]
and so on for all pings.
This works fine , but an issue is that over a large time period this list will get a lot of elements. To avoid this currently, on a ping, I pop the last element of the list, check if it can be merged with the latest time interval. If it can be merged, I coalesce and push a single time interval into the list. if not then 2 time intervals are pushed.
So with this my list becomes like this after step 2 : xyz:["{start:T end:T+60}"]
Some problems I see with this approach is:
the merging is being done in my service, and not redis.
incase my service is distributed, The list ordering might get corrupted due to multiple readers and writers.
Is there a more efficient/elegant way to handle this , like maybe handling merging of time intervals in redis itself ?

Apache Pulsar topic replication with increase in cluster size

I want to understand how the namespace/topic replication works in Apache Pulsar and what affect does the change in cluster size have on the replication factor of the existing and new namespaces/topics.
Consider the following scenario:
I am starting with a single node with the following broker configuration:
# Number of bookies to use when creating a ledger
managedLedgerDefaultEnsembleSize=1
# Number of copies to store for each message
managedLedgerDefaultWriteQuorum=1
# Number of guaranteed copies (acks to wait before write is complete)
managedLedgerDefaultAckQuorum=1
After a few months I decide to increase the cluster size to two with the following configuration for the new broker:
# Number of bookies to use when creating a ledger
managedLedgerDefaultEnsembleSize=2
# Number of copies to store for each message
managedLedgerDefaultWriteQuorum=2
# Number of guaranteed copies (acks to wait before write is complete)
managedLedgerDefaultAckQuorum=2
In the above scenario what will be the behaviour of the cluster:
Does this change the replication factor(RF) of the existing topics?
Do newly created topics have the old RF or the new specified RF?
How does the namespace/topic(Managed Ledger) -> Broker ownership work?
Please note that the two broker nodes have different configurations at this point.
TIA
What you are changing is the default replication settings (ensemble, write, ack). You shouldn't be using different defaults on different brokers, because then you'll get inconsistent behavior depending on which broker the client connects to.
The replication settings are controlled at namespace level. If you don't explicitly set them, you get the default settings. However, you can change the settings on individual namespaces using the CLI or the REST interface. If you start with settings of (1 ensemble, 1 write, 1 ack) on the namespace and then change to (2 ensemble, 2 write, 2 ack), then the following happens:
All new topics in the namespace use the new settings, storing 2 copies of each message
All new messages published to existing topics in the namespace use the new settings, storing 2 copies. Messages that are already stored in existing topics are not changed. They still have only 1 copy.
An important point to note is that the number of brokers doesn't affect the message replication. In Pulsar, the broker just handles the serving (producing/consuming) of the message. Brokers are stateless and can be scaled horizontally. The messages are stored on Bookkeeper nodes (bookies). The replication settings (ensemble, write, ack) refer to Bookkeeper nodes, not brokers. Here is an diagram from the Pulsar website that illustrates this:
So, to move from a setting of (1 ensemble, 1 write, 1 ack) to (2 ensemble, 2 write, 2 ack), you need to add a Bookkeeper node to your cluster (assuming you start with just 1), not another broker.

Ideal value for Kafka Connect Distributed tasks.max configuration setting?

I am looking to productionize and deploy my Kafka Connect application. However, there are two questions I have about the tasks.max setting which is required and of high importance but details are vague for what to actually set this value to.
If I have a topic with n partitions that I wish to consume data from and write to some sink (in my case, I am writing to S3), what should I set tasks.max to? Should I set it to n? Should I set it to 2n? Intuitively it seems that I'd want to set the value to n and that's what I've been doing.
What if I change my Kafka topic and increase partitions on the topic? I will have to pause my Kafka Connector and increase the tasks.max if I set it to n? If I have set a value of 2n, then my connector should automatically increase the parallelism it operates?
In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task with receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it's reading.
If you change the partition count of the topic you'll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.
edit, just discovered ConnectorContext: https://kafka.apache.org/0100/javadoc/org/apache/kafka/connect/connector/ConnectorContext.html
The connector will have to be written to include this but it looks like Connect has the ability to reconfigure a connector if there's a topic change (partitions added/removed).
We had a problem with the distribution of the workload between the Kafka-Connect(5.1.2) instances, caused by the high number of tasks.max than the number of partitions.
In our case, there were 10 Kafka Connect tasks and 3 partitions of the topic which is to be consumed. 3 of those 10 workers are assigned to the 3 partitions of the topic and the other 7 are not assigned to any partitions(which is expected) but the Kafka Connect were distributing the tasks evenly, without considering their workload. So we were ending up with a task distribution to our instances where some instances are staying idle( because they are not assigned to any unempty worker ) or some instances are working more than the others.
To come up with the issue, we set tasks.max equal to number of partitions of our topics.
It is really unexpected for us to see that Kafka Connect does not consider tasks' assignments while rebalancing. Also, I couldn't find any documentation for the tasks.max setting.

carbon-relay Replication across Datacenters

I recently "inherited" a Carbon/Graphite setup from a colleague which I have to redesign. The current setup is:
Datacenter 1 (DC1): 2 servers (server-DC1-1 and server-DC1-2) with 1 carbon-relay and 4 carbon caches
Datacenter 2 (DC2): 2 servers (server-DC2-1 and server-DC2-2) with 1 carbon-relay and 4 carbon caches
All 4 carbon-relays are configured with a REPLICATION_FACTOR of 2, consistent hashing and all carbon-caches ( 2(DCs) * 2(Servers) * 4(Caches) ). This had the effect that some metrics exist only on 1 server (they probably were hashed to a different cache on the same server). With over 1 million metrics this problem affects about 8% of all metrics.
What I would like to do is a multi-tiered setup with redundancy, so that I mirror all metrics across the datacenters and inside the datacenter I use consistent hashing to distribute the metrics evenly across 2 servers.
For this I need help with the configuration (mainly) of the relays. Here is a picture of what I have in mind:
The clients would send their data to the tier1relays in their respective Datacenters ("loadbalancing" would occur on client side, so that for example all clients with an even number in the hostname would send to tier1relay-DC1-1 and clients with an odd number would send to tier1relay-DC1-2).
The tier2relays use consistent hashing to distribute the data in the datacenter evenly across the 2 servers. For example the "pseudo" configuration for tier2relay-DC1-1 would look like this:
RELAY_METHOD = consistent-hashing
DESTINATIONS = server-DC1-1:cache-DC1-1-a, server-DC1-1:cache-DC1-1-b, (...), server-DC1-2:cache-DC1-2-d
What I would like to know: how do I tell tier1relay-DC1-1 and tier1relay-DC1-2 that they should send all metrics to the tier2relays in DC1 and DC2 (replicate the metrics across the DCs) and do some kind of "loadbalancing" between tier2relay-DC1-1 and tier2relay-DC1-2.
On another note: I also would like to know what happens inside the carbon-relay if I use consistent hashing, but one or more of the destinations are unreachable (server down) - do the metrics get hashed again (against the reachable caches) or will they simply be dropped for the time? (Or to ask the same question from a different angle: when a relay receives a metric does it do the hashing of the metric based on the list of all configured destinations or based on the currently available destinations?)
https://github.com/grobian/carbon-c-relay
Which exactly does what you need. Also it give you a great boost in performance.