I am new to Apache Kafka and was playing around with it. If I have 2 brokers and one topic with 4 partitions and assume one of my broker is heavily loaded, will kafka takes care of balancing the incoming traffic from producers to the other free broker ? If so how it is done ?
If you have multiple partitions, it's the producers responsibility/choice of which partition they want to send it to.
Producers publish data to the topics of their choice. The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). link
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also. To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. link
If you specify which partition you want the data to go into, it will always go into that specific partition. If you don't specify, the producer could send it to any partition. The Kafka broker never internally moves or balances messages/partitions.
I believe this decision is to provide certain guarantees for the ordering of messages in a Kafka partition.
Kafka producer tends to distribute messages equally among all partitions unless you override this behavior, then you need to have a look if the four partitions is distributed evenly among brokers.
It depends on what do you mean by "one of the brokers is heavily loaded". if it is because of that topic or this cluster has any other topics (e.g. __consumer_offset).
You can choose the brokers in which partition resides with a cli tools with Kafka or with some kind of UI like yahoo kafka-manager.
Related
I am trying to understand how redis streams does partitioning , if a specific message can be sent to a specific partition (similar to how you can do with Kafka).
I have checked the redis-cli api , and there is nothing similar to partitioning , also nothing about this when using the StackExchangeRedis redis library.
The only method is : IDatabase.StreamAdd(key,streamKey,streamValue,messageId)
Am i missing anything ? Is the partitioning done only in a fixed way ?
P.S If partitioning can be done , can the partitioning key be composed ?
You cannot achieve Kafka partition with a single Redis Stream. Redis Stream's consumer group looks like a load balancer, which distributes messages in the stream to different consumers without any partition. If one consumer is faster than others, it will consume more messages than others, regardless of the logical partition of messages.
If you want to do partition, you have to use multiple Redis Streams, i.e. multiple keys, and each key for a different partition. And your producers add messages in a logical partition to a dedicated Redis Stream. The different consumers xread messages from different key.
We have been using ActiveMQ version 5.16.0 broker with single instances in production. Now we are planning to use cluster of AMQ brokers for HA and load distribution with consistency in message data. Currently we are using only one queue
HA can be achieved using failover but do we need to use the same datastore or it can be separated? If I use different instances for AMQ brokers then how to setup a common datastore.
Please guide me how to setup datastore for HA and load distribution
Multiple ActiveMQ servers clustered together can provide HA in a couple ways:
Scale message flow by using compute resources across multiple broker nodes
Maintain message flow during single node planned or unplanned outage of a broker node
Share data store in the event of ActiveMQ process failure.
Network of brokers solve #1 and #2. A standard 3-node cluster will give you excellent performance and ability to scale the number of producers and consumers, along with splitting the full flow across 3-nodes to provide increased capacity.
Solving for #3 is complicated-- in all messaging products. Brokers are always working to be completely empty-- so clustering the data store of a single-broker becomes an anti-pattern of sorts. Many times, relying on RAID disk with a single broker node will provide higher reliability than adding NFSv4, GFSv2, or JDBC and using shared-store.
That being said, if you must use a shared store-- follow best practices and use GFSv2 or NFSv4. JDBC is much slower and requires significant DB maintenance to keep running efficiently.
Note: [#Kevin Boone]'s note about CIFS/SMB is incorrect and CIFS/SMB should not be used. Otherwise, his responses are solid.
You can configure ActiveMQ so that instances share a message store, or so they have separate message stores. If they share a message store, then (essentially) the brokers will automatically form a master-slave configuration, such that only one broker (at a time) will accept connections from clients, and only one broker will update the store. Clients need to identify both brokers in their connection URIs, and will connect to whichever broker happens to be master.
With a shared message store like this, locks in the message store coordinate the master-slave assignment, which makes the choice of message store critical. Stores can be shared filesystems, or shared databases. Only a few shared filesystem implementations work properly -- anything based on NFS 4.x should work. CIFS/SMB stores can work, but there's so much variation between providers that it's hard to be sure. NFS v3 doesn't work, however well-implemented, because the locking semantics are inappropriate. In any case, the store needs to be robust, or replicated, or both, because the whole broker cluster depends on it. No store, no brokers.
In my experience, it's easier to get good throughput from a shared file store than a shared database although, of course, there are many factors to consider. Poor network connectivity will make it hard to get good throughput with any kind of shared store (or any kind of cluster, for that matter).
When using individual message stores, it's typical to put the brokers into some kind of mesh, with 'network connectors' to pass messages from one broker to another. Both brokers will accept connections from clients (there is no master), and the network connections will deal with the situation where messages are sent to one broker, but need to be consumed from another.
Clients' don't necessarily need to specify all brokers in their connection URIs, but generally will, in case one of the brokers is down.
A mesh is generally easier to set up, and (broadly speaking) can handle more client load, than a master-slave with shared filestore. However, (a) losing a broker amounts to losing any messages that were associated with it (until the broker can be restored) and (b) the mesh interferes with messaging patterns like message grouping and exclusive consumers.
There's really no hard-and-fast rule to determine which configuration to use. Many installers who already have some sort of shared store infrastructure (a decent relational database, or a clustered NFS, for example) will tend to want to use it. The rise in cloud deployments has had the effect that mesh operation with no shared store has become (I think) a lot more popular, because it's so symmetric.
There's more -- a lot more -- that could be said here. As a broad question, I suspect the OP is a bit out-of-scope for SO. You'll probably get more traction if you break your question up into smaller, more focused, parts.
In RabbitMQ, if two clusters are hosted on geographical different locations, then we can’t use Clustering. Then how to make them highly available I.e. if one site’s whole cluster goes down then the messages should be mirrored to other site and other site should be able to cater those messages. Note : sites are connected by WAN
See I can’t lose any message on the both sites. Publishing message to the right site can be taken care of, but if the messages are in queue(work queue) or messages are being processed by consumer and suddenly if the site goes down which includes the broker and consumer, how can those messages be catered by the other site. Like in a cluster if one node dies, the other one has all the messages mirrored and knows which were acknowledged, but how to achieve this on WAN, cause clustering cross WAN is not practical.
I think the question illustrates a conceptual problem with the design. To summarize,
There are two sites, connected via WAN
One site is the primary, while one is the active standby
There is a desire for complete replication of system state (total consistency) between site A and B, to include the status of messages in the queue and messages being processed.
Essentially, you want 100% consistency, availability, and partition tolerance. Such a design is not possible according to CAP Theorem. What RabbitMQ provides is either consistency and availability, with low partition tolerance via clustering, or availability and partition tolerance via federation or shovel. RabbitMQ does not deal very well with the case of needing consistency and partition tolerance, since message brokers really handle highly transient traffic.
Instead, what is needed is to fully scope the problem to something that can be solved using the available technologies. It sounds to me like the correct approach (since it's over a WAN) is to sacrifice availability for consistency and partition tolerance, and have your application handle the failover case. You may be able to configure RabbitMQ sufficiently in this regard - see https://www.rabbitmq.com/partitions.html.
what scaling options can we use if rabbitMQ metrices reaches a threshold?I have a VM on which RabbitMQ is running. If the queue length>90% of total queue length, can we increase the instance count by 1 and a with a separate queue such that they are to be processed on a priority basis?
In short what scaling options do we have based on different parameters for RabbitMQ
Take a look into RabbitMQ Sharding Plugin
From their README:
RabbitMQ Sharding Plugin
This plugin introduces the concept of sharded queues for RabbitMQ.
Sharding is performed by exchanges, that is, messages will be
partitioned across "shard" queues by one exchange that we should
define as sharded. The machinery used behind the scenes implies
defining an exchange that will partition, or shard messages across
queues. The partitioning will be done automatically for you, i.e: once
you define an exchange as sharded, then the supporting queues will be
automatically created on every cluster node and messages will be
sharded across them.
Auto-scaling
One interesting property of this plugin, is that if you add more nodes
to your RabbitMQ cluster, then the plugin will automatically create
more shards in the new node. Say you had a shard with 4 queues in node
a and node b just joined the cluster. The plugin will automatically
create 4 queues in node b and join them to the shard partition.
Already delivered messages will not be rebalanced, but newly arriving
messages will be partitioned to the new queues.
I would like to create a cluster for high availability and put a load balancer front of this cluster. In our configuration, we would like to create exchanges and queues manually, so one exchanges and queues are created, no client should make a call to redeclare them. I am using direct exchange with a routing key so its possible to route the messages into different queues on different nodes. However, I have some issues with clustering and queues.
As far as I read in the RabbitMQ documentation a queue is specific to the node it was created on. Moreover, we can only one queue with the same name in a cluster which should be alive in the time of publish/consume operations. If the node dies then the queue on that node will be gone and messages may not be recovered (depends on the configuration of course). So, even if I route the same message to different queues in different nodes, still I have to figure out how to use them in order to continue consuming messages.
I wonder if it is possible to handle this failover scenario without using mirrored queues. Say I would like switch to a new node in case of a failure and continue to consume from the same queue. Because publisher is just using routing key and these messages can go into more than one queue, same situation is not possible for the consumers.
In short, what can I to cope with the failures in an environment explained in the first paragraph. Queue mirroring is the best approach with a performance penalty in the cluster or a more practical solution exists?
Data replication (mirrored queues in RabbitMQ) is a standard approach to achieve high availability. I suggest to use those. If you don't replicate your data, you will lose it.
If you are worried about performance - RabbitMQ does not scale well.
The only way I know to improve performance is just to make your nodes bigger or create second cluster. Adding nodes to cluster does not really improve things. Also if you are planning to use TLS it will decrease throughput significantly as well. If you have high throughput requirement +HA I'd consider Apache Kafka.
If your use case allows not to care about HA, then just re-declare queues/exchanges whenever your consumers/publishers connect to the broker, which is absolutely fine. When you declare queue that's already exists nothing wrong will happen, queue won't be purged etc, same with exchange.
Also, check out RabbitMQ sharding plugin, maybe that will do for your usecase.