We are using confluent's s3 connector to send avro data from a topic to s3. We have 3 broker nodes and on all 3 we have confluent s3-connector running. In the configuration file of connector we have two topics and tasks.max=1. I am new to kafka and I have following doubts:
Since we have overall three s3-connectors, how they are reading from each topic (each topic has 3 partitions and 2 replication factor). Are they considered as three different consumers reading from same topic or all these consumers come under a single consumer group and read data in parallel?
We have two topics in each connector. Do they launch different threads to read data from both the topics in parallel or do they consume sequentially (read from a topic at a time)?
tasks.max=1
First, set that to the number of total partitions.
Replication factor doesn't matter. Consumers can only ever read from one partition at a time.
Connect forms a consumer group. That is the basic design for any Kafka consumer client. They read in parallel, depending on all your other properties.
Sounds like you are running connect-standalone, and not connect-distributed, however
If you have 3 machines, obviously use distributed mode
And yes, tasks and threads are funtionally equivalent, with the difference being that tasks will rebalance , while threads are logically only on a single machine.
Related
Google didn't help me, so I want to ask you. I have a lot of kafka topics I have seend many articles about java heap memory and so on but I need some guidance
I have a lot of kafka topics that I need to go to one s 3 bucket using s3 sink connector
How do you go about running multiple instances of s3 sink connector ? Should I create multiple systemd scripts for each s3 connector and multiple copies of kafka start scripts ? or use one script and run it multiple times poiting to different s3 connector properites configs for each topic? Is this better performance than using single connector for all the topics since they all going to the same bucket ?
How do i calculate the memory needed
lets say i have
topic 1 5000 Messages
tpic2 2000 Messages
and topic 3 500 Messages
How do I balance the load requirements vs the memory requirements vs what is available on the server
how much memory do I need in case I have a server with lets say 4 GB
what scaling options can we use if rabbitMQ metrices reaches a threshold?I have a VM on which RabbitMQ is running. If the queue length>90% of total queue length, can we increase the instance count by 1 and a with a separate queue such that they are to be processed on a priority basis?
In short what scaling options do we have based on different parameters for RabbitMQ
Take a look into RabbitMQ Sharding Plugin
From their README:
RabbitMQ Sharding Plugin
This plugin introduces the concept of sharded queues for RabbitMQ.
Sharding is performed by exchanges, that is, messages will be
partitioned across "shard" queues by one exchange that we should
define as sharded. The machinery used behind the scenes implies
defining an exchange that will partition, or shard messages across
queues. The partitioning will be done automatically for you, i.e: once
you define an exchange as sharded, then the supporting queues will be
automatically created on every cluster node and messages will be
sharded across them.
Auto-scaling
One interesting property of this plugin, is that if you add more nodes
to your RabbitMQ cluster, then the plugin will automatically create
more shards in the new node. Say you had a shard with 4 queues in node
a and node b just joined the cluster. The plugin will automatically
create 4 queues in node b and join them to the shard partition.
Already delivered messages will not be rebalanced, but newly arriving
messages will be partitioned to the new queues.
Kafka and RabbitMQ are well known message brokers. I want to build a microservice with Spring Boot and it seems that Spring Cloud provides out of the box solutions for them as the defacto choices. I know a bit of the trayectory of RabbitMQ which has lot of support. Kafka belongs to Apache so it should be good. So whats the main goal difference between RabbitMQ and Kafka? Take in consideration this will be used with Spring Cloud. Please share your experiences and criteria. Thanks in advance.
I certainly wouldn't consider Kafka as lightweight. Kafka relies on ZooKeeper so you'd need to throw ZooKeeper to your stack as well.
Kafka is pubsub but you could re-read messages. If you need to process large volumes of data, Kafka performs much better and its synergy with other big-data tools is a lot better. It specifically targets big data.
Three application level difference is:
Kafka supports re-read of consumed messages while rabbitMQ
not.
Kafka supports ordering of messages in partition while rabbitMQ
supports it with some constraint such as one exchange routing
to the queue,one queue, one consumer to queue.
Kafka is for fast in publishing data to partition than rabbitMQ.
Kafka is more than just a pub/sub messaging platform. It also includes APIs for data integration (Kafka Connect) and stream processing (Kafka Streams). These higher level APIs make developers more productive versus using only lower level pub/sub messaging APIs.
Also Kafka has just added Exactly Once Semantics in June 2017 which is another differentiator.
To start with Kafka does more than what RabbitMQ does. Message broker is just one subset of Kafka but Kafka can also act as Message storage and Stream processing. Comparing just the Message broker part, again Kafka is more robust than RabbitMQ as it supports Replication (for availability) and partitioning (for scalability), Replay of messages (if needed to reprocess) and it is Pull based. RabbitMQ can be scalable by using multiple consumers for a given queue but again it is push based and you lose ordering among multiple consumers.
It all depends on the use case and your question doesn't provide the use case and performance requirements to suggest one over other.
I found a nice answer in this youtube video Apache Kafka Explained (Comprehensive Overview).
It basically says that the difference between Kafka and standard JMS systems like RabbitMQ or ActiveMQ it that
Kafka consumers pull the messages from the brokers which allows for buffering messages for as long as the retention period holds. While in most JMS systems messages are pushed to the consumers which make strategies like back-pressure harder to achieve.
Kafka also eases the replacement of events by storing them on disk, so they can be replaced at any time.
Kafka guarantees the ordering of message within a partition.
Kafka overall provides an easy way for building scalable and fault-tolerant systems.
Kafka requires is more complex and harder to understand than JMS systems.
Looking for pros and cons of using Apache Kafka over RabbitMQ. Also to decide if I should move my existing infrastructure over to Kafka.
very different, some you might consider to begin with -
a) rabbit mq is queue FIFO.
kafka is a log, your writes are appended to the tail, but you read from where you want.
b) Kafka is truly distributed - data is sharded , replicated, durability guarantees can be tuned, availability can be tuned.
rabbitmq has limited support for the above.
c) Kafka also comes OOB with consumer frameworks which allow distributed processing of the log reliably. Kafka streams also has stream processing semantics built into it.
rabbitmq the consumer is just FIFO based, reading from the HEAD and processing 1 by 1.
d) Kafka is extensible in the consumer model, allows you to build exactly once, atmost once , atleast once.
I am new to Apache Kafka and was playing around with it. If I have 2 brokers and one topic with 4 partitions and assume one of my broker is heavily loaded, will kafka takes care of balancing the incoming traffic from producers to the other free broker ? If so how it is done ?
If you have multiple partitions, it's the producers responsibility/choice of which partition they want to send it to.
Producers publish data to the topics of their choice. The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). link
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also. To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. link
If you specify which partition you want the data to go into, it will always go into that specific partition. If you don't specify, the producer could send it to any partition. The Kafka broker never internally moves or balances messages/partitions.
I believe this decision is to provide certain guarantees for the ordering of messages in a Kafka partition.
Kafka producer tends to distribute messages equally among all partitions unless you override this behavior, then you need to have a look if the four partitions is distributed evenly among brokers.
It depends on what do you mean by "one of the brokers is heavily loaded". if it is because of that topic or this cluster has any other topics (e.g. __consumer_offset).
You can choose the brokers in which partition resides with a cli tools with Kafka or with some kind of UI like yahoo kafka-manager.