As part of my big data course in university, I'm required to mimic kafka. It involves setting up a mini-Kafka on a student's system, complete with a Producer, Subscriber and a Publish-Subscribe architecture.
High-Level Overview
You are required to set up a mini-Zookeeper, multiple Kafka Brokers, one of which is a leader, and multiple Producers and Consumers.
The number of Producers and Consumers must be dynamic and not hard-coded, i.e.,the user must be able to specify the number of Producers and Consumers.
The number of topics must also be dynamic, the user should be able to create and delete topics on demand.
To help you get started with the project, the following sections will provide a detailed description of all the individual modules.
Architecture
Related
I have a RabbitMQ with multiple consumers subscribed on a single queue. And I want the messages with same hash key can be consumed by the same consumer for each time. I know the default behavior for RabbitMQ is loop through all consumers and dispatch the message 1 by 1.
Does it have the same ability like Kafka partition?
Thanks
Rebalancer (forked from Jack Vanlightly and improved)
Create Kafka style consumer groups in other technologies. Rebalancer was born of the need for consumer groups with RabbitMQ. But Rebalancer is completely technology agnostic and will balance activity over any group of resources across a group of participating nodes.
Use cases
Create Kafka-like "consumer groups" with messaging technologies like RabbitMQ, SQS, etc.
Consume a group of resources such as file shares, FTPs, S3 buckets between the instances of a scaled out application.
Single Active-Consumer / Active-Backup
Create an application cluster that consumes a single resource in a highly available manner. The cluster leader (Coordinator) consumes the single resource and the slaves (Followers) remain idle in backup in case the leader dies.
Well not exactly but a very close one .
You need to use RabbitMQ Consistent Hash Exchange Type which is available by adding the rabbitmq-consistent-hash-exchange plugin. It adds a consistent-hash exchange type to RabbitMQ. This exchange type uses consistent hashing to distribute messages between the bound queues. It is recommended to get a basic understanding of the concept before evaluating this plugin and its alternatives.
I have been using pubsub for a bit of asynchronous work, and was wondering why someone may create multiple subscriptions for a single topic. My default values are as follows:
project_id = 'project'
topic_name = 'app'
subscription_name = 'general'
The routing of the actual function -- and how to process that -- is being doing in the subscriber receiver itself.
What would be reasons why there would be various subscription names? The only thing I can think of is to spread items across multiple servers for processing, such as:
server1 -- `main-1`
server2 -- `main-2`
etc.
Are there any other reasons why a subscription name would not work well with one value?
In general, there are two paradigms for having multiple subscribers:
Load balancing: The goal is to parallelize the processing of the load by having multiple subscribers using the same subscription. In this scenario, every subscriber receives a subset of the messages. One can horizontally scale processing by creating more subscribers for the same subscription.
Fan out: The goal is to have multiple subscribers receive the entire feed of messages. This is accomplished by having multiple subscriptions. The reason to have fan out is if there are multiple downstream applications interested in the full feed of messages. Imagine there is a feed where the messages are user events on a shopping website. Perhaps one application backs up the data to files, another analyzes the feed for trends in what people are looking at, and another looks through activity to try to find potentially fraudulent transactions. In this scenario, every one of those applications acting as a subscriber needs the full feed of messages, which requires separate subscriptions.
We are currently starting to broadcast events from one central applications to other possibly interested consumer applications, and we have different options among members of our team about how much we should put in our published messages.
The general idea/architecture is the following :
In the producer application :
the user interacts with some entities (Aggregate Roots in the DDD sense) that can be created/modified/deleted
Based on what is happening, Domain Events are raised (ex : EntityXCreated, EntityYDeleted, EntityZTransferred etc ... i.e. not only CRUD, but mostly )
Raised events are translated/converted into messages that we send to a RabbitMQ Exchange
in RabbitMQ (we are using RabbitMQ but I believe the question is actually technology-independent):
we define a queue for each consuming application
bindings connect the exchange to the consumer queues (possibly with message filtering)
In the consuming application(s)
application consumes and process messages from its queue
Based on Enterprise Integration Patterns we are trying to define the Canonical format for our published messages, and are hesitating between 2 approaches :
Minimalist messages / event-store-ish : for each event published by the Domain Model, generate a message that contains only the parts of the Aggregate Root that are relevant (for instance, when an update is done, only publish information about the updated section of the aggregate root, more or less matching the process the end-user goes through when using our application)
Pros
small message size
very specialized message types
close to the "Domain Events"
Cons
problematic if delivery order is not guaranteed (i.e. what if Update message is received before Create message ? )
consumers need to know which message types to subscribe to (possibly a big list / domain knowledge is needed)
what if consumer state and producer state get out of sync ?
how to handle new consumer that registers in the future, but does not have knowledge of all the past events
Fully-contained idempotent-ish messages : for each event published by the Domain Model, generate a message that contains a full snapshot of the Aggregate Root at that point in time, hence handling in reality only 2 kind of messages "Create or Update" and "Delete" (+metadata with more specific info if necessary)
Pros
idempotent (declarative messages stating "this is what the truth is like, synchronize yourself however you can")
lower number of message formats to maintain/handle
allow to progressively correct synchronization errors of consumers
consumer automagically handle new Domain Events as long as the resulting message follows canonical data model
Cons
bigger message payload
less pure
Would you recommend an approach over the other ?
Is there another approach we should consider ?
Is there another approach we should consider ?
You might also consider not leaking information out of the service acting as the technical authority for that part of the business
Which roughly means that your events carry identifiers, so that interested parties can know that an entity of interest has changed, and can query the authority for updates to the state.
for each event published by the Domain Model, generate a message that contains a full snapshot of the Aggregate Root at that point in time
This also has the additional Con that any change to the representation of the aggregate also implies a change to the message schema, which is part of the API. So internal changes to aggregates start rippling out across your service boundaries. If the aggregates you are implementing represent a competitive advantage to your business, you are likely to want to be able to adapt quickly; the ripples add friction that will slow your ability to change.
what if consumer state and producer state get out of sync ?
As best I can tell, this problem indicates a design error. If a consumer needs state, which is to say a view built from the history of an aggregate, then it should be fetching that view from the producer, rather than trying to assemble it from a collection of observed messages.
That is to say, if you need state, you need history (complete, ordered). All a single event really tells you is that the history has changed, and you can evict your previously cached history.
Again, responsiveness to change: if you change the implementation of the producer, and consumers are also trying to cobble together their own copy of the history, then your changes are rippling across the service boundaries.
In reading many MSDN pages about the Azure Service Bus, it alludes to the ability to set up a "Load Balancing" pattern with the "Topic/Subscription" model, but never says how this is done.
My question is, is this possible. Essentially, we are looking to create Topics that would have a possible n number of subscribers that could be dynamically ramped up and down, based upon incoming load. So, it would not be using the traditional "multicast" pattern but round robining the messages to the subscribers. The reason we want to use this pattern is that we want to take advantage of the rules and filtering that reside in the Topics and Subscriptions, while allowing for dynamic scaling.
Any ideas?
I have found this image is very similar to my bussiness model. I need to split message to some queue.
for some heavy work. I can add more worker thread for them. But for some no much heavy work. I can
let single consumer to subscribe their message. But how to do that in rabbitMQ.
Through their document. I just found that single-queue-multi-consumer model.
You can add multiple workers to a queue
There can be multiple queues bound to an exchange.
In RabbitMQ, the producer always sends the message to an exchange. So, in your case, I hope only one exchange is enough. If you want to load balance at the consumer side, you have the above said two options.
You can also read my article:
https://techietweak.wordpress.com/2015/08/14/rabbitmq-a-cloud-based-message-oriented-middleware/
RabbitMQ has a very flexible model, which enables a wide variety of routing scenarios to take place.
I need to split message to some queue. for some heavy work. I can add more worker thread for them.
Yes, this is supported via a direct exchange. Publish a message using a routing key that is the same as the name of the queue. For convenience, let's say you use the fully-qualified object name (e.g. MyApp.Objects.DataTypeOne). All you need to do is subscribe multiple consuming processes to this queue, and RabbitMQ will load-balance using a round-robin approach.
But for some no much heavy work. I can let single consumer to subscribe their message.
Yes, you can do this also. Same process as in the paragraph above. Just don't attach multiple consuming processes.
I have found this image is very similar to my business model.
The diagram isn't very useful, because it lacks information about the type of messages being published. In that sense, it is only an interconnect diagram. The interesting lines are the ones connecting the queues to the exchange, as that is what you specify within RabbitMQ via Queue Bindings. You can also bind exchanges to one another, but that's a bit further than we probably need to go.
Everything else on the diagram is fully under your control as the user of the RabbitMQ/AMQP system. You can create an arbitrary number of publishers and have an arbitrary number of consuming processes each consuming from an arbitrary number of queues. There are no hard and fast limits, though there are some practical aspects you probably will want to think about to ensure your system is maintainable.