Data exchange between Data Centers - rabbitmq

We have two Data Centers DC_1 and DC_2 and each data center has an application Processor_App which processes and stores data in ElasticSearch. The data has to be exchanged between the data centers, before or after the data is processed by Processor_App and stored in ES in both DC's. Can RabbitMQ be used to fetch and exchange data across DC's?

Can RabbitMQ be used to fetch and exchange data across DC's?
Yes. You should read about Federation since I think that feature might fit what you need.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

Related

RabbitMQ: Limiting consumer prefetch across multiple connections

I have two processes on two different servers connecting to RabbitMQ and consuming messages from the same queues (for active/active HA). Is it possible to ensure that a maximum total of one message in a queue is unacked at a given point in time, across two connections?
Combining the "exclusive" flag with basic.qos(1) would ensure that a maximum of one message in a queue is unacked at a given point in time, but would have only one process consuming.
Is there a way to have a consumer prefetch limit (e.g. basic.qos(1)) apply as a total across all connections while still having all connections able to consume?
It's not possible. Please see the documentation for the global flag.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

System design to aggregate in near-real time, the N most shared articles over the last five minutes, last hour and last day?

I was recently asked this system design question in an interview:
Let's suppose an application allows users to share articles from 3rd
party sites with their connections. Assume all share actions go
through a common code path on the app site (served by multiple servers
in geographically diverse colos). Design a system to aggregate, in
near-real time, the N most shared articles over the last five minutes,
last hour and last day. Assume the number of unique shared articles
per day is between 1M and 10M.
So I came up with below components:
Existing service tier that handles share events
Aggregation service
Data Store
Some Transport mechanism to send notifications of share events to aggregation service
Now I started talking about how data from existing service tier that handles share events will get to the aggregation servers? Possible solution was to use any messaging queue like Kafka here.
And interviewer asked me why you chose Kafka here and how Kafka will work like what topics you will create and how many partitions will it have. Since I was confuse so couldn't answer properly. Basically he was trying to get some idea on point-to-point vs publish-subscribe or push vs pull model?
Now I started talking about how Aggregation service operates. One solution I gave was to keep a collection of counters for each shared URL by 5 minute bucket for the last 24 hours (244 buckets per URL) As each share events happens, increment the current bucket and recompute the 5 min, hour, and day totals. Update Top-N lists as necessary. As each newly shared URL comes in, push out any URLs that haven't been updated in 24 hours. Now I think all this can be done on single machine.
Interviewer asked me can this all be done on one machine? Also can maintenance of 1M-10M tracked shares be done on one machine? If not, how would you partition? What happens if it crashes and how will you recover? Basically I was confuse how Aggregation service will actually work here? How it is getting data from Kafka and what is going to do actually with those data.
Now for data store part, I don't think we need persistent data store here so I suggested we can use Redis with partitioning and redundancy.
Interviewer asked me how will you partition and have redundancy here? And how Redis instance will get updated from the entire flow and how Redis will be structured? I was confuse on this as well. I told him that we can write output from Aggregation service to these redis instance.
There were few things I was not able to answer since I am confuse on how the entire flow will work. Can someone help me understand how we can design a system like this in a distributed fashion? And what I should have answered for the questions that interviewer asked me.
The intention of these questions is not to get ultimate answer for the problem. Instead check the competence and thought process of the interviewee. There is no point to be panic while answering these kind questions while facing tough follow up questions. Intention of the follow up questions is to guide you or give some hint for the interviewee.
I will try to share one probable answer for this problem. Assume I have s distributed persistent system like Cassandra. And I am going to maintain the status of sharing at any moment using my Cassandra infrastructure. I will maintain a Redis cluster ahead of persistence layer for LRU caching and maintain the buckets for 5 minutes, 1 hour and a day. Eviction will be configured using expire set. Now my aggregator service only need to address minimal data present within my Redis LRU cache. Set up a high through put distributed Kafka cluster will pump data from shared handler. And Kafka feed the data to Redis cluster and from there to Cassandra. To maintain the near real time output, we have to maintain the Kafka cluster throughput matching with it.

Load balancing for RabbitMQ server (broker), not the consumers(clients)

In this example I have a setup of 2 consumers and 2 publishers in my network. The centre is a RabbitMQ broker as shown in the screenshot below. Due to fail-safe reasons, I am wondering if RabbitMQ supports load-balancing or mirroring of the server (broker) in any way. I just would like to get rid of the star topology for two reasons:
1) If one broker fails, another publisher can take over immediately
2) If one brokers network throughput is not good enough the other takes over
Solving one or the other (or even both) would be great.
My current infrastructure
Preferred infrastructure
RabbitMQ clustering (docs) can meet your first requirement. Use three nodes and be sure your applications are coded and tested to take failure scenarios into account.
I don't know of anything out-of-the-box that can meet your second requirement. You will have to implement something that uses cluster statistics or application statistics to determine when to switch to another cluster due to lower throughput.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

RabbitMQ vs NoSQL?

I was just wondering why would you use a something like RabbitMQ instead of a persistent store especially a document store like MongoDB? Arent they kinda the same? Whats the benefit of something like RabbitMQ over a database?
Would anyone who used something like RabbitMQ elaborate on the benefits?
RabbitMQ is a message broker software aka a queue and not a NoSql database!
While the trend goes towards storing more and more data in scaled-up queues as well as processing data at real time and thus obliterating the need for additional data storage, queues are not to be confused with databases:
most queues don't persist data indefinitely.
the data in queues is not available on demand by the use of queries, but accessed via an automatically triggered consumer mechanism.
the architectural intention behind queues differs tremendously from that of databases. Their purpose in a system's architecture is not data storage, but system integration and data distribution. For more good information on queue architecture, please check this article from the Kafka guys.

Synchronize one queue instance with multiple Redis instances

The Scenario:
We have multiple nodes distributed geographically on which we want to have queues collecting messages for that location. And then we want to send this collected data from every queue in every node to their corresponding queues in a central location. In the central node, we will pull out data collected in the queues (from other nodes), process it and store it persistently.
Constraints:
Data is very important to us. Therefore, we have to make sure that we are not loosing data in any case.
Therefore, we need persistent queues on every node so that even if the node goes down for some random reason, when we bring it up we have the collected data safe with us and we can send it to the central node where it can be processed.
Similarly, if the central node goes down, the data must remain at all the other nodes so that when the central node comes up we can send all the data to the central node for processing.
Also, the data on the central node must not get duplicated or stored again. That is data collected on one of the nodes should be stored on the central nodes only once.
The data that we are collecting is very important to us and the order of data delivery to the central node is not an issue.
Our Solution
We have considered a couple of solutions out of which I am going to list down the one that we thought would be the best. A possible solution (in our opinion) is to use Redis to maintain queues everywhere because Redis provides persistent storage. Then perhaps have a daemon running on all the geographically separated nodes which reads the data from the queue and sends it to the central node. The central node on receiving the data sends an ACK to the node it received the data from (because data is very important to us) and then on receiving the ACK, the node deletes the data from the queue. Of course, there will be timeout period in which the ACK must be received.
The Problem
The above stated solution (according to us) will work fine but the issue is that we don't want to implement the whole synchronization protocol by ourselves for the simple reason that we might be wrong here. We were unable to find this particular way of synchronization in Redis. So we are open to other AMQP based queues like RabbitMQ, ZeroMQ, etc. Again we were not able to figure out if we can do this with these solutions.
Do these Message Queues or any other data store provide features that can be the solution to our problem? If yes, then how?
If not, then is our solution good enough?
Can anyone suggest a better solution?
Can there be a better way to do this?
What would be the best way to make it fail safe?
The data that we are collecting is very important to us and the order of data delivery to the central node is not an issue.
You could do this with RabbitMQ by setting up the central node (or cluster of nodes) to be a consumer of messages from the other nodes, and using the message acknowledgement feature. This feature means that the central node(s) can ack delivery, so that other nodes only delete messages after the ack. See for example: http://www.rabbitmq.com/tutorials/tutorial-two-python.html
If you have further questions please email the mailing list rabbitmq-discuss.