Need advice on suitable message queue for Storm spout

Need advice on suitable message queue for Storm spout - rabbitmq

I'm developing a prototype Lambda system and my data is streaming in via Flume to HDFS. I also need to get the data into Storm. Flume is a push system and Storm is more pull so I don't believe it's wise to try to connect a spout to Flume, but rather I think there should be a message queue between the two. Again this is a prototype, so I'm looking for best practices, not perfection. I'm thinking of putting an AMQP compliant queue as a Flume sink and then pulling the messages from a spout.
Is this a good approach? If so, I want to use a message queue that has relatively robust support in both the Flume world (as a sink) and the Storm world (as a spout). If I go AMQP then I assume that gives me the option to use whatever AMQP-compliant queue I want to use, correct? Thanks.

If your going to use AMQP, I'd recommend sticking to the finalized 1.0 version of the AMQP spec. Otherwise, your going to feel some pain when you try to upgrade to it from previous versions.

Your approach makes a lot of sense, but, for us the AMQP compliant issue looked a little less important. I will try to explain why.
We are using Kafka to get data into storm. The main reason is mainly around performance and usability. AMQP complaint queues do not seem to be designed for holding information for a considerable time, while with Kafka this is just a definition. This allows us to keep messages for a long time and allow us to "playback" those easily (as the message we wish to consume is always controlled by the consumer we can consume the same messages again and again without a need to set up an entire system for that purpose). Also, Kafka performance is incomparable to anything that I have seen.
Storm has a very useful KafkaSpout, in which the main things to pay attention to are:
Error reporting - there is some improvement to be done there. Messages are not as clear as one would have hoped.
It depends on zookeeper (which is already there if you have storm) and a path is required to be manually created for it.
According to the storm version, pay attention to the Kafka version in use. It is documented, but, can really easily be missed and cause unclear problems.

You can have the data streamed to a broker topic first. Then flume and storm spout can both consume from that topic. Flume has a jms source that makes it easy to consume from the message broker. And a storm jms spout to get the messages into storm.

Related

What's the difference between RabbitMQ and kafka?

Which will fair better under different scenarios?
I know that RabbitMQ is based on AMQP protocol, and has visualization for the developer.

RabbitMQ, as you noted, is an implementation of AMQP. AMQP is a messaging protocol which originated in the financial industry. Its core metaphors are Messages, Exchanges, and Queues.
Kafka was designed as a log-structured distributed datastore. It has features that make it suitable for use as part of a messaging system, but it also can accommodate other use cases, like stream processing. It originated at LinkedIn and its core metaphors are Messages, Topics, and Partitions.
Subjectively, I find RabbitMQ more pleasant to work with: its web-based management tool is nice; there is little friction in creating, deleting, and configuring queues and exchanges. Library support is good in many languages. As in its default configurations Rabbit only stores messages in memory, latencies are low.
Kafka is younger, the tooling feels more clunky, and it has had relatively poor support in non-JVM languages, though this is getting better. On the other hand, it has stronger guarantees in the face of network partitions and broker loss, and since it is designed to move messages to disk as soon as possible, it can accommodate a larger data set on typical deployments. (Rabbit can page messages to disk but sometimes it doesn't perform well).
In either, you can design for direct (one:one), fanout (one:many), and pub-sub (many:many) communication patterns.
If I were building a system that needed to buffer massive amounts of incoming data with strong durability guarantees, I'd choose Kafka for sure. If I was working with a JVM language or needed to do some stream processing over the data, that would only reinforce the choice.
If, on the other hand, I had a use case in which I valued latency over throughput and could handle loss of transient messages, I'd choose Rabbit.

Kafka:
Message will be always there. You can manage this by specifying a
message retention policy.
It is distributed event streaming platform.
We can use it as a log
Kafka streaming, you can change and process the message automatically.
We can not set message priority
Retain order only inside a partition. In a partition, Kafka guarantees that the whole batch of messages either fail or pass.
Not many mature platforms like RMQ (Scala and Java)
RabbitMQ:
RabbitMQ is a queuing system so messages get deleted just after consume.
It is distributed, by a message broker.
It can not be used like this
Streaming is not supported in RMQ
We can set the priority of the message and can consume on the basis of the same.
Does not support guarantee automaticity even in relation to transactions involving a single queue.
Mature platform ( Have the best compatibility with all languages)

RabbitMQ - federated queues Vs exchange federation

I have set up a rabbit cluster and I publish messages into a fanout exchange every time something changes in a database.
I have dedicated queues bound to this exchange for some of my microservices that consume these updates and I also originally set up a dedicated queue for an external client so that they can federate it with their own rabbit infrastructure and consume a copy of every message.
Now I'm wondering whether allowing exchange federation rather than creating a new dedicated queue for each new external consumer would be a better approach since more and more users will come.
What are the pros and cons?
Thanks

As long as you manage permissions properly, the final decision is up to you. You can give a try to all variants first and find what will fit your actual needs.
Having local queue may have it pros and cons: it allows end-user to survive some outage with their infrastructure or network issue at the cost of your disk/memory, however, you may limit queue length and/or size.
I'd suggest you to take a look at Shovel plugin and Dynamic shovels. With local queue it may server a good job.
Comparing to federation, shovel is much simpler, e.g. it doesn't sync content between upstream and downstream but simply moves message from one queue to another in a reliable manner. As long as you don't need what federation provides, shovel could be a good choice.
Also, you may find this q/a useful (however, it might be a bit outdated) - https://stackoverflow.com/a/19357272.

Pros and Cons of Kafka vs Rabbit MQ

Kafka and RabbitMQ are well known message brokers. I want to build a microservice with Spring Boot and it seems that Spring Cloud provides out of the box solutions for them as the defacto choices. I know a bit of the trayectory of RabbitMQ which has lot of support. Kafka belongs to Apache so it should be good. So whats the main goal difference between RabbitMQ and Kafka? Take in consideration this will be used with Spring Cloud. Please share your experiences and criteria. Thanks in advance.

I certainly wouldn't consider Kafka as lightweight. Kafka relies on ZooKeeper so you'd need to throw ZooKeeper to your stack as well.
Kafka is pubsub but you could re-read messages. If you need to process large volumes of data, Kafka performs much better and its synergy with other big-data tools is a lot better. It specifically targets big data.

Three application level difference is:
Kafka supports re-read of consumed messages while rabbitMQ
not.
Kafka supports ordering of messages in partition while rabbitMQ
supports it with some constraint such as one exchange routing
to the queue,one queue, one consumer to queue.
Kafka is for fast in publishing data to partition than rabbitMQ.

Kafka is more than just a pub/sub messaging platform. It also includes APIs for data integration (Kafka Connect) and stream processing (Kafka Streams). These higher level APIs make developers more productive versus using only lower level pub/sub messaging APIs.
Also Kafka has just added Exactly Once Semantics in June 2017 which is another differentiator.

To start with Kafka does more than what RabbitMQ does. Message broker is just one subset of Kafka but Kafka can also act as Message storage and Stream processing. Comparing just the Message broker part, again Kafka is more robust than RabbitMQ as it supports Replication (for availability) and partitioning (for scalability), Replay of messages (if needed to reprocess) and it is Pull based. RabbitMQ can be scalable by using multiple consumers for a given queue but again it is push based and you lose ordering among multiple consumers.
It all depends on the use case and your question doesn't provide the use case and performance requirements to suggest one over other.

I found a nice answer in this youtube video Apache Kafka Explained (Comprehensive Overview).
It basically says that the difference between Kafka and standard JMS systems like RabbitMQ or ActiveMQ it that
Kafka consumers pull the messages from the brokers which allows for buffering messages for as long as the retention period holds. While in most JMS systems messages are pushed to the consumers which make strategies like back-pressure harder to achieve.
Kafka also eases the replacement of events by storing them on disk, so they can be replaced at any time.
Kafka guarantees the ordering of message within a partition.
Kafka overall provides an easy way for building scalable and fault-tolerant systems.
Kafka requires is more complex and harder to understand than JMS systems.

Specifying RabbitMQ messaging strategy on memeory or disc

I am new at RabbitMQ am wonder something about saving message strategy. By default RabbitMQ saves message queuses on memeory. This way is high performance. But messages are important and should be save on disc. Because server may down at any time. This way shows slower performace.
Which stuation should be prefable. What is your real world experience?

There is a whole lot regarding persistance here.
You can make queues durable, in that way messages are saved to the disk. Of course only until they are acknowledged!
You didn't say what is your use case and what do you need this for, but bare in mind that RAbbitMQ is not a database.

Redis PUB/SUB and high availability

Currently I'm working on a distributed test execution and reporting system. I'm planning to use Redis PUB/SUB as a message queue and message distribution system.
I'm new to Redis, so I'm trying to read as many docs as I can and play around with it. One of the most important topics is high availability. As I said, I'm not an expert, but I'm aware of the possible options - using Sentinel, replication, clustering, etc.
What's not clear for me is how the Pub/Sub feature and the HA options are related each other. What's the best practice to build a reliable messaging system with Redis? By reliable I mean if my Redis message broker is down there should be some kind of a backup node (a slave?) that should be able to take over this role.
Is there a purely server-side solution? Or do I need to create a smart wrapper around the Redis client to handle this? Will a Sentinel-driven setup help me?

Doing pub sub in Redis with failover means thinking about additional factors in the client side. A key piece to understand is that subscriptions are per-connection. If you are subscribed to a channel on a node and it fails, you will need to handle reconnect and resubscribe. Because subscriptions are done at the connection level it is not something which can be replicated.
Regarding the details as to how it works and what you can expect to see, along with ways around it see a post I made earlier this year at https://objectrocket.com/blog/how-to/reliable-pubsub-and-blocking-commands-during-redis-failovers
You can lower the risk surface by subscribing to slaves and publishing to the master, but you would then need to have non-promotable slaves to subscribe to and still need to handle losing a slave - there is just as much chance to lose a given slave as there is a master.

IMO, PUB/SUB is not a good choice, may be disque (comes from antirez, author of the Redis) fits better:
Disque, an in-memory, distributed job queue

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas