Distributed batch processing with Spring Batch and AMQP - rabbitmq

I want to distribute the processing of large batches. The idea is to use Spring Batch fire up a bunch of AMQP consumers in a cloud and then to load cheap tasks (like item IDs) and submit them to an AMQP exchange. Writing of the results will be done by the consumers themselves.
Is there a ready-made library to accomplish this?
A few thoughts:
Spring Batch is totally negotiable.
Batch size is several millions. I don't want to kill my message broker by brute-force submitting all these IDs at once but use some kind of throttling instead.
I do want to know about what items have been processed so I can monitor the process. So the batch controlling process will have to receive replies from the consumers.

Yes, see the spring-batch-integration project. It combines Spring Batch and Spring Integration to perform what you want.
For batch 2.2.x, it's part of the spring-batch-admin distribution; in the upcoming batch 3.0.0 release it's been moved to batch proper.
Remote partitioning just sends metadata about the partitions and the workers actually fetch the data.
It comes with a JMS example but it wouldn't be hard to swap out the spring-integration JMS gateways for spring integration amqp gateways.
There's also a remote chunking version where the data is sent over the transport instead of partition metadata.

Related

Using both, request-reply and pub-sub for microservices communication

We are planning to introduce both, pub-sub and request-reply communication models to our micriservices architecture. Both communication models are needed.
One of the solutions could be using RabbitMQ as it can provide both models and provide HA, clusterring ang other interesting features.
RabbitMQ request-reply model requires using queues, both for input and for output messages. Only one service can read from the input queue and this increases coupling.
Is there any other recommended solution for using both request-reply and pub-sub communication models in the same system?
Does service mesh could be a better option?
It shall be suppoered by node.js, python and. Net CORE.
Thank you for your help
There multiple pub-sub and request-reply support HA communication models :
1. Kafka
Kafka relies heavily on the filesystem for storing and caching messages. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel’s pagecache.
Kafka is designed with failure in mind. At some point in time, web communications or storage resources fail. When a broker goes offline, one of the replicas becomes the new leader for the partition. When the broker comes back online, it has no leader partitions. Kafka keeps track of which machine is configured to be the leader. Once the original broker is back up and in a good state, Kafka restores the information it missed in the interim and makes it the partition leader once more.
See :
https://kafka.apache.org/
https://docs.cloudera.com/documentation/kafka/latest/topics/kafka_ha.html
https://docs.confluent.io/4.1.2/installation/docker/docs/tutorials/clustered-deployment.html
2. Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
See :
https://redis.io/
https://redislabs.com/redis-enterprise/technology/highly-available-redis/
https://redis.io/topics/sentinel
3. ZeroMQ
ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems.
See :
https://zeromq.org/
http://zguide.zeromq.org/pdf-c:chapter3
http://zguide.zeromq.org/pdf-c:chapter4
4. RabbitMQ
RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. RabbitMQ can be deployed in distributed and federated configurations to meet high-scale, high-availability requirements.
My preference would be to have REST api for request-reply pattern. This is specially applicable for internal microservices where you are in control of communication mechanism. I don't understand your comment about why they are not scalable if you defined them as properly and you can scale out and down the number of instances for the services based on demand. Be it Kafka, RabbitMQ, or any other broker, I don't think they are developed for request-reply as primary use case. And don't forget that whatever broker you are using, if it is A->B->C in REST, it will be A->broker->B->broker->C->broker->A and broker need to do it house keeping.
Then for pub-sub, I would use Kafka as it is unified model which can support pub-sub as well as point to point.
But if you still wanted to use a broker for request-reply, I would check Kafka as it can scale massively via partitions and lot of near real streaming applications are built using that. So It could be near the minimal latency requirement of request-reply pattern. But then I would want a framework on top of that to associate request and replies. So I would consider using Spring Kafka to achieve that

Flow control limitting message rate on single queue

I have a exchange and only one queue bind to it. When the message publishing rate goes over some cap the rabbitmq automatically throttles the incoming message rate.
On further investigation i found this happens due to the "Flow control" trottling mechanism built in rabbitmq. https://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
As per this document i have connection, channels in flow control and not the queue. which means there is a cpu-bound / disk-bound limit.
My messages are not persistent so i don't have disk limitation. On Searching, i found documents stating a queue is limited to single cpu. https://groups.google.com/forum/#!msg/rabbitmq-users/wzHMV7F0ugU/zhW_9b8ACQAJ
What does it mean ? do the rabbitmq queue process uses only 1 cpu even multiple cores are available in the machine? what is the limitation of cpu with respect to queue flow control?
A queue is handled by one and one only CPU, which mean that you have to design your message flow through rabbit with multiple queue in order to remain scalable.
If you are on one queue only you will be limited to a maximum number of messages no matter if you have 1 or more cores
https://www.rabbitmq.com/queues.html#runtime-characteristics
If you have a specific need to build an architecture with only one logical queue, which is explicitely not recommended ; or if you have a queue with a really high trafic, you can check sharded queues here : Github Sharded queues Plugin
It's a pluggin (take with caution and test everything before going to production, especialy failure and replication) that split a logical queue name into multiple queues.
If you are running a benchmark on rabbitmq, remember to produce and consume on a number of queues superior to the amount of CPU cores present on the server.
Other tips about benchmark, try to produce only, consume only, and both at the same time, with different persistence settings (persistence, message size, lazy queues, ...) and ack settings.

What's the difference between RabbitMQ and kafka?

Which will fair better under different scenarios?
I know that RabbitMQ is based on AMQP protocol, and has visualization for the developer.
RabbitMQ, as you noted, is an implementation of AMQP. AMQP is a messaging protocol which originated in the financial industry. Its core metaphors are Messages, Exchanges, and Queues.
Kafka was designed as a log-structured distributed datastore. It has features that make it suitable for use as part of a messaging system, but it also can accommodate other use cases, like stream processing. It originated at LinkedIn and its core metaphors are Messages, Topics, and Partitions.
Subjectively, I find RabbitMQ more pleasant to work with: its web-based management tool is nice; there is little friction in creating, deleting, and configuring queues and exchanges. Library support is good in many languages. As in its default configurations Rabbit only stores messages in memory, latencies are low.
Kafka is younger, the tooling feels more clunky, and it has had relatively poor support in non-JVM languages, though this is getting better. On the other hand, it has stronger guarantees in the face of network partitions and broker loss, and since it is designed to move messages to disk as soon as possible, it can accommodate a larger data set on typical deployments. (Rabbit can page messages to disk but sometimes it doesn't perform well).
In either, you can design for direct (one:one), fanout (one:many), and pub-sub (many:many) communication patterns.
If I were building a system that needed to buffer massive amounts of incoming data with strong durability guarantees, I'd choose Kafka for sure. If I was working with a JVM language or needed to do some stream processing over the data, that would only reinforce the choice.
If, on the other hand, I had a use case in which I valued latency over throughput and could handle loss of transient messages, I'd choose Rabbit.
Kafka:
Message will be always there. You can manage this by specifying a
message retention policy.
It is distributed event streaming platform.
We can use it as a log
Kafka streaming, you can change and process the message automatically.
We can not set message priority
Retain order only inside a partition. In a partition, Kafka guarantees that the whole batch of messages either fail or pass.
Not many mature platforms like RMQ (Scala and Java)
RabbitMQ:
RabbitMQ is a queuing system so messages get deleted just after consume.
It is distributed, by a message broker.
It can not be used like this
Streaming is not supported in RMQ
We can set the priority of the message and can consume on the basis of the same.
Does not support guarantee automaticity even in relation to transactions involving a single queue.
Mature platform ( Have the best compatibility with all languages)

Pros and Cons of Kafka vs Rabbit MQ

Kafka and RabbitMQ are well known message brokers. I want to build a microservice with Spring Boot and it seems that Spring Cloud provides out of the box solutions for them as the defacto choices. I know a bit of the trayectory of RabbitMQ which has lot of support. Kafka belongs to Apache so it should be good. So whats the main goal difference between RabbitMQ and Kafka? Take in consideration this will be used with Spring Cloud. Please share your experiences and criteria. Thanks in advance.
I certainly wouldn't consider Kafka as lightweight. Kafka relies on ZooKeeper so you'd need to throw ZooKeeper to your stack as well.
Kafka is pubsub but you could re-read messages. If you need to process large volumes of data, Kafka performs much better and its synergy with other big-data tools is a lot better. It specifically targets big data.
Three application level difference is:
Kafka supports re-read of consumed messages while rabbitMQ
not.
Kafka supports ordering of messages in partition while rabbitMQ
supports it with some constraint such as one exchange routing
to the queue,one queue, one consumer to queue.
Kafka is for fast in publishing data to partition than rabbitMQ.
Kafka is more than just a pub/sub messaging platform. It also includes APIs for data integration (Kafka Connect) and stream processing (Kafka Streams). These higher level APIs make developers more productive versus using only lower level pub/sub messaging APIs.
Also Kafka has just added Exactly Once Semantics in June 2017 which is another differentiator.
To start with Kafka does more than what RabbitMQ does. Message broker is just one subset of Kafka but Kafka can also act as Message storage and Stream processing. Comparing just the Message broker part, again Kafka is more robust than RabbitMQ as it supports Replication (for availability) and partitioning (for scalability), Replay of messages (if needed to reprocess) and it is Pull based. RabbitMQ can be scalable by using multiple consumers for a given queue but again it is push based and you lose ordering among multiple consumers.
It all depends on the use case and your question doesn't provide the use case and performance requirements to suggest one over other.
I found a nice answer in this youtube video Apache Kafka Explained (Comprehensive Overview).
It basically says that the difference between Kafka and standard JMS systems like RabbitMQ or ActiveMQ it that
Kafka consumers pull the messages from the brokers which allows for buffering messages for as long as the retention period holds. While in most JMS systems messages are pushed to the consumers which make strategies like back-pressure harder to achieve.
Kafka also eases the replacement of events by storing them on disk, so they can be replaced at any time.
Kafka guarantees the ordering of message within a partition.
Kafka overall provides an easy way for building scalable and fault-tolerant systems.
Kafka requires is more complex and harder to understand than JMS systems.

Difference between Redis and Kafka

Redis can be used as realtime pub-sub just as Kafka.
I am confused which one to use when.
Any use case would be a great help.
Redis pub-sub is mostly like a fire and forget system where all the messages you produced will be delivered to all the consumers at once and the data is kept nowhere. You have limitation in memory with respect to Redis. Also, the number of producers and consumers can affect the performance in Redis.
Kafka, on the other hand, is a high throughput, distributed log that can be used as a queue. Here any number of users can produce and consumers can consume at any time they want. It also provides persistence for the messages sent through the queue.
Final Take:
Use Redis:
If you want a fire and forget kind of system, where all the messages that you produce are delivered instantly to consumers.
If speed is most concerned.
If you can live up with data loss.
If you don't want your system to hold the message that has been sent.
The amount of data that is gonna be dealt with is not huge.
Use kafka:
If you want reliability.
If you want your system to have a copy of messages that has been sent even after consumption.
If you can't live up with data loss.
If Speed is not a big concern.
data size is huge
Redis 5.0+ version provides the Stream data structure. It could be considered as a log data structure with delivery guarantees. It offers a set of blocking operations allowing consumers to wait for new data added to a stream by producers, and in addition to that, a concept called Consumer Groups.
Basically Stream structure provides the same capabilities as Kafka.
Here is the documentation https://redis.io/topics/streams-intro
There are two most popular Java clients that support this feature: Redisson and Jedis
Redisson provides ReliableTopic object if reliability of delivery is required. https://github.com/redisson/redisson/wiki/6.-distributed-objects/#613-reliable-topic