Difference between Redis and Kafka - redis

Redis can be used as realtime pub-sub just as Kafka.
I am confused which one to use when.
Any use case would be a great help.

Redis pub-sub is mostly like a fire and forget system where all the messages you produced will be delivered to all the consumers at once and the data is kept nowhere. You have limitation in memory with respect to Redis. Also, the number of producers and consumers can affect the performance in Redis.
Kafka, on the other hand, is a high throughput, distributed log that can be used as a queue. Here any number of users can produce and consumers can consume at any time they want. It also provides persistence for the messages sent through the queue.
Final Take:
Use Redis:
If you want a fire and forget kind of system, where all the messages that you produce are delivered instantly to consumers.
If speed is most concerned.
If you can live up with data loss.
If you don't want your system to hold the message that has been sent.
The amount of data that is gonna be dealt with is not huge.
Use kafka:
If you want reliability.
If you want your system to have a copy of messages that has been sent even after consumption.
If you can't live up with data loss.
If Speed is not a big concern.
data size is huge

Redis 5.0+ version provides the Stream data structure. It could be considered as a log data structure with delivery guarantees. It offers a set of blocking operations allowing consumers to wait for new data added to a stream by producers, and in addition to that, a concept called Consumer Groups.
Basically Stream structure provides the same capabilities as Kafka.
Here is the documentation https://redis.io/topics/streams-intro
There are two most popular Java clients that support this feature: Redisson and Jedis
Redisson provides ReliableTopic object if reliability of delivery is required. https://github.com/redisson/redisson/wiki/6.-distributed-objects/#613-reliable-topic

Related

ActiveMQ datastore for cluster setup

We have been using ActiveMQ version 5.16.0 broker with single instances in production. Now we are planning to use cluster of AMQ brokers for HA and load distribution with consistency in message data. Currently we are using only one queue
HA can be achieved using failover but do we need to use the same datastore or it can be separated? If I use different instances for AMQ brokers then how to setup a common datastore.
Please guide me how to setup datastore for HA and load distribution
Multiple ActiveMQ servers clustered together can provide HA in a couple ways:
Scale message flow by using compute resources across multiple broker nodes
Maintain message flow during single node planned or unplanned outage of a broker node
Share data store in the event of ActiveMQ process failure.
Network of brokers solve #1 and #2. A standard 3-node cluster will give you excellent performance and ability to scale the number of producers and consumers, along with splitting the full flow across 3-nodes to provide increased capacity.
Solving for #3 is complicated-- in all messaging products. Brokers are always working to be completely empty-- so clustering the data store of a single-broker becomes an anti-pattern of sorts. Many times, relying on RAID disk with a single broker node will provide higher reliability than adding NFSv4, GFSv2, or JDBC and using shared-store.
That being said, if you must use a shared store-- follow best practices and use GFSv2 or NFSv4. JDBC is much slower and requires significant DB maintenance to keep running efficiently.
Note: [#Kevin Boone]'s note about CIFS/SMB is incorrect and CIFS/SMB should not be used. Otherwise, his responses are solid.
You can configure ActiveMQ so that instances share a message store, or so they have separate message stores. If they share a message store, then (essentially) the brokers will automatically form a master-slave configuration, such that only one broker (at a time) will accept connections from clients, and only one broker will update the store. Clients need to identify both brokers in their connection URIs, and will connect to whichever broker happens to be master.
With a shared message store like this, locks in the message store coordinate the master-slave assignment, which makes the choice of message store critical. Stores can be shared filesystems, or shared databases. Only a few shared filesystem implementations work properly -- anything based on NFS 4.x should work. CIFS/SMB stores can work, but there's so much variation between providers that it's hard to be sure. NFS v3 doesn't work, however well-implemented, because the locking semantics are inappropriate. In any case, the store needs to be robust, or replicated, or both, because the whole broker cluster depends on it. No store, no brokers.
In my experience, it's easier to get good throughput from a shared file store than a shared database although, of course, there are many factors to consider. Poor network connectivity will make it hard to get good throughput with any kind of shared store (or any kind of cluster, for that matter).
When using individual message stores, it's typical to put the brokers into some kind of mesh, with 'network connectors' to pass messages from one broker to another. Both brokers will accept connections from clients (there is no master), and the network connections will deal with the situation where messages are sent to one broker, but need to be consumed from another.
Clients' don't necessarily need to specify all brokers in their connection URIs, but generally will, in case one of the brokers is down.
A mesh is generally easier to set up, and (broadly speaking) can handle more client load, than a master-slave with shared filestore. However, (a) losing a broker amounts to losing any messages that were associated with it (until the broker can be restored) and (b) the mesh interferes with messaging patterns like message grouping and exclusive consumers.
There's really no hard-and-fast rule to determine which configuration to use. Many installers who already have some sort of shared store infrastructure (a decent relational database, or a clustered NFS, for example) will tend to want to use it. The rise in cloud deployments has had the effect that mesh operation with no shared store has become (I think) a lot more popular, because it's so symmetric.
There's more -- a lot more -- that could be said here. As a broad question, I suspect the OP is a bit out-of-scope for SO. You'll probably get more traction if you break your question up into smaller, more focused, parts.

Count inflight messages in Redis

I'm using Redis as a simple pubsub broker, managed by the redis-py library, using just the default 'main' channel. Is there a technique, in either Redis itself or the wrapping Python library to count the number of messages in this queue? I don't have deeper conceptual knowledge of Redis (in particular how it implements broker functionality) so am not sure if such a question makes sense
Exact counts, lock avoidance etc. is not necessary; I only need to check periodically (on the order of minutes) whether this queue is empty
Redis Pub/Sub doesn't hold any internal queues of messages see - https://redis.io/topics/pubsub.
If you need a more queue based publish mechanism you might to check Redis Streams. Redis Streams provides two methods that might help you XLEN and XINFO.

How to use backpressure with Redis streams?

Am I missing something, or is there no way to generate backpressure with Redis streams? If a producer is pushing data to a stream faster consumers can consume it, there's no obvious way to signal to the producer that it should stop or slow down.
I expected that there would be a blocking version of XADD, that would block the client until room became available in a capped stream (similar to the blocking version of XREAD that allows consumers to wait until data becomes available), but this doesn't seem to be the case.
How do people deal with the above scenario — signaling to a producer that it should hold off on adding more items to a stream?
I understand that some data stream systems such as Kafka do not require backpressure, but Redis doesn't appear to have a comparable solution, and it seems like this would be a relatively common problem for many Redis streams use cases.
If you have persistence (either RDB or AOF) turned on, your stream messages will be persisted, hence there's no need for backpressure.
And if you use replicas, you have another level of redudancy.
Backpressure is needed only when Redis does not have enough memory (or enough network bandwidth to the replicas) to hold the messages.
And, honestly, I have never seen this scenario.
Why would you want to ? Unless you run out of memory it is not an issue and each consumer slow and fast can read at their leisure.
Note not using consumer groups just publishing via XADD and readers read via XRANGE via position stored in a key which is closer to Kafka. Using one stream per partition.
Producer can check if the table size gets too big every 1K messages (via XLEN) to slow down if this is an issue and cant you cant throw HW at it 5 nodes with 20 Gig each is pretty easy with the streams spread across the cluster .. Don't understand this should be easy so im probably missing something.
There is also an XADD version that trims the size of the table to ensure you don't over fill with the above but that world require some pretty extreme stuff. For us this is 2 days worth for frequent stuff which sends the latest state and 9 months for others.
Another thing dont store large messages in the stream , use a blob or separate key/ store.

Redis Streams vs Kafka Streams/NATS

Redis team introduce new Streams data type for Redis 5.0. Since Streams looks like Kafka topics from first view it seems difficult to find real world examples for using it.
In streams intro we have comparison with Kafka streams:
Runtime consumer groups handling. For example, if one of three consumers fails permanently, Redis will continue to serve first and second because now we would have just two logical partitions (consumers).
Redis streams much faster. They stored and operated from memory so this one is as is case.
We have some project with Kafka, RabbitMq and NATS. Now we are deep look into Redis stream to trying using it as "pre kafka cache" and in some case as Kafka/NATS alternative. The most critical point right now is replication:
Store all data in memory with AOF replication.
By default the asynchronous replication will not guarantee that XADD commands or consumer groups state changes are replicated: after a failover something can be missing depending on the ability of followers to receive the data from the master. This one looks like point to kill any interest to try streams in high load.
Redis failover process as operated by Sentinel or Redis Cluster performs only a best effort check to failover to the follower which is the most updated, and under certain specific failures may promote a follower that lacks some data.
And the cap strategy. The real "capped resource" with Redis Streams is memory, so it's not really so important how many items you want to store or which capped strategy you are using. So each time you consumer fails you would get peak memory consumption or message lost with cap.
We use Kafka as RTB bidder frontend which handle ~1,100,000 messages per second with ~120 bytes payload. With Redis we have ~170 mb/sec memory consumption on write and with 512 gb RAM server we have write "reserve" for ~50 minutes of data. So if processing system would be offline for this time we would crash.
Could you please tell more about Redis Streams usage in real world and may be some cases you try to use it themself? Or may be Redis Streams could be used with not big amount of data?
long time no see. This feels like a discussion that belongs in the redis-db mailing list, but the use case sounds fascinating.
Note that Redis Streams are not intended to be a Kafka replacement - they provide different properties and capabilities despite the similarities. You are of course correct with regards to the asynchronous nature of replication. As for scaling the amount of RAM available, you should consider using a cluster and partition your streams across period-based key names.

What's the difference between RabbitMQ and kafka?

Which will fair better under different scenarios?
I know that RabbitMQ is based on AMQP protocol, and has visualization for the developer.
RabbitMQ, as you noted, is an implementation of AMQP. AMQP is a messaging protocol which originated in the financial industry. Its core metaphors are Messages, Exchanges, and Queues.
Kafka was designed as a log-structured distributed datastore. It has features that make it suitable for use as part of a messaging system, but it also can accommodate other use cases, like stream processing. It originated at LinkedIn and its core metaphors are Messages, Topics, and Partitions.
Subjectively, I find RabbitMQ more pleasant to work with: its web-based management tool is nice; there is little friction in creating, deleting, and configuring queues and exchanges. Library support is good in many languages. As in its default configurations Rabbit only stores messages in memory, latencies are low.
Kafka is younger, the tooling feels more clunky, and it has had relatively poor support in non-JVM languages, though this is getting better. On the other hand, it has stronger guarantees in the face of network partitions and broker loss, and since it is designed to move messages to disk as soon as possible, it can accommodate a larger data set on typical deployments. (Rabbit can page messages to disk but sometimes it doesn't perform well).
In either, you can design for direct (one:one), fanout (one:many), and pub-sub (many:many) communication patterns.
If I were building a system that needed to buffer massive amounts of incoming data with strong durability guarantees, I'd choose Kafka for sure. If I was working with a JVM language or needed to do some stream processing over the data, that would only reinforce the choice.
If, on the other hand, I had a use case in which I valued latency over throughput and could handle loss of transient messages, I'd choose Rabbit.
Kafka:
Message will be always there. You can manage this by specifying a
message retention policy.
It is distributed event streaming platform.
We can use it as a log
Kafka streaming, you can change and process the message automatically.
We can not set message priority
Retain order only inside a partition. In a partition, Kafka guarantees that the whole batch of messages either fail or pass.
Not many mature platforms like RMQ (Scala and Java)
RabbitMQ:
RabbitMQ is a queuing system so messages get deleted just after consume.
It is distributed, by a message broker.
It can not be used like this
Streaming is not supported in RMQ
We can set the priority of the message and can consume on the basis of the same.
Does not support guarantee automaticity even in relation to transactions involving a single queue.
Mature platform ( Have the best compatibility with all languages)