Is REDIS Pub/Sub apt for moderate size binary data? - redis

I've got jobs that I'm planning to send to workers via REDIS Pub/Sub. Job involves processing an image (JPEG, 20KB-800KB, typically around 150KB).
Is it a good idea to send the image directly as the message's payload?

I don't see this as a problem at all. If you are confident your subscriber(s)/worker(s) will be able to keep up and you won't risk running out of RAM then I think this is a valid approach. I don't know if its better than nginx streaming as suggested, but being an in-memory data store redis should scale pretty close to the hardware and network limits.
Keep in mind that redis pub/sub is not "durable" so if an image is published to a channel no one is currently subscribed to it won't get picked up. The image would just go nowhere.
You could build a durable queue pretty easy using a redis List if you need durability.

You can encode the JPEG file by base64 into string, and publish the string to the channel.
The size of send data(payload JPEG file) would be increased to about 1.5x to 2x.

Related

How to use backpressure with Redis streams?

Am I missing something, or is there no way to generate backpressure with Redis streams? If a producer is pushing data to a stream faster consumers can consume it, there's no obvious way to signal to the producer that it should stop or slow down.
I expected that there would be a blocking version of XADD, that would block the client until room became available in a capped stream (similar to the blocking version of XREAD that allows consumers to wait until data becomes available), but this doesn't seem to be the case.
How do people deal with the above scenario — signaling to a producer that it should hold off on adding more items to a stream?
I understand that some data stream systems such as Kafka do not require backpressure, but Redis doesn't appear to have a comparable solution, and it seems like this would be a relatively common problem for many Redis streams use cases.
If you have persistence (either RDB or AOF) turned on, your stream messages will be persisted, hence there's no need for backpressure.
And if you use replicas, you have another level of redudancy.
Backpressure is needed only when Redis does not have enough memory (or enough network bandwidth to the replicas) to hold the messages.
And, honestly, I have never seen this scenario.
Why would you want to ? Unless you run out of memory it is not an issue and each consumer slow and fast can read at their leisure.
Note not using consumer groups just publishing via XADD and readers read via XRANGE via position stored in a key which is closer to Kafka. Using one stream per partition.
Producer can check if the table size gets too big every 1K messages (via XLEN) to slow down if this is an issue and cant you cant throw HW at it 5 nodes with 20 Gig each is pretty easy with the streams spread across the cluster .. Don't understand this should be easy so im probably missing something.
There is also an XADD version that trims the size of the table to ensure you don't over fill with the above but that world require some pretty extreme stuff. For us this is 2 days worth for frequent stuff which sends the latest state and 9 months for others.
Another thing dont store large messages in the stream , use a blob or separate key/ store.

Redis Streams vs Kafka Streams/NATS

Redis team introduce new Streams data type for Redis 5.0. Since Streams looks like Kafka topics from first view it seems difficult to find real world examples for using it.
In streams intro we have comparison with Kafka streams:
Runtime consumer groups handling. For example, if one of three consumers fails permanently, Redis will continue to serve first and second because now we would have just two logical partitions (consumers).
Redis streams much faster. They stored and operated from memory so this one is as is case.
We have some project with Kafka, RabbitMq and NATS. Now we are deep look into Redis stream to trying using it as "pre kafka cache" and in some case as Kafka/NATS alternative. The most critical point right now is replication:
Store all data in memory with AOF replication.
By default the asynchronous replication will not guarantee that XADD commands or consumer groups state changes are replicated: after a failover something can be missing depending on the ability of followers to receive the data from the master. This one looks like point to kill any interest to try streams in high load.
Redis failover process as operated by Sentinel or Redis Cluster performs only a best effort check to failover to the follower which is the most updated, and under certain specific failures may promote a follower that lacks some data.
And the cap strategy. The real "capped resource" with Redis Streams is memory, so it's not really so important how many items you want to store or which capped strategy you are using. So each time you consumer fails you would get peak memory consumption or message lost with cap.
We use Kafka as RTB bidder frontend which handle ~1,100,000 messages per second with ~120 bytes payload. With Redis we have ~170 mb/sec memory consumption on write and with 512 gb RAM server we have write "reserve" for ~50 minutes of data. So if processing system would be offline for this time we would crash.
Could you please tell more about Redis Streams usage in real world and may be some cases you try to use it themself? Or may be Redis Streams could be used with not big amount of data?
long time no see. This feels like a discussion that belongs in the redis-db mailing list, but the use case sounds fascinating.
Note that Redis Streams are not intended to be a Kafka replacement - they provide different properties and capabilities despite the similarities. You are of course correct with regards to the asynchronous nature of replication. As for scaling the amount of RAM available, you should consider using a cluster and partition your streams across period-based key names.

Specifying RabbitMQ messaging strategy on memeory or disc

I am new at RabbitMQ am wonder something about saving message strategy. By default RabbitMQ saves message queuses on memeory. This way is high performance. But messages are important and should be save on disc. Because server may down at any time. This way shows slower performace.
Which stuation should be prefable. What is your real world experience?
There is a whole lot regarding persistance here.
You can make queues durable, in that way messages are saved to the disk. Of course only until they are acknowledged!
You didn't say what is your use case and what do you need this for, but bare in mind that RAbbitMQ is not a database.

What is the point of REDIS in ELK stack?

I currently have architecture with filebeat as the log shipper, which sends logs to log stash indexer instance and then to managed elastic search in AWS. Due to persistent TCP connections, I cannot load balance using AWS ELB multiple log stash indexer instances since filebeats always picks on of the instances and sends it there. So I decided to use redis. Now seeing how difficult it is to scale redis and make it highly available compontent in ELK stack I want to ask what is even the point of redis. I read a million times it acts as a buffer, but if filebeats stops sending logs to logstash if logstash can't handle the load, why do we even need a buffer. Filebeat is smart enough to know to stop sending logs. Logstash is smart enough to stop sending logs to elastic search if elastic search goes down. So the pipeline stops. I really don't understand of the redis acting as a buffer in every standard ELK architecture.
Redis or Kafka or XYZ can be used as buffer in the ELK stack as you've rightly noticed.
The ES folks published a blog post yesterday about using Kafka in the pipeline, but it could as well have been Redis or XYZ. They make a good point about WHEN such a buffer could be needed and when it is not.
It is a good idea to have such a buffer in order to
handle event spikes
deal with a potentially unreachable ES cluster
If you don't anticipate such behaviors, i.e. you know
your events will always come at the same rate and/or
you're ok with your logs being shipped a bit later in case you need to upgrade your ES cluster
...then you don't need such a buffer. What's more, that will be one less piece of software you need to manage, monitor and maintain.
When it comes to the Elastic Stack ecosystem, there's no one-size-fits-all approach, it always depends on your precise use case and requirements. You need to ask yourself what is important to you, your system(s) and your users and then design your solution accordingly.

When will LogStash exceed the queue capacity and drop messages?

I am using LogStash to collect the logs from my service. The volume of the data is so large (20GB/day) that I am afraid that some of the data will be dropped at peak time.
So I asked question here and decided to add a Redis as a buffer between ELB and LogStash to prevent data loss.
However, I am curious about when will LogStash exceed the queue capacity and drop messages?
Because I've done some experiments and the result shows that LogStash can completely process all the data without any loss, e.g., local file --> LogStash --> local file, netcat --> LogStash --> local file.
Can someone give me a solid example when LogStash eventually drop messages? So I can have a better understanding about why we need a buffer in front of it.
As far as I know, Logstash queue is very small. Please refer to here.
Logstash sets each queue size to 20. This means only 20 events can be pending into the next phase.
This helps reduce any data loss and in general avoids logstash trying to act as a data storage
system. These internal queues are not for storing messages long-term.
As you say, your daily logs size are 20GB. It's quite large amount. So, it is recommended that install a redis before logstash. The other advantage for installing a redis is when your logstash process have error and shutdown, redis can buffer the logs for you, otherwise all your logs will be drop.
The maximum queue size is configurable and the queue can be stored on-disk or in-memory. (Strongly advise in-memory due to high volume).
When the queue is full, logstash will stop reading log messages and drop incoming logs.
For log files, logstash will stop reading further when tit can't keep up, it can resume reading later. It's keeping track of active log files and last read position. The files are basically acting like an enormous buffer, it's really unlikely to lose data (unless files are deleted).
For TCP/UDP input, messages can be lost if the queue is full.
For other inputs/outputs, you have to check the doc, whether it can support back pressure, whether it can replay missed messages if a network connection was lost.
Generally speaking, 20 GB a day is pretty low (even in 2014 when it was originally posted), we're talking about 1000 messages a second. logstash really doesn't need a redis in front.
For very large deployments (multiple TB per day), it's common to encounter kafka somewhere in the chain to buffer messages. At this stage there are typically many clients with different types of messages, flowing over a variety of protocols.