With redis streams I can read multiple streams in a chronologically forward direction.
Each stream is read from a point in time up to to newest point in that stream.
I can get the oldest item multiple streams
XREAD COUNT 1 STREAMS streamA streamB streamC 0-0 0-0 0-0
and from there (noting the returned id's) move forward consuming all the items in a selection of streams. Limiting consumption to only 1 item per stream.
How can I easily consume multiple streams in reverse. Starting with the newest items on the tip on a stream and stepping backwards in time, whilst constraining consumption to only 1 item from each stream?
You can't, since Redis doesn't supply this XREVREAD command as of version 6.0.
The closest is XREVRANGE except it only takes one stream at a time. If you really want to, you might have to write a lua script so you can provide multiple streams, have the script loop the streams and call the XREVRANGE command, then result the results at once.
Related
We are looking at using redis streams as a cluster wide messaging bus, where each node in the cluster has a unique id. The idea is that each node, when spawned, creates a consumer group with that unique id to a central redis stream to guarantee each node in the cluster gets a copy of every message. In an orchestrated environment, cluster nodes will be spawned and removed on the fly, each having a unique id. Over time I can see this resulting in there being 100's or even 1000's of old/unused consumer groups all subscribed to the same redis stream.
My question is this - is there an upper limit to the number of consumer groups that redis can handle and does a large number of (unused) consumer groups have any real processing cost? It seems that a consumer group is just a pointer stored in redis that points to the last read entry in the stream, and is only accessed when a consumer of the group does a ranged XREADGROUP. That would lead me to assume (without diving into Redis code) that the number of consumer groups really does not matter, save for the small amount of RAM that the consumer groups pointers would eat up.
Now, I understand we should be smarter and a node should delete its own consumer groups when it is being killed or we should be cleaning this up on a scheduled basis, but if a consumer group is just a record in redis, I am not sure it is worth the effort - at least at the MVP stage of development.
TL;DR;
Is my understanding correct, that there is no practical limit on the number of consumer groups for a given stream and that they have no processing cost unless used?
Your understanding is correct, there's no practical limit to the number of CGs and these do not impact the operational performance.
That said, other than the wasted RAM (which could become significant, depending on the number of consumers in the group and PEL entries), this will add time complexity to invocations of XINFO STREAM ... FULL and XINFO GROUPS as these list the CGs. Once you have a non-trivial number of CGs, every call to these would become slow (and block the server while it is executing).
Therefore, I'd still recommend implementing some type of "garbage collection" for the "stale" CGs, perhaps as soon as the MVP is done. Like any computing resource (e.g. disk space, network, mutexes...) and given there are no free lunches, CGs need to be managed as well.
P.S. IIUC, you're planning to use a single consumer in each group, and have each CG/consumer correspond to a node in your app's cluster. If that is the case, I'm not sure that you need CGs and you can use the simpler XREAD (instead of XREADGROUP) while keeping the last ID locally in the node.
OTOH, assuming I'm missing something and that there's a real need for this use pattern, I'd imagine Redis being able to support it better by offering some form of expiry for idle groups.
The https://redis.io/topics/streams-intro#capped-streams documentation mentions the capped streams to prevent memory overload:
...Sometimes it is useful to have at maximum a given number of items inside a stream, other times once a given size is reached, it is useful to move data from Redis to a storage which is not in memory...
However it only explains the redis capabilities on trimming the stream. I was not able to find any concept or a proven way to actually move the data from redis. I understand I can create a consumer to move all events to the unlimited place but the statement quoted above suggests that I should be able to move only old events in an efficient manner. Could you please share an idea of a solution?
IIUC you're looking to delete consumed messages, use case could be a replay or store it as historical data.
Redis as such does not offer a clean way to move data out of any Redis collection, capped stream just means you can trim the stream since it could potentially lead to out of memory.
The easiest way would be to add a consumer in the archive group that would consume from the stream and writes this data somewhere else. The consumer has to work for all required Redis streams where archival is required, this way you will always have data in secondary storage.
Now you would need some trim policy that would trim the collection, the easiest way could be trim it periodically like daily see my other answer
How to define TTL for redis streams?
Can XREAD (or perhaps another command) be used to atomically detect whether data was written to a Redis stream?
More specifically:
Suppose you added some data to a Redis stream in one process and saw that the data was added successfully with some auto generated key.
XADD somestream foo bar
After this XADD completes, you immediately run the following read in another process.
XREAD COUNT 1000 STREAMS somestream 0-0
Is this XREAD guaranteed to return data? The documentation is not clear about whether a successful XADD guarantees that readers will immediately see the added data, or whether there might be some small delay.
Redis's famous single threaded architecture answers that question. When you execute XADD on one process(client side) and after another process(client side) executes XREAD then the server execute them consecutively which guarantees that the data will be there before XREAD is executed.
The next quotes are from The Little Redis Book
Every Redis command is atomic, including the ones that do multiple things. Additionally, Redis has support for transactions when using multiple commands.
You might not know it, but Redis is actually single-threaded, which is how every command is guaranteed to be atomic.
While one command is executing, no other command will run. (We’ll briefly talk about scaling in a later chapter.) This
is particularly useful when you consider that some commands do multiple things.
Am I missing something, or is there no way to generate backpressure with Redis streams? If a producer is pushing data to a stream faster consumers can consume it, there's no obvious way to signal to the producer that it should stop or slow down.
I expected that there would be a blocking version of XADD, that would block the client until room became available in a capped stream (similar to the blocking version of XREAD that allows consumers to wait until data becomes available), but this doesn't seem to be the case.
How do people deal with the above scenario — signaling to a producer that it should hold off on adding more items to a stream?
I understand that some data stream systems such as Kafka do not require backpressure, but Redis doesn't appear to have a comparable solution, and it seems like this would be a relatively common problem for many Redis streams use cases.
If you have persistence (either RDB or AOF) turned on, your stream messages will be persisted, hence there's no need for backpressure.
And if you use replicas, you have another level of redudancy.
Backpressure is needed only when Redis does not have enough memory (or enough network bandwidth to the replicas) to hold the messages.
And, honestly, I have never seen this scenario.
Why would you want to ? Unless you run out of memory it is not an issue and each consumer slow and fast can read at their leisure.
Note not using consumer groups just publishing via XADD and readers read via XRANGE via position stored in a key which is closer to Kafka. Using one stream per partition.
Producer can check if the table size gets too big every 1K messages (via XLEN) to slow down if this is an issue and cant you cant throw HW at it 5 nodes with 20 Gig each is pretty easy with the streams spread across the cluster .. Don't understand this should be easy so im probably missing something.
There is also an XADD version that trims the size of the table to ensure you don't over fill with the above but that world require some pretty extreme stuff. For us this is 2 days worth for frequent stuff which sends the latest state and 9 months for others.
Another thing dont store large messages in the stream , use a blob or separate key/ store.
I'm working on an alerting solution that uses Logstash to stream AWS CloudFront logs from an S3 bucket into Graphite after doing some minor processing.
Since multiple events with the same timestamp can occur (multiple events within a second), I elected to use Carbon Aggregator to count these events per second.
The problem I'm facing is that the aggregated whisper database seems to be dropping data. The normal whisper file sees all of it, but of course it cannot account for more than 1 event per second.
I'm running this setup in docker on an EC2 instance, which isn't hitting any sort of limit (CPU, Mem, Network, Disk).
I've checked every log I could find in the docker instances and checked docker logs, however nothing jumps out.
I've set the logstash output to display the lines on stdout (not missing any) and to send them to graphite on port 2023, which is set to be the line-by-line receiver for Carbon Aggregator:
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
aggregation-rules.conf is set to a very simple count per second:
test.<user>.total1s (1) = count test.<user>.total
storage-schemas.conf:
[default]
pattern = .*
retentions = 1s:24h
Happy to share more of my configuration as you request it.
I've hit a brick wall with this, I've been trying so many different things but I'm not able to see all data in the aggregated whisper db.
Any help is very much appreciated.
Carbon aggregator isn't designed to do what you are trying to do. For that use-case you'd want to use statsd to count the events per second.
https://github.com/etsy/statsd/blob/master/docs/metric_types.md#counting
Carbon aggregator is meant to aggregate across different series, for each point that it sees on the input it quantizes it to a timestamp before any aggregation happens, so you are still only going to get a single value per second with aggregator. statsd will take any number of counter increments and total them up each interval.