Deleting a range of entries from a Redis Stream - redis

I want to delete entries of a Redis Stream older than a particular entry ID. But the XDEL command takes each ID explicitly as input. Is there any way to specify a range of IDs which will help when there are large number of entries in the stream? Also trimming a range of entries will also help me recollect unsed memory.

Currently, there's no way.
However, XTRIM is designed to accept different trimming strategies,
even if currently only MAXLEN is implemented. Given that this is an
explicit command, it is possible that in the future it will allow to
specify trimming by time, because the user calling this command in a
stand-alone way is supposed to know what she or he is doing.
One useful eviction strategy that XTRIM should have is probably the
ability to remove by a range of IDs. This is currently not possible,
but will be likely implemented in the future in order to more easily
use XRANGE and XTRIM together to move data from Redis to other storage
systems if needed.
You can use XTRIM to claim the space, in XTRIM you can give the desired length.
XTRIM mystream MAXLEN ~ 1000
In this 1000 is the size of remaining stream, it could be more or less, it's an approximate number.
Ref: https://redis.io/topics/streams-intro

Related

How many keys can be deleted in a single redis del command?

I want to delete multiple redis keys using a single delete command on redis client.
Is there any limit in the number of keys to be deleted?
i will be using del key1 key2 ....
There's no hard limit on the number of keys, but the query buffer limit does provide a bound. Connections are closed when the buffer hits 1 GB, so practically speaking this is somewhat difficult to hit.
Docs:
https://redis.io/topics/clients
However! You may want to take into consideration that Redis is single-threaded: a time-consuming command will block all other commands until completed. Depending on your use-case this may make a good case for "chunking" up your deletes into groups of, say, 1000 at a time, because it allows other commands to squeeze in between. (Whether or not this is tolerable is something you'll need to determine based on your specific scenario.)

compare redis command: multi and mget

there are two systems sharing a redis database, one system just read the redis, the other update it.
the read system is so busy that the redis can't handle it, to reduce the count of requests to redis, I find "mget", but I also find "multi".
I'm sure mget will reduce the number of requests, but will "multi" do the same? I think "multi" will force the redis server to keep some info about this transaction and collect requests in this transaction from the client one by one, so the total number of requests sent stays the same, but the results returned will be combined together, right?
So If I just read KeyA, keyB, keyC in "multi" when the other write system changed KeyB's value, what will happen?
Short Answer: You should use MGET
MULTI is used for transaction, and it won't reduces the number of requests. Also, the MULTI command MIGHT be deprecated in the future, since there's a better choice: lua scripting.
So If I just read KeyA, keyB, keyC in "multi" when the other write system changed KeyB's value, what will happen?
Since MULTI (with EXEC) command ensures transaction, all of the three GET commands (read operations) executes atomically. If the update happens before the read operation, you'll get the old value. Otherwise, you'll get the new value.
By the way, there's another option to reduce RTT: PIPELINE. However, in your case, MGET should be the best option.

optimization: stream of many key updates, small number of keys

I have a program that receives a constant stream of data.
From this stream of data I populate a hashtable. Every piece of data I receive
is translated in, either:
a key update ;
or a key insertion if it doesn't already exist.
I store the incoming raw data in a queue before it is being processed.
The number of keys in the hashtable is very small. 99% of the data I receive
corresponds to key updates.
The problem is that I have so many key updates that the queue becomes
too big for my consumers.
Obviously, from the thousands of key updates, many of them concern the same
key, so only the last one has a real value while all the others are useless.
What is the best way for me to handle this case? Which data structure should I
be using?
What can you tell us about your keys? How many are there? Are they numeric (and if so, what range of values might they take?), textual? Any limit on the number of bytes per key? What kind of hash table are you inserting to (e.g. closed hashing, open hashing)? What contention/locking is there on the hash table? How many updates per second? What programming language are you using?
How many keys
A few hundreds or maybe a few thousands. Not a lot!
Numeric keys
The keys themselves are alphanumeric, they are not very long, around 30 characters at most. The values, however, are all numbers (integers).
Limit on the number of bytes per key
My keys are 30 characters long, at most.
Kind of hash table
I'm simply using Python's defaultdict
Contention/locking
Python's dictionaries are considered thread-safe
Number of updates per second
It can go from 1 every 3 seconds to more than a 100 per second
Programming language
I'm using python
Instead of using a simple queue you can use another hashtable - each incoming message could be stored in the appropriate stack based on key. You then take each element from each stack (which will be the most recent item) - you can optionally clear each stack when you pull an item out.
ConcurrentDictionary should fit the bill nicely.
But what you need here is an (maybe adaptive) throttling mechanism, that detects when the queue is too slow and starts collapsing the data.

Correct modeling in Redis for writing single entity but querying multiple

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
.
.
.
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help
I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Overall:
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis
For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]