Redis keeping a hash of individual items that expire - redis

I'm looking for a way to maintain a hash map of values however I want the values (not the whole map) to expire individually after a a period of time.
The reason I'd like to accomplish this is because it drastically simplifies and minimizes my dependency on a database and time checking. Basically I'd like to define a list of resources to poll for 10 minutes after they are first requested.
So say my list is: ItemA, ItemB, ItemC.
ItemA is now older than 10 minutes, it should be knocked off the list so that the new list when I request it is only ItemB and ItemC.
I'm using a Node package with the standard redis library, so I'm looking for a way to do this easily with those packages. If not I'll fall back to the db method, but getting this working with Redis would be really great.
I'm already successfully using this to expire session tokens and from what I've read you can't do this directly with hashmap values. Just curious what some workarounds could look like.
Thanks.

Related

Making sure (distributed) caches only store the latest value in a distributed system

Let's say I want to use a built-in solution such as Redis or Memcached to cache database rows (as an example), to avoid recurrent costly trips to the database.
For the sake of the argument, let's assume I have a TABLE(id, x, y) and that I want to cache all rows so I never have to read directly from the database.
Questions:
Consider the following case: NodeA tries to update a given row's field x while NodeB tries to update y, then both simultaneously try to update the cache line. If they try to "manually" update the field they just changed to the row in the cache, if we follow the typical last-write-wins, one of the fields is going to be discarded, which is catastrophic. This makes me think I need to always fill the cache's rows with a full row read from the database.
But this by itself won't necessarily help me. If NodeA writes to x and loads the entire row in memory and then NodeB writes to y and reads the entire row in memory, if NodeB writes to the cache before NodeA then NodeB's changes will be overwritten! This makes me believe I need to always somehow version the rows both in the DB and in the cache. Is this the case? Memcached seems to have a compare and set primitive, but I see no such thing in Redis.
Even if 1. and 2. are not an issue, I still need to guarantee that my write / read has read-after-write consistency, otherwise it may happen that what I'm reading and intending to put in the cache is not necessarily the most up-to-date version. If that's the case, how can I make sure of this? By requiring w + r > n?
This seems to be a very common use-case, I'd guess it's pretty much a solved problem. What can I try to resolve this?
Key value stores as redis support advance data structures, such as HASHs.
If you're doing partial updates to cached entities (only a set of fields is updated as part of the super set), and given your goal is to avoid time-consuming database reads, simply save the table entry as a HASH K/V pairs (using HSET) and the use HGETALL for reading.
Redis OPS are atomic by nature, so that should solve your problems, if I got them right.
On a side note, if you're caching an entire entity yet doing partial updates, you should consider a simpler caching approach, such as read-through (making cache validity a reader-only concern).
As opposed to Database accesses. Redis cache access from different location unless somehow serialized, will always have the potential of being out of order when it comes to distributed systems, as there's always the execution environment (network, threading) to introduce possible delays.
Doing read-through caching will ensure data is always updated after the most recent write without the need to synchronize anything else.
This is how Facebook solved the issue with Memcached: http://nil.csail.mit.edu/6.824/2020/papers/memcache-faq.txt.
The idea is to use the concept of a lease: when a request for a cached value is received and there is no data for such key, a lease token (64 bits id) is returned.
When the webserver fetches the data from the database it can then store the data in the cache with that token. Every time an invalidation request is invoked on a key, a new lease token is created, and as such, if a put is attempted for an old token, the put ends up rejected.
As far as I understand, it's not really possible to (easily) replicate this behavior with Redis without resorting to LUA scripts.

Max value size for Redis

I've been trying to make replay system. So basically when player moves, system saves his datas(movements, location, animation etc.) into JSON file. In the end of the record, JSON file may be over 50 MB. I'd want to save this data into Redis with expire date (24-48 hours).
My questions are;
Is it bad to save over 50 MB into Redis with expire date?
How many datas that over 50 MB can Redis handle without performance loss?
If players make 500 records in 48 hours, may it be bad for Redis?
How many milliseconds does it takes 50 MB data from Redis with average VDS/VPS?
Storing a large object(in terms of size) is not a good practice. You may read it from here. One of the problem is network. You need to send 50MB payload to a redis server in a single call. Also if you save them as one big object, then while retrieving, updating it (a single field, element etc), you need to get 50 MB back from server and parse it to get a single field, update it back end send back to server. That's a serious problem in terms of network.
Instead of redis strings, you may prefer sorted sets or lists depending on your use case. If you are going to store them with timestamps and get the range of events between these timestamps, then sorted sets may be an ideal solution for you. It's good for pagination etc. One of the crucial drawback is the complexity of adding a new element is O(log(N)).
lists may also provide a good playground for your case. You may use LPUSH/RPUSH to add new events to your list, and since Redis lists are implemented with linked lists, both adding a message to the beginning or end of the list is same, O(1), which is great.
Whenever an event happens, you either call ZADD or RPUSH/LPUSH to send the events to redis. If you need to query those to you may use available functions such as ZRANGEBYSCORE or LRANGE depending on your choice.
While designing your keys you may use an identifier such as user-id just like you mentioned in the comments. You will not have the problems with lists/sorted sets like you will have in strings. But choosing which one is most suitable for your depends on your use case for reads/writes or business rules.
Here some useful links to read;
Redis data types intro
Redis data types
Redis labs documentation about data types

Redis: Expire Values in a List or Set

Allow me to preface this by saying I'm fairly new to Redis. I have used Redis in the context of Resque.
I have a service that dispatches jobs to multiple other services. Those jobs either succeed or fail. Regardless of this, I'd like to send the results of what happened with a given job to a client that can then store they jobs in some sort of logical way, for example: JobType1Success, JobTypeOneFailure, etc.
I know I can create lists or sets with Redis and easily add some string representation of the data as values to the lists. Additionally, I know with traditional key/string values in Redis you can set an expiration in seconds. In my ideal world I would create several lists such as the ones mentioned above. Each job would then be prepended to it's appropriate list type, and after 7 days in the list a value would expire. Right now it seems fairly trivial to add the string values to a given list. However I am unable to find anything on whether or not I am able to expire values of a certain age from a list, and if so how to do that. I am working with a Node stack and using the Node Redis library. Any help here would be enormously appreciated.

Data structure for efficient access of random slices of data from an API call

We are writing a library for an Api which pulls down on ordered stream of data. Through this Api you can make calls for data by slices. For instance if I want items 15-25 I can make an api call for that.
The library we are writing will allow the client to call for any slice of data as well, but we want the library to be as efficient with these api calls as possible. So if I've already asked for items 21-30, I don't want to ever request those individual data items again. If someone asks the library for 15-25 we want to call the api for 15-20. We will need to search for what data we already have and avoid requesting that data again.
What is the most efficient data structure for storing the results of these api calls? The data sets will not be huge so search time in local memory isn't that big of a deal. We are looking for simplicity and cleanliness of code. There are several obvious answers to this problem but I'm curious if any data structure nerds out there have an elegant solution that isn't coming to mind.
For reference we are coding in Python but are really just looking for a data structure that solves this problem elegantly.
I'd use a balanced binary tree (e.g. http://pypi.python.org/pypi/bintrees/0.4.0) to map begin -> (end, data). When a new request comes in for [b, e) range, do a search for b (followed by move to previous record if b != key), another search for e (also step back), scan all entries between the resulting keys, pull down missing ranges, and merge all from-cache intervals and the new data into one interval. For N intervals in the cache, you'll get amortized O(log-N) cost of each cache update.
You can also simply keep a list of (begin, end, data) tuples, ordered by begin, and use bisect_right to search. Cost: O(N=number of cached intervals) for every update in the worst case, but if the clients tend to request data in increasing order, the cache update will be O(1).
Cache search itself is O(log-N) in either case.
The canonical data structure often used to self this problem is an interval tree. (See this Wikipedia article.) Your problem can be thought of as needing to know what things you've sent (what intervals) overlap with what you're trying to send -- then cut out the intervals that intersect with what you're trying to send (which is linear time with respect to the number of intervals that you find overlap) and you're there. The "Augmented" tree half way down the Wikipedia article looks simpler in implementation, though, so I'd stick with that. Should be "log N" time complexity, amortization or not.

Redis and Object Versioning

How are people coping with changes to redis object schemas - adding or removing properties from objects?
Sharing from my own experience (one year old project with thousands of user requests per second).
Usually, there were two scenarios for me:
Add new information to existing structures (like, "email" field to a user)
Remove or change existing values in existing structures (like, change format of some field)
Drop stuff from the database
For 1 I keep following simple strategy: degrade gracefully, e.g. if user doesn't have email record - treat it as empty email. Worked all the time.
For 2 and 3 it depends, whether data can be changed/calculated/fixed before releasing or after. I run a job on database that does all the work for me, for few millions of keys it takes considerable time (minutes). If that job can be run only after I release the new code - then degrading gracefully helps a lot, I simply release and then run the job.
PS: If you affect a lot of keys in redis then it is very important to use http://redis.io/topics/pipelining Saves a lot of time.
Take a list of all affected (i.e. you want to fix them in any way) keys or records in pipeline
Do whatever you want on them. If it's possible try to queue writing operations into pipeline too
Send queued operations to redis.
It is also very important for you to make indexes of your structures. I keep sets with ids. Then I simply iterate over SMEMBERS(set_with_ids).
It is much, much better than iterating over KEYS command.
For extremely simple versioning, you could use different database numbers. This could be quite limiting in cases where almost everything is the same between two versions but it's also a very clean way to do it if it will work for you.