Max value size for Redis - redis

I've been trying to make replay system. So basically when player moves, system saves his datas(movements, location, animation etc.) into JSON file. In the end of the record, JSON file may be over 50 MB. I'd want to save this data into Redis with expire date (24-48 hours).
My questions are;
Is it bad to save over 50 MB into Redis with expire date?
How many datas that over 50 MB can Redis handle without performance loss?
If players make 500 records in 48 hours, may it be bad for Redis?
How many milliseconds does it takes 50 MB data from Redis with average VDS/VPS?

Storing a large object(in terms of size) is not a good practice. You may read it from here. One of the problem is network. You need to send 50MB payload to a redis server in a single call. Also if you save them as one big object, then while retrieving, updating it (a single field, element etc), you need to get 50 MB back from server and parse it to get a single field, update it back end send back to server. That's a serious problem in terms of network.
Instead of redis strings, you may prefer sorted sets or lists depending on your use case. If you are going to store them with timestamps and get the range of events between these timestamps, then sorted sets may be an ideal solution for you. It's good for pagination etc. One of the crucial drawback is the complexity of adding a new element is O(log(N)).
lists may also provide a good playground for your case. You may use LPUSH/RPUSH to add new events to your list, and since Redis lists are implemented with linked lists, both adding a message to the beginning or end of the list is same, O(1), which is great.
Whenever an event happens, you either call ZADD or RPUSH/LPUSH to send the events to redis. If you need to query those to you may use available functions such as ZRANGEBYSCORE or LRANGE depending on your choice.
While designing your keys you may use an identifier such as user-id just like you mentioned in the comments. You will not have the problems with lists/sorted sets like you will have in strings. But choosing which one is most suitable for your depends on your use case for reads/writes or business rules.
Here some useful links to read;
Redis data types intro
Redis data types
Redis labs documentation about data types

Related

o(1) complexity python structure for storing data in memory or to disc depends on number of items

At this moment I'm working on this library:
https://pypi.org/project/daffi/
Which is suppose to be kind of multiprocess RPC communication framework with ability to execute sync or async tasks on remotes.
The process of communication includes server which temporary stores some message metadata about receiver/transmitter processes and keep it until message is returned to process that sent message.
Generally speaking it is not the problem to keep 1k or 10k metadata items in memory as they are small and typically communication is fast so I haven't even experienced the state when server keeps so many items.
But I'd like to be protected for such cases and reduce memory consumption on server side when it happened.
So my question. Would you suggest any library or algorithm to store metadata in dict like object with ability to store items to disc under certain conditions?
Criteria is the following:
Lets say number of items in dict is less then 1k. In this case it acts like regular dict and store items in RAM.
If number or items becomes greater then 1k it starts serialize and storing them to disk with ability to take them by key and deserialize.
If, after spike number of items returns to normal (< 1K) it returns back to normal dict like behavior.
The speed is very important so I'd like to keep 0(1) complexity if possible.

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.

Dump the whole redis database instance using hiredis

I have a buffer that needs to read all values(hash, field and values) from redis after reboot, is there a way to do that in a fast way? I have approximately 100,000 hashes with 4 fields each.
Thanks!
EDIT:
The Slow way: Current Implementation is getting all the hashes using
Keys *
then
HGETALL xxx
to get all the fields' values.
There are two ways to approach this problem.
The first one is to try to optimize the KEYS/HGETALL combination you have described. Because you do not have millions of keys (100K is not so high by Redis standard), the KEYS command will not block the instance for a long time, and the output buffer size required to return 100K items is probably acceptable. Once the list of keys have been received by your program, then the next challenge is to run many HGETALL commands as fast as possible. The key is to pipeline them (for instance in synchronous batches of 1000 items) which is quite easy to implement with hiredis (just use redisAppendCommand / redisGetReply). The 100K items will be retrieved in 100 roundtrips only. Because most Redis instances can sustain 100K op/s or more, it should not last more than a few seconds. A more efficient solution would be to use the asynchronous interface of hiredis to try to maximize the throughput, but it is more complex to implement. I'm not sure it is worth it for 100K items.
The second approach is to use a BGSAVE command to take a snapshot of Redis content, retrieve the generated dump file, and then parse the file to extract the data. You can have a look at the excellent redis-rdb-tools package for a Python implementation. The main benefit of this approach is there is no impact on the Redis instance (no KEYS command to block the event loop) while still retrieving consistent data.

Redis set vs hash

In many Redis tutorials (such as this one), data is stored in a set, but with multiple values combined together in a string (i.e. a user account might be stored in the set as two entries, "user:1000:username" and "user:1000:password").
However, Redis also has hashes. It seems that it would make more sense to have a "user:1000" hash, which contains a "username" entry and a "password" entry. Rather than concatenating strings to access a particular value, you just access them directly in the hash.
So why isn't it used as much? Are these just old tutorials? Or do Redis hashes have performance issues?
Redis hashes are good for storing more complex data, like you suggest in your question. I use them for exactly that - to store objects with multiple attributes that need to be cached (specifically, inventory data for a particular product on an e-commerce site). Sure, I could use a concatenated string - but that adds unneeded complexity to my client code, and updating an individual field is not possible.
You may be right - the tutorials may simply be from before Hashes were introduced. They were clearly designed for storing Object representations: http://oldblog.antirez.com/post/redis-weekly-update-1.html
I suppose one concern would be the number of commands Redis must service when a new item is inserted (n number of commands, where n is the number of fields in the Hash) when compared to a simple String SET command. I haven't found this to be a problem yet on a service which hits Redis about 1 million times per day. Using the right data structure to me is more important than a negligible performance impact.
(Also, please see my comment regarding Redis Sets vs. Redis Strings - I think your question is referring to Strings but correct me if I'm wrong!)
Hashes are one of the most efficient methods to store data in Redis, even going so far as to recommending them for use whenever effectively possible.
http://redis.io/topics/memory-optimization
Use hashes when possible
Small hashes are encoded in a very small space, so you should try representing your data using hashes every time it is possible. For instance if you have objects representing users in a web application, instead of using different keys for name, surname, email, password, use a single hash with all the required fields.
Use case comparison:
Sets provide with a semantic interface to store data as a set in Redis server. The use
cases for this kind of data would be more for an analytics purpose, for example
how many people browse the product page and how many end up purchasing
the product.
Hashes provide a semantic interface to store simple and complex data objects in the
Redis server. For example, user profile, product catalog, and so on.
Ref: Learning Redis
Use cases for SETS
Uniqueness:
We have to enforce our application to make sure every username can be used by one single person. If someone signup with a username, we first look up set of usernames
SISMEMBER setOfUsernames newUsername
Creating relationships between different records:
Imagine you have Like functionality in your app. you might have a separate set for every single user and store the ID's of the images that user has liked so far.
Find common attributes that people like
In dating apps, users usually pick different attributes, and those attributes are stored in sets. And to help people match easily, our app might check the intersection of those common attributes
SINTER user#45:likesSet user#34:likesSet
When we have lists of items and order does not matter
For example, if you want to restrict API addresses that want to reach your app or block emails to send you emails, you can store them in a set.
Use cases for Hash
Redis Hashes are usually used to store complex data objects: sessions, users etc. Hashes are more memory-optimized.

Data structure for efficient access of random slices of data from an API call

We are writing a library for an Api which pulls down on ordered stream of data. Through this Api you can make calls for data by slices. For instance if I want items 15-25 I can make an api call for that.
The library we are writing will allow the client to call for any slice of data as well, but we want the library to be as efficient with these api calls as possible. So if I've already asked for items 21-30, I don't want to ever request those individual data items again. If someone asks the library for 15-25 we want to call the api for 15-20. We will need to search for what data we already have and avoid requesting that data again.
What is the most efficient data structure for storing the results of these api calls? The data sets will not be huge so search time in local memory isn't that big of a deal. We are looking for simplicity and cleanliness of code. There are several obvious answers to this problem but I'm curious if any data structure nerds out there have an elegant solution that isn't coming to mind.
For reference we are coding in Python but are really just looking for a data structure that solves this problem elegantly.
I'd use a balanced binary tree (e.g. http://pypi.python.org/pypi/bintrees/0.4.0) to map begin -> (end, data). When a new request comes in for [b, e) range, do a search for b (followed by move to previous record if b != key), another search for e (also step back), scan all entries between the resulting keys, pull down missing ranges, and merge all from-cache intervals and the new data into one interval. For N intervals in the cache, you'll get amortized O(log-N) cost of each cache update.
You can also simply keep a list of (begin, end, data) tuples, ordered by begin, and use bisect_right to search. Cost: O(N=number of cached intervals) for every update in the worst case, but if the clients tend to request data in increasing order, the cache update will be O(1).
Cache search itself is O(log-N) in either case.
The canonical data structure often used to self this problem is an interval tree. (See this Wikipedia article.) Your problem can be thought of as needing to know what things you've sent (what intervals) overlap with what you're trying to send -- then cut out the intervals that intersect with what you're trying to send (which is linear time with respect to the number of intervals that you find overlap) and you're there. The "Augmented" tree half way down the Wikipedia article looks simpler in implementation, though, so I'd stick with that. Should be "log N" time complexity, amortization or not.