Can I create multiple full sets with 2^32 data in Redis? - redis

Actually the question is about the capacity of a single instance of the Redis, regardless of the Memory size.
The reference said:
Redis can handle up to 2^32 keys, and was tested in practice to handle
at least 250 million keys per instance. Every hash, list, set, and
sorted set, can hold 2^32 elements. In other words your limit is likely
the available memory in your system.
So regardless of the server's memory size, Can I create 4 "set" and fill them with almost 2^32 keys in a single instance of Redis? That means 4*(2^32) keys by total.

Sets do not contain keys, they contain strings.
Redis Sets are an unordered collection of Strings.
Of course, your string could happen to share the same characters as one of your keys, but there's nothing special about that. So, yes, you could have four sets containing up to 4 * (2^32) strings, but the total number of keys would still be limited to 2^32.

Related

Using HSET or SETBIT for storing 6 billion SHA256 hashes in Redis

Problem set : I am looking to store 6 billion SHA256 hashes. I want to check if a hash exist and if so, an action will be performed. When it comes to storing the SHA256 hash (64 byte string) just to check the if the key exist, I've come across two functions to use
HSET/HEXIST and GETBIT/SETBIT
I want to make sure I take the least amount of memory, but also want to make sure lookups are quick.
The Use case will be "check if sha256 hash exist"
The problem,
I want to understand how to store this data as currently I have a 200% increase from text -> redis. I want to understand what would the best shard options using ziplist entries and ziplist value would be. How to split the hash to be effective so the ziplist is maximised.
I've tried setting the ziplist entries to 16 ^ 4 (65536) and the value to 60 based on splitting 4:60
Any help to help me understand options, and techniques to make this as small of a footprint but quick enough to run lookups.
Thanks
A bit late to the party but you can just use plain Redis keys for this:
# Store a given SHA256 hash
> SET 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 ""
OK
# Check whether a specific hash exists
> EXISTS 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
0
Where both SET and EXISTS have a time complexity of O(1) for single keys.
As Redis can handle a maximum of 2^32 keys, you should split your dataset into two or more Redis servers / clusters, also depending on the number of nodes and the total combined memory available to your servers / clusters.
I would also suggest to use the binary sequence of your hashes instead of their textual representation - as that would allow to save ~50% of memory while storing your keys in Redis.

Storing 13 Million floats and integer in redis

I have a file with 13 million floats each of them have a associated index as integer. The original size of file is 80MB.
We want to pass multiple indexes to get float data. The only reason, I needed hashmap field and value as List does not support passing multiple indexes to get.
Stored them as hashmap in redis, with index being field and float as value. On checking memory usage it was about 970MB.
Storing 13 million as list is using 280MB.
Is there any optimization I can use.
Thanks in advance
running on elastic cache
You can do a real good optimization by creating buckets of index vs float values.
Hashes are very memory optimized internally.
So assume your data in original file looks like this:
index, float_value
2,3.44
5,6.55
6,7.33
8,34.55
And you have currently stored them one index to one float value in hash or a list.
You can do this optimization of bucketing the values:
Hash key would be index%1000, sub-key would be index, and value would be float value.
More details here as well :
At first, we decided to use Redis in the simplest way possible: for
each ID, the key would be the media ID, and the value would be the
user ID:
SET media:1155315 939 GET media:1155315
939 While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to
the 300,000,000 we would eventually need, it was looking to be around
21GB worth of data — already bigger than the 17GB instance type on
Amazon EC2.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core
developers, for input, and he suggested we use Redis hashes. Hashes in
Redis are dictionaries that are can be encoded in memory very
efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures
the maximum number of entries a hash can have while still being
encoded efficiently. We found this setting was best around 1000; any
higher and the HSET commands would cause noticeable CPU activity. For
more details, you can check out the zipmap source file.
To take advantage of the hash type, we bucket all our Media IDs into
buckets of 1000 (we just take the ID, divide by 1000 and discard the
remainder). That determines which key we fall into; next, within the
hash that lives at that key, the Media ID is the lookup key within
the hash, and the user ID is the value. An example, given a Media ID
of 1155315, which means it falls into bucket 1155 (1155315 / 1000 =
1155):
HSET "mediabucket:1155" "1155315" "939" HGET "mediabucket:1155"
"1155315"
"939" The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each),
Redis only needs 16MB to store the information. Expanding to 300
million keys, the total is just under 5GB — which in fact, even fits
in the much cheaper m1.large instance type on Amazon, about 1/3 of the
cost of the larger instance we would have needed otherwise. Best of
all, lookups in hashes are still O(1), making them very quick.
If you’re interested in trying these combinations out, the script we
used to run these tests is available as a Gist on GitHub (we also
included Memcached in the script, for comparison — it took about 52MB
for the million keys)

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

optimization: stream of many key updates, small number of keys

I have a program that receives a constant stream of data.
From this stream of data I populate a hashtable. Every piece of data I receive
is translated in, either:
a key update ;
or a key insertion if it doesn't already exist.
I store the incoming raw data in a queue before it is being processed.
The number of keys in the hashtable is very small. 99% of the data I receive
corresponds to key updates.
The problem is that I have so many key updates that the queue becomes
too big for my consumers.
Obviously, from the thousands of key updates, many of them concern the same
key, so only the last one has a real value while all the others are useless.
What is the best way for me to handle this case? Which data structure should I
be using?
What can you tell us about your keys? How many are there? Are they numeric (and if so, what range of values might they take?), textual? Any limit on the number of bytes per key? What kind of hash table are you inserting to (e.g. closed hashing, open hashing)? What contention/locking is there on the hash table? How many updates per second? What programming language are you using?
How many keys
A few hundreds or maybe a few thousands. Not a lot!
Numeric keys
The keys themselves are alphanumeric, they are not very long, around 30 characters at most. The values, however, are all numbers (integers).
Limit on the number of bytes per key
My keys are 30 characters long, at most.
Kind of hash table
I'm simply using Python's defaultdict
Contention/locking
Python's dictionaries are considered thread-safe
Number of updates per second
It can go from 1 every 3 seconds to more than a 100 per second
Programming language
I'm using python
Instead of using a simple queue you can use another hashtable - each incoming message could be stored in the appropriate stack based on key. You then take each element from each stack (which will be the most recent item) - you can optionally clear each stack when you pull an item out.
ConcurrentDictionary should fit the bill nicely.
But what you need here is an (maybe adaptive) throttling mechanism, that detects when the queue is too slow and starts collapsing the data.

Efficient Hashmap Use

What is the more efficient approach for using hashmaps?
A) Use multiple smaller hashmaps, or
B) store all objects in one giant hashmap?
(Assume that the hashing algorithm for the keys is fairly efficient, resulting in few collisions)
CLARIFICATION: Option B implies segregation by primary key -- i.e. no additional lookup is necessary to determine which actual hashmap to use. (For example, if the lookup keys are alphanumeric, Hashmap 1 stores the A's, Hashmap 2 stores B's, and so on.)
Definitely B. The advantage of hash tables is that the average number of comparisons per lookup is independent of the size.
If you split your map into N smaller hashmaps, you will have to search half of them on average for each lookup. If the smaller hashmaps have the same load factor that the larger map would have had, you will increase the total number of comparisons by a factor of approximately N/2.
And if the smaller hashmaps have a smaller load factor, you are wasting memory.
All that is assuming you distribute the keys randomly between the smaller hashmaps. If you distribute them according to some function of the key (e.g. a string prefix) then what you have created is a trie, which is efficient for some applications (e.g. auto-complete in web forms.)
Are these maps used in logically distinct places? For instance, I wouldn't have one map containing users, cached query results, loggers etc, just because you happen to know the keys won't clash. However, I equally wouldn't split up a single map into multiple maps.
Keep one hashmap for each logical mapping from key to value.
In addition #Jon's answer, there can be practical reasons why you want to maintain separate hash tables.
If you have separate tables for different mappings you can 'clear' each of the mappings independently; e.g. by calling 'clear' or getting rid of the reference to the corresponding table.
If the separate tables hold mappings to cached entries, you can use different strategies to 'age' the respective entries.
If the application is multi-threaded, using separate tables may reduce lock contention, and may (for some processor architectures) increase processor memory cache hit ratios.