Storing 13 Million floats and integer in redis - redis

I have a file with 13 million floats each of them have a associated index as integer. The original size of file is 80MB.
We want to pass multiple indexes to get float data. The only reason, I needed hashmap field and value as List does not support passing multiple indexes to get.
Stored them as hashmap in redis, with index being field and float as value. On checking memory usage it was about 970MB.
Storing 13 million as list is using 280MB.
Is there any optimization I can use.
Thanks in advance
running on elastic cache

You can do a real good optimization by creating buckets of index vs float values.
Hashes are very memory optimized internally.
So assume your data in original file looks like this:
index, float_value
2,3.44
5,6.55
6,7.33
8,34.55
And you have currently stored them one index to one float value in hash or a list.
You can do this optimization of bucketing the values:
Hash key would be index%1000, sub-key would be index, and value would be float value.
More details here as well :
At first, we decided to use Redis in the simplest way possible: for
each ID, the key would be the media ID, and the value would be the
user ID:
SET media:1155315 939 GET media:1155315
939 While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to
the 300,000,000 we would eventually need, it was looking to be around
21GB worth of data — already bigger than the 17GB instance type on
Amazon EC2.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core
developers, for input, and he suggested we use Redis hashes. Hashes in
Redis are dictionaries that are can be encoded in memory very
efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures
the maximum number of entries a hash can have while still being
encoded efficiently. We found this setting was best around 1000; any
higher and the HSET commands would cause noticeable CPU activity. For
more details, you can check out the zipmap source file.
To take advantage of the hash type, we bucket all our Media IDs into
buckets of 1000 (we just take the ID, divide by 1000 and discard the
remainder). That determines which key we fall into; next, within the
hash that lives at that key, the Media ID is the lookup key within
the hash, and the user ID is the value. An example, given a Media ID
of 1155315, which means it falls into bucket 1155 (1155315 / 1000 =
1155):
HSET "mediabucket:1155" "1155315" "939" HGET "mediabucket:1155"
"1155315"
"939" The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each),
Redis only needs 16MB to store the information. Expanding to 300
million keys, the total is just under 5GB — which in fact, even fits
in the much cheaper m1.large instance type on Amazon, about 1/3 of the
cost of the larger instance we would have needed otherwise. Best of
all, lookups in hashes are still O(1), making them very quick.
If you’re interested in trying these combinations out, the script we
used to run these tests is available as a Gist on GitHub (we also
included Memcached in the script, for comparison — it took about 52MB
for the million keys)

Related

Redis multi Key or multi Hash field

I have about 300k row data like this Session:Hist:[account]
Session:Hist:100000
Session:Hist:100001
Session:Hist:100002
.....
Each have 5-10 childs [session]:[time]
b31c2a43-e61b-493a-b8d4-ff0729fe89de:1846971068807
5552daa2-c9f6-4635-8a7c-6f027b4aa1a3:1846971065461
.....
I have 2 options:
Using Hash, key is Session:Hist:[account], field is [session], value is [time]
Using Hash flat all account, key is Session:Hist, field is [account]:[session], value is [time]
My Redis have 1 master, 4-5 Slave, using to store & push session (about 300k *5 in 2h) every days, and clear at end of day!
So the question is which options is better for performance (faster sync master-slave/smaller memory/faster for huge request), thanks for your help!
Comparing the two options mentioned, option #2 is less optimal.
According to official Redis documentation:
It is worth noting that small hashes (i.e., a few elements with small values) are encoded in special way in memory that make them very memory efficient.
More details here.
So having one huge hash with key Session:Hist would affect memory consumption. It would also affect clustering (sharding) since you would have one hash (hot-spot) located on one instance which would get hammered.
Option #1 does not suffer from the problems mentioned above. As long as you have many well-distributed (i.e. all accounts have similar count of sessions vs a few accounts being dominant with huge amount of sessions) hashes keyed as Session:Hist:[account].
If, however, there is a possibility for uneven distribution of sessions into accounts, you could try (and measure) the efficiency of option 1a:
Key: Session:Hist:[account]:[session - last two characters]
field: [session's last two characters]
value: [time]
Example:
Key: Session:Hist:100000:b31c2a43-e61b-493a-b8d4-ff0729fe89
field: de
value: 1846971068807
This way, each hash will only contain up to 256 fields (assuming last 2 characters of session are hex, all possible combinations would be 256). This would be optimal if redis.conf defines hash-max-zipmap-entries 256.
Obviously option 1a would require some modifications in your application but with proper bench-marking (i.e. memory savings) you could decide if it's worth the effort.

Using HSET or SETBIT for storing 6 billion SHA256 hashes in Redis

Problem set : I am looking to store 6 billion SHA256 hashes. I want to check if a hash exist and if so, an action will be performed. When it comes to storing the SHA256 hash (64 byte string) just to check the if the key exist, I've come across two functions to use
HSET/HEXIST and GETBIT/SETBIT
I want to make sure I take the least amount of memory, but also want to make sure lookups are quick.
The Use case will be "check if sha256 hash exist"
The problem,
I want to understand how to store this data as currently I have a 200% increase from text -> redis. I want to understand what would the best shard options using ziplist entries and ziplist value would be. How to split the hash to be effective so the ziplist is maximised.
I've tried setting the ziplist entries to 16 ^ 4 (65536) and the value to 60 based on splitting 4:60
Any help to help me understand options, and techniques to make this as small of a footprint but quick enough to run lookups.
Thanks
A bit late to the party but you can just use plain Redis keys for this:
# Store a given SHA256 hash
> SET 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 ""
OK
# Check whether a specific hash exists
> EXISTS 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
0
Where both SET and EXISTS have a time complexity of O(1) for single keys.
As Redis can handle a maximum of 2^32 keys, you should split your dataset into two or more Redis servers / clusters, also depending on the number of nodes and the total combined memory available to your servers / clusters.
I would also suggest to use the binary sequence of your hashes instead of their textual representation - as that would allow to save ~50% of memory while storing your keys in Redis.

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

Does redis key size also include the data size for that key or just the key itself?

I'm trying to analyise the db size for redis db and tweak the storage of our data per a few articles such as https://davidcel.is/posts/the-story-of-my-redis-database/
and https://engineering.instagram.com/storing-hundreds-of-millions-of-simple-key-value-pairs-in-redis-1091ae80f74c
I've read documentation about "key sizes" (i.e. https://redis.io/commands/object)
and tried running various tools like:
redis-cli --bigkeys
and also tried to read the output from the redis-cli:
INFO memory
The size semantics are not clear to me.
Does the reported size reflect ONLY the size for the key itself, i.e. if my key is "abc" and the value is "value1" the reported size is for the "abc" portion? Also the same question in respects to complex data structures for that key such as a hash / array or list.
Trial and error doesn't seem to give me a clear result.
Different tools give different answers.
First read about --bigkeys - it reports big value sizes in the keyspace, excluding the space taken by the key's name. Note that in this case the size of the value means something different for each data type, i.e. Strings are sized by their STRLEN (bytes) whereas all other by the number of their nested elements.
So that basically means that it gives little indication about actual usage, but rather does as it is intended - finds big keys (not big key names, only estimated big values).
INFO MEMORY is a different story. The used_memory is reported in bytes and reflects the entire RAM consumption of key names, their values and all associated overheads of the internal data structures.
There also DEBUG OBJECT but note that it's output is not a reliable way to measure the memory consumption of a key in Redis - the serializedlength field is given in bytes needed for persisting the object, not the actual footprint in memory that includes various administrative overheads on top of the data itself.
Lastly, as of v4 we have the MEMORY USAGE command that does a much better job - see https://github.com/antirez/redis-doc/pull/851 for the details.

Redis: Multiple unique keys versus bucketing through Hash

I have total six type of keys, say a,b,..,f each having around a million subkeys, like a1,a2,...a99999(different in each bucket). What is the faster way to access?
Having separate keys by combining bucket name and key like: a_a1,b_b1 etc.
Use hash for 6 keys to have buckets and then have 1 million keys in each?
I search stack-overflow and couldn't find such comparison when I have few buckets with huge number of keys!
Edit1: Every key and value is string only at maximum 100 characters. I would access it using Jedis library of Java making transactions
Your question remind me this article. It doesn't contains performance benchmarks but seems like your second case (with buckets of keys) will have appropriate performance and small memory footprint.