Redis multi Key or multi Hash field - redis

I have about 300k row data like this Session:Hist:[account]
Session:Hist:100000
Session:Hist:100001
Session:Hist:100002
.....
Each have 5-10 childs [session]:[time]
b31c2a43-e61b-493a-b8d4-ff0729fe89de:1846971068807
5552daa2-c9f6-4635-8a7c-6f027b4aa1a3:1846971065461
.....
I have 2 options:
Using Hash, key is Session:Hist:[account], field is [session], value is [time]
Using Hash flat all account, key is Session:Hist, field is [account]:[session], value is [time]
My Redis have 1 master, 4-5 Slave, using to store & push session (about 300k *5 in 2h) every days, and clear at end of day!
So the question is which options is better for performance (faster sync master-slave/smaller memory/faster for huge request), thanks for your help!

Comparing the two options mentioned, option #2 is less optimal.
According to official Redis documentation:
It is worth noting that small hashes (i.e., a few elements with small values) are encoded in special way in memory that make them very memory efficient.
More details here.
So having one huge hash with key Session:Hist would affect memory consumption. It would also affect clustering (sharding) since you would have one hash (hot-spot) located on one instance which would get hammered.
Option #1 does not suffer from the problems mentioned above. As long as you have many well-distributed (i.e. all accounts have similar count of sessions vs a few accounts being dominant with huge amount of sessions) hashes keyed as Session:Hist:[account].
If, however, there is a possibility for uneven distribution of sessions into accounts, you could try (and measure) the efficiency of option 1a:
Key: Session:Hist:[account]:[session - last two characters]
field: [session's last two characters]
value: [time]
Example:
Key: Session:Hist:100000:b31c2a43-e61b-493a-b8d4-ff0729fe89
field: de
value: 1846971068807
This way, each hash will only contain up to 256 fields (assuming last 2 characters of session are hex, all possible combinations would be 256). This would be optimal if redis.conf defines hash-max-zipmap-entries 256.
Obviously option 1a would require some modifications in your application but with proper bench-marking (i.e. memory savings) you could decide if it's worth the effort.

Related

Using HSET or SETBIT for storing 6 billion SHA256 hashes in Redis

Problem set : I am looking to store 6 billion SHA256 hashes. I want to check if a hash exist and if so, an action will be performed. When it comes to storing the SHA256 hash (64 byte string) just to check the if the key exist, I've come across two functions to use
HSET/HEXIST and GETBIT/SETBIT
I want to make sure I take the least amount of memory, but also want to make sure lookups are quick.
The Use case will be "check if sha256 hash exist"
The problem,
I want to understand how to store this data as currently I have a 200% increase from text -> redis. I want to understand what would the best shard options using ziplist entries and ziplist value would be. How to split the hash to be effective so the ziplist is maximised.
I've tried setting the ziplist entries to 16 ^ 4 (65536) and the value to 60 based on splitting 4:60
Any help to help me understand options, and techniques to make this as small of a footprint but quick enough to run lookups.
Thanks
A bit late to the party but you can just use plain Redis keys for this:
# Store a given SHA256 hash
> SET 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 ""
OK
# Check whether a specific hash exists
> EXISTS 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
0
Where both SET and EXISTS have a time complexity of O(1) for single keys.
As Redis can handle a maximum of 2^32 keys, you should split your dataset into two or more Redis servers / clusters, also depending on the number of nodes and the total combined memory available to your servers / clusters.
I would also suggest to use the binary sequence of your hashes instead of their textual representation - as that would allow to save ~50% of memory while storing your keys in Redis.

Storing 13 Million floats and integer in redis

I have a file with 13 million floats each of them have a associated index as integer. The original size of file is 80MB.
We want to pass multiple indexes to get float data. The only reason, I needed hashmap field and value as List does not support passing multiple indexes to get.
Stored them as hashmap in redis, with index being field and float as value. On checking memory usage it was about 970MB.
Storing 13 million as list is using 280MB.
Is there any optimization I can use.
Thanks in advance
running on elastic cache
You can do a real good optimization by creating buckets of index vs float values.
Hashes are very memory optimized internally.
So assume your data in original file looks like this:
index, float_value
2,3.44
5,6.55
6,7.33
8,34.55
And you have currently stored them one index to one float value in hash or a list.
You can do this optimization of bucketing the values:
Hash key would be index%1000, sub-key would be index, and value would be float value.
More details here as well :
At first, we decided to use Redis in the simplest way possible: for
each ID, the key would be the media ID, and the value would be the
user ID:
SET media:1155315 939 GET media:1155315
939 While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to
the 300,000,000 we would eventually need, it was looking to be around
21GB worth of data — already bigger than the 17GB instance type on
Amazon EC2.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core
developers, for input, and he suggested we use Redis hashes. Hashes in
Redis are dictionaries that are can be encoded in memory very
efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures
the maximum number of entries a hash can have while still being
encoded efficiently. We found this setting was best around 1000; any
higher and the HSET commands would cause noticeable CPU activity. For
more details, you can check out the zipmap source file.
To take advantage of the hash type, we bucket all our Media IDs into
buckets of 1000 (we just take the ID, divide by 1000 and discard the
remainder). That determines which key we fall into; next, within the
hash that lives at that key, the Media ID is the lookup key within
the hash, and the user ID is the value. An example, given a Media ID
of 1155315, which means it falls into bucket 1155 (1155315 / 1000 =
1155):
HSET "mediabucket:1155" "1155315" "939" HGET "mediabucket:1155"
"1155315"
"939" The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each),
Redis only needs 16MB to store the information. Expanding to 300
million keys, the total is just under 5GB — which in fact, even fits
in the much cheaper m1.large instance type on Amazon, about 1/3 of the
cost of the larger instance we would have needed otherwise. Best of
all, lookups in hashes are still O(1), making them very quick.
If you’re interested in trying these combinations out, the script we
used to run these tests is available as a Gist on GitHub (we also
included Memcached in the script, for comparison — it took about 52MB
for the million keys)

Identifying Differences Efficiently

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.
The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.
Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..
There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.
One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
How to hash varies by database product, but all of the major providers have some sort of hashing available in them.
The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.
It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.
The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.
Both rsync and git are examples of widely used software that use hashes in this way.
In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.
There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

In Redis, how many HSET keys's keys is too many?

I've carefully read https://redis.io/topics/memory-optimization but I'm still confused. Basically, it says to cap the number of keys in each hash map (HSET). But what about the number of keys in each HSET.
If I have 1,000,000 keys for a certain prefix. Each one with a unique value. Suppose they're integer looking like "12345689". If I "shard" the keys by taking the first two characters (e.g. "12") and the remainder as the "sub key" (e.g. "3456789"), then for each hash I'm going to have 1,000,000 / 100 = 10,000 keys each (theoretically). Is that too many?
My (default) config is:
redis-store:6379> config get hash-*
1) "hash-max-ziplist-entries"
2) "512"
3) "hash-max-ziplist-value"
4) "64"
So, if I shard up each 1,000,000 keys per prefix, I'll have less than 512. Actually, I'll have 100 (e.g. "12" or "99"). But what about within each one? There'll theoretically be 10,000 keys each. Does that mean I break the limit and can't benefit from the space optimization that hash maps offer?
You can use such formula to calculate HASH internal data overhead for each key:
3 * next_power(n) * size_of(pointer)
There n is number of keys in your HASH. I think you are using x64 version of Redis so size_of(pointer) is 8. So for each 10,000 keys in your HASH your would have at least 240,000 bytes of overhead.
UPDATED
Please keep in mind hash-max-ziplist-entries is not the silver bullet. Please look at article here Under the hood of Rdis #2 — ziplist could be calculated as 21 * n and in same time: saving up to х10 RAM you got the write speed subsidence up to 30 times and up to 100 times in reading. So with total amount with 1,000,000 entries in HASH you could catch the critical breakdown with perfomance
You can read more about Redis HASH internals Under the hood of Redis #1.
After some extensive research I've finally understood how hash-max-ziplist-entries works.
https://www.peterbe.com/plog/understanding-redis-hash-max-ziplist-entries
Basically, it's just 1 hash map or if you need to break it up into multiple hash maps if within you need to store more keys than hash-max-ziplist-entries is set to.