How to calculate hash-max-ziplist-value properly? - redis

My question example: HMSET myhash field1 value1 field2 value2 and myhash only has these two fields.
The main question is how to calculate hash-max-ziplist-value so that my hash will not exceed the value to maintain my hash in a compressed format to minimize the memory usage.
Thank "Kevin Christopher Henry" very much for his detail explanation, help and time. Due to my limited English, I will summarize Kevin's answer here. Please correct me if what I understand is wrong.
(1) To meet hash-max-ziplist-value, I need to calculate max(field1, value1, field2, value2). Let's assume value1 has the biggest size. Then I just need to make sure the size of value1 does not exceed hash-max-ziplist-value.
(2) To calculate value1, I just need to calculate the number of bytes in size. Because hash-max-ziplist-value is the number of bytes for the string value before any compression.
(3) To calculate the number of bytes for value1, there are many ways and one of them is as follows: First, convert value1 to UTF8 encoding if it's not. Second, check the length of it by using the client language. Because The length of UTF8 encoded string is the number of bytes in Size. (for instance: utf8.encode(value1).length).
Originl Post
For example, HMSET myhash field1 value1 field2 value2
First, I want to clarify what hash-max-ziplist-entries really means.
Is the above example one entry or two entries because it has two fields?
What is hash-max-ziplist-value? Is that the size in bytes for
(a) MEMORY USAGE myhash
(b) the sum size of field1, value1, field2, value2
(c) the sum size of value1 and value2.
(d) max(value1, value2)?
(e) max(field1+value1, field2+value2)
I don't know how I can calculate my hash value to match hash-max-ziplist-value. Is hash-max-ziplist-value the number of bytes in size? Is that utf-8 encoded string length? Is there an existing command in redis for this calculation?.
Thank you very much for your help.

These values are briefly described in the redis.conf file, as well as the memory optimization documentation.
# Hashes are encoded using a memory efficient data structure when they have a
# small number of entries, and the biggest entry does not exceed a given
# threshold. These thresholds can be configured using the following directives.
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
Using these default values as an example, if the hash has 512 or fewer entries, and each is 64 bytes or smaller, the hash will be encoded using a ziplist.
Although the documentation doesn't say exactly how the size of a hash entry is calculated, a look at the source code indicates that both the field name and the value must have a size less than or equal to the threshold. You should be able to determine the size by computing the length, in bytes, of the binary string.
To answer some of your specific questions:
Is the above example one entry or two entries because it has two fields?
Two.
What is hash-max-ziplist-value?
Using your terminology, this would be max(field1, value1, field2, value2).
Is that utf-8 encoded string calculation?
Redis works with binary strings. It's up to you (or your client) to decide what encoding to use.
Is there any easy way to calculate myhash value in bytes for hash-max-ziplist-value? Is there an existing command in redis for this calculation?
Not that I know of, but the length of the binary string representation of the value should be approximately right.

Related

Reducing the hash value of SHA 256 using modulus

I am trying to create an alternative to a bloom filter. I am using an array of bits that has capacity to hold 100 billion bits (around 25 GB). Initially, all the bits will be set to zero.The steps I will take to create it are as follows :
I will take an input and generate a hash using SHA-256(due to less chances of collision) and perform modulus operation with 100 billion on the generated hash to obtain a value say N.
I will set the bit on the Nth position in the array to 1.
If the bit is already set on the Nth position, then I will add the input to a bucket specific for that bit.
How do I find the increase in the number of collisions as a result of performing modulus on the hash value ?
If I have 40 billion entries as the input, what are the chances of collisions using the proposed method?

Redis: Memory Optimization

I have around 256 keys. Against each key I have to store a large number of non-repitative integers.
Following are the top 7 keys with number of total values (entries) against each key. Each value is a unique integer with large value.
Key No. of integers (values) in the list
Key 1 3394967
Key 2 3385081
Key 3 2172866
Key 4 2171779
Key 5 1776702
Key 6 1772936
Key 7 1748858
By default Redis consumes lot of memory in storing this data. I read that changing following parameters can result in memory usage reduction highly.
list-max-zipmap-entries 512
list-max-zipmap-value 64
Can anyone please explain me these above configuration commands (are 512 and 64 bytes?) and what changes I can make in the above configuration settings for my case to achieve memory usage reduction?
What should be kept in mind while selecting the values for entries and value in above command?
list-max-mipmap-entries 512:
list-max-zipmap-value 64
If the number of entries in a List exceeds 512, or if the size of any given element in the list > 64 bytes, Redis will switch to a less-efficient in-memory storage structure. More specifically, below those thresholds it will use a ziplist, and above it will use a linked list.
So in your case, you would need to use an entries value of > 1748858 to see any change (and then only in keys 8-end). Also note that for Redis to re-encode them to the smaller object size you would also need to make the change in the config and restart Redis as it doesn't re-encode down automatically.
To verify a given key is using a ziplist vs. linked list, use the OBJECTcommand.
For more details, see Redis Memory Optimization
IMO you can't achieve redis' memory optimization. In your case the entries in each list/set is around 3 million. In order to achieve memory optimization if you give the value of list-max-zipmap-entries as 3 million.
Redis doc says,
This operation is very fast for small values, but if you change the
setting in order to use specially encoded values for much larger
aggregate types the suggestion is to run some benchmark and test to
check the conversion time.
As per this encoding and decoding will take more time/CPU for that huge number. So it is better to run a benchmark test and then decide.
One alternative suggestion, if you only look up this sets to see whether a key is available or not. then you can change the Structure to a bucket kind of a thing.
For example a value 123456 set to key1 can be stored like this
Sadd key1:bucket:123 456
123 = 123456/1000
456 = 123456%1000
Note this won't work if you want to retrieve all the values for key1. In that case you would be looping through 1000 of sets. similarly for total size of key1 you have to loop through 1000 keys.
But the memory will be reduced to about 10 times.

SQL Server : Taking Numerical Characters and Hashing them under with a max length of 20 characters

Hello I was trying to find a good way to hash a set of numerical numbers which its output would be under 20 characters that are positive and unique. Any one have any suggestions?
For hashing in general, I'd use the HASHBYTES function. You can then convert the binary data to a string and just pick the first 20 characters, that should still be unique enough.
To get around HASHBYTES limitations (8000 bytes for instance), you can incrementally hash, e.g. for each value concat the previous hash with the value to be added and hash that again. This will make it unique with order etc. and unless you append close to 8000 bytes in one value it will not cause data truncation for the hashing.

Parallelizable hashing algorithm where size and order of sub-strings is irrelevant

EDIT
Here is the problem I am trying to solve:
I have a string broken up into multiple parts. These parts are not of equal, or predictable length. Each part will have a hash value. When I concatenate parts I want to be able to use the hash values from each part to quickly get the hash value for the parts together. In addition the hash generated by putting the parts together must match the hash generated if the string were hashed as a whole.
Basically I want a hashing algorithm where the parts of the data being hashed can be hashed in parallel, and I do not want the order or length of the pieces to matter. I am not breaking up the string, but rather receiving it in unpredictable chunks in an unpredictable order.
I am willing to ensure an elevated collision rate, so long as it is not too elevated. I am also ok with a slightly slower algorithm as it is hardly noticeable on small strings, and done in parallel for large strings.
I am familiar with a few hashing algorithms, however I currently have a use-case for a hash algorithm with the property that the sum of two hashes is equal to a hash of the sum of the two items.
Requirements/givens
This algorithm will be hashing byte-strings with length of at least 1 byte
hash("ab") = hash('a') + hash('b')
Collisions between strings with the same characters in different order is ok
Generated hash should be an integer of native size (usually 32/64 bits)
String may contain any character from 0-256 (length is known, not \0 terminated)
The ascii alpha-numeric characters will be by far the most used
A disproportionate number of strings will be 1-8 ASCII characters
A very tiny percentage of the strings will actually contain bytes with values at or above 127
If this is a type of algorithm that has terminology associated with it, I would love to know that terminology. If I knew what a proper term/name for this type of hashing algorithm was it would be much easier to google.
I am thinking the simplest way to achieve this is:
Any byte's hash should be its value, normalized to <128 (if >128 subtract 128)
To get the hash of a string you normalize each byte to <128 and add it to the key
Depending on key size I may need to limit how many characters are used to hash to avoid overflow
I don't see anything wrong with just adding each (unsigned) byte value to create a hash which is just the sum of all the characters. There is nothing wrong with having an overflow: even if you reach the 32/64 bit limit (and it would have to be a VERY/EXTREMELY long string to do this) the overflow into a negative number won't matter in 2's complement arithmetic. As this is a linear process it doesn't matter how you split your string.

Creating unique hash code (string) in SQL Server from a combination of two or more columns (of different data types)

I would like to create unique string columns (32 characters in length) from combination of columns with different data types in SQL Server 2005.
I have found out the solution elsewhere in StackOverflow
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', 'HelloWorld')), 3, 32)
The answer thread is here
With HASBYTES you can create SHA1 hashes, that have 20 bytes, and you can create MD5 hashes, 16 bytes. There are various combination algorithms that can produce arbitrary length material by repeated hash operations, like the PRF of TLS (see RFC 2246).
This should be enough to get you started. You need to define what '32 characters' mean, since hash functions produce bytes not characters. Also, you need to internalize that no algorithm can possibly produce hashes of fixed length w/o collisions (guaranteed 'unique'). Although at 32 bytes length (assuming that by 'characters' you mean bytes) the theoretical collision probability of 50% is at 4x1038 hashed elements (see birthday problem), that assumes a perfect distribution for your 32 bytes output hash function, which you're not going to achieve.