Storing integers in a redis ordered set? - redis

I have a system which deals with keys that have been turned into unsigned long integers (by packing short sequences into byte strings). I want to try storing these in Redis, and I want to do it in the best way possible. My concern is mainly memory efficiency.
From playing with the online REPL I notice that the two following are identical
zadd myset 1.0 "123"
zadd myset 1.0 123
This means that even if I know I want to store an integer, it has to be set as a string. I notice from the documentation that keys are just stored as char*s and that commands like SETBIT indicate that Redis is not averse to treating strings as bytestrings in the client. This hints at a slightly more efficient way of storing unsigned longs than as their string representation.
What is the best way to store unsigned longs in sorted sets?

Thanks to Andre for his answer. Here are my findings.
Storing ints directly
Redis keys must be strings. If you want to pass an integer, it has to be some kind of string. For small, well-defined sets of values, Redis will parse the string into an integer, if it is one. My guess is that it will use this int to tailor its hash function (or even statically dimension a hash table based on the value). This works for small values (examples being the default values of 64 entries of a value of up to 512). I will test for larger values during my investigation.
http://redis.io/topics/memory-optimization
Storing as strings
The alternative is squashing the integer so it looks like a string.
It looks like it is possible to use any byte string as a key.
For my application's case it actually didn't make that much difference storing the strings or the integers. I imagine that the structure in Redis undergoes some kind of alignment anyway, so there may be some pre-wasted bytes anyway. The value is hashed in any case.
Using Python for my testing, so I was able to create the values using the struct.pack. long longs weigh in at 8 bytes, which is quite large. Given the distribution of integer values, I discovered that it could actually be advantageous to store the strings, especially when coded in hex.
As redis strings are "Pascal-style":
struct sdshdr {
long len;
long free;
char buf[];
};
and given that we can store anything in there, I did a bit of extra Python to code the type into the shortest possible type:
def do_pack(prefix, number):
"""
Pack the number into the best possible string. With a prefix char.
"""
# char
if number < (1 << 8*1):
return pack("!cB", prefix, number)
# ushort
elif number < (1 << 8*2):
return pack("!cH", prefix, number)
# uint
elif number < (1 << 8*4):
return pack("!cI", prefix, number)
# ulonglong
elif number < (1 << 8*8):
return pack("!cQ", prefix, number)
This appears to make an insignificant saving (or none at all). Probably due to struct padding in Redis. This also drives Python CPU through the roof, making it somewhat unattractive.
The data I was working with was 200000 zsets of consecutive integer => (weight, random integer) × 100, plus some inverted index (based on random data). dbsize yields 1,200,001 keys.
Final memory use of server: 1.28 GB RAM, 1.32 Virtual. Various tweaks made a difference of no more than 10 megabytes either way.
So my conclusion:
Don't bother encoding into fixed-size data types. Just store the integer as a string, in hex if you want. It won't make all that much difference.
References:
http://docs.python.org/library/struct.html
http://redis.io/topics/internals-sds

I'm not sure of this answer, it's more of a suggestion than anything else. I'd have to give it a try and see if it works.
As far as I can tell, Redis only supports UTF-8 strings.
I would suggest grabbing a bit representation of your long integer and pad it accordingly to fill up the nearest byte. Encode each set of 8 bytes to a UTF-8 string (ending up with 8x*utf8_char* string) and store that in Redis. The fact that they're unsigned means that you don't care about that first bit but if you did, you could add a flag to the string.
Upon retrieving the data, you have to remember to pad each character to 8 bytes again as UTF-8 will use less bytes for the representation if the character can be stored with less bytes.
End result is that you store a maximum of 8 x 8 byte characters instead of (possibly) a maximum of 64 x 8 byte characters.

Related

Is there a way to wrap an integer value into an integer range [min,max] without division or modulo?

My situation is the same as in this question, but instead of using floats, I'm using integers. I feel like this difference might be significant because there might be some bit twiddling hacks that can be used to wrap the value into the range.
One use case for wrapping a value into a range is assigning a value to a bucket in a hash table, like this (range of [0,len(buckets)]):
bucket = buckets[hash(key) % len(buckets)]
If we are continuously adding values to a hash table, it might be worthwhile to optimize this operation.
Since modulo isn't actually needed, there is a shortcut to mapping a hash into a new range. See Lemire's FastRange.
// map a **full-width 32-bit** hash to [0,p) without division.
uint32_t fastrange32(uint32_t word, uint32_t p) {
return (uint32_t)(((uint64_t)word * (uint64_t)p) >> 32);
}
Simply take the high half of a 32x32 -> 64-bit multiply of word * p. If all your word values are smallish, all your results will be 0, so this is only good for word values that are fairly uniformly distributed over the full 32-bit range. Hash functions are normally fine.
I guess that is slightly biased, but that can be worked around see https://github.com/apple/swift/pull/39143
Often the number of buckets is kept as a power of 2. This allows unwanted bits to just get masked-off.
(e.g. If there are 16 buckets then bucket_id = hash & 0xF.)
The downside is that it requires the number of buckets to double each time the table is resized.

How to manipulate bits in Smalltalk?

I am currently working on a file compressor based on Huffman decoding. So I have a decoding tree like so:
and I have to encode this tree on an output file by following a certain criteria:
"for each leaf, write out a 0 bit, followed by the 8 bits of
the corresponding character. Write out the bits in the order bit 7, bit 6, . . ., bit 0, that is high bit first. As a special case, if the byte is 0, write out bit 8, which will be a 0 for a byte value of 0, and 1 for a byte value of 256 (the EOF marker)." For an internal node, just write a bit 1.
So what I plan to do is to create a bit array and add to it the corresponding bits in the specified format. The problem is that I don't know how to convert a number to binary in smalltalk.
For example, if I want to encode the first leaf, I would want to do something like 01101011 i.e 0 followed by the bit representation of k and then add every bit one by one into the array.
I don't know which dialect you are using exactly, but generally, you can access the bits of Integer. They are modelled as if the representation was in two-complement, with an infinite sequence of bits.
2 is ....0000000000010
1 is ....0000000000001
0 is ....0000000000000 with infinitely many 0 on the left
-1 is ....1111111111111 with infinitely many 1 on the left
-2 is ....1111111111110
This is also true for LargeIntegers, even though they are generally implemented as sign magnitude (the class encodes the sign), two-complement will be emulated.
Then you can operate with bitAnd: bitOr: bitXor: bitInvert bitShift:, and in some flavours bitAt:put:
You can access the bits with (2 bitAt: index) where the index starts at 1 for least significant bit, or grows higher. If it's missing, implement it with bitAnd: and bitShift:...
For positive, you can ask for the rank of high bit (2 highBit).
All these operations should create a new integer (there's no in place modification possible).
Conceptually, a ByteArray is a collection of unsigned integers on 8 bits (between 0 and 255), so you can implement a bit Array with them (if it does not already exist in the dialect). Or you can use an Integer (but won't be able to control size which will be infinite, nor in place mofifications, operations will cost a copy).

Dealing with Int64 value with Booksleeve

I have a question about Marc Gravell's Booksleeve library.
I tried to understand how booksleeve deal the Int64 value (i have billion long value in Redis actually)
I used reflection to undestand the Set long value overrides.
// BookSleeve.RedisMessage
protected static void WriteUnified(Stream stream, long value)
{
if (value >= 0L && value <= 99L)
{
int i = (int)value;
if (i <= 9)
{
stream.Write(RedisMessage.oneByteIntegerPrefix, 0, RedisMessage.oneByteIntegerPrefix.Length);
stream.WriteByte((byte)(48 + i));
}
else
{
stream.Write(RedisMessage.twoByteIntegerPrefix, 0, RedisMessage.twoByteIntegerPrefix.Length);
stream.WriteByte((byte)(48 + i / 10));
stream.WriteByte((byte)(48 + i % 10));
}
}
else
{
byte[] bytes = Encoding.ASCII.GetBytes(value.ToString());
stream.WriteByte(36);
RedisMessage.WriteRaw(stream, (long)bytes.Length);
stream.Write(bytes, 0, bytes.Length);
}
stream.Write(RedisMessage.Crlf, 0, 2);
}
I don't understand why, with more than two digits int64, the long is encoding in ascii?
Why don't use byte[] ? I know than i can use byte[] overrides to do this, but i just want to understand this implementation to optimize mine. There may be a relationship with the Redis storage.
By advance thank you Marc :)
P.S : i'm still very enthusiastic about your next major version, than i can use long value key instead of string.
It writes it in ASCII because that is what the redis protocol demands.
If you look carefully, it is always encoded as ASCII - but for the most common cases (0-9, 10-99) I've special-cased it, as these are very simple results:
x => $1\r\nX\r\n
xy => $2\r\nXY\r\n
where x and y are the first two digits of a number in the range 0-99, and X and Y are those digits (as numbers) offset by 48 ('0') - so decimal 17 becomes the byte sequence (in hex):
24-32-0D-0A-31-37-0D-0A
Of course, that can also be achieved simply via the writing each digit sequentially and offsetting the digit value by 48 ('0'), and handling the negative sign - I guess the answer there is simply "because I coded it the simple but obviously correct way". Consider the value -123 - which is encoded as $4\r\n-123\r\n (hey, don't look at me - I didn't design the protocol). It is slightly awkward because it needs to calculate the buffer length first, then write that buffer length, then write the value - remembering to write in the order 100s, 10s, 1s (which is much harder than writing the other way around).
Perfectly willing to revisit it - simply: it works.
Of course, it becomes trivial if you have a scratch buffer available - you just write it in the simple order, then reverse the portion of the scratch buffer. I'll check to see if one is available (and if not, it wouldn't be unreasonable to add one).
I should also clarify: there is also the integer type, which would encode -123 as :-123\r\n - however, from memory there are a lot of places this simply does not work.

How do I limit BitConverter.GetBytes() to return only a certain amount of bytes using VB.NET?

I do:
Dim BytArr() as Byte = BitConverter.GetBytes(1234)
Since, by default, they are 32 bits, it returns 4 byte elements.
I want to be able to control it to return only like two bytes. Maybe only three bytes. Are there any built-in functions to control it?
I don't want to rely on using shifting >> 8 >> 16 >> 24 >> 32, etc..
I also don't want to rely on type casting the data in GetBytes() to a specific datatype.
It is not that GetBytes defaults to 32 bits, it is that GetBytes returns an array of the size required to hold the data type. If you pass a Long then you will get a 8 elements in your array.
The best way to control this is indeed casting the data you pass in. Otherwise you could truncate some of the number.
That being said, you could do something like this:
Dim BytArr() as Byte = Array.Resize(BitConverter.GetBytes(1234), 2)
But if the value you passed in exceeded what could be stored in 2 bytes (in this case) then you will have some very broken code.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.