How to manipulate bits in Smalltalk? - smalltalk

I am currently working on a file compressor based on Huffman decoding. So I have a decoding tree like so:
and I have to encode this tree on an output file by following a certain criteria:
"for each leaf, write out a 0 bit, followed by the 8 bits of
the corresponding character. Write out the bits in the order bit 7, bit 6, . . ., bit 0, that is high bit first. As a special case, if the byte is 0, write out bit 8, which will be a 0 for a byte value of 0, and 1 for a byte value of 256 (the EOF marker)." For an internal node, just write a bit 1.
So what I plan to do is to create a bit array and add to it the corresponding bits in the specified format. The problem is that I don't know how to convert a number to binary in smalltalk.
For example, if I want to encode the first leaf, I would want to do something like 01101011 i.e 0 followed by the bit representation of k and then add every bit one by one into the array.

I don't know which dialect you are using exactly, but generally, you can access the bits of Integer. They are modelled as if the representation was in two-complement, with an infinite sequence of bits.
2 is ....0000000000010
1 is ....0000000000001
0 is ....0000000000000 with infinitely many 0 on the left
-1 is ....1111111111111 with infinitely many 1 on the left
-2 is ....1111111111110
This is also true for LargeIntegers, even though they are generally implemented as sign magnitude (the class encodes the sign), two-complement will be emulated.
Then you can operate with bitAnd: bitOr: bitXor: bitInvert bitShift:, and in some flavours bitAt:put:
You can access the bits with (2 bitAt: index) where the index starts at 1 for least significant bit, or grows higher. If it's missing, implement it with bitAnd: and bitShift:...
For positive, you can ask for the rank of high bit (2 highBit).
All these operations should create a new integer (there's no in place modification possible).
Conceptually, a ByteArray is a collection of unsigned integers on 8 bits (between 0 and 255), so you can implement a bit Array with them (if it does not already exist in the dialect). Or you can use an Integer (but won't be able to control size which will be infinite, nor in place mofifications, operations will cost a copy).

Related

Size of a serialized complete binary tree

I'm tryin to work out the size of a serialized binary tree having N nodes (also mentioned in Leetcode). This is how I calculate the size:
If we assume the storage required to store values be V bits for each node, then the storage needed to store N nodes will be N.V. We also need to store NULL for the leaves; since there are exactly Ceiling(N/2) leaves in a complete tree, and assuming only one bit is enough to represent NULL, then an additional of 2 x Ceiling(N/2) bits will be required. 2 x Ceiling(N/2) translates to N+1 as in a complete tree N is always an odd number.
So, N.V + (N+1) bit is required in total.
However, I can see that in Leetcode and some other places (e.g. this), it's calculated as N.V + 2N.
What am I missing?
What am I missing?
The two references you provided (LeetCode and blog article) deal with arbitrary binary trees, not necessarily complete. So let me first deal with arbitrary binary trees:
Although a NULL reference could be represented with one bit (e.g. with value 0), you also need to store the fact that a reference is not a NULL (value 1). You cannot just omit the bit, as then the next bit (belonging to a node value) could be misinterpreted as indicating a NULL reference. So you should not only count that bit for each NULL reference, but count it in for all branches.
The serialised format would for each node represent:
The node's value (𝑉 bits)
The fact whether or not its left child is a NULL (1 bit)
The fact whether or not its right child is a NULL (1 bit)
Example:
Let 𝑉 be 4
Tree to serialise:
10
/ \
7 13
\
14
Serialisation process (level order):
node value
has left
has right
serialised
without spacing
10
yes
yes
1010 1 1
101011
7
no
no
0111 0 0
011100
13
no
yes
1101 0 1
110101
14
no
no
1110 0 0
111000
Complete:
101011011100110101111000
If we were only to store the 0 when there is a NULL, then we would get this:
101001110011010111000
^
But now the bit at the indicated position is ambiguous, because that bit could be interpreted as representing a NULL reference, but actually it is the first of 𝑉 bits 0111 representing the value 7.
It is however possible to reduce the serialised string with 2 bits: the very last 2 bits will always be 0 in a tree traversal that is guaranteed to end with a leaf. This is for example the case for level-order and pre-order traversals. So then you could just omit those 2 bits.
The case for complete binary trees
First of all about the definition of a complete binary tree. You write:
in a complete tree N is always an odd number.
I suppose then your definition of a complete tree is what at Wikipedia is called a perfect tree. We can however also look at (nearly) complete binary trees (and then 𝑁 is not necessarily odd).
For complete binary trees the case is simpler, as a level order traversal of a complete binary tree will never include NULLs, i.e. there are no "gaps" in such a traversal.
So you can just serialise the node's values in that order, giving each 𝑉 bits. This is actually the array representation that is used for binary heaps:
The parent / child relationship is defined implicitly by the elements' indices in the array.
If serialisation happens in a string data type that implicitly has a length attribute, then that's it. If there is no such meta data, then you need to prefix the value of 𝑁 in the serialisation, reserving a predefined number of bits for it. Alternatively, if there is a special value of 𝑉 bits that will never occur as actual node value, you could append it as a terminator (much like \0 in C-strings).

Structure Packing

I'm currently learning C# and my first project (as a learning experiment) is to create a DBF reader. I'm having some difficulty understanding "packing" according to this: http://www.developerfusion.com/pix/articleimages/dec05/structs1.jpg
If I specified a packing of 2, wouldn't all structure elements begin on a 2-byte boundary, and if I specified a packing of 4, wouldn't all structure elements begin on a 4-byte boundary, and also consume a minimum of 4 bytes each?
For instance, a byte element would be placed on a 4 byte boundary, and the element following it (in a sequential layout) would be located on the next 4-byte boundary (losing 3 bytes to padding)?
In the image shown, in the "pack=4" it shows a byte that is on a 2 byte boundary, following a short.
If I understand the picture correctly, pack equal to n means that one variable cannot be stored "between" two packs of lengths n. In other words, bytes which compose a variable cannot cross one pack's boundary. This is only true if the size of a variable is less or equal to the size of a pack.
Let's take Pack = 4 as an example. Here, we can safely store a byte and a short in one pack, because they require 3 bytes of memory together. But since there is only one byte in the pack left, it requires one byte of padding to be able to store an int into the data structure, because what's left in the pack is too little to store the whole int.
I hope the explanation makes sense.
Looking at the picture again, I think it would be better if all data were aligned to the same side of a pack, either to bottom or top. This would make it clearer what's going on.

Advice for bit level manipulation

I'm currently working on a project that involves a lot of bit level manipulation of data such as comparison, masking and shifting. Essentially I need to search through chunks of bitstreams between 8kbytes - 32kbytes long for bit patterns between 20 - 40bytes long.
Does anyone know of general resources for optimizing for such operations in CUDA?
There has been a least a couple of questions on SO on how to do text searches with CUDA. That is, finding instances of short byte-strings in long byte-strings. That is similar to what you want to do. That is, a byte-string search is much like a bit-string search where the number of bits in the byte-string can only be a multiple of 8, and the algorithm only checks for matches every 8 bits. Search on SO for CUDA string searching or matching, and see if you can find them.
I don't know of any general resources for this, but I would try something like this:
Start by preparing 8 versions of each of the search bit-strings. Each bit-string shifted a different number of bits. Also prepare start and end masks:
start
01111111
00111111
...
00000001
end
10000000
11000000
...
11111110
Then, essentially, perform byte-string searches with the different bit-strings and masks.
If you're using a device with compute capability >= 2.0, store the shifted bit-strings in global memory. The start and end masks can probably just be constants in your program.
Then, for each byte position, launch 8 threads that each checks a different version of the 8 shifted bit-strings against the long bit-string (which you now treat like a byte-string). In each block, launch enough threads to check, for instance, 32 bytes, so that the total number of threads per block becomes 32 * 8 = 256. The L1 cache should be able to hold the shifted bit-strings for each block, so that you get good performance.

Storing integers in a redis ordered set?

I have a system which deals with keys that have been turned into unsigned long integers (by packing short sequences into byte strings). I want to try storing these in Redis, and I want to do it in the best way possible. My concern is mainly memory efficiency.
From playing with the online REPL I notice that the two following are identical
zadd myset 1.0 "123"
zadd myset 1.0 123
This means that even if I know I want to store an integer, it has to be set as a string. I notice from the documentation that keys are just stored as char*s and that commands like SETBIT indicate that Redis is not averse to treating strings as bytestrings in the client. This hints at a slightly more efficient way of storing unsigned longs than as their string representation.
What is the best way to store unsigned longs in sorted sets?
Thanks to Andre for his answer. Here are my findings.
Storing ints directly
Redis keys must be strings. If you want to pass an integer, it has to be some kind of string. For small, well-defined sets of values, Redis will parse the string into an integer, if it is one. My guess is that it will use this int to tailor its hash function (or even statically dimension a hash table based on the value). This works for small values (examples being the default values of 64 entries of a value of up to 512). I will test for larger values during my investigation.
http://redis.io/topics/memory-optimization
Storing as strings
The alternative is squashing the integer so it looks like a string.
It looks like it is possible to use any byte string as a key.
For my application's case it actually didn't make that much difference storing the strings or the integers. I imagine that the structure in Redis undergoes some kind of alignment anyway, so there may be some pre-wasted bytes anyway. The value is hashed in any case.
Using Python for my testing, so I was able to create the values using the struct.pack. long longs weigh in at 8 bytes, which is quite large. Given the distribution of integer values, I discovered that it could actually be advantageous to store the strings, especially when coded in hex.
As redis strings are "Pascal-style":
struct sdshdr {
long len;
long free;
char buf[];
};
and given that we can store anything in there, I did a bit of extra Python to code the type into the shortest possible type:
def do_pack(prefix, number):
"""
Pack the number into the best possible string. With a prefix char.
"""
# char
if number < (1 << 8*1):
return pack("!cB", prefix, number)
# ushort
elif number < (1 << 8*2):
return pack("!cH", prefix, number)
# uint
elif number < (1 << 8*4):
return pack("!cI", prefix, number)
# ulonglong
elif number < (1 << 8*8):
return pack("!cQ", prefix, number)
This appears to make an insignificant saving (or none at all). Probably due to struct padding in Redis. This also drives Python CPU through the roof, making it somewhat unattractive.
The data I was working with was 200000 zsets of consecutive integer => (weight, random integer) × 100, plus some inverted index (based on random data). dbsize yields 1,200,001 keys.
Final memory use of server: 1.28 GB RAM, 1.32 Virtual. Various tweaks made a difference of no more than 10 megabytes either way.
So my conclusion:
Don't bother encoding into fixed-size data types. Just store the integer as a string, in hex if you want. It won't make all that much difference.
References:
http://docs.python.org/library/struct.html
http://redis.io/topics/internals-sds
I'm not sure of this answer, it's more of a suggestion than anything else. I'd have to give it a try and see if it works.
As far as I can tell, Redis only supports UTF-8 strings.
I would suggest grabbing a bit representation of your long integer and pad it accordingly to fill up the nearest byte. Encode each set of 8 bytes to a UTF-8 string (ending up with 8x*utf8_char* string) and store that in Redis. The fact that they're unsigned means that you don't care about that first bit but if you did, you could add a flag to the string.
Upon retrieving the data, you have to remember to pad each character to 8 bytes again as UTF-8 will use less bytes for the representation if the character can be stored with less bytes.
End result is that you store a maximum of 8 x 8 byte characters instead of (possibly) a maximum of 64 x 8 byte characters.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.