Size of a serialized complete binary tree - serialization

I'm tryin to work out the size of a serialized binary tree having N nodes (also mentioned in Leetcode). This is how I calculate the size:
If we assume the storage required to store values be V bits for each node, then the storage needed to store N nodes will be N.V. We also need to store NULL for the leaves; since there are exactly Ceiling(N/2) leaves in a complete tree, and assuming only one bit is enough to represent NULL, then an additional of 2 x Ceiling(N/2) bits will be required. 2 x Ceiling(N/2) translates to N+1 as in a complete tree N is always an odd number.
So, N.V + (N+1) bit is required in total.
However, I can see that in Leetcode and some other places (e.g. this), it's calculated as N.V + 2N.
What am I missing?

What am I missing?
The two references you provided (LeetCode and blog article) deal with arbitrary binary trees, not necessarily complete. So let me first deal with arbitrary binary trees:
Although a NULL reference could be represented with one bit (e.g. with value 0), you also need to store the fact that a reference is not a NULL (value 1). You cannot just omit the bit, as then the next bit (belonging to a node value) could be misinterpreted as indicating a NULL reference. So you should not only count that bit for each NULL reference, but count it in for all branches.
The serialised format would for each node represent:
The node's value (𝑉 bits)
The fact whether or not its left child is a NULL (1 bit)
The fact whether or not its right child is a NULL (1 bit)
Example:
Let 𝑉 be 4
Tree to serialise:
10
/ \
7 13
\
14
Serialisation process (level order):
node value
has left
has right
serialised
without spacing
10
yes
yes
1010 1 1
101011
7
no
no
0111 0 0
011100
13
no
yes
1101 0 1
110101
14
no
no
1110 0 0
111000
Complete:
101011011100110101111000
If we were only to store the 0 when there is a NULL, then we would get this:
101001110011010111000
^
But now the bit at the indicated position is ambiguous, because that bit could be interpreted as representing a NULL reference, but actually it is the first of 𝑉 bits 0111 representing the value 7.
It is however possible to reduce the serialised string with 2 bits: the very last 2 bits will always be 0 in a tree traversal that is guaranteed to end with a leaf. This is for example the case for level-order and pre-order traversals. So then you could just omit those 2 bits.
The case for complete binary trees
First of all about the definition of a complete binary tree. You write:
in a complete tree N is always an odd number.
I suppose then your definition of a complete tree is what at Wikipedia is called a perfect tree. We can however also look at (nearly) complete binary trees (and then 𝑁 is not necessarily odd).
For complete binary trees the case is simpler, as a level order traversal of a complete binary tree will never include NULLs, i.e. there are no "gaps" in such a traversal.
So you can just serialise the node's values in that order, giving each 𝑉 bits. This is actually the array representation that is used for binary heaps:
The parent / child relationship is defined implicitly by the elements' indices in the array.
If serialisation happens in a string data type that implicitly has a length attribute, then that's it. If there is no such meta data, then you need to prefix the value of 𝑁 in the serialisation, reserving a predefined number of bits for it. Alternatively, if there is a special value of 𝑉 bits that will never occur as actual node value, you could append it as a terminator (much like \0 in C-strings).

Related

DEFLATE: how to handle "no distance codes" case?

I mostly get RFC 1951, however I'm not too clear on how to manage the case where (when using dynamic Huffman tables) no distance codes are needed or present. For example, let's take the input:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890987654321ZYXWVUTSR
where no backreference is possible since there are no repetitions of length >= 3.
According to RFC 1951, at least one distance code must be present regardless, otherwise it wouldn't be possible to encode HDIST - 1. I understand, according to the reference, that such code should be of zero bits to signal "no distance codes".
One distance code of zero bits means that there are no distance codes
used at all (the data is all literals).
In infgen symbols, I'd expect to see a dist 0 0.
Analyzing what gzip does with infgen, however, I see that TWO distance codes are emitted (each 1 bit long) for the above input (even though none is actually used then):
! infgen 2.4 output
!
gzip
!
last
dynamic
litlen 48 6
litlen 49 6
litlen 50 6
...cut...
litlen 121 6
litlen 122 6
litlen 256 6
dist 0 1
dist 1 1
literal 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890987654321Z
literal 'YXWVUTSR
end
!
crc
length
So what's the correct behavior in these cases?
If there are no matches in the deflate block, there will be no lengths from the length/literal code, and so the decoder will never look for a distance code. In that case, what would make the most sense is to provide no information at all about a distance code.
However the format does not permit that, since the 5-bit HDIST value in the header is interpreted as 1 to 32 distance codes, for which lengths must be provided for in the header. You must provide at least one distance code length in the header, even though it will never be used.
There are several valid things you can do in that case. RFC 1951 notes you can provide a single distance code (HDIST == 0, meaning one length), with length zero, which would be just one zero in the list of lengths.
It is also permitted to provide a single code of length one, or you could do as zlib is doing, which is to provide two codes of length one. You can actually put any valid distance code description you like there, and it will still be accepted.
As to why zlib's deflate is choosing to define two codes there, I can only guess that Jean-loup was being conservative, writing something he knew that even an over-simplified inflator would have to accept. Both gzip and zopfli do the same thing. They all do the same thing when there is only one distance code used. They could emit just the single one-bit distance code, per the RFC, but they emit two single-bit distance codes, one of which is never used.
Really the right thing to do would be to write a single zero length as noted in the RFC, which would take the fewest number of bits in the header. I will consider updating zlib to do that, to eke out a few more bits of compression.

How to manipulate bits in Smalltalk?

I am currently working on a file compressor based on Huffman decoding. So I have a decoding tree like so:
and I have to encode this tree on an output file by following a certain criteria:
"for each leaf, write out a 0 bit, followed by the 8 bits of
the corresponding character. Write out the bits in the order bit 7, bit 6, . . ., bit 0, that is high bit first. As a special case, if the byte is 0, write out bit 8, which will be a 0 for a byte value of 0, and 1 for a byte value of 256 (the EOF marker)." For an internal node, just write a bit 1.
So what I plan to do is to create a bit array and add to it the corresponding bits in the specified format. The problem is that I don't know how to convert a number to binary in smalltalk.
For example, if I want to encode the first leaf, I would want to do something like 01101011 i.e 0 followed by the bit representation of k and then add every bit one by one into the array.
I don't know which dialect you are using exactly, but generally, you can access the bits of Integer. They are modelled as if the representation was in two-complement, with an infinite sequence of bits.
2 is ....0000000000010
1 is ....0000000000001
0 is ....0000000000000 with infinitely many 0 on the left
-1 is ....1111111111111 with infinitely many 1 on the left
-2 is ....1111111111110
This is also true for LargeIntegers, even though they are generally implemented as sign magnitude (the class encodes the sign), two-complement will be emulated.
Then you can operate with bitAnd: bitOr: bitXor: bitInvert bitShift:, and in some flavours bitAt:put:
You can access the bits with (2 bitAt: index) where the index starts at 1 for least significant bit, or grows higher. If it's missing, implement it with bitAnd: and bitShift:...
For positive, you can ask for the rank of high bit (2 highBit).
All these operations should create a new integer (there's no in place modification possible).
Conceptually, a ByteArray is a collection of unsigned integers on 8 bits (between 0 and 255), so you can implement a bit Array with them (if it does not already exist in the dialect). Or you can use an Integer (but won't be able to control size which will be infinite, nor in place mofifications, operations will cost a copy).

Structure Packing

I'm currently learning C# and my first project (as a learning experiment) is to create a DBF reader. I'm having some difficulty understanding "packing" according to this: http://www.developerfusion.com/pix/articleimages/dec05/structs1.jpg
If I specified a packing of 2, wouldn't all structure elements begin on a 2-byte boundary, and if I specified a packing of 4, wouldn't all structure elements begin on a 4-byte boundary, and also consume a minimum of 4 bytes each?
For instance, a byte element would be placed on a 4 byte boundary, and the element following it (in a sequential layout) would be located on the next 4-byte boundary (losing 3 bytes to padding)?
In the image shown, in the "pack=4" it shows a byte that is on a 2 byte boundary, following a short.
If I understand the picture correctly, pack equal to n means that one variable cannot be stored "between" two packs of lengths n. In other words, bytes which compose a variable cannot cross one pack's boundary. This is only true if the size of a variable is less or equal to the size of a pack.
Let's take Pack = 4 as an example. Here, we can safely store a byte and a short in one pack, because they require 3 bytes of memory together. But since there is only one byte in the pack left, it requires one byte of padding to be able to store an int into the data structure, because what's left in the pack is too little to store the whole int.
I hope the explanation makes sense.
Looking at the picture again, I think it would be better if all data were aligned to the same side of a pack, either to bottom or top. This would make it clearer what's going on.

Binary Search Tree Minimum Value

I am new to binary search tree data structure. One thing I don't understand is why the leftest node is the smallest
10
/ \
5 12
/ \ / \
1 6 0 14
In the above instance, 0 is the smallest value not 1.
Let me know where I got mixed up.
Thank you!
That tree is not binary search tree.
Creating a binary search tree is a process which starts with adding
elements.
You can do it with array.
First there is no element so make it root.Then start adding elements as node.If the new value is bigger than before add it array[2 x n + 1] (call index of the last value: n). If it is smaller than before add it to array[2 x n]. So all values left of any node is smaller than it and all values right of any node is bigger than it. Even 10 and 6's place.6 cannot be 11.(at your tree,it isn't actually.).That's all !
For a tree to be considered as a binary search tree, it must satisfy the following property:
... the key in each node must be greater than all keys stored in the left sub-tree, and smaller than all keys in right sub-tree
Source: https://en.wikipedia.org/wiki/Binary_search_tree
The tree you posted is not a binary search tree because the root node (10) is not smaller than all keys in the right sub-tree (node 0)
I'm not really sure of your question, but binary search works by comparing the search-value to the value of the node, starting with the root node (value 10 here). If the search-value is less, it then looks at the left node of the root (value 5), otherwise it looks next at the right node (12).
It doesn't matter so much where in the tree the value is as long as the less and greater rule is followed.
In fact, you want to have trees set up like this (except for the bad 0 node), because the more balanced a tree is (number of nodes on left vs. number of nodes on right), the faster your search will be!
A tree balancing algorithm might, for example, look for the median value in a list of values and make that the value of the root node.

Why is "Yes" a value of -1 in MS Access database?

I'm looking at linked data in MS Access.
The "Yes/No" fields contain the value -1 for YES and 0 for NO. Can someone explain why such a counter-intuitive value is used for "Yes"? (Obviously, it should be 1 and 0)
I imagine there must be a good reason, and I would like to know it.
The binary representation of False is 0000000000000000 (how many bits are used depends on the implementation). If you perform a binary NOT operation on it, it will be changed to 1111111111111111, i.e. True, but this is the binary representation of the signed integer -1.
A bit of 1 at the most significant position signals a negative number for signed numbers. Changing the sign of a number happens by inverting all the bits and adding 1. This is called the Two's complement.
Let us change the sign of 1111111111111111. First invert; we get:
0000000000000000
Then add one:
0000000000000001, this is 1.
This is the proof that 1111111111111111 was the binary representation of -1.
UPDATE
Also, when comparing these values do not compare
x = -1
or
x = 1
instead, do compare
x <> 0
this always gives the correct result, independently of the convention used. Most implementations treat any value unequal zero as True.
"Yes" is -1 because it isn't anything else.
When dealing with Microsoft products, especially one as old as Access, don't assume that there is a good reason for any design choice.