Can DEFLATE only compress duplicate strings up to 32 KiB apart? - gzip

According to DEFLATE spec:
Compressed representation overview
A compressed data set consists of a series of blocks, corresponding to successive blocks of input
data. The block sizes are arbitrary, except that non-compressible
blocks are limited to 65,535 bytes.
Each block is compressed using a combination of the LZ77 algorithm and
Huffman coding. The Huffman trees for each block are independent of
those for previous or subsequent blocks; the LZ77 algorithm may use a
reference to a duplicated string occurring in a previous block, up to
32K input bytes before.
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
So pointers to duplicate strings only go back 32 KiB, but since block size is not limited, could the Huffman code tree store two duplicate strings more than 32 KiB apart as the same code? Then is the limiting factor the block size?

The Huffman tree for distances contains codes 0 to 29 (table below); the code 29, followed by 8191 in "plain" bits, means "distance 32768". That's a hard limit in the definition of Deflate. The block size is not limiting. Actually the block size is not stored anywhere: the block is an infinite stream. If you want to stop the block, you send an End-Of-Block code for that.
Distance Codes
--------------
Extra Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
6 2 9-12 14 6 129-192 22 10 2049-3072
7 2 13-16 15 6 193-256 23 10 3073-4096

To add to Zerte's answer, the references to previous sequences have nothing to do with blocks or block boundaries. Such references can be within blocks, across blocks, and the referenced sequence can cross a block boundary.

Related

Store non-binary values into a unique integer

In 8 bits we can store 8 numbers from 0 to 1 each. We can also say that we can store 8 different piece of data in a range from 0 to 255.
0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 → 8 different piece of data
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
[bit] [bit] [bit] [bit] [bit] [bit] [bit] [bit] → 8 bits total
If instead of storing 0 or 1 we need to store 0, 1 or 2, then we will end up using more bits, of course. In this condition, by the way, we can store 0, 1, 2 or 3. However, this is beyond what I need but technically it will use the same amount of data space.
0/1/2 0/1/2 0/1/2 0/1/2 → 4 different piece of data (expectative)
0/1/2/3 0/1/2/3 0/1/2/3 0/1/2/3 → 4 different piece of data (too much for me)
↓ ↓ ↓ ↓
[2bits] [2bits] [2bits] [2bits] → 8 bits total
We can agree that we are "losing" storage capacity. The solution is to do a radix conversion (I don't know if that's the right term) using the numerical base of 3, this way we will achieve the objective of storing exclusively 0, 1 or 2, and in the same 8 bits we can store more information (and, to be honest, I don't know how the bits organize themselves for this to work).
Examples:
00000₃ → int 0
01212₃ → int 50
11111₃ → int 121
12121₃ → int 151
22222₃ → int 242 (max)
In this conversion, we were able to store 5 pieces of information composed of 0, 1 or 2, instead of just 4. And there's still a "space" left, and that's what I'd like to talk about.
I was wondering if there is any way to store mixed base data instead of fixed base (eg. 3, as exemplified above). To be clearer, supposing that I wanted to store two pieces of data composed of 0, 1 or 2 each, and one piece of data composed of a number between 0 and 8. In binary, and still limited to 256, we can do it as follows:
0/1/2 0/1/2 0/1/2/3/4/5/6/7/8 → 3 piece of different data
↓ ↓ ↓
[2bits] [2bits] [ 4 bits ] → 8 bits total
But, it seems to me that we have empty "spaces" left again, since it doesn't seem to me to be using the full capacity possible in 8 bits, and maybe it's still possible to add more data if we use number base conversion instead.
The problem is: I don't know how to handle the data in this way, in a way to reliably merge and split it again.
In the real world: I need to transfer a lot of data that uses very little information (eg numbers 0 to 2, 0 to 5, 0 to 10, etc.). And I'm currently doing bitwise snapping of this data. And in my case, any optimized byte is a very nice gain (if this is really possible, maybe there is a 20~40% data saving).
I understand that it might consume more processing on rebasing, merge and split conversions, but that won't be an issue, because this processing will be done on the client side (merge when sending and split when receiving), and the server manages the same data already optimized (no additional processing).
--
Possible solutions:
Let's imagine that I have a set of three numbers that can be 0, 1 or 2, and another set of three numbers that can go from 0 to 8 (so it is 9 possibilities each). The goal is to merge all these numbers into a single integer (which should use about 2 bytes at best current method).
Solution 1: each number being stored in a single complete byte:
byte A from 0 to 2
byte B from 0 to 2
byte C from 0 to 2
byte D from 0 to 8
byte E from 0 to 8
byte F from 0 to 8
Problem: it will be necessary to consume 6 bytes to store these 6 numbers.
Solution 2: storage through bits:
2 bits to store A
2 bits to store B
2 bits to store C
4 bits to store F
4 bits to store G
4 bits to store H
6 unused bits to complete the last byte
Problems: in addition to 2 bits being "more than necessary" to store numbers from 0 to 2 (A, B and C), and same for 4 bits from 0 to 8, we will use a total of 3 bytes and still have 6 "dead" (unused) bits at end to complete the last byte.
Solution 3: Separate into two sets and convert their respective bases individually:
A, B and C to base 3 consumes 1 byte
D, E and F to base 9 consumes 2 bytes
Problem: despite being a great solution at the moment (and despite the example above, in more complex situations it can be more optimized than solution 2), and that's what I'm using, I believe there's a lot of "left over" space in this union, and maybe it's still possible to squeeze everything into 2 bytes only.
For example, the conversion of 222₃ consumes up to 5 bits, which means that in this byte we still have 3 unused bits. The 888₉ conversion consumes 10 bits. This makes me see the possibility of using only 15 bits with only 1 unused bit (2 bytes).
Then we come to the next solution.
Solution 4: move the bits to further optimize space:
higher bits stores A, B and C
lower bits stores D, E and F
Example:
[5 bits of numbers of base 3] +
[10 bits of numbers of base 9] +
[1 unused bit]
Problem: currently this solution is even better than the one I am currently using. However, I still see a possibility to improve in situations where more information may be available through number base conversion and union.
You pack your values in the following way:
Example:
possible values for
a = [0,2] -> 3 states
b = [0,2] -> 3 states
c = [0,8] -> 9 states
lets assume you have following values
a := 1
b := 2
c := 8
than you can calculate the final number the following way
int number = 0;
int multi = 1;
number = number.add(multi.multiply(a));
multi = multi.multiply(3));
number = number.add(multi.multiply(b));
multi = multi.multiply(3));
number = number.add(multi.multiply(c));
number holds now your packed numbers
to unpack you just need todo
a = number.mod(3)
number = number.divide(3)
b = number.mod(3)
number = number.divide(3)
c = number.mod(9)

No of Passes in a Bubble Sort

For the list of items in an array i.e. {23 , 12, 8, 15, 21}; the number of passes I could see is only 3 contrary to (n-1) passes for n elements. I also see that the (n-1) passes for n elements is also the worst case where all the elements are in descending order. So I have got 2 conclusions and I request you to let me know if my understanding is right wrt the conclusions.
Conclusion 1
(n-1) passes for n elements can occur in the following scenarios:
all elements in the array are in descending order which is the worst case
when it is not the best case(no of passes is 1 for a sorted array)
Conclusion 2
(n-1) passes for n elements is more like a theoretical concept and may not hold true in all cases as in this example {23 , 12, 8, 15, 21}. Here the number of passes are (n-2).
In a classic, one-directional bubble sort, no element is moved more than one position to the “left” (to a smaller index) in a pass. Therefore you must make n-1 passes when (and only when) the smallest element is in the largest position. Example:
12 15 21 23 8 // initial array
12 15 21 8 23 // after pass 1
12 15 8 21 23 // after pass 2
12 8 15 21 23 // after pass 3
8 12 15 21 23 // after pass 4
Let's define L(i) as the number of elements to the left of element i that are larger than element i.
The array is sorted when L(i) = 0 for all i.
A bubble sort pass decreases every non-zero L(i) by one.
Therefore the number of passes required is max(L(0), L(1), ..., L(n-1)). In your example:
23 12 8 15 21 // elements
0 1 2 1 1 // L(i) for each element
The max L(i) is L(2): the element at index 2 is 8 and there are two elements left of 8 that are larger than 8.
The bubble sort process for your example is
23 12 8 15 21 // initial array
12 8 15 21 23 // after pass 1
8 12 15 21 23 // after pass 2
which takes max(L(i)) = L(2) = 2 passes.
For a detailed analysis of bubble sort, see The Art of Computer Programming Volume 3: Sorting and Searching section 5.2.2. In the book, what I called the L(i) function is called the “inversion table” of the array.
re: Conclusion 1
all elements in the array are in descending order which is the worst case
Yep, this is when you'll have to do all (n-1) "passes".
when it is not the best case (no of passes is 1 for a sorted array)
No. When you don't have the best-case, you'll have more than 1 passes. So long as it's not fully sorted, you'll need less than (n-1) passes. So it's somewhere in between
re: Conclusion 2
There's nothing theoretical about it at all. You provide an example of a middle-ground case (not fully reversed, but not fully sorted either), and you end up needing a middle-ground number of passes through it. What's theoretical about it?

gnuplot: Spurious data points in plots when using index

I'm trying to use gnuplot 4.6 patchlevel 6 to visualize some data from a file test.dat which looks like this:
#Pkg 1
type min max avg
small 1 10 5
medium 5 15 7
large 10 20 15
#Pkg 2
small 3 9 5
medium 5 13 6
large 11 17 13
(Note that the values are actually separated by tabs even though it shows as spaces here.)
My gnuplot commands are
reset
set datafile separator "\t"
plot 'test.dat' index 0 using 2:xticlabels(1) title col, '' using 3 title col, '' using 4 title col
This works fine as long as there is only a single data block in test.dat. When I add the second block spurious data points appear. Why is that and how can it be fixed?
YFTR: Using stat on the file yields only expected results. It reports two data blocks for the full file and correct values (for min, max and sum) when I specify one of the two using index
as mentioned in the comment to the question, one has to explicitly repeat the index 0 specification within all parts of the plot command as
plot 'test.dat' index 0 using 2, '' index 0 using 3, ...
otherwise '' refers to all blocks in the data file

Is there a way to represent a number in binary where bits have approximately uniform significance?

I'm wondering if it is possible to represent a number as a sequence of bits, each having approximately the same significance, such that if we flip one of the bits, the overall value does not change by much.
For example, we can use sequences of 4-bits, where each group represents a value from 0 to 15 and the overall value is the sum of all these values.
0110 0101 1101 1010 1011 → 6 + 5 + 13 + 10 + 11 = 45
and now flipping any bit can only incur in a maximum difference of 8 in the final value.
Some drawbacks obviously exist with this approach:
values have multiple representations, with some values having more representations than other ones (for example, there are 39280 distinct representations for the number 38, and only 1 for the number 0);
the amount of values that can be represented is greatly reduced (this representation allows for integers from 0 to 75, while 20 bits could normally represent 220 ~ 1 million different integers).
Are there any resources I can find concerning this problem? I can't seem to find anything online, but maybe I'm not searching with the right keywords. What other alternatives exist to my approach? Do they improve on its disadvantages?

SQL Server 2008, how much space does this occupy?

I am trying to calculate how much space (Mb) this would occupy. In the database table there is 7 bit columns, 2 tiny int and 1 guid.
Trying to calculate the amount that 16 000 rows would occupies.
My line of thought was that 7 bit columns consume 1 byte, 2 tiny ints consumes 2 bytes and a guid consumes 16 bytes. Total of 19byte for one row in the table? That would mean 304000 bytes for 16 000 rows or ~0.3mbs us that correct? Is there a meta data byte as well?
There are several estimators out there which take away the donkey work
You have to take into account the Null bitmap which will be 3 bytes in this case + number of rows per page + row header + row version + pointers + all the stuff here:
Inside the Storage Engine: Anatomy of a record
Edit:
Your 19 bytes of actual data
has 11 bytes overhead
total 30 bytes per row
around 269 rows per page (8096 / 30)
requires 60 pages (16000 / 269)
around 490k space (60 x 8192)
a few KB for the index structure of the primary