Deflate: code lengths of > 7 bits for top-level HCLEN? - gzip

RFC 1951 specifies that the first level of encoding in a block contains HCLEN 3-bit values, which encode the lengths of the next level of Huffman codes. Since these are 3-bit values, it follows that no code for the next level can be longer than 7 bits (111 in binary).
However, there seem to be corner cases which (at least with the "classical" algorithm to build Huffman codes, using a priority queue) apparently generate codes of 8 bits, which can of course not be encoded.
An example I came up with is the following (this represents the 19 possible symbols resulting from the RLE encoding, 0-15 plus 16, 17 and 18):
symbol | frequency
-------+----------
0 | 15
1 | 14
2 | 6
3 | 2
4 | 18
5 | 5
6 | 12
7 | 26
8 | 3
9 | 20
10 | 79
11 | 94
12 | 17
13 | 7
14 | 8
15 | 4
16 | 16
17 | 1
18 | 13
According to various online calculators (eg https://people.ok.ubc.ca/ylucet/DS/Huffman.html), and also building the tree by hand, some symbols in the above table (namely 3 and 17) produce 8-bit long Huffman codes. The resulting tree looks ok to me, with 19 leaf nodes and 18 internal nodes.
So, is there a special way to calculate Huffman codes for use in DEFLATE?

Yes. deflate uses length-limited Huffman codes. You need either a modified Huffman algorithm that limits the length, or an algorithm that shortens a Huffman code that has exceeded the length. (zlib does the latter.)
In addition to the code lengths code being limited to seven bits, the literal/length and distance codes are limited to 15 bits. It is not at all uncommon to exceed those limits when applying Huffman's algorithm to sets of frequencies encountered during compression.
Though your example is not a valid or possible set of frequencies for that code. Here is a valid example that results in a 9-bit Huffman code, which would then need to be squashed down to seven bits:
3 0 0 5 5 1 9 31 58 73 59 28 9 1 2 0 6 0 0

Related

Group rows using the cumulative sum of a third column

I have a table with two columns:
sort_column = A column I use for sorting
value_column = My metric of interest (a positive integer)
Using SQL, I need to create contiguous groups of rows, ordered by sort_column, such that the sum of value_column within each group is the largest possible but staying below 100 (100 not included).
Find below an example of my desired result.
Thanks
sort_column
value_column
desired_result
1
53
1
2
25
1
3
33
2
4
25
2
5
10
2
6
46
3
7
9
3
8
49
4
9
48
4
10
53
5
11
33
5
12
52
6
13
29
6
14
16
6
15
66
7
16
1
7
17
62
8
18
57
9
19
47
10
20
12
10
Ok, so after a few lengthy attempts, I came to the conclusion the task is impossible with pure SQL, because a given value of the desired column depends on previous values of that same column, in a way that cannot be obtained from the first two columns alone, so the problem is impossible to tackle without using a recursive CTE, which BigQuery does not support.
I solved the issue by writing a javascript UDF for the task. It seems to be working fine and produces the expected results.
Many thanks everyone!

How to select half precision (BFLOAT16 vs FLOAT16) for your trained model?

how will you decide what precision works best for your inference model? Both BF16 and F16 takes two bytes but they use different number of bits for fraction and exponent.
Range will be different but I am trying to understand why one chose one over other.
Thank you
|--------+------+----------+----------|
| Format | Bits | Exponent | Fraction |
|--------+------+----------+----------|
| FP32 | 32 | 8 | 23 |
| FP16 | 16 | 5 | 10 |
| BF16 | 16 | 8 | 7 |
|--------+------+----------+----------|
Range
bfloat16: ~1.18e-38 … ~3.40e38 with 3 significant decimal digits.
float16: ~5.96e−8 (6.10e−5) … 65504 with 4 significant decimal digits precision.
bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.
Check out AMP if you choose float16.

bit varying in Postgres to be queried by sub-string pattern

The following Postgres table contains some sample content where the binary data is stored as bit varying (https://www.postgresql.org/docs/10/datatype-bit.html):
ID | Binary data
----------------------
1 | 01110
2 | 0111
3 | 011
4 | 01
5 | 0
6 | 00011
7 | 0001
8 | 000
9 | 00
10 | 0
11 | 110
12 | 11
13 | 1
Q: Is there any query (either native SQL or as Postgres function) to return all rows where the binary data field is equal to all sub-strings of the target bit array. To make it more clear lets look at the example search value 01101:
01101 -> no result
0110 -> no result
011 -> 3
01 -> 4
0 -> 5, 10
The result returned should contain the rows: 3, 4, 5 and 10.
Edit:
The working query is (thanks to Laurenz Albe):
SELECT * FROM table WHERE '01101' LIKE (table.binary_data::text || '%')
Furthermore I found this discussion about Postgres bit with fixed size vs bit varying helpful:
PostgreSQL Bitwise operators with bit varying "cannot AND bit strings of different sizes"
How about
WHERE '01101' LIKE (col2::text || '%')
I think you are looking for bitwise and:
where col2 & B'01101' = col2

Question on the structure of RTP Extension headers as explaind in RFC 8285

In RFC 8285, which deals with RTP Header Extensions, the structure for a 1-byte header extension is as shown below (Section 4.2):
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0xBE | 0xDE | length=3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID | L=0 | data | ID | L=1 | data...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
...data | 0 (pad) | 0 (pad) | ID | L=3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
I understand the OxBEDE which is explained in the RFC. Then comes the "length=3" bits which are followed by the actual extensions. Each extension consists of the ID followed by length. A similar structure is defined for two-byte header extensions.
In both types of headers, I do not understand the "length=3" bits section. Is it just padding used for 32-bit boundary? If so, what purpose does this serve? Ease in parsing? Why not have extension elements started immediately after the xBEDE. Certainly would have been space efficient.
May be I am missing something basic.
This probably dates back to RFC 3550. Specifying the length field explicitly like this allows clients that do not understand extensions to skip them more easily.
Also note that until extended by RFC 5285 (updated by 8285) there could only be a single extension so what you see is a backward compability hack.

How to Create a CDF out of a PDF in SQL

So I have a datatable that looks something like that following. ID represents an object, bin represents how I am segmenting the data, and percent is how much of a data falls into that bin.
id bin percent
2 8 0.20030698388
2 16 0.14504988488
2 24 0.12356101304
2 32 0.09976976208
2 40 0.09056024558
2 48 0.07137375287
2 56 0.04067536454
2 64 0.03914044512
2 72 0.02916346891
2 80 0.16039907904
3 8 0.36316695352
3 16 0.03958691910
3 24 0.11876075731
3 32 0.13253012048
3 40 0.03098106712
3 48 0.07228915662
3 56 0.07745266781
3 64 0.02581755593
3 72 0.02065404475
3 80 0.11876075731
I am looking for a function to turn this dataset into a cdf partitioning id. I have tried cume_dist and percent_rank, but they do not appear to work.
I am facing a similar problem and found this great tutorial for doing exactly that:
https://dwaincsql.com/2015/05/14/excel-in-t-sql-part-2-the-normal-distribution-norm-dist-density-functions/
It tries to rebuild the Excel function NORM.DIST function which gives you either the PDF if you set the cummulative flag as FALSE and the CDF if you set it as TRUE. I assumed that CUME_DIST would do the exact same thing in SQL. However, it turns out that the latter distributes by counting the elements whereas Excel uses the relative differences in the values.