Bzip2 block header: 1AY&SY - block

This is the question about bzip2 archive format. Any Bzip2 archive consists of file header, one or more blocks and tail structure. All blocks should start with "1AY&SY", 6 bytes of BCD-encoded digits of the Pi number, 0x314159265359. According to the source of bzip2:
/*--
A 6-byte block header, the value chosen arbitrarily
as 0x314159265359 :-). A 32 bit value does not really
give a strong enough guarantee that the value will not
appear by chance in the compressed datastream. Worst-case
probability of this event, for a 900k block, is about
2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
For a compressed file of size 100Gb -- about 100000 blocks --
only a 48-bit marker will do. NB: normal compression/
decompression do *not* rely on these statistical properties.
They are only important when trying to recover blocks from
damaged files.
--*/
The question is: Is it true, that all bzip2 archives will have blocks with start aligned to byte boundary? I mean all archives created by reference implementation of bzip2, the bzip2-1.0.5+ utility.
I think that bzip2 may parse the stream not as byte stream but as bit stream (the block itself is encoded by huffman, which is not byte-aligned by design).
So, in other words: If grep -c 1AY&SY greater (huffman may generate 1AY&SY inside block) or equal to count of bzip2 blocks in the file?

BZIP2 looks at a bit stream.
From http://blastedbio.blogspot.com/2011/11/random-access-to-bzip2.html:
Anyway, the important bits are that a BZIP2 file contains one or more
"streams", which are byte aligned, each containing one (zero?) or more
"blocks", which are not byte aligned, followed by an end of stream
marker (the six bytes 0x177245385090 which is the square root of pi as
a binary coded decimal (BCD), a four byte checksum, and empty bits for
byte alignment).
The bzip2 wikipedia article also alludes to bit-block alignment (see the File Format section), which seems to be inline from what I remember from school (had to implement the algorithm...).

Related

What is the bit width of a single webassembly instruction?

I know that webassembly currently supports a 32 bit architecture, so I am supposing that, like RISCV32, that its base instruction set has instructions which are 32 bit wide (Of course, RISCV32 supports 16-bit compressed instructions and 48-bit ones as well). RISC-V's instructions are interpreted mostly as left-endian (in terms of bit indices).
For example, in RISC-V, we can have an instruction like lui (load upper-immediate to register), that embeds a 20-bit immediate into an instruction, has a 5-bit field to encode the desitination register, and a 7-bit format to specify the opcode. Among other things, the opcode contains two bits at the beginning that connote whether the instruction is compressed or not. This is encoded in the specification, where lui has an LUI opcode.:
RISC-V instructions have a variety of different layouts specified in the specification as well, and for example, the lui instruction takes the "U" format, so we know exactly where the 20-bit field is and where the 5-bit destination register is in the serialization:
What is the bit width of a wasm instruction? What are the possible layouts of a wasm instruction? Are there compressed instruction formats for webassembly, such as 16-bit instructions for very common operations?
If webassembly instructions are variable-width, how is the width of an instruction encoded for the interpreter?
Binary WASM bytecode has variable-length instruction, not fixed-width like a RISC CPU. https://en.wikipedia.org/wiki/WebAssembly#Code_representation has an example.
It's not intended to be executed directly, but rather JITed into native machine code, thus a fixed-width format that would require multiple instructions for some 32 or 64-bit constants would make more work for the JIT optimizer. And would be less compact in the WASM binary format, and more instructions to parse.
Much better for the JIT optimizer to know the ultimate goal is to materialize a whole constant, since some ISAs will be able to do that in one instruction, and others will need it split up in different parts depending on the ISA. e.g. 20:12 for RISC-V, 16:16 for ARM movw/movk or MIPS, or if the constant only has set bits in a narrow region, ARM rotated immediates can maybe still use one instruction. Or AArch64 bit-pattern immediates can materialize a constant like 0x01010101 (or 0x0101010101010101) in a single 32-bit instruction.
TL:DR: Don't make the JIT put the pieces back together before breaking back down into asm that works for the target machine.
And in general, variable-length isn't much of a problem for a stream that will be parsed once by software anyway, not decoded repeatedly by hardware every time through a loop.
Examples
A lot of webassembly instructions take up one byte. For example, the left shift instructions are i32.shl andi64.shl and take single byte opcodes 0x74 and 0x86 without any subsequent values, while the i32.const instruction for example starts with 0x41 and takes from 2 to 6 bytes.
Instruction
Opcode
i32.const
0x41
i64.const
0x42
f32.const
0x43
f64.const
0x44
-
-
i32.shl
0x74
i64.shl
0x86
-
-
i32.eqz
0x45
i32.eq
0x46
i64.eqz
0x50
i64.eq
0x51
And so on. The values here are taken from the MDN website. See the Numeric Instructions.
Encoding Numbers
Some instructions such as the const above require specifying the immediate, which increases the overall size of the instruction. The immediates are encoded in LEB128, and the variant depends on whether the integer is signed or unsigned. Those are normally given in the specification.
LEB128 is roughly this: bits are padded to a multiple of seven, split into groups and the last bit is used to determine whether the end is reached. Those numbers are constrained to their maximum width. Floating point numbers are encoded in IEE-754
The const instructions are followed by the respective literal.
All other numeric instructions are plain opcodes without any immediates.
Source: https://webassembly.github.io/spec/core/binary/instructions.html#numeric-instructions
Wasm instructions are represented with a unique opcode (typically 1 byte, more for newer instruction), followed by the encodings of immediate operands, for instructions that have them. There is no specific length, it depends on both the opcode and the immediate values.
For example:
i32.add is opcode 0x6A with no immediates;
i64.const i is opcode 0x42, followed by a variable-length encoding of i in LEB128 format;
br_table l* ld is opcode 0x0E, followed by a variable-length encoding of the length of l* in LEB128, followed by as many variable-length encodings of the label indices in l*, followed by the variable-length encoding of label index ld.
See the binary grammar in the specification for details. A Wasm decoder is essentially "parsing" the binary input according to this grammar.
Here are some citations from the current specification v2.0 related to the instructions (as "seen" by the specification itself):
some instructions also have static immediate arguments, typically
indices or type annotations, which are part of the instruction itself.
Some instructions are structured in that they bracket nested sequences of instructions.
In relation to the nesting:
Implementations typically impose additional restrictions on a number of aspects of a WebAssembly module or execution
Then, one of the noted implementation limitations is:
the nesting depth of structured control instructions
As the nesting depth of the instructions is not strictly defined by the specification, but its left to the implementation to choose, that means that there is no limit of the instructions length regardless are they encoded as binary or text, as per the specification.
Even if we ignore the structured instructions (as we should not), there are many instructions having vectors as arguments. The vectors length is limited to 2^32-1. If my memory serves me right, there was and an instruction having vector of vectors as an argument.

How to read a binary file with TCL

So I have a function I'm using to read data from a file. It works fine if the file is plain text, but when I try to read a binary file, like a png, it returns a different text (diff confirms that). I opened a hex editor to see what was wrong and found out it is putting some c2 bytes along with the file (I don't know if the position is random or if there are other bytes except this c2 one).
This is my function. I just want it to read and save to a variable.
proc read_file {path} {
set channel [open $path r]
fconfigure $channel -translation binary
set return_string "[read $channel]"
close $channel
return "$return_string"
}
To actually print, I'm doing this:
puts -nonewline [read_file file.png]
When you open a file, it defaults to being in text mode . In text mode (which is really a combination of options) the IO layer translates characters from whatever encoding they are in into Tcl's internal encoding, and does the reverse operation on output. The default encoding scheme is platform specific, but in your case it sounds like it is UTF-8. (Tcl uses a complex internal system of encodings; it doesn't expose those to the outside world.)
By contrast, when you put the channel into binary mode, the bytes on the outside are directly mapped to characters in the range 0-255 (and vice versa on output). You get a perfect copy, provided you put both input and output channels in binary mode. (There are other optimisations for binary mode, but they don't matter here.)
When you only put one of the channels in binary mode, you get what looks like corruption. It isn't random though. In particular, when the input is binary but the output is UTF-8, input bytes in the range 128-255 get converted into multiple output bytes, where the first of those bytes is in the sort of range you observed. There are other combinations that mess things up; the whole range of problems is collectively known as mojibake.
tl;dr Don't mix up binary and text data unless you're very careful. The results of getting it wrong are "surprising".

How to manually construct a gzip so that compressed file is larger than original?

Suppose a 1KB file called data.bin, If it's possible to construct a gzip of it data.bin.gz, but much larger, how to do it?
How much larger could we theoretically get in GZIP format?
You can make it arbitrarily large. Take any gzip file and insert as many repetitions as you like of the five bytes: 00 00 00 ff ff after the gzip header and before the deflate data.
Summary:
With header fields/general structure: effect is unlimited unless it runs into software limitations
Empty blocks: unlimited effect by format specification
Uncompressed blocks: effect is limited to 6x
Compressed blocks: with apparent means, the maximum effect is estimated at 1.125x and is very hard to achieve
Take the gzip format (RFC1952 (metadata), RFC1951 (deflate format), additional notes for GNU gzip) and play with it as much as you like.
Header
There are a whole bunch of places to exploit:
use optional fields (original file name, file comment, extra fields)
bluntly append garbage (GNU gzip will issue a warning when decompressing)
concatenate multiple gzip archives (the format allows that, the resulting uncompressed data is, likewise, the concatenation or all chunks).
An interesting side effect (a bug in GNU gzip, apparently): gzip -l takes the reported uncompressed size from the last chunk only (even if it's garbage) rather than adding up values from all. So you can make it look like the archive is (absurdly) larger/smaller than raw data.
These are the ones that are immediately apparent, you may be able to find yet other ways.
Data
The general layout of "deflate" format is (RFC1951):
A compressed data set consists of a series of blocks, corresponding to
successive blocks of input data. The block sizes are arbitrary,
except that non-compressible blocks are limited to 65,535 bytes.
<...>
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
Full blocks
The 00 00 00 ff ff that Mark Adler suggests is essentially an empty, non-final block (RFC1951 section 3.2.3. for the 1st byte, 3.2.4. for the uncompressed block itself).
Btw, according to gzip overview at the official site and the source code, Mark is the author of the decompression part...
Uncompressed blocks
Using non-empty uncompressed blocks (see prev. section for references), you can at most create one for each symbol. The effect is thus limited to 6x.
Compressed blocks
In a nutshell: some inflation is achievable but it's very hard and the achievable effect is limited. Don't waste your time on them unless you have a very good reason.
Inside compressed blocks (section 3.2.5.), each chunk is [<encoded character(8-9 bits>|<encoded chunk length (7-11 bits)><distance back to data(5-18 bits)>], with lengths starting at 3. A 7-9-bit code unambiguously resolves to a literal character or a specific range of lengths. Longer codes correspond to larger lengths/distances. No space/meaningless stuff is allowed between chunks.
So the maximum for raw byte chunks is 9/8 (1.125x) - if all the raw bytes are with codes 144 - 255.
Playing with reference chunks isn't going to do any good for you: even a reference to a 3-byte sequence gives 25/24 (1.04x) at most.
That's it for static Huffman tables. Looking through the docs on dynamic ones, it optimizes the aforementioned encoding for the specific data or something. So, it should allow to make the ratio for the given data closer to the achievable maximum, but that's it.

Redis int representation of a string is bigger when the string is more than 7 bytes but smaller otherwise

I'm trying to reduce Redis's objects size as much as I can and I've taken this whole week to experiment with it.
While testing different data representations I found out that an int representation of the string "hello" results in a smaller object. It may not look like much, but if you have a lot of data it can make a difference between using a few GB memory vs dozens of it.
Look at the following example (you can try it yourself if you want):
> SET test:1 "hello"
> debug object test:1
> Value at:0xb6c9f380 refcount:1 encoding:raw serializedlength:6 lru:9535350 lru_seconds_idle:7
In particular you should look at serializedlength which is 6 (bytes) in this case.
Now, look at the following int representation of it:
> SET test:2 "857715"
> debug object test:2
> Value at:0xb6c9f460 refcount:1 encoding:int serializedlength:5 lru:9535401 lru_seconds_idle:2
As you see, it results in a byte shorter object (note also encoding:int which I think is suggesting that ints get handled in a more efficient way).
With the string "hello w" (you'll see in a few moments why I didn't use "hello world" instead) we get an even bigger saving when it's represented as an int:
> SET test:3 "hello w"
> SET test:4 "857715023" <- Int representation. Notice that I inserted a "0", if I don't, it results in a bigger object and the encoding is set to "raw" instead (after all a space is not an int).
>
> debug object test:3
> Value at:0xb6c9f3a0 refcount:1 encoding:raw serializedlength:8 lru:9535788 lru_seconds_idle:6
> debug object test:4
> Value at:0xb6c9f380 refcount:1 encoding:int serializedlength:5 lru:9535809 lru_seconds_idle:5
It looks cool as long as you don't exceed 7 bytes string.. Look at what happens by a "hello wo" int representation:
> SET test:5 "hello wo"
> SET test:6 "85771502315"
>
> debug object test:5
> Value at:0xb6c9f430 refcount:1 encoding:raw serializedlength:9 lru:9535907 lru_seconds_idle:9
> debug object test:6
> Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:5
As you can see the int (12 bytes) is bigger than the string representation (9 bytes).
My question here is, what's going on behind the scenes when you represent a string as an int, that it is smaller until you reach 7 bytes?
Is there a way to increase this limit as you do with "list-max-ziplist-entries/list-max-ziplist-value" or a clever way to optimize this process so that it always (or nearly) results in a smaller object than a string?
UPDATE
I've further experimented with other tricks, and you can actually have smaller ints than string, regardless of its size, but that would involve a little more work as of data structure modelling.
I've found out that if you split the int representation of a string in chunks of ~8 numbers each, it ends up being smaller.
Take as an example the word "Hello World Hi Universe" and create both a string and int SET:
> HMSET test:7 "Hello" "World" "Hi" "Universe"
> HMSET test:8 "74111114" "221417113" "78" "2013821417184"
The results are as follows:
> debug object test:7
> Value at:0x7d12d600 refcount:1 encoding:ziplist serializedlength:40 lru:9567096 lru_seconds_idle:296
>
> debug object test:8
> Value at:0x7c17d240 refcount:1 encoding:ziplist serializedlength:37 lru:9567531 lru_seconds_idle:2
As you can see we got the int set smaller by 3 bytes.
The problem in this will be how to organize such a thing, but it shows that it's possible nonetheless.
Still, don't know where this limit is set. The ~700K persistent use of memory (even when you have no data inside) makes me think that there is a pre-defined "pool" dedicated to the optimization of int sets.
UPDATE2
I think I've found where this intset "pool" is defined in Redis source.
At line 81 in the file redis.h there is the def REDIS_SHARED_INTEGERS set to 10000
REDISH_SHARED_INTEGERS
I suspect it's the one defining the limit of an intset byte length.
I have to try to recompile it with an higher value and see if I can use a longer int value (it'll most probably allocate more memory if it's the one I think of).
UPDATE3
I want to thank Antirez for the reply! Didn't expect that.
As he made me notice, len != memory usage.
I got further in my experiment and saw that the objects get already slightly compressed (serialized). I may have missed something from the Redis documentation.
The confirmation comes from analyzing a Redis key wih the command redis-memory-for-key key, which actually returns the memory usage and not the serialized length.
For example, let's take the "hello" string and int we used before, and see what's the result:
~ # redis-memory-for-key test:1
Key "test:1"
Bytes 101
Type string
~ #
~ # redis-memory-for-key test:2
Key "test:2"
Bytes 87
Type string
As you can notice the intset is smaller (87 bytes) than the string (101 bytes) anyway.
UPDATE4
Surprisingly a longer intset seems to affect its serializedlength but not memory usage..
This makes it possible to actually build a 2digit-char mapping while it still being more memory efficient than a string, without even chunking it.
By 2digit-char mapping I mean that instead of mapping "hello" to "85121215" we map it to digits with a fixed length of 2 each, prefixing it with "0" if digit < 10 like "0805121215".
A custom script would then proceed by taking every two digit apart and converting them to their equivalent char:
08 05 12 12 15
\ | | | /
h e l l o
This is enough to avoid disambiguation (like "o" and "ae" which both result in the digit "15").
I'll show you this works by creating another set and therefore analyzing its memory usage like I did before:
> SET test:9 "0805070715"
Unix shell
----------
~ # redis-memory-for-key test:9
Key "test:9"
Bytes 87
Type string
You can see that we have a memory win here.
The same "hello" string compressed with Smaz for comparison:
>>> smaz.compress('hello')
'\x10\x98\x06'
// test:10 would be unfair as it results in a byte longer object
SET post:1 "\x10\x98\x06"
~ # redis-memory-for-key post:1
Key "post:1"
Bytes 99
Type string
My question here is, what's going on behind the scenes when you represent a
string as an int, that it is smaller until you reach 7 bytes?
Notice that the integer you supplied as test #6 is no longer actually encoded
as an integer, but as raw:
SET test:6 "85771502315"
Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:
So we see that a "raw" value occupies one byte plus the length of its string representation. In memory
you get that plus the overhead of the value.
The integer encoding, I suspect, encodes a number as a 32-bit integer; then it will always
need five bytes, one to tell its type, and four to store those 32 bits.
As soon as you overflow the maximum representable integer in 32 bits, which is either 2 billions or 4 depending on whether you use a sign or not, you need to revert to raw encoding.
So probably
2147483647 -> five bytes (TYPE_INT 0x7F 0xFF 0xFF 0xFF)
2147483649 -> eleven bytes (TYPE_RAW '2' '1' '4' '7' '4' '8' '3' '6' '4' '9')
Now, how can you squeeze a string representation PROVIDED THAT YOU ONLY USE AN ASCII SET?
You can get the string (140 characters):
When in the Course of human events it becomes necessary for one people
to dissolve the political bands which have connected them with another
and convert each character to a six-bit representation; basically its index in the string
"ABCDEFGHIJKLMNOPQRSTUVWXYZ01234 abcdefghijklmnopqrstuvwxyz56789."
which is the set of all the characters you can use.
You can now encode four such "text-only characters" in three "binary characters", a sort of "reverse base 64 encoding"; base64 encoding will get three binary characters and create a four-byte sequence of ASCII characters.
If we were to code it as groups of integers, we would save a few bytes - maybe get it down
to 130 bytes - at the cost of a larger overhead.
With this type of "reverse base64" encoding, we can get 140 character to 35 groups of four characters, which become a string of 35x3 = 105 binary characters, raw encoded to 106 bytes.
As long, I repeat, as you never use characters outside the range above. If you do, you can
enlarge the range to 128 characters and 7 bits, thus saving 12.5% instead of 25%; 140 characters will then become 126, raw encoded to 127 bytes, and you save (141-127) = 14 bytes.
Compression
If you have much longer strings, you can compress them (i.e., you use a function such as deflate() or gzencode() or gzcompress() ). Either straight; in which case the above string becomes 123 bytes. Easy to do.
Compressing many small strings: the Rube Goldberg approach
Since compression algorithms learn, and at the beginning they dare assume nothing, small strings will not compress greatly. They're "all beginning", so to speak. Just as an engine, when running cold the performances are inferior.
If you have a "corpus" of text these strings come from, you can use a time-consuming trick that "warms up" the compression engine and may double (or better) its performances.
Suppose you have two strings, COMMON and TARGET (the second one is the one you're interested in). If you z-compressed COMMON you would get, say, ZCMN. If you compressed TARGET you would get ZTRGT.
But as I said, since the gz compression algorithm is stream oriented, and it learns as it goes by, the compression ratio of the second half of any text (provided there aren't freakish statistical distribution changes between halves) is always appreciably higher than that of the first half.
So if you were to compress, say, COMMONTARGET, you'd get ZCMGHQI.
Notice that the first part of the string, as far as almost the end, is the same as before. Indeed if you compressed COMMONFOOBAR, you'd get something like ZCMQKL. And the second part is compressed better than before, even if we count the area of overlap as belonging entirely to the second string.
And this is the trick. Given a family of strings (TARGET, FOOBAR, CASTLE BRAVO), we compress not the strings, but the concatenation of those strings with a large prefix. Then we discard from the result the common compressed prefix. Thus TARGET is taken from the compression of COMMONTARGET (which is ZCMGHQI), and becomes GHQI instead of ZTRGT, with a 20% gain.
The decoder does the reverse: given GHQI, it first applies the common compressed prefix ZCM (which it must know); then it decodes the result, and finally discards the common uncompressed prefix, of which it need only know the length beforehand.
So the first sentence above (140 characters) becomes 123 when compressed by itself; if I take the rest of the Declaration and use it as a prefix, it compresses to 3355 bytes. This prefix plus my 140 bytes becomes 3409 bytes, of which 3352 are common, leaving 57 bytes.
At the cost of storing once the uncompressed prefix in the encoder, and the compressed prefix once in the decoder, and the whole thingamajig running five times as slow, I can now get those 140 bytes down to 57 instead of 123 - less than half of before.
This trick works great for small strings; for larger ones, the advantage isn't worth the pain. Also, different prefixes yield different results. The best prefixes are those that contain most of the sequences that are likely to appear in the string pool, ordered by increasing length.
Added bonus: the compressed prefix also doubles as a sort of weak encryption, as without that, you can't easily decode the compressed strings, even if you might be able to recover some pieces thereof.

dll files compared to gzip files

Okay, the title isn't very clear.
Given a byte array (read from a database blob) that represents EITHER the sequence of bytes contained in a .dll or the sequence of bytes representing the gzip'd version of that dll, is there a (relatively) simple signature that I can look for to differentiate between the two?
I'm trying to puzzle this out on my own, but I've discovered I can save a lot of time by asking for help. Thanks in advance.
Check if it's first two bytes are the gzip magic number 0x1f8b (see RFC 1952). Or just try to gunzip it, the operation will fail if the DLL is not gzip'd.
A gzip file should be fairly straight forward to determine as it ought to consist of a header, footer and some other distinguishable elements in between.
From Wikipedia:
"gzip" is often also used to refer to
the gzip file format, which is:
a 10-byte header, containing a magic
number, a version number and a time
stamp
optional extra headers, such as
the original file name
a body,
containing a DEFLATE-compressed
payload
an 8-byte footer, containing a
CRC-32 checksum and the length of the
original uncompressed data
You might also try determining if the gzip contains any records/entries as each will also have their own header.
You can find specific information on this file format (specifically the member header which is linked) here.