What does a zlib header look like? - header

In my project I need to know what a zlib header looks like. I've heard it's rather simple but I cannot find any description of the zlib header.
For example, does it contain a magic number?

zlib magic headers
78 01 - No Compression/low
78 9C - Default Compression
78 DA - Best Compression

Link to RFC
0 1
+---+---+
|CMF|FLG|
+---+---+
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
CM (Compression method)
This identifies the compression method used in the file. CM = 8
denotes the "deflate" compression method with a window size up
to 32K. This is the method used by gzip and PNG and almost everything else.
CM = 15 is reserved.
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size). Values
of CINFO above 7 are not allowed in this version of the
specification. CINFO is not defined in this specification for
CM not equal to 8.
In practice, this means the first byte is almost always 78 (hex)
FLG (FLaGs)
This flag byte is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
The FCHECK value must be such that CMF and FLG, when viewed as
a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG),
is a multiple of 31.
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm

ZLIB/GZIP headers
Level | ZLIB | GZIP
1 | 78 01 | 1F 8B
2 | 78 5E | 1F 8B
3 | 78 5E | 1F 8B
4 | 78 5E | 1F 8B
5 | 78 5E | 1F 8B
6 | 78 9C | 1F 8B
7 | 78 DA | 1F 8B
8 | 78 DA | 1F 8B
9 | 78 DA | 1F 8B
Deflate doesn't have common headers

The ZLIB header (as defined in RFC1950) is a 16-bit, big-endian value - in other words, it is two bytes long, with the higher bits in the first byte and the lower bits in the second.
It contains these bitfields from most to least significant:
CINFO (bits 12-15, first byte)
Indicates the window size as a power of two, from 0 (256 bytes) to 7 (32768 bytes). This will usually be 7. Higher values are not allowed.
CM (bits 8-11)
The compression method. Only Deflate (8) is allowed.
FLEVEL (bits 6-7, second byte)
Roughly indicates the compression level, from 0 (fast/low) to 3 (slow/high)
FDICT (bit 5)
Indicates whether a preset dictionary is used. This is usually 0.
(1 is technically allowed, but I don't know of any Deflate formats that define preset dictionaries.)
FCHECK (bits 0-4)
A checksum (5 bits, 0..31), whose value is calculated such that the entire value divides 31 with no remainder.*
Typically, only the CINFO and FLEVEL fields can be freely changed, and FCHECK must be calculated based on the final value. Assuming no preset dictionary, there is no choice in what the other fields contain, so a total of 32 possible headers are valid. Here they are:
FLEVEL: 0 1 2 3
CINFO:
0 08 1D 08 5B 08 99 08 D7
1 18 19 18 57 18 95 18 D3
2 28 15 28 53 28 91 28 CF
3 38 11 38 4F 38 8D 38 CB
4 48 0D 48 4B 48 89 48 C7
5 58 09 58 47 58 85 58 C3
6 68 05 68 43 68 81 68 DE
7 78 01 78 5E 78 9C 78 DA
The CINFO field is rarely, if ever, set by compressors to be anything other than 7 (indicating the maximum 32KB window), so the only values you are likely to see in the wild are the four in the bottom row (beginning with 78).
* (You might wonder if there's a small amount of leeway on the value of FCHECK - could it be set to either of 0 or 31 if both pass the checksum? In practice though, this can only occur if FDICT=1, so it doesn't feature in the above table.)

Following is the Zlib compressed data format.
+---+---+
|CMF|FLG| (2 bytes - Defines the compression mode - More details below)
+---+---+
+---+---+---+---+
| DICTID | (4 bytes. Present only when FLG.FDICT is set.) - Mostly not set
+---+---+---+---+
+=====================+
|...compressed data...| (variable size of data)
+=====================+
+---+---+---+---+
| ADLER32 | (4 bytes of checksum)
+---+---+---+---+
Mostly, FLG.FDICT (Dictionary flag) is not set. In such cases the DICTID is simply not present. So, the total hear is just 2 bytes.
The header values(CMF and FLG) with no dictionary are defined as follows.
CMF | FLG
0x78 | 0x01 - No Compression/low
0x78 | 0x9C - Default Compression
0x78 | 0xDA - Best Compression
More at ZLIB RFC

All answers here are most probably correct, however - if you want to manipulate ZLib compression stream directly, and it was produced by using gz_open, gzwrite, gzclose functions - then there is extra 10 leading bytes header before zlib compression steam comes - and those are produced by function gz_open - header looks like this:
fprintf(s->file, "%c%c%c%c%c%c%c%c%c%c", gz_magic[0], gz_magic[1],
Z_DEFLATED, 0 /*flags*/, 0,0,0,0 /*time*/, 0 /*xflags*/, OS_CODE);
And results in following hex dump: 1F 8B 08 00 00 00 00 00 00 0B
followed by zlib compression stream.
But there is also trailing 8 bytes - they are uLong - crc over whole file, uLong - uncompressed file size - look for following bytes at end of stream:
putLong (s->file, s->crc);
putLong (s->file, (uLong)(s->in & 0xffffffff));

Related

How do I calculate a webm's duration metadata value as hex?

I basically want to take a time value, convert it into a hexadecimal format value that the webm metadata can read.
Here is an example value:
44 89 88 40 D5 51 C0 00 00 00 00 = 00:00:21.831000000
How would I calculate this, can you provide examples?
It's a 64-bit Double, can be calculated with this: binaryconvert
Input example: 48000 would be 48 seconds flat. 100000 would be 1 minute 40 seconds. Etc.
Output as hex is shown when you convert as binary and read the left side on the site's output.

Traversing a fat32 file system

I have formatted a thumbdrive with Fat32 and placed a file in the root directory named sampleFile.txt and with the contents "oblique". I looked at the drive in Disk Investigator and I found in the RootDirSector: sector 4096 the following
0040 53 41 4D 50 4C 45 7E 31 S A M P L E ~ 1 83 65 77 80 76 69 126 49
0048 54 58 54 20 00 36 81 5B T X T . 6 . [ 84 88 84 32 0 54 129 91
0050 2E 45 2E 45 00 00 89 5B . E . E . . . [ 46 69 46 69 0 0 137 91
0058 2E 45 03 00 07 00 00 00 . E . . . . . . 46 69 3 0 7 0 0 0
How do I find the location of the sector cluster where the actual data of the file is located? Here is some additional info:
Logical drive: G
Size: 3 Gb (popularly 3 Gb)
Logical sectors: 3889016
Bytes per sector: 1024
Sectors per Cluster: 8
Cluster size: 8192
File system: FAT32
Number of copies of FAT: 2
Sectors per FAT: 1899
Start sector for FAT1: 298
Start sector for FAT2: 2197
Root DIR Sector: 4096
Root DIR Cluster: 2
2-nd Cluster Start Sector: 4096
Ending Cluster: 485616
Media Descriptor: 248
Root Entries: 0
Heads: 255
Hidden sectors: 0
Backup boot sector: 6
Reserved sectors: 298
FS Info sector: 1
Sectors per track: 63
File system version: 0
SerialVolumeID: 4A95395B
Volume Label: NO NAME
The "Short File Name Entry" contains the starting cluster of the file. Because the test file is very small, it only requires a cluster disk space.
In this case, 8192 bytes for a 7 byte string. So therefore, the FAT does not matter, because the file is not span multiple clusters. However, your file entry is incomplete. A FAT32 file name entry is 32 bytes long.
Offset 1Ah contains the starting cluster (2 bytes length). If offset 14h (2 bytes length) contains a value, then 1Ah is the low word, 14h the high word of the starting cluster.
I'm not sure, but I think the system area is counted sector wise, the data area cluster wise. The data area begins after the fat2. Unusually, your disk has a sector size of 1024 bytes.

reading a pdf version >= 1.5, how to handle Cross Reference Stream Dictionary

I'm trying to read the xref table of a pdf version >= 1.5.
the xref table is an object:
58 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<CB05990F613E2FCB6120F059A2BCA25B><E2ED9D17A60FB145B03010B70517FC30>]/Index[38 39]/Info 37 0 R/Length 96/Prev 67529/Root 39 0 R/Size 77/Type/XRef/W[1 2 1]>>stream
hÞbbd``b`:$AD`­Ì ‰Õ Vˆ8âXAÄ×HÈ$€t¨ – ÁwHp·‚ŒZ$ìÄb!&F†­ .#5‰ÿŒ>(more here but can't paste)
endstream
endobj
as you can see
/FlatDecode
/Index [38 39], that is 39 entries in the stream
/W [1 2 1] that is each entry is 1 + 2 + 1 = 4 bytes long
/Root 39 0 R that is root object is number 39
BUT :
the decompressed stream is 195 bytes long (39 * 5 = 195). So the length of an entry is 4 or 5.
Here is the first inflated bytes
02 01 00 10 00 02 00 02 cd 00 02 00 01 51 00 02 00 01 70 00 02 00 05 7a 00 02
^^
if entry length is 4 then the root entry is a free object (see the ^^) !!
if the entry is 5: how to interpret the fields of one entry (reference is implicitly made to PDF Reference, chapter 3.4.7 table 3.16 ) ?
For object 38, the first of the stream: it seems, as it is of type 2, to be the 16 object of the stream object number 256, but there is no object 256 in my pdf file !!!
The question is: how shall I handle the 195 bytes ?
A compressed xref table may have been compressed with one of the PNG filters. If the /Predictor value is set to '10' or greater ("a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data")1, PNG row filters are supplied inside the compressed data "as usual" (i.e., in the first byte of each 'row', where the 'row' is of the width in /W).
Width [1 2 1] plus Predictor byte:
02 01 00 10 00
02 00 02 cd 00
02 00 01 51 00
02 00 01 70 00
02 00 05 7a 00
02 .. .. .. ..
After applying the row filters ('2', or 'up', for all of these rows), you get this:
01 00 10 00
01 02 ed 00
01 03 3e 00
01 04 ae 00
01 09 28 00
.. .. .. ..
Note: calculated by hand; I might have made the odd mistake here and there. Note that the PNG 'up' filter is a byte filter, and the result of the "up" filter is truncated to 8 bits for each addition.
This leads to the following Type 1 XRef references ("type 1 entries define objects that are in use but are not compressed (corresponding to n entries in a cross-reference table)."):2
#38 type 1: offset 10h, generation 0
#39 type 1: offset 2EDh, generation 0
#40 type 1: offset 33Eh, generation 0
#41 type 1: offset 4AEh, generation 0
#42 type 1: offset 928h, generation 0
1 See LZW and Flate Predictor Functions in PDF Reference 1.7, 6th Ed, Section 3.3: Filters.
2 As described in your Table 3.16 in PDF Ref 1.7.

Extracting data from a .DLL: unknown file offsets

I'm currently trying to extract some data from a .DLL library - I've figured out the file structure (there are 1039 data blocks compressed with zlib, starting at offset 0x3c00, the last one being the fat table). The fat table itself is divided into 1038 "blocks" (8 bytes + a base64 encoded string - the filename). As far as I've seen, byte 5 is the length of the filename.
My problem is that I can't seem to understand what bytes 1-4 are used for: my first guess was that they were an offset to locate the file block inside the .DLL (mainly because the values are increasing throughout the table), but for instance, in this case, the first "block" is:
Supposed offset: 2E 78 00 00
Filename length: 30 00 00 00
Base64 encoded filename: 59 6D 46 30 64 47 78 6C 58 32 6C 75 64 47 56 79 5A 6D 46 6A 5A 56 78 42 59 33 52 70 64 6D 56 51 5A 58 4A 72 63 31 4E 6F 62 33 63 75 59 77 3D 3D
yet, as I said earlier, the block itself is at 0x3c00, so things don't match. Same goes for the second block (starting at 0x3f0b, whereas the table supposed offset is 0x167e)
Any ideas?
Answering my own question lol
Anyway, those numbers are the actual offsets of the file blocks, except for the fact that the first one starts from some random number instead than from the actual location of the first block. Aside from that, though, differences between each couple of offsets do match the length of the corresponding block.

What are the parts ECDSA entry in the 'known_hosts' file?

I'm trying to extract an ECDSA public key from my known_hosts file that ssh uses to verify a host. I have one below as an example.
This is the entry for "127.0.0.1 ecdsa-sha2-nistp256" in my known_hosts file:
AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBF3QCzKPRluwunLRHaFVEZNGCPD/rT13yFjKiCesA1qoU3rEp9syhnJgTbaJgK70OjoT71fDGkwwcnCZuJQPFfo=
I ran it through a Base64 decoder to get this:
���ecdsa-sha2-nistp256���nistp256���A]2F[rUF=wXʈ'ZSzħ2r`M::WL0rp
So I'm assuming those question marks are some kind of separator (no, those are lengths). I figured that nistp256 is the elliptical curve used, but what exactly is that last value?
From what I've been reading, the public key for ECDSA has a pair of values, x and y, which represent a point on the curve. Is there some way to extract x and y from there?
I'm trying to convert it into a Java public key object, but I need x and y in order to do so.
Not all of characters are shown since they are binary. Write the Base64-decoded value to the file and open it in a hex editor.
The public key for a P256 curve should be a 65-byte array, starting from the byte with value 4 (which means a non-compressed point). The next 32 bytes would be the x value, and the next 32 the y value.
Here is the result in hexadecimal:
Signature algorithm:
00 00 00 13
65 63 64 73 61 2d 73 68 61 32 2d 6e 69 73 74 70 32 35 36
(ecdsa-sha2-nistp256)
Name of domain parameters:
00 00 00 08
6e 69 73 74 70 32 35 36
(nistp256)
Public key value:
00 00 00 41
04
5d d0 0b 32 8f 46 5b b0 ba 72 d1 1d a1 55 11 93 46 08 f0 ff ad 3d 77 c8 58 ca 88 27 ac 03 5a a8
53 7a c4 a7 db 32 86 72 60 4d b6 89 80 ae f4 3a 3a 13 ef 57 c3 1a 4c 30 72 70 99 b8 94 0f 15 fa
So you first have the name of the digital signature algorithm to use, then the name of the curve and then the public component of the key, represented by an uncompressed EC point. Uncompressed points start with 04, then the X coordinate (same size as the key size) and then the Y coordinate.
As you can see, all field values are preceded by four bytes indicating the size of the field. All values and fields are using big-endian notation.