What does the "Apollo file type information" extra field in GZIP do? - gzip

In GZIP there is an optional extra field. The only defined type in RFC 1952 is "Apollo file type information". The RFC tells us it's SI1 and SI2 bytes ('A', 'P'), but it says nothing about it's content. I couldn't find any information on Google either.
What is it's content? Can you explain me it's format?
Thanks

See https://jim.rees.org/apollo-archive/ . In there you can find https://jim.rees.org/apollo-archive/gzip-1.2.4-patch.tar.Z , which patches gzip to process the Apollo extra field in order to save and restore Apollo-unique file attributes.
By the way, there is an typo in RFC 1951 for that ID. It says:
0x41 ('A') 0x70 ('P')
The hex values are correct, but the second letter is not. It should read:
0x41 ('A') 0x70 ('p')
This is all for an ancient workstation.

Related

What GZip extra field subfields exist?

RFC 1952 (GZIP File Format Specification) section 2.3.1.1 reads:
2.3.1.1. Extra field
If the FLG.FEXTRA bit is set, an "extra field" is present in
the header, with total length XLEN bytes. It consists of a
series of subfields, each of the form:
+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+
SI1 and SI2 provide a subfield ID, typically two ASCII letters
with some mnemonic value. Jean-Loup Gailly
<email#hidden> is maintaining a registry of subfield
IDs; please send him any subfield ID you wish to use. Subfield
IDs with SI2 = 0 are reserved for future use. The following
IDs are currently defined:
SI1 SI2 Data
---------- ---------- ----
0x41 ('A') 0x70 ('P') Apollo file type information
LEN gives the length of the subfield data, excluding the 4
initial bytes.
Do any subfield types exist beyond the AP given in the RFC? A web search doesn't find a list; neither is there any mention on GZip's Wikipedia page, the GNU homepage, in the gzip source code, or on Stack Overflow.
As far as I know, there is no such registry being maintained. Jean-loup no longer works on gzip.
Here is one more subfield in use:
The BGZF format (which is gzip-conformant) developed for use in bioinformatics, uses the subfield type "BC", to indicate the size of the current block. This is used to make parallel decompression easy.
From the specification at http://samtools.github.io/hts-specs/SAMv1.pdf :
Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:
The F.EXTRA bit in the header is set to indicate that extra fields are present.
The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII ‘BC’).
The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of
payload).
The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer
gives the size of the containing BGZF block minus one.

Where are named pdf characters defined like "f_f", "uni00D0" and "a204"?

I'm trying to read the official pdf specification "Document management — Portable document format — Part 1: PDF 1.7" (PDF32000_2008.pdf) as bytes and then interpret them according to that specification.
In Annex D, Character Sets and Encodings, there is a list of all named characters, like:
or
When I parse PDF32000_2008.pdf, there are also named characters like "f_f", "uni00D0" and "a204", which are missing in that specification.
My guess is that "f_f" is a symbol for two 'f' characters, which might get printed with a special glyph. There is a unicode "Latin Small Ligature Ff" for 'ff'.
For example, there is also "f_i" in that file, which I expect to mean 'fi', one glyph showing the 2 characters 'f' and 'i'. However, the pdf specification has 'fi' as named character "fi" and what is the point for having 2 named characters pointing to the same symbol ?
I can imagine that "uni00D0" means the unicode character 'Ð'. However, pdf defines it already as named character "Eth"
What could be "a204" ? Maybe Ansi 204 'Ì', which also has already a named character "Igrave" ?
Why do they use also "a62", which would be just a '<' ?
However, my main question is: Where can I find a specification for these additional named characters ?
Of course, Adobe Acrobat understands them, but also Gmail seems not to have a problem with them. So I guess, their meaning must be specified somewhere.

What's the difference between zlib and zlib#openssh.com?

When I was debugging ssh and I found there are 2 compression method: zlib and zlib#openssh.com.
debug2:compression ctos: none, zlib#openssh.com,zlib
debug2:compression stoc: none, zlib#openssh.com,zlib
So is there any difference between the 2?
In rfc4251
There are two formats for algorithm and method names:
Names that do not contain an at-sign ("#") are reserved to be
assigned by IETF CONSENSUS. Examples include "3des-cbc", "sha-1",
"hmac-sha1", and "zlib" (the doublequotes are not part of the
name). Names of this format are only valid if they are first
registered with the IANA. Registered names MUST NOT contain an
at-sign ("#"), comma (","), whitespace, control characters (ASCII
codes 32 or less), or the ASCII code 127 (DEL). Names are case-
sensitive, and MUST NOT be longer than 64 characters.
Anyone can define additional algorithms or methods by using names
in the format name#domainname, e.g., "ourcipher-cbc#example.com".
The format of the part preceding the at-sign is not specified;
however, these names MUST be printable US-ASCII strings, and MUST
NOT contain the comma character (","), whitespace, control
characters (ASCII codes 32 or less), or the ASCII code 127 (DEL).
They MUST have only a single at-sign in them. The part following
the at-sign MUST be a valid, fully qualified domain name [RFC1034]
controlled by the person or organization defining the name. Names
are case-sensitive, and MUST NOT be longer than 64 characters. It
is up to each domain how it manages its local namespace. It
should be noted that these names resemble STD 11 [RFC0822] email
addresses. This is purely coincidental and has nothing to do with
STD 11 [RFC0822].
In short, one without at-sign is a formal version and the other one is additional one made by openssh.
From https://www.openssh.com/manual.html
OpenSSH implemented a compression method "zlib#openssh.com" that delays starting compression until after user authentication, to eliminate the risk of pre-authentication attacks against the compression code. It is described in draft-miller-secsh-compression-delayed-00.txt.
So in short it performs the same zlib compression, but starts the compression only after successful authentication, this way preventing certain type of attacks.

dll files compared to gzip files

Okay, the title isn't very clear.
Given a byte array (read from a database blob) that represents EITHER the sequence of bytes contained in a .dll or the sequence of bytes representing the gzip'd version of that dll, is there a (relatively) simple signature that I can look for to differentiate between the two?
I'm trying to puzzle this out on my own, but I've discovered I can save a lot of time by asking for help. Thanks in advance.
Check if it's first two bytes are the gzip magic number 0x1f8b (see RFC 1952). Or just try to gunzip it, the operation will fail if the DLL is not gzip'd.
A gzip file should be fairly straight forward to determine as it ought to consist of a header, footer and some other distinguishable elements in between.
From Wikipedia:
"gzip" is often also used to refer to
the gzip file format, which is:
a 10-byte header, containing a magic
number, a version number and a time
stamp
optional extra headers, such as
the original file name
a body,
containing a DEFLATE-compressed
payload
an 8-byte footer, containing a
CRC-32 checksum and the length of the
original uncompressed data
You might also try determining if the gzip contains any records/entries as each will also have their own header.
You can find specific information on this file format (specifically the member header which is linked) here.

MIME "From:" header with national characters

What is the correct format of "From:" header when From Name contains national characters and dot (.) character?
We generate (using C# Chilkat lib) this:
From: =?utf-8?Q?Micha=C5=82_from_domain.com?= <abcdef#domain.com>
(where From Name = Michał from domain.com)
This works OK in most cases. However, we encountered an email provider which marks this header as invalid and uses Return-Path header instead (which is machine-readable only).
The error is:
Illegal-Object: Syntax error in From: address found on ps11.m5r2.onet:
From: =?utf-8?Q?Micha=C5=82_from_domain.com?=<abcdef#domain.com>
^-missing end of mailbox
The provider insists the the problem is the lack of space between name and email. This is not the case on our end (see previous code example).
That email provider has a broken MTA. Unfortunately, you have to deal with it.
You're already formatting your non-ASCII "From" personal-part as an RFC 2047 encoded-word. Since you're using Q as the encoding, you can take advantage of the flexibility in the quoted-printable encoding and encode the . as well:
From: =?utf-8?Q?Micha=C5=82_from_domain=2Ecom?= <abcdef#domain.com>
(Note that the . has been replaced by its quoted-printable encoding, =2E.)