Why can't deflate (zlib) compress two identical strings concatenated together? - gzip

Or more accurately stated, when two identical strings are concatenated to each other, why can't zlib deflate the entire second string? It seems that when a matching string starts immediately after the previous instance of the same string, zlib emits the first character as a string literal and then emits a backwards reference to the previous string minus the first character.
For example, if I use zlib to deflate the string latelate, the output is 5 string literals followed by a back reference...
l a t e l <len=3, dist=4>
or huffman encoded...
0000000 cb 49 2c 49 cd 01 62 00
0000010
where I've simplified the output by using a "raw" deflate stream (i.e. windowBits = -15) and the fixed huffman encoding (i.e. the compression strategy is Z_FIXED).
Why must zlib emit the second literal character 'l' before using a back reference to "ate"?
In other words, why can't it output...?
l a t e <len=4, dist=4>
I tried forcing the second version with my own deflate implementation, but zlib won't inflate the output. I get the error "invalid or incomplete deflate data".

Let's separate DEFLATE, as a compression bitstream format described in https://www.ietf.org/rfc/rfc1951.txt from zlib, which is implementation of algorithms to encode and decode such bitstream.
Then, DEFLATE certainly can represent, and compress, concatenation of 2 strings. Why zlib doesn't do that? Well, because match searching for LZ77 compression is inherently heuristic task, so some choices won't be explored, even those which seem obvious to a human.
Using a trivial hash-based LZ77 encoder, double-string case is easily found:
L6c # l
L61 # a
L74 # t
L65 # e
C-4,4
And this sequence can be encoded with static zlib encoding without problem, result is:
CB 49 2C 49 05 61 00
This bitstream also can be decoded without problem by zlib. You can try that using Python:
import zlib
import binascii
zlib.decompress(binascii.unhexlify("CB492C49056100"), -15)
So, what version of zlib did you use? Maybe it was too old?

Related

Need Online Tool to Convert GZip compression text to ASCII (Readable) text

I am trying to view data in a Redis database.
It is compressed data using Lettuce 6.1.1 version compression library.
It uses a GZIP compression type.
I have tried several online tools to convert the GZIP text to a readable ASCII format.
The tools fail because it does not recognize the GZIP text as GZIP data. Maybe it has something to do with the compression algorithm lettuce uses to compress the data.
Can anyone point me to a tool where I can decompress this data to readable ascii text?
Here is an example of the compressed data:
\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x00\xABVN-\xCBLNu,JM\xF4+\xCDMJ-R\xB2R2604\xB44Q\xAA\x05\x00\x190\x9B\xD1\x1E\x00\x00\x00
This should translate to a number: 301194
Here is a second example:
1.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003602\xB04\x01\x00\x93\xC0t\xC3\x06\x00\x00\x00
2.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003602\xB0\xB0\x04\x00o\x8D\xDE\xA4\x06\x00\x00\x00
3.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003602\xB04\x07\x00)\x91}Z\x06\x00\x00\x00
4.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003602\xB04\x03\x00\xBF\xA1z-\x06\x00\x00\x00
5.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003602\xB04\x00\x00\x8A\x04\x19\xC4\x06\x00\x00\x00
6.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003602\xB04\x02\x00\xA6e\x17*\x06\x00\x00\x00
7.\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x003604\xB44\x01\x00J\x05\x03\xD0\x06\x00\x00\x00
This should be a list of 7 service area numbers.
Not sure of the order but the values should be:
302090
302092
302097
302094
302096
302089
301194
I tried using this online tool:
https://codebeautify.org/gzip-decompress-online
There is no translation that appears in the translation window and no error is shown.
I also tried a this website:
https://www.multiutil.com/gzip-to-text-decompress/
I get the error: Invalid compression text
UPDATE
The RedisInsight screenshot below shows the key-value information. The value information that is compressed as gzip I would like to translate.
I wanted to copy the value that I have highlighted and decompress it so I can document what is stored in the database.
There is nothing wrong with your examples 1 through 7. They are all valid gzip streams that decompress to:
302094
302089
302097
302096
302090
302092
301194
Your first example in your question however has an error in the integtity check at the end. It decodes to:
{#eviceAreaNumber":"301194"}
While the deflate compressed data in the gzip stream is valid, the CRC that follows it is not. The uncompressed length after that is incorrect as well.
The online tools you point to are expecting Base64 encoded data. Not the partial hex encodings you are trying there.

What GZip extra field subfields exist?

RFC 1952 (GZIP File Format Specification) section 2.3.1.1 reads:
2.3.1.1. Extra field
If the FLG.FEXTRA bit is set, an "extra field" is present in
the header, with total length XLEN bytes. It consists of a
series of subfields, each of the form:
+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+
SI1 and SI2 provide a subfield ID, typically two ASCII letters
with some mnemonic value. Jean-Loup Gailly
<email#hidden> is maintaining a registry of subfield
IDs; please send him any subfield ID you wish to use. Subfield
IDs with SI2 = 0 are reserved for future use. The following
IDs are currently defined:
SI1 SI2 Data
---------- ---------- ----
0x41 ('A') 0x70 ('P') Apollo file type information
LEN gives the length of the subfield data, excluding the 4
initial bytes.
Do any subfield types exist beyond the AP given in the RFC? A web search doesn't find a list; neither is there any mention on GZip's Wikipedia page, the GNU homepage, in the gzip source code, or on Stack Overflow.
As far as I know, there is no such registry being maintained. Jean-loup no longer works on gzip.
Here is one more subfield in use:
The BGZF format (which is gzip-conformant) developed for use in bioinformatics, uses the subfield type "BC", to indicate the size of the current block. This is used to make parallel decompression easy.
From the specification at http://samtools.github.io/hts-specs/SAMv1.pdf :
Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:
The F.EXTRA bit in the header is set to indicate that extra fields are present.
The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII ‘BC’).
The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of
payload).
The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer
gives the size of the containing BGZF block minus one.

How to manually construct a gzip so that compressed file is larger than original?

Suppose a 1KB file called data.bin, If it's possible to construct a gzip of it data.bin.gz, but much larger, how to do it?
How much larger could we theoretically get in GZIP format?
You can make it arbitrarily large. Take any gzip file and insert as many repetitions as you like of the five bytes: 00 00 00 ff ff after the gzip header and before the deflate data.
Summary:
With header fields/general structure: effect is unlimited unless it runs into software limitations
Empty blocks: unlimited effect by format specification
Uncompressed blocks: effect is limited to 6x
Compressed blocks: with apparent means, the maximum effect is estimated at 1.125x and is very hard to achieve
Take the gzip format (RFC1952 (metadata), RFC1951 (deflate format), additional notes for GNU gzip) and play with it as much as you like.
Header
There are a whole bunch of places to exploit:
use optional fields (original file name, file comment, extra fields)
bluntly append garbage (GNU gzip will issue a warning when decompressing)
concatenate multiple gzip archives (the format allows that, the resulting uncompressed data is, likewise, the concatenation or all chunks).
An interesting side effect (a bug in GNU gzip, apparently): gzip -l takes the reported uncompressed size from the last chunk only (even if it's garbage) rather than adding up values from all. So you can make it look like the archive is (absurdly) larger/smaller than raw data.
These are the ones that are immediately apparent, you may be able to find yet other ways.
Data
The general layout of "deflate" format is (RFC1951):
A compressed data set consists of a series of blocks, corresponding to
successive blocks of input data. The block sizes are arbitrary,
except that non-compressible blocks are limited to 65,535 bytes.
<...>
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
Full blocks
The 00 00 00 ff ff that Mark Adler suggests is essentially an empty, non-final block (RFC1951 section 3.2.3. for the 1st byte, 3.2.4. for the uncompressed block itself).
Btw, according to gzip overview at the official site and the source code, Mark is the author of the decompression part...
Uncompressed blocks
Using non-empty uncompressed blocks (see prev. section for references), you can at most create one for each symbol. The effect is thus limited to 6x.
Compressed blocks
In a nutshell: some inflation is achievable but it's very hard and the achievable effect is limited. Don't waste your time on them unless you have a very good reason.
Inside compressed blocks (section 3.2.5.), each chunk is [<encoded character(8-9 bits>|<encoded chunk length (7-11 bits)><distance back to data(5-18 bits)>], with lengths starting at 3. A 7-9-bit code unambiguously resolves to a literal character or a specific range of lengths. Longer codes correspond to larger lengths/distances. No space/meaningless stuff is allowed between chunks.
So the maximum for raw byte chunks is 9/8 (1.125x) - if all the raw bytes are with codes 144 - 255.
Playing with reference chunks isn't going to do any good for you: even a reference to a 3-byte sequence gives 25/24 (1.04x) at most.
That's it for static Huffman tables. Looking through the docs on dynamic ones, it optimizes the aforementioned encoding for the specific data or something. So, it should allow to make the ratio for the given data closer to the achievable maximum, but that's it.

Detect if Base 64 string is image or text

Is there a way to detect if the Base 64 string contained in an NSData instance is an image or a text or any other object?
You can't generally just look at the base 64 string and decide, but you can decode the first few bytes of data, look at the hex codes (you can do this by decoding your base-64 string into a NSData and just NSLog it or examining it in the debugger), and draw some conclusions. For example:
Image files generally start with special byte sequences (e.g. JPEG start with the hex bytes FF D8; PNG generally start with hex bytes 89 50 4E 47 0D 0A 1A 0A (e.g. 89 "PNG" CR LF EOF LF, etc.). Note, there are a dizzying number of different image formats, so this is a non-trivial exercise, but sometimes you can get lucky and it will be self-evident that it's one of these common format when you glance at the first few bytes.
NSKeyedArchiver archives generally start with the string "bplist".
ASCII text consists of codes between 20 and 7F (with linefeeds represented by 0A; carriage return and linefeeds represented by OD 0A; tab characters as 09; etc.). Then, again, if it was a text, it's unlikely they'd be base-64 encoding it.
If it was UTF-8 it would conform to the coding pattern outlined here. For example, you can look at the first few high bits of the first byte that might conceivably represent a UTF-8 character, and conclude (a) how many bytes the character is represented by and (b) what high bits will be turned on those subsequent bytes. You can often quickly look at it and confirm whether the data conforms to this UTF-8 pattern or not (especially easy to do for most western languages)
If the first three characters were EF BB BF, that often indicates a UTF-8 byte order mark.
This is, by no means, an exhaustive list of codes, but just a few that leapt out at me.
To do this programmatically and do so exhaustively would be a non-trivial exercise. But if you're just "eye-balling" a base-64 string and trying to draw some logical inferences, decode it and look at the hex bytes and you can quickly narrow down the possibilities, at the very least. If you're unsure about how to interpret it, update your question with the hex representation of the decoded base-64 string (just the first 16-32 bytes, please), and we might be able to point you in the right direction.
It is impossible to clearly distinguish text string and Base64 image encoding string. The only way - check if your string is valid Base 64 encoding string. If it is - probably it is an image. If not - you can be sure it is a text.
How to check if string is valid Base 64 you can ere How to check whether the string is base64 encoded or not.

Xcode UTF-8 literals

Suppose I have the MUSICAL SYMBOL G CLEF symbol: ** 𝄞 ** that I wish to have in a string literal in my Objective-C source file.
The OS X Character Viewer says that the CLEF is UTF8 F0 9D 84 9E and Unicode 1D11E(D834+DD1E) in their terms.
After some futzing around, and using the ICU UNICODE Demonstration Page, I did get the following code to work:
NSString *uni=#"\U0001d11e";
NSString *uni2=[[NSString alloc] initWithUTF8String:"\xF0\x9D\x84\x9E"];
NSString *uni3=#"𝄞";
NSLog(#"unicode: %# and %# and %#",uni, uni2, uni3);
My questions:
Is it possible to streamline the way I am doing UTF-8 literals? That seems kludgy to me.
Is the #"\U0001d11e part UTF-32?
Why does cutting and pasting the CLEF from Character Viewer actually work? I thought Objective-C files had to be UTF-8?
I would prefer the way you did it in uni3, but sadly that is not recommended. Failing that, I would prefer the method in uni to that in uni2. Another option would be [NSString stringWithFormat:#"%C", 0x1d11e].
It is a "universal character name", introduced in C99 (section 6.4.3) and imported into Objective-C as of OS X 10.5. Technically this doesn't have to give you UTF-8 (it's up to the compiler), but in practice UTF-8 is probably what you'll get.
The encoding of the source code file is probably UTF-8, matching what the runtime expects, so everything happens to work. It's also possible the source file is UTF-16 or UTF-32 and the compiler is doing the Right Thing when compiling it. None the less, Apple does not recommend this.
Answers to your questions (same order):
Why choose? Xcode uses C99 in default setup. Refer to the C0X draft specification 6.4.3 on Universal Character Names. See below.
More technically, the #"\U0001d11e is the 32 bit Unicode code point for that character in the ISO 10646 character set.
I would not count on this behavior working. You should absolutely, positively, without question have all the characters in your source file be 7 bit ASCII. For string literals, use an encoding or, preferably, a suitable external resource able to handle binary data.
Universal Character Names (from the WG14/N1256 C0X Draft which CLANG follows fairly well):
Universal Character Names may be used
in identifiers, character constants,
and string literals to designate
characters that are not in the basic
character set.
The universal
character name \Unnnnnnnn designates
the character whose eight-digit short
identifier (as specified by ISO/IEC
10646) is nnnnnnnn) Similarly, the
universal character name \unnnn
designates the character whose
four-digit short identifier is nnnn
(and whose eight-digit short
identifier is 0000nnnn).
Therefor, you can produce your character or string in a natural, mixed way:
char *utf8CStr =
"May all your CLEF's \xF0\x9D\x84\x9E be left like this: \U0001d11e";
NSString *uni4=[[NSString alloc] initWithUTF8String:utf8CStr];
The \Unnnnnnnn form allows you to select any Unicode code point, and this is the same value as "Unicode" field at the bottom left of the Character Viewer. The direct entry of \Unnnnnnnn in the C99 source file is handled appropriately by the compiler. Note that there are only two options: \unnnn which is a 256 character offset to the default code page or \Unnnnnnnn which is the full 32 bit character of any Unicode code point. You need to pad the left with 0's if you are not using all 4 or all 8 digits or \u or \U.
The form of \xF0\x9D\x84\x9E in the same string literal is more interesting. This is inserting the raw UTF-8 encoding of the same character. Once passed to the initWithUTF8String method, but the literal and the encoded literal end up as encoded UTF-8.
It may, arguably, be a violation of 130 of section 5.1.1.2 to use raw bytes in this way. Given that a raw UTF-8 string would be encoded similarly, I think you are OK.
You can write the clef character in your string literal, too:
NSString *uni2=[[NSString alloc] initWithUTF8String:"𝄞"];
The \U0001d11e matches the unicode code point for the G clef character. The UTF-32 form of a character is the same as its codepoint, so you can think of it as UTF-32 if you want to. Here's a link to the unicode tables for musical symbols.
Your file probably is UTF-8. The G clef is a valid UTF8 character - check out the output from hexdump for your file:
00 4e 53 53 74 72 69 6e 67 20 2a 75 6e 69 33 3d 40 |NSString *uni3=#|
10 22 f0 9d 84 9e 22 3b 0a 20 20 4e 53 4c 6f 67 28 |"....";. NSLog(|
As you can see, the correct UTF-8 representation of that character is in the file right where you'd expect it. It's probably safer to use one of your other methods and try to keep the source file in the ASCII range.
I created some utility classes to convert easily between unicode code points, UTF-8 byte sequences and NSString. You can find the code on Github, maybe it is of some use to someone.