Is it possible to create valid gzip with static dictionary? - gzip

I am trying to create valid gzip file (which can be decompressed using standard linux gzip) with data encoded using DEFLATE algorithm with static/preset dictionary.
I've read both specifications for DEFLATE and gzip, and it looks like it is impossible. As I get from DEFLATE specification there are two types of encoding for compressed data blocks:
Compressed with dynamic dictionary (sliding window), such blocks starts with header with FDICT flag set to 0.
Compressed with static (preset dictionary), with FDICT = 1
But I found no way to actually write such dictionary to file. Is it possible to add some header with my dictionary/dictionaries, or in some other way make gzip to uncompress data encoded with FDICT = 1?
P.S. I am trying to accomplish it using Java's Deflate class, but interested in an actuall gzip support of blocks compressed in such a way.

You are conflating two different concepts, so I'm not sure which you are talking about.
There are deflate blocks which use a static Huffman code, which are generally used when compressing very small amounts of data. Normally dynamic Huffman codes are used, where the code optimized for that particular block is sent at the start of the block. For small amounts of data, e.g. 100 bytes, the overhead of that code description would dominate the size of the output. Instead a static code would be used, which avoids the overhead at the cost of less compression. But overall, the result is smaller. All deflate applications (gzip, zlib, png, etc.) support all deflate block types.
The other concept is a pre-defined dictionary, which is a chunk of 32K of data that preloads the sliding dictionary in which matching strings are searched for. That is only supported by zlib. It is not possible to provide a pre-defined dictionary for a gzip stream. Your link for "deflate" is actually a link to the zlib format, which is where FDICT is defined.

Related

How to continue the decompressing step over system reboots ? zlib

Say I have a 100GB compressed file, after 76% of uncompression, my device got rebooted by other events, then I simply want to recontinue the uncompression from that 76% mark where I last left off. That's it.
To help with this, I could control how files are compressed and archived.
But while uncompressing on device, no command line, only zlib APIs are available, or any new APIs that may require.
This is a repost, a reworded question, for clarity, I apologize for that. Previously Z_FULL_FLUSH was suggested, but I didn't understand how I will use that 76% mark's offset to initialize the zlib.
Much appreciate any feedbacks.
Thanks
Read thru the zlib's FAQ and annotated usage page for better understanding of how deflate, inflate are working together in compressed stream.
For this, you don't even need to specially prepare the gzip file. You can save the state of inflation periodically. If interrupted, roll back to the previous saved state and start from there.
You can use Z_BLOCK to get inflate() to return at deflate block boundaries. This will be noted in data_type as documented in zlib.h. You would pick an amount of uncompressed data after which to save a new state. E.g. 16 MB. Upon reaching that amount, at the next deflate block boundary would save the location in the compressed data, which is both a byte offset and bit offset within that byte, the location in the uncompressed data you saved up to, and the last 32K of uncompressed data, which you can get using inflateGetDictionary().
To restart from the last state, do a raw inflate, use inflatePrime() to feed the bits from the byte at the compressed data offset, and use inflateSetDictionary() to provide the 32K of history. Seek to the saved offset in your output file to start writing from there. Then continue inflating.

What is a suitable compression algorithm for a large set of UTF-16 strings(multilingual)?

I need to compress a large number of short multilingual strings(<1000bytes each). I have tried implementing LZW with a separate dictionary for each language. Is there a better solution for this? The strings are stored in a set so the ordering doesn't matter.
Base64 library to create encoded binary represantation + zlib compression might help.

gzip partial modification and re-compression

I am unfamiliar with compression algorithms. Is it possible with zlib or some other library to decompress, modify and recompress only the beginning of a gzip stream and then concatenate it with the compressed remainder of the stream? This would be done in a case where, for example, I need to modify the first bytes of user data (not headers) of a 10GB gzip file so as to avoid decompressing and recompressing the entire file.
No. Compression will generally make use of the preceding data in compressing the subsequent data. So you can't change the preceding data without recompressing the remaining data.
An exception would be if there were breakpoints put in the compressed data originally that reset the history at each breakpoint. In zlib this is accomplished with Z_FULL_FLUSH during compression.

why png size doesn't change after using http gzip compression

I use following .htaccess to set gzip compression:
AddOutputFilterByType DEFLATE text/html image/png image/jpeg text/css text/javascript
Please check this url: http://www.coinex.com/cn/silver_panda/proof/china_1984_27_gram_silver_panda_coin/
the gzip compression works for html, css, js and jpg, but not working for png (really amazing..)
PNG is already a compressed data format. Compressing it with GZIP is not likely to decrease the size, and can in fact make it larger.
I'm surprised you're seeing benefits when GZIP-ing JPGs, as they are also compressed.
See here for Google's tips on using GZIP. They recommend not applying it to images.
The PNG image format already uses deflate compression internally. So you will not usually see any appreciable decrease in transmitted size by using HTTP compression on top of that. Therefore you should remove image/png from the list you mentioned to avoid wasting CPU cycles at the server and client on a redundant compression step.
If you want to make your PNGs smaller use https://tinypng.com/
or other png optimizer. Yes, it fully supports alpha channel too.
PNG is a lossless image compression format. Basically it uses spatial compression to fully preserve the original image quality. It cannot be compressed further without loss of quality (you would need to use another lossless format to see if it works better).
There is no need to use GZIP (or equivalent) as it will just add processing for the decompression of images client side.
For JPEG the best you can do is make sure you use the correct resolution and quality settings for your purpose. GZIP produces mix results at best. Make sure you strip all metadata from it (unless you need those info client side but you would be better off holding those data in a database).

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.