gzip partial modification and re-compression - gzip

I am unfamiliar with compression algorithms. Is it possible with zlib or some other library to decompress, modify and recompress only the beginning of a gzip stream and then concatenate it with the compressed remainder of the stream? This would be done in a case where, for example, I need to modify the first bytes of user data (not headers) of a 10GB gzip file so as to avoid decompressing and recompressing the entire file.

No. Compression will generally make use of the preceding data in compressing the subsequent data. So you can't change the preceding data without recompressing the remaining data.
An exception would be if there were breakpoints put in the compressed data originally that reset the history at each breakpoint. In zlib this is accomplished with Z_FULL_FLUSH during compression.

Related

How to continue the decompressing step over system reboots ? zlib

Say I have a 100GB compressed file, after 76% of uncompression, my device got rebooted by other events, then I simply want to recontinue the uncompression from that 76% mark where I last left off. That's it.
To help with this, I could control how files are compressed and archived.
But while uncompressing on device, no command line, only zlib APIs are available, or any new APIs that may require.
This is a repost, a reworded question, for clarity, I apologize for that. Previously Z_FULL_FLUSH was suggested, but I didn't understand how I will use that 76% mark's offset to initialize the zlib.
Much appreciate any feedbacks.
Thanks
Read thru the zlib's FAQ and annotated usage page for better understanding of how deflate, inflate are working together in compressed stream.
For this, you don't even need to specially prepare the gzip file. You can save the state of inflation periodically. If interrupted, roll back to the previous saved state and start from there.
You can use Z_BLOCK to get inflate() to return at deflate block boundaries. This will be noted in data_type as documented in zlib.h. You would pick an amount of uncompressed data after which to save a new state. E.g. 16 MB. Upon reaching that amount, at the next deflate block boundary would save the location in the compressed data, which is both a byte offset and bit offset within that byte, the location in the uncompressed data you saved up to, and the last 32K of uncompressed data, which you can get using inflateGetDictionary().
To restart from the last state, do a raw inflate, use inflatePrime() to feed the bits from the byte at the compressed data offset, and use inflateSetDictionary() to provide the 32K of history. Seek to the saved offset in your output file to start writing from there. Then continue inflating.

.gz (gzip) file analysis

According to RFC 1952 ("GZIP File Format Specification"), a gzip file consists of a series of "members" (compressed data sets).
Is it possible to analyze a gzip file without decompressing it, for example count up the number of members and index their locations within the file, or go into the middle of the file and find and decompress just one of the members.
No. To find when a member ends, you have to decompress it. You don't have to write out the decompressed result — just process the input to find where the members start.
Once you know where the members start, then yes, you can start decompression from any one of those locations.
Note that the vast majority of gzip files have just one member.

Is there a way to read tar.gz top lines without uncompression?

I have 1000+ *.tar.gz files with size 4G+ each. But the only thing that I needed is the top 5 lines of each file. I am wondering whether there is a fast way to read these lines without uncompressing process (it takes 3-5 minutes to uncompress a single file).
My platform is Linux.
No, there isn't any faster way.
The issue is that .tar file is stream of concatenated original files (with some meta information). gzip then adds compression of full archive. Therefore even to just get the list of the files the archive has to be uncompressed first.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

Write multiple streams to a single file without knowing the length of the streams?

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.
Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).