How to get uncompressed length of GZIP-compressed string? [duplicate] - gzip

This question already has answers here:
Find the size of the file inside a GZIP file
(4 answers)
Closed 2 years ago.
Is there any reliable way of getting the uncompressed length of a compressed string (specifically compressed with GZIP) without decompressing the string? I have no control over the compression process, i.e. no agreed upon flags or tail data. I've read you can check the four last bytes, but that seem to require flags to be enabled which, again, I have no control over.

The gzip stream itself does not contain the information you require. But if your stream includes a header, it might.
More info here.

Related

How to continue the decompressing step over system reboots ? zlib

Say I have a 100GB compressed file, after 76% of uncompression, my device got rebooted by other events, then I simply want to recontinue the uncompression from that 76% mark where I last left off. That's it.
To help with this, I could control how files are compressed and archived.
But while uncompressing on device, no command line, only zlib APIs are available, or any new APIs that may require.
This is a repost, a reworded question, for clarity, I apologize for that. Previously Z_FULL_FLUSH was suggested, but I didn't understand how I will use that 76% mark's offset to initialize the zlib.
Much appreciate any feedbacks.
Thanks
Read thru the zlib's FAQ and annotated usage page for better understanding of how deflate, inflate are working together in compressed stream.
For this, you don't even need to specially prepare the gzip file. You can save the state of inflation periodically. If interrupted, roll back to the previous saved state and start from there.
You can use Z_BLOCK to get inflate() to return at deflate block boundaries. This will be noted in data_type as documented in zlib.h. You would pick an amount of uncompressed data after which to save a new state. E.g. 16 MB. Upon reaching that amount, at the next deflate block boundary would save the location in the compressed data, which is both a byte offset and bit offset within that byte, the location in the uncompressed data you saved up to, and the last 32K of uncompressed data, which you can get using inflateGetDictionary().
To restart from the last state, do a raw inflate, use inflatePrime() to feed the bits from the byte at the compressed data offset, and use inflateSetDictionary() to provide the 32K of history. Seek to the saved offset in your output file to start writing from there. Then continue inflating.

gzip partial modification and re-compression

I am unfamiliar with compression algorithms. Is it possible with zlib or some other library to decompress, modify and recompress only the beginning of a gzip stream and then concatenate it with the compressed remainder of the stream? This would be done in a case where, for example, I need to modify the first bytes of user data (not headers) of a 10GB gzip file so as to avoid decompressing and recompressing the entire file.
No. Compression will generally make use of the preceding data in compressing the subsequent data. So you can't change the preceding data without recompressing the remaining data.
An exception would be if there were breakpoints put in the compressed data originally that reset the history at each breakpoint. In zlib this is accomplished with Z_FULL_FLUSH during compression.

Is there a way to read tar.gz top lines without uncompression?

I have 1000+ *.tar.gz files with size 4G+ each. But the only thing that I needed is the top 5 lines of each file. I am wondering whether there is a fast way to read these lines without uncompressing process (it takes 3-5 minutes to uncompress a single file).
My platform is Linux.
No, there isn't any faster way.
The issue is that .tar file is stream of concatenated original files (with some meta information). gzip then adds compression of full archive. Therefore even to just get the list of the files the archive has to be uncompressed first.

Detect if a file is an MP3 file?

I'm writing a C++ library for decoding and encoding audio between different formats/codecs. I have a routine for quickly detecting the format before loading the required codec library.
For WAV files one can simple look for the ASCII values "RIFF" and "WAVE" at the start of the file. The same applies to FLAC, we can simply read in the first 4 bytes, which will be "fLaC".
But how can I quickly detect if a file is MP3? I can't rely on the file extension. I also can't try to decode the first MP3 frame, since there might be additional data at the start of the file (eg: ID3, cover image, etc).
Detecting if a file is an MP3 is more complicated than searching for a fixed pattern in the file.
Some concepts
(See http://www.codeproject.com/Articles/8295/MPEG-Audio-Frame-Header for details)
MP3 file consists of a series of frames and each frame has a header at the beginning.
Header starts at a byte boundary with an 11-bit sync word, which is all 1s. Hence the sync word is either 0xFFE or 0XFFF.
Length of each frame is calculated based on the header parameters.
Algorithm to determine if a file is MP3 or not
Search for the sync word in the file (0xFFF or 0xFFE).
Parse the header parameters.
Determine the frame length using the header parameters.
Seek to the next frame using the frame length.
If you find another sync word after seeking, then the file is mostly an MP3 file.
To be sure, repeat the process to find N consecutive MP3 frames. N can be increased for a better hit-rate.

Reading last lines of gzipped text file

Let's say file.txt.gz has 2GB, and I want to see last 100 lines or so. zcat <file.txt.gz | tail -n 100 would go through all of it.
I understand that compressed files cannot be randomly accessed, and if I cut let's say the last 5MB of it, then data just after the cut will be garbage - but can gzip resync and decode rest of the stream?
If I understand it correctly gzip stream is a straightforward stream of commands describing what to output - it should be possible to sync with that. Then there's 32kB sliding window of the most recent uncompressed data - which starts as garbage of course if we start in the middle, but I'd guess it would normally get filled with real data quickly, and from that point decompression is trivial (well, it's possible that something gets recopied over and over again from start of file to the end, and so the sliding window never clears - it would surprise me if it was all that common - and if that happens we just process the whole file).
I'm not terribly eager to do this kin of gzip hackery myself - hasn't anybody done it before, for dealing with corrupted files if nothing else?
Alternatively - if gzip really cannot do that, are there perhaps any other stream compression programs that work pretty much like it, except they allow resyncing mid-stream?
EDIT: I found pure Ruby reimplementation of zlib and hacked it to print ages of bytes within sliding window. It turns out that things do get copied over and over again a lot and even after 5MB+ the sliding window still contains stuff from the first 100 bytes, and from random places throughout the file.
We cannot even get around that by reading the first few blocks and the last few blocks, as those first bytes are not referenced directly, it's just a very long chain of copies, and the only way to find out what it's referring to is by processing it all.
Essentially, with default options what I wanted is probably impossible.
On the other hand zlib has Z_FULL_FLUSH option that clears up this sliding window for purpose of syncing. So the question still stands. Assuming that zlib syncs every now and then, are there any tools for reading just the end of it without processing it all?
Z_FULL_FLUSH emits a known byte sequence (00 00 FF FF) that you can use to synchronize. This link may be useful.
This is the difference between block and stream ciphers. Because gzip is a stream cipher, you might need the whole file up to a certain point to decrypt the bytes at that point.
As you mention, when the window is cleared, you're golden. But there's no guarantee that zlib actually does this often enough for you... I suggest you seek backwards from the end of the file and find the marker for a full flush.