How to continue the decompressing step over system reboots ? zlib - gzip

Say I have a 100GB compressed file, after 76% of uncompression, my device got rebooted by other events, then I simply want to recontinue the uncompression from that 76% mark where I last left off. That's it.
To help with this, I could control how files are compressed and archived.
But while uncompressing on device, no command line, only zlib APIs are available, or any new APIs that may require.
This is a repost, a reworded question, for clarity, I apologize for that. Previously Z_FULL_FLUSH was suggested, but I didn't understand how I will use that 76% mark's offset to initialize the zlib.
Much appreciate any feedbacks.
Thanks
Read thru the zlib's FAQ and annotated usage page for better understanding of how deflate, inflate are working together in compressed stream.

For this, you don't even need to specially prepare the gzip file. You can save the state of inflation periodically. If interrupted, roll back to the previous saved state and start from there.
You can use Z_BLOCK to get inflate() to return at deflate block boundaries. This will be noted in data_type as documented in zlib.h. You would pick an amount of uncompressed data after which to save a new state. E.g. 16 MB. Upon reaching that amount, at the next deflate block boundary would save the location in the compressed data, which is both a byte offset and bit offset within that byte, the location in the uncompressed data you saved up to, and the last 32K of uncompressed data, which you can get using inflateGetDictionary().
To restart from the last state, do a raw inflate, use inflatePrime() to feed the bits from the byte at the compressed data offset, and use inflateSetDictionary() to provide the 32K of history. Seek to the saved offset in your output file to start writing from there. Then continue inflating.

Related

gzip partial modification and re-compression

I am unfamiliar with compression algorithms. Is it possible with zlib or some other library to decompress, modify and recompress only the beginning of a gzip stream and then concatenate it with the compressed remainder of the stream? This would be done in a case where, for example, I need to modify the first bytes of user data (not headers) of a 10GB gzip file so as to avoid decompressing and recompressing the entire file.
No. Compression will generally make use of the preceding data in compressing the subsequent data. So you can't change the preceding data without recompressing the remaining data.
An exception would be if there were breakpoints put in the compressed data originally that reset the history at each breakpoint. In zlib this is accomplished with Z_FULL_FLUSH during compression.

For linearized PDF how to determine the length of the cross-reference stream in advance?

When generating a linearized PDF, a cross-reference table should be stored in the very beginning of the file. If it is a cross-reference stream, this means the content of the table will be compressed and the actual size of the cross-reference stream after compression is unpredictable.
So my question is:
How to determine the actual size of this cross-reference stream in advance?
If the actual size of the stream is unpredictable, after the offsets of objects are written into the stream and the stream is written into the file, it will change the actual offsets of the following objects again, won't it? Do I miss something here?
Any hints are appreciated.
How to determine the actual size of this cross-reference stream in advance?
First of all you don't. At least not exactly. You described why.
But it suffices to have an estimate. Just add some bytes to the estimate and later-on pad with whitespaces. #VadimR pointed out that such padding can regularly be observed in linearized PDFs.
You can either use a rough estimate as in the QPDF source #VadimR referenced or you can try for a better one.
You could, e.g. make use of predictors:
At the time you eventually have to create the cross reference streams, all PDF objects can already be serialized in the order you need with the exception of the cross reference streams and the linearization dictionary (which contains the final size of the PDF and some object offsets). Thus, you already know the differences between consecutive xref entry values for most of the entries.
If you use up predictors, you essentially only store those differences. So, you already know most of the data to compress. Changes in a few entries won't change the compressed result too much. So this probably gives you a better estimate.
Furthermore, as the first cross reference stream does not contain too many entries in general, you can try compressing that stream multiple times for different numbers of reserved bytes.
PS: I have no idea what Adobe do use in their linearization code. And I don't know whether it makes sense to fight for a few bytes more or less here; after all linearization is most sensible for big documents for which a few bytes more or less hardly count.

How to detect silence and cut mp3 file without re-encoding using NAudio and .NET

I've been looking for an answer everywhere and I was only able to find some bits and pieces. What I want to do is to load multiple mp3 files (kind of temporarily merge them) and then cut them into pieces using silence detection.
My understanding is that I can use Mp3FileReader for this but the questions are:
1. How do I read say 20 seconds of audio from an mp3 file? Do I need to read 20 times reader.WaveFormat.AverageBytesPerSecond? Or maybe keep on reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds?
2. How do I actually detect the silence? I would look at an appropriate number of the consecutive samples to check if they are all below some threshold. But how do I access the samples regardless of them being 8 or 16bit, mono or stereo etc.? Can I directly decode an MP3 frame?
3. After I have detected silence at say sample 10465, how do I map it back to the mp3 frame index to perform the cutting without re-encoding?
Here's the approach I'd recommend (which does involve re-encoding)
Use AudioFileReader to get your MP3 as floating point samples directly in the Read method
Find an open source noise gate algorithm, port it to C#, and use that to detect silence (i.e. when noise gate is closed, you have silence. You'll want to tweak threshold and attack/release times)
Create a derived ISampleProvider that uses the noise gate, and in its Read method, does not return samples that are in silence
Either: Pass the output into WaveFileWriter to create a WAV File and and encode the WAV file to MP3
Or: use NAudio.Lame to encode directly without a WAV step. You'll probably need to go from SampleProvider back down to 16 bit WAV provider first
BEFORE READING BELOW: Mark's answer is far easier to implement, and you'll almost certainly be happy with the results. This answer is for those who are willing to spend an inordinate amount of time on it.
So with that said, cutting an MP3 file based on silence without re-encoding or full decoding is actually possible... Basically, you can look at each frame's side info and each granule's gain & huffman data to "estimate" the silence.
Find the silence
Copy all the frames from before the silence to a new file
now it gets tricky...
Pull the audio data from the frames after the silence, keeping track of which frame header goes with what audio data.
Start writing the second new file, but as you write out the frames, update the main_data_begin field so the bit reservoir is in sync with where the audio data really is.
MP3 is a compressed audio format. You can't just cut bits out and expect the remainder to still be a valid MP3 file. In fact, since it's a DCT-based transform, the bits are in the frequency domain instead of the time domain. There simply are no bits for sample 10465. There's a frame which contains sample 10465, and there's a set of bits describing all frequencies in that frame.
Plain cutting the audio at sample 10465 and continuing with some random other sample probably causes a discontinuity, which means the number of frequencies present in the resulting frame skyrockets. So that definitely means a full recode. The better way is to smooth the transition, but that's not a trivial operation. And the result is of course slightly different than the input, so it still means a recode.
I don't understand why you'd want to read 20 seconds of audio anyway. Where's that number coming from? You usually want to read everything.
Sound is a wave; it's entirely expected that it crosses zero. So being close to zero isn't special. For a 20 Hz wave (threshold of hearing), zero crossings happen 40 times per second, but each time you'll have multiple samples near zero. So you basically need multiple samples that are all close to zero, but on both sides. 5 6 7 isn't much for 16 bits sounds, but it might very well be part of a wave that will have a maximum at 10000. You really should check for at least 0.05 seconds to catch those 20 Hz sounds.
Since you detected silence in a 50 millisecond interval, you have a "position" that's approximately several hundred samples wide. With any bit of luck, there's a frame boundary in there. Cut there. Else it's time for reencoding.

Is it possible to memory map a compressed file?

We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.

Reading last lines of gzipped text file

Let's say file.txt.gz has 2GB, and I want to see last 100 lines or so. zcat <file.txt.gz | tail -n 100 would go through all of it.
I understand that compressed files cannot be randomly accessed, and if I cut let's say the last 5MB of it, then data just after the cut will be garbage - but can gzip resync and decode rest of the stream?
If I understand it correctly gzip stream is a straightforward stream of commands describing what to output - it should be possible to sync with that. Then there's 32kB sliding window of the most recent uncompressed data - which starts as garbage of course if we start in the middle, but I'd guess it would normally get filled with real data quickly, and from that point decompression is trivial (well, it's possible that something gets recopied over and over again from start of file to the end, and so the sliding window never clears - it would surprise me if it was all that common - and if that happens we just process the whole file).
I'm not terribly eager to do this kin of gzip hackery myself - hasn't anybody done it before, for dealing with corrupted files if nothing else?
Alternatively - if gzip really cannot do that, are there perhaps any other stream compression programs that work pretty much like it, except they allow resyncing mid-stream?
EDIT: I found pure Ruby reimplementation of zlib and hacked it to print ages of bytes within sliding window. It turns out that things do get copied over and over again a lot and even after 5MB+ the sliding window still contains stuff from the first 100 bytes, and from random places throughout the file.
We cannot even get around that by reading the first few blocks and the last few blocks, as those first bytes are not referenced directly, it's just a very long chain of copies, and the only way to find out what it's referring to is by processing it all.
Essentially, with default options what I wanted is probably impossible.
On the other hand zlib has Z_FULL_FLUSH option that clears up this sliding window for purpose of syncing. So the question still stands. Assuming that zlib syncs every now and then, are there any tools for reading just the end of it without processing it all?
Z_FULL_FLUSH emits a known byte sequence (00 00 FF FF) that you can use to synchronize. This link may be useful.
This is the difference between block and stream ciphers. Because gzip is a stream cipher, you might need the whole file up to a certain point to decrypt the bytes at that point.
As you mention, when the window is cleared, you're golden. But there's no guarantee that zlib actually does this often enough for you... I suggest you seek backwards from the end of the file and find the marker for a full flush.