Reading last lines of gzipped text file - gzip

Let's say file.txt.gz has 2GB, and I want to see last 100 lines or so. zcat <file.txt.gz | tail -n 100 would go through all of it.
I understand that compressed files cannot be randomly accessed, and if I cut let's say the last 5MB of it, then data just after the cut will be garbage - but can gzip resync and decode rest of the stream?
If I understand it correctly gzip stream is a straightforward stream of commands describing what to output - it should be possible to sync with that. Then there's 32kB sliding window of the most recent uncompressed data - which starts as garbage of course if we start in the middle, but I'd guess it would normally get filled with real data quickly, and from that point decompression is trivial (well, it's possible that something gets recopied over and over again from start of file to the end, and so the sliding window never clears - it would surprise me if it was all that common - and if that happens we just process the whole file).
I'm not terribly eager to do this kin of gzip hackery myself - hasn't anybody done it before, for dealing with corrupted files if nothing else?
Alternatively - if gzip really cannot do that, are there perhaps any other stream compression programs that work pretty much like it, except they allow resyncing mid-stream?
EDIT: I found pure Ruby reimplementation of zlib and hacked it to print ages of bytes within sliding window. It turns out that things do get copied over and over again a lot and even after 5MB+ the sliding window still contains stuff from the first 100 bytes, and from random places throughout the file.
We cannot even get around that by reading the first few blocks and the last few blocks, as those first bytes are not referenced directly, it's just a very long chain of copies, and the only way to find out what it's referring to is by processing it all.
Essentially, with default options what I wanted is probably impossible.
On the other hand zlib has Z_FULL_FLUSH option that clears up this sliding window for purpose of syncing. So the question still stands. Assuming that zlib syncs every now and then, are there any tools for reading just the end of it without processing it all?

Z_FULL_FLUSH emits a known byte sequence (00 00 FF FF) that you can use to synchronize. This link may be useful.

This is the difference between block and stream ciphers. Because gzip is a stream cipher, you might need the whole file up to a certain point to decrypt the bytes at that point.
As you mention, when the window is cleared, you're golden. But there's no guarantee that zlib actually does this often enough for you... I suggest you seek backwards from the end of the file and find the marker for a full flush.

Related

How to continue the decompressing step over system reboots ? zlib

Say I have a 100GB compressed file, after 76% of uncompression, my device got rebooted by other events, then I simply want to recontinue the uncompression from that 76% mark where I last left off. That's it.
To help with this, I could control how files are compressed and archived.
But while uncompressing on device, no command line, only zlib APIs are available, or any new APIs that may require.
This is a repost, a reworded question, for clarity, I apologize for that. Previously Z_FULL_FLUSH was suggested, but I didn't understand how I will use that 76% mark's offset to initialize the zlib.
Much appreciate any feedbacks.
Thanks
Read thru the zlib's FAQ and annotated usage page for better understanding of how deflate, inflate are working together in compressed stream.
For this, you don't even need to specially prepare the gzip file. You can save the state of inflation periodically. If interrupted, roll back to the previous saved state and start from there.
You can use Z_BLOCK to get inflate() to return at deflate block boundaries. This will be noted in data_type as documented in zlib.h. You would pick an amount of uncompressed data after which to save a new state. E.g. 16 MB. Upon reaching that amount, at the next deflate block boundary would save the location in the compressed data, which is both a byte offset and bit offset within that byte, the location in the uncompressed data you saved up to, and the last 32K of uncompressed data, which you can get using inflateGetDictionary().
To restart from the last state, do a raw inflate, use inflatePrime() to feed the bits from the byte at the compressed data offset, and use inflateSetDictionary() to provide the 32K of history. Seek to the saved offset in your output file to start writing from there. Then continue inflating.

Potential jpg exploit in chinese gopro knock-off

Does this look like potential jpg exploit attemp to you?
I picked up one of these GoPro knock-off action cameras. I tried recording some videos which seemed to work fine. I later went out with a buddy to shoot some pool and thought I wanted a cool timelapse of it.
Coming home I had hundreds of pictures, all of which seemed corrupt and couldn't be opened. I tried to peek with a hex editor why it might be and stumbled upon this stuff at the top of the file.
Does my camera try to hack me?
Sample File
(mandatory warning, open at own risk of course)
That file doesn't contain the correct codes to make it recognizable as a JPEG image. It does contain all the correct information, but there are two bytes which are incorrect at the beginning. The file should start with "FF D8 FF E1...". If you edit those two first bytes (they're 00 00 in your example), the resulting image is:
(I had to scale the image to get it to upload - it's 4 times larger on each side. The quality is quite nice)
Why this happens is a mystery to me, but very probably there's a bug in the recording software. It shouldn't be difficult to make a small program which reinstates the first two bytes. I suspect that the supplied software would concatenate the separate jpegs into a movie.
So no, your jpegs are not invading your computer.
This is Friistyler's script to correct the files (from the comment below):
for file in os.listdir("<dir>"):
if os.path.isfile("<dir>%s" % file):
with open("<dir>%s" % file, 'r+b') as f:
f.seek(0)
f.write('\xff\xd8')

labview - buffer data then save to excel file

My question is with respect to a labVIEW VI (2013), I am trying to modify. (I am only just learning to use this language. I have searched the NI site and stackoverflow for help without success, I suspect I am using the incorrect key words).
My VI consists of a flat sequence one pane of which contains a while loop where integer data is collected from a device and displayed on a graph.
I would like to be able to be able to buffer this data and then send it to disk when a preset number of samples have been collected. My attempts so far result in only the last record being saved.
Specifically I need to know how to save the data in a buffer (array) then when the correct number of samples are captured save it all to disk (saving as it is captured slows the process down to much).
Hope the question is clear and thanks very much in advance for any suggestions.
Tom
Below is a simple circular-buffer that holds the most recent 100 readings. Each time the buffer is refilled, its contents are written to a text file. Drag the image onto a VI's block diagram to try it out.
As you learn more about LabVIEW and as your performance and multi-threaded needs increase, consider reading about some of the LabVIEW design patterns mentioned in the other answers:
State machine: http://www.ni.com/tutorial/7595/en/
Producer-consumer: http://www.ni.com/white-paper/3023/en/
I'd suggest to split the data acquisition and the data saving in two different loops using a producer/consumer design pattern..
Moreover if you need a very high throughput consider using TDMS file format.
Have a look here for an overview: http://www.ni.com/white-paper/3727/en/
Screenshot will definitely help. However, some things are clear:
Unless you are dealing with very high volume of data, very slow hard drives or have other unusual requirements, open the file before your while loop, write to it every time you acquire a sample (leaving buffering to the OS), and close it afterwards.
If you decide you need to manage buffering on your own, you can use queues. See this example: https://decibel.ni.com/content/docs/DOC-14804 for reference (they stream data from disk, buffering it in the queue, but it is the same idea)
My VI consists of a flat sequence one pane of which
Substitute flat sequence for finite state machine (e.g. http://forums.ni.com/t5/LabVIEW/Ending-a-Flat-Sequence-Inside-a-case-structure/td-p/3170025)

How to detect silence and cut mp3 file without re-encoding using NAudio and .NET

I've been looking for an answer everywhere and I was only able to find some bits and pieces. What I want to do is to load multiple mp3 files (kind of temporarily merge them) and then cut them into pieces using silence detection.
My understanding is that I can use Mp3FileReader for this but the questions are:
1. How do I read say 20 seconds of audio from an mp3 file? Do I need to read 20 times reader.WaveFormat.AverageBytesPerSecond? Or maybe keep on reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds?
2. How do I actually detect the silence? I would look at an appropriate number of the consecutive samples to check if they are all below some threshold. But how do I access the samples regardless of them being 8 or 16bit, mono or stereo etc.? Can I directly decode an MP3 frame?
3. After I have detected silence at say sample 10465, how do I map it back to the mp3 frame index to perform the cutting without re-encoding?
Here's the approach I'd recommend (which does involve re-encoding)
Use AudioFileReader to get your MP3 as floating point samples directly in the Read method
Find an open source noise gate algorithm, port it to C#, and use that to detect silence (i.e. when noise gate is closed, you have silence. You'll want to tweak threshold and attack/release times)
Create a derived ISampleProvider that uses the noise gate, and in its Read method, does not return samples that are in silence
Either: Pass the output into WaveFileWriter to create a WAV File and and encode the WAV file to MP3
Or: use NAudio.Lame to encode directly without a WAV step. You'll probably need to go from SampleProvider back down to 16 bit WAV provider first
BEFORE READING BELOW: Mark's answer is far easier to implement, and you'll almost certainly be happy with the results. This answer is for those who are willing to spend an inordinate amount of time on it.
So with that said, cutting an MP3 file based on silence without re-encoding or full decoding is actually possible... Basically, you can look at each frame's side info and each granule's gain & huffman data to "estimate" the silence.
Find the silence
Copy all the frames from before the silence to a new file
now it gets tricky...
Pull the audio data from the frames after the silence, keeping track of which frame header goes with what audio data.
Start writing the second new file, but as you write out the frames, update the main_data_begin field so the bit reservoir is in sync with where the audio data really is.
MP3 is a compressed audio format. You can't just cut bits out and expect the remainder to still be a valid MP3 file. In fact, since it's a DCT-based transform, the bits are in the frequency domain instead of the time domain. There simply are no bits for sample 10465. There's a frame which contains sample 10465, and there's a set of bits describing all frequencies in that frame.
Plain cutting the audio at sample 10465 and continuing with some random other sample probably causes a discontinuity, which means the number of frequencies present in the resulting frame skyrockets. So that definitely means a full recode. The better way is to smooth the transition, but that's not a trivial operation. And the result is of course slightly different than the input, so it still means a recode.
I don't understand why you'd want to read 20 seconds of audio anyway. Where's that number coming from? You usually want to read everything.
Sound is a wave; it's entirely expected that it crosses zero. So being close to zero isn't special. For a 20 Hz wave (threshold of hearing), zero crossings happen 40 times per second, but each time you'll have multiple samples near zero. So you basically need multiple samples that are all close to zero, but on both sides. 5 6 7 isn't much for 16 bits sounds, but it might very well be part of a wave that will have a maximum at 10000. You really should check for at least 0.05 seconds to catch those 20 Hz sounds.
Since you detected silence in a 50 millisecond interval, you have a "position" that's approximately several hundred samples wide. With any bit of luck, there's a frame boundary in there. Cut there. Else it's time for reencoding.

How to deal with thousands of small audio files?

Need to implement an app that has a feature to play sounds. Each sound will be some word sound, number of expected sounds is about one thousand. So, the most simple solution would be to store those sounds as sound files, each word sound in separate sound file, and to play them on demand. Would there be any potential problems with such a large number of files?
No problem with that many files, but they will take up more space than just the total of their sizes. Each file will fill up a whole # of space blocks on the device. On average you will then waste half a block (as a rule of thumb) unless all your files are significantly smaller than one block, in which case you will always use 1.000 blocks (one pr. file) and waste 1000 * (blocksize - average file size).
Things you could do:
Concatenate the files into one big file, store the start and length of each subfile, either read the chunk into memory or copy to a temporary file.
Drop the files in a database as BLOB fields for easier retrieval. This won't save space, but may make your code simpler or more reliable.
I don't think you need to make your own caching mechanism. Most likely iOS has a system-wide cache that does a far better job. That should only be relevant if you experience performance issues and need to get much shorter load times. In that case prhaps consider using bolcks for loading and dispatching the playing, as that's an easier way to hide the load latency and avoid UI freezes.
If your audio is uncompressed, the App Store will report the compressed size. If that differs a lot from the unpacked size, some (nitpicking) customers will definitely notice ald complain, as they think the advertised size is the install size. I know from personal experience. They wil generally not take a technical answer for an answer, any may even bypass talking to you, and just downvote you based on this. I s#it you not.
You should be fine storing 1000 audio clip files within the IPA but it is important to take note about the space requirements and organisation.
Also to take into consideration is the fact that accessing the disk is slower than memory and it also takes up battery space so it my be ideal to load up the most frequently used audio clips into memory.
If you can afford it, use FMOD which I believe can extract audio from various compressed schemes. If you just want to handle all those files yourself create a .zip file and extract them on the fly using libz (iOS library libs.dylib).