Potential jpg exploit in chinese gopro knock-off - camera

Does this look like potential jpg exploit attemp to you?
I picked up one of these GoPro knock-off action cameras. I tried recording some videos which seemed to work fine. I later went out with a buddy to shoot some pool and thought I wanted a cool timelapse of it.
Coming home I had hundreds of pictures, all of which seemed corrupt and couldn't be opened. I tried to peek with a hex editor why it might be and stumbled upon this stuff at the top of the file.
Does my camera try to hack me?
Sample File
(mandatory warning, open at own risk of course)

That file doesn't contain the correct codes to make it recognizable as a JPEG image. It does contain all the correct information, but there are two bytes which are incorrect at the beginning. The file should start with "FF D8 FF E1...". If you edit those two first bytes (they're 00 00 in your example), the resulting image is:
(I had to scale the image to get it to upload - it's 4 times larger on each side. The quality is quite nice)
Why this happens is a mystery to me, but very probably there's a bug in the recording software. It shouldn't be difficult to make a small program which reinstates the first two bytes. I suspect that the supplied software would concatenate the separate jpegs into a movie.
So no, your jpegs are not invading your computer.
This is Friistyler's script to correct the files (from the comment below):
for file in os.listdir("<dir>"):
if os.path.isfile("<dir>%s" % file):
with open("<dir>%s" % file, 'r+b') as f:
f.seek(0)
f.write('\xff\xd8')

Related

How to remove everything except bitmaps from a PDF?

In How can I remove all images from a PDF?, Kurt Pfeifle gave a piece of PostScript code (by courtesy of Chris Liddell) to filter out all bitmaps from a PDF, using GhostScript.
This works like a charm; however, I'm also interested in the companion task of removing everything except bitmaps from the PDF, and without recompressing bitmaps. Or, eventually, separating the vector and bitmap "layers." (I know, this is not what a layer is in PDF terminology.)
AFAIU, Kurt's filter works by sending all bitmaps to a null device, while leaving everything else to pdfwrite. I read that it is possible to use different devices with GS, so my hope is that it is possible to send everything to a fake/null device by default, and only switch to pdfwrite for those images which are captured by the filter. But unfortunately I'm completely unable to translate such a thing into PostScript code.
Can anyone help, or at least tell me if this approach might be doomed to fail?
Its possible, but its a large amount of work.
You can't start with the nulldevice and push the pdfwrite device as needed, that simply won't work because the pdfwrite device will write out the accumulated PDF file as soon as you unload it. Reloadng it will start a new PDF file.
Also, you need the same instance of the pdfwrite device for all the code, so you can't load the pdfwrite device, load the nulldevice, then load the pdfwrite device again only for the bits you want. Which means the only approach which (currently) works is the one which Chris wrote. You need to load pdfwrite and push the null device into place whenever you want to silently consume an operation.
Just 'images' is quite a limited amount of change, because there aren't that many operators which deal with images.
In order to remove everything except images however, there are a lot of operators. You need to override; stroke, fill, eofill, rectstroke, rectfill, ustroke, ufill, ueofill, shfill, show, ashow, widthshow, awidthshow, xshow, xyshow, yshow, glyphshow, cshow and kshow. I might have missed a few operators but those are the basics at least.
Note that the code Chris originally posted did actually filter various types of objects, not just images, you can find his code here:
http://www.ghostscript.com/~chrisl/filter-obs.ps
Please be aware this is unsupported example code only.

Optimized conversion of many PDF files to PNG if most of them have 90% the same content

I'm using ImageMagick to convert a few hundred thousand PDF files to PNG files. ImageMagick takes about ten seconds to do this. Now, most of these PDF files are automatically generated grading certificates, so it's basically just a bunch of PDF files with the forms filled in with different numbers. There are also a few simple raster images on each PDF> I mean, one option is to just throw computing power at it, but that means money as well as making sure they all end up in the right place when they come back. Another option is to just wait it out on our current computer. But I did the calculations here, and we won't even be able to keep up with the certificates we get in real-time.
Now, the option I'm hoping to pursue is to somehow take advantage of the fact that most of these files are very similar, so if we have some sort of pre-computed template to use, we can skip the process of calculating the entire PDF file every time the conversion is done. I'd do a quick check to see if the PDF fits any of the templates, run the optimized conversion if it does, and just do a full conversion if it doesn't.
Of course, my understanding of the PDF file format is intermediate at best, and I don't even know if this idea is practical or not. Would it require making a custom version of ImageMagick? Maybe contributing to the ImageMagick source code? Or is there some solution out there already that does exactly what I need it to? (We've all spend weeks on a project, then had this happen, I imagine)
Ok, I have had a look at this. I took your PDF and converted it to a JPEG like this - till you tell me the actual parameters you prefer.
convert -density 288 image.pdf image.jpg
and it takes 8 seconds and results in a 2448x3168 pixel JPEG of 1.6MB - enough for a page size print.
Then I copied your file 99 times so I had 100 PDFs, and processed them sequentially like this:
#!/bin/bash
for p in *.pdf; do
echo $new
new="${p%%pdf}jpg"
convert -density 288 $p $new
done
and that took 14 minutes 32 seconds, or average of 8.7 seconds.
Then I tried using GNU Parallel to do exactly the same 100 PDF files, like this:
time parallel -j 8 convert -density 288 {} {.}.jpg ::: *.pdf
keeping all 8 cores of my CPU very busy. but it processed the same 100 PDFs in 3 minutes 12, so averaging 1.92 seconds each, or a 4.5x speed-up. I'd say well worth the effort for a pretty simple command line.
Depending on your preferred parameters for convert there may be further enhancements possible...
The solution in my case ended up being to use MuPDF (thanks #Vadim) from the command line, which is about ten times faster than GhostScript (the library used by Imagemagick). MuPDF fails on about 1% of the PDF files though, due to improper formatting, which GhostScript is able to handle reasonably well, so I just wrote an exception handler to only use Imagemagick in those cases. Even so, it took about 24 hours on an 8-core server to process all the PDF files.

How to detect silence and cut mp3 file without re-encoding using NAudio and .NET

I've been looking for an answer everywhere and I was only able to find some bits and pieces. What I want to do is to load multiple mp3 files (kind of temporarily merge them) and then cut them into pieces using silence detection.
My understanding is that I can use Mp3FileReader for this but the questions are:
1. How do I read say 20 seconds of audio from an mp3 file? Do I need to read 20 times reader.WaveFormat.AverageBytesPerSecond? Or maybe keep on reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds?
2. How do I actually detect the silence? I would look at an appropriate number of the consecutive samples to check if they are all below some threshold. But how do I access the samples regardless of them being 8 or 16bit, mono or stereo etc.? Can I directly decode an MP3 frame?
3. After I have detected silence at say sample 10465, how do I map it back to the mp3 frame index to perform the cutting without re-encoding?
Here's the approach I'd recommend (which does involve re-encoding)
Use AudioFileReader to get your MP3 as floating point samples directly in the Read method
Find an open source noise gate algorithm, port it to C#, and use that to detect silence (i.e. when noise gate is closed, you have silence. You'll want to tweak threshold and attack/release times)
Create a derived ISampleProvider that uses the noise gate, and in its Read method, does not return samples that are in silence
Either: Pass the output into WaveFileWriter to create a WAV File and and encode the WAV file to MP3
Or: use NAudio.Lame to encode directly without a WAV step. You'll probably need to go from SampleProvider back down to 16 bit WAV provider first
BEFORE READING BELOW: Mark's answer is far easier to implement, and you'll almost certainly be happy with the results. This answer is for those who are willing to spend an inordinate amount of time on it.
So with that said, cutting an MP3 file based on silence without re-encoding or full decoding is actually possible... Basically, you can look at each frame's side info and each granule's gain & huffman data to "estimate" the silence.
Find the silence
Copy all the frames from before the silence to a new file
now it gets tricky...
Pull the audio data from the frames after the silence, keeping track of which frame header goes with what audio data.
Start writing the second new file, but as you write out the frames, update the main_data_begin field so the bit reservoir is in sync with where the audio data really is.
MP3 is a compressed audio format. You can't just cut bits out and expect the remainder to still be a valid MP3 file. In fact, since it's a DCT-based transform, the bits are in the frequency domain instead of the time domain. There simply are no bits for sample 10465. There's a frame which contains sample 10465, and there's a set of bits describing all frequencies in that frame.
Plain cutting the audio at sample 10465 and continuing with some random other sample probably causes a discontinuity, which means the number of frequencies present in the resulting frame skyrockets. So that definitely means a full recode. The better way is to smooth the transition, but that's not a trivial operation. And the result is of course slightly different than the input, so it still means a recode.
I don't understand why you'd want to read 20 seconds of audio anyway. Where's that number coming from? You usually want to read everything.
Sound is a wave; it's entirely expected that it crosses zero. So being close to zero isn't special. For a 20 Hz wave (threshold of hearing), zero crossings happen 40 times per second, but each time you'll have multiple samples near zero. So you basically need multiple samples that are all close to zero, but on both sides. 5 6 7 isn't much for 16 bits sounds, but it might very well be part of a wave that will have a maximum at 10000. You really should check for at least 0.05 seconds to catch those 20 Hz sounds.
Since you detected silence in a 50 millisecond interval, you have a "position" that's approximately several hundred samples wide. With any bit of luck, there's a frame boundary in there. Cut there. Else it's time for reencoding.

Could someone explain to me about the training Tesseract OCR?

I'm trying to do the training process, but I don't understand even how to start. I would like to train for read it numbers. My images are from real world, so it didn't go so good with the reading process.
It says that I have to have a ".tif" image with the examples... is a single image of every number (in this case) or a image with a lot of different types of number (same font, though)?
And what about the makebox? The command didn't work here.
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Could someone explain me better, at least how to start?
I saw a few softwares that do this more quickly, but I tryied one (SunnyPage 1.8) but isn't free. Anyone know any free software that does this? Or a good tutorial?
Using Tesseract 3, Windows 8 (32bits).
It is important to patiently follow the training wiki google code project site. If needed multiple times. It is an open source library and is constantly evolving.
You will have to create a training image(tiff) with a lot of different types of numbers probably should have all the numbers you wish the engine to recognize.
Please consider posting the exact error message you got with make box.
I think Tesseract is the best free solution available. You have to keep working and seek help from community.
There is a very good post from Cédric here explaining the training process for Tesseract.
A good free OCR software is PDF OCR X which is also based on Tesseract. I tried to copy my notes from German which I had scanned at 1200dpi, and the results were commendable but not perfect. I found that this website - http://onlineocr.net - is a lot more accurate. If you are not registered, it allows a maximum of 4mb file size from most image formats (BMP, PNG, JPEG etc.) and PDF. It can output them as a Word file, an Excel file or an txt file.
Hope this helps.

Reading last lines of gzipped text file

Let's say file.txt.gz has 2GB, and I want to see last 100 lines or so. zcat <file.txt.gz | tail -n 100 would go through all of it.
I understand that compressed files cannot be randomly accessed, and if I cut let's say the last 5MB of it, then data just after the cut will be garbage - but can gzip resync and decode rest of the stream?
If I understand it correctly gzip stream is a straightforward stream of commands describing what to output - it should be possible to sync with that. Then there's 32kB sliding window of the most recent uncompressed data - which starts as garbage of course if we start in the middle, but I'd guess it would normally get filled with real data quickly, and from that point decompression is trivial (well, it's possible that something gets recopied over and over again from start of file to the end, and so the sliding window never clears - it would surprise me if it was all that common - and if that happens we just process the whole file).
I'm not terribly eager to do this kin of gzip hackery myself - hasn't anybody done it before, for dealing with corrupted files if nothing else?
Alternatively - if gzip really cannot do that, are there perhaps any other stream compression programs that work pretty much like it, except they allow resyncing mid-stream?
EDIT: I found pure Ruby reimplementation of zlib and hacked it to print ages of bytes within sliding window. It turns out that things do get copied over and over again a lot and even after 5MB+ the sliding window still contains stuff from the first 100 bytes, and from random places throughout the file.
We cannot even get around that by reading the first few blocks and the last few blocks, as those first bytes are not referenced directly, it's just a very long chain of copies, and the only way to find out what it's referring to is by processing it all.
Essentially, with default options what I wanted is probably impossible.
On the other hand zlib has Z_FULL_FLUSH option that clears up this sliding window for purpose of syncing. So the question still stands. Assuming that zlib syncs every now and then, are there any tools for reading just the end of it without processing it all?
Z_FULL_FLUSH emits a known byte sequence (00 00 FF FF) that you can use to synchronize. This link may be useful.
This is the difference between block and stream ciphers. Because gzip is a stream cipher, you might need the whole file up to a certain point to decrypt the bytes at that point.
As you mention, when the window is cleared, you're golden. But there's no guarantee that zlib actually does this often enough for you... I suggest you seek backwards from the end of the file and find the marker for a full flush.