How to extract the encoding dictionary from gzip archives - gzip

I am looking for a method whereby I can extract the encoding dictionary made by DEFLATE algorithm from a gzip archive.
I need the LZ77 made pointers from the whole archive which refer to patterns from the file as well as the Huffman tree with the aforementioned pointers.
Is there any solution in python?
Does anyone know the https://github.com/madler/infgen/blob/master/infgen.c which might provide the dictionary?

The "dictionary" used for compression at any point in the input is nothing more than the 32K bytes of uncompressed data that precede that point.
Yes, infgen will disassemble a deflate stream, showing all of the LZ77 references and the derived Huffman codes in a readable form. You could run infgen from Python and interpret the output in Python.
infgen also has a -b option for a non-human-readable binary format that might be faster to process for what you want to do.

Related

Is it possible to obfuscate PDF file binary data?

Is it possible to obfuscate the bytes that are visible when a PDF file is opened with a hex editor? Also, I wonder if there is any problem in viewing the contents of the PDF file even if it is obfuscated.
You will always be able to see whatever bytes are within a file using a hex editor.
There might be ways to generate your pdf pages using methods that don't involve directly writing the text into the pdf (for example using javascript that's obfuscated).
Like answered above, the bytes of the file are always visible when being viewed with a hex-editor. However there are some options to hide/protect data in the file:
You could encrypt either the whole pdf or partial datasets. Note that an encryption/decryption always requires a secret. When the file is fully encrypted you can't read it without the key.
You can add additional similiar dataframes but set them invisible in the pdf. Note that this technique blows up the size of the file.
You can use scripting languages which dynamicly build up your pdf. Be aware that this could look suspicious to users or any anti-virus software.
You can use tools steganography to hide your data. For example a tool you could use is steghide
You can simply compress datastreams in the pdf, e.g. using gzip or similiar compression tools. That way you can't read it directly. However that is easy to recognize and to uncompress for anyone.

Puff.c How do I create the defalte stream that will work

I'm using Zlib to deflate a series of arrays using compress. My test code uses uncompress and works correctly. Here's my question:
Can I use Zlib compress my array so that it can be uncompressed using puff.c. Puff.c is available in a much larger application and I do not have the option of installing Zlib as a library.
I ran pufftest.c with zero.raw successfully, but How do I create zeros.raw
"raw" means no zlib header or trailer. You can simply strip the two-byte header and four-byte trailer from the output of compress to feed to puff. Better would be to process the zlib header and trailer (documented in RFC 1950), and feed the deflate innards to puff. Then the trailer provides an integrity check on the uncompressed data, as was intended.

With vb.net, Is there a way to find all compressed files within a folder and its subfolders?

I know how to find .zip files based on using the extension, but does anyone know of a way to find all compressed files without having to specify each type or extension?
Here's some code with pseudo logic at the end of it.
Dim zipFiles = New DirectoryInfo(tempFolder & "\extract") _
.GetFiles("*", SearchOption.AllDirectories) _
.Where(Function(f) FILE IS COMPRESSED
So basically without having to specify every type of zipped/compressed extension.
This in simple words is not possible. Though you could do this for a few well-known compression algorithms and formats, it is important to understand that anyone could come up with a new compression technique that would use its own file structure to store compressed data. Also try to understand that an uncompressed file could technically contain exactly the same sequence of bytes that would be generated by a compression algorithm for some input. So generally speaking, the extension in most cases is the only way of deciding whether or not a particular file contains compressed data.
Therefore your best bet would be to Google for the list of known compression formats and the file extensions they use and use the GetFiles() method with that list.

Compressing the final output data into gzip format

I am getting the final output data in C++ as a string.....Need to compress that data in gzip format.Can someone tell me the way about how to implement it?
Use zlib. It's probably already available in your development environment. (Which for some reason you are keeping a secret in your question.)

Surefire way of determining the codec of a media file

I'm looking for a surefire way of determining the codec used in an audio or video file. The two things I am currently using are the file extension (obvious), and the mime type as determined by running `file -ib' on the file.
This doesn't seem to get me all the way there: loads of formats are `wrapper' formats that hide the exact codec used within -- for example, '.ogg' files can internally use the Vorbis, Speex, or FLAC codecs. Their MIME type is also usually hidden under 'application/ogg' or similar.
The `file' program is apparently able to tell me which codec is used, but it returns this as human-readable prose:
kb.ogg: Ogg data, Vorbis audio, stereo, 44100 Hz, ~0 bps
and as such it is dodgy to use programmatically.
What I'm essentially asking is: is there a script out there (any language) that can wade through these wrapper formats and tell me what the meat of the file is made of?
ffmpeg includes a library called libavformat that can open and demux pretty much any media format. Obviously that's more than you actually need, but I don't think you can find anything else that's quite as complete. I've used it myself with great success. Take a look at this article for an introduction. There's also bindings for these libraries for some common scripting languages, such as python.
(If you don't want to build something using the library, you can probably use the regular ffmpeg binary.)
You can always use your own magic file, copied and modified from the pre-installed magic file, and change the return string so that it can be easily parsed by your program.
See:
http://linux.die.net/man/1/file
http://linux.die.net/man/5/magic