With vb.net, Is there a way to find all compressed files within a folder and its subfolders? - vb.net

I know how to find .zip files based on using the extension, but does anyone know of a way to find all compressed files without having to specify each type or extension?
Here's some code with pseudo logic at the end of it.
Dim zipFiles = New DirectoryInfo(tempFolder & "\extract") _
.GetFiles("*", SearchOption.AllDirectories) _
.Where(Function(f) FILE IS COMPRESSED
So basically without having to specify every type of zipped/compressed extension.

This in simple words is not possible. Though you could do this for a few well-known compression algorithms and formats, it is important to understand that anyone could come up with a new compression technique that would use its own file structure to store compressed data. Also try to understand that an uncompressed file could technically contain exactly the same sequence of bytes that would be generated by a compression algorithm for some input. So generally speaking, the extension in most cases is the only way of deciding whether or not a particular file contains compressed data.
Therefore your best bet would be to Google for the list of known compression formats and the file extensions they use and use the GetFiles() method with that list.

Related

Is it possible to obfuscate PDF file binary data?

Is it possible to obfuscate the bytes that are visible when a PDF file is opened with a hex editor? Also, I wonder if there is any problem in viewing the contents of the PDF file even if it is obfuscated.
You will always be able to see whatever bytes are within a file using a hex editor.
There might be ways to generate your pdf pages using methods that don't involve directly writing the text into the pdf (for example using javascript that's obfuscated).
Like answered above, the bytes of the file are always visible when being viewed with a hex-editor. However there are some options to hide/protect data in the file:
You could encrypt either the whole pdf or partial datasets. Note that an encryption/decryption always requires a secret. When the file is fully encrypted you can't read it without the key.
You can add additional similiar dataframes but set them invisible in the pdf. Note that this technique blows up the size of the file.
You can use scripting languages which dynamicly build up your pdf. Be aware that this could look suspicious to users or any anti-virus software.
You can use tools steganography to hide your data. For example a tool you could use is steghide
You can simply compress datastreams in the pdf, e.g. using gzip or similiar compression tools. That way you can't read it directly. However that is easy to recognize and to uncompress for anyone.

embed identification in file and resistance to detection

Say I'm distributing a file that I want to be secret, and I assign each person that I give the file a unique id.
How can I embed this id in the file so that I can determine who leaks my file?
Some file formats have a section in which I can put information that won't render the file corrupt. But this is easily detectable by looking at the specific section, or by changing the information.
I would guess that any solution is identifiable by byte comparison, but I was wondering if there exists solutions that embed the id in a part that if changed, renders the file corrupt. (I would guess this would be file format specific, but this question is to learn about techniques, so I'd gladly read about specific cases.)
Thanks!
For image files and Unicode text you may use Steganography.
For audio files there are special watermarking algorithms that add noise not heard by humans.
You may use metadata to add watermarks, but they can be easily removed by end user.
See at what is currently possible in this SO question: Good library for Digital watermarking

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

Extract embedded PDF file without a full parse

I want to build a utility to extract embedded files from a PDF (see section 7.11.4 of the spec). However I want the utility to be "small" and not depend on a full PDF parsing framework. I'm wondering if the file format is such that a simple tool could scan through the document for some token or sequence, and from that know where to start extracting the embedded file(s).
Potential difficulties include the possibility that the token or sequence that you scan for could validly exist elsewhere in the document leading to spurious or corrupt document extraction.
I'm not that familiar with the PDF spec, and so I'm looking for
confirmation that this is possible
a general approach that would work
There are at least two scenarios that are going to make your life difficult: encrypted files, and object streams (a compressed object that contains a collection of objects inside).
About the second item (object streams), some PDF generation tools will take most of the objects (dictionaries) inside a PDF file, put them inside a single object, and compress this single object (usually with deflate compression). These means that you cannot just skim through a PDF file looking for some particular token in order to extract some piece of information that you need while ignoring the rest. You will need to actually interpret the structure of PDF files at least partially.
Note that the embedded files that you want to extract are very likely to be compressed as well, even if an objects stream is not used.
Your program will need to be able to do at least the following:
- Processing xref tables
- Processing object streams
- Applying decoding/decompression filters to a data stream.
Once you are able to get all objects from the file, you could in theory go through all of them looking for dictionaries of type EmbeddedFile. This approach has the disadvantage that you might extract files that are not been referenced from anywhere inside the document (because a user deleted it at some point of the file's history for example)
Another approach could be to actually navigate through the structure of the file looking for embedded files on the locations specified by the PDF spec. You can find embedded files in at least the following elements (this list is from the top of my head, there might be a lot more that these):
- Names dictionary
- Document outlines
- Page annotations

opening and writing to files in visual basic 2010

Am writing a simple application which can write a to pdf,doc,xls and access files. so far it can write to word.i also want it to be able to navigate a hard disk and open these files using filters.
i was using this code to write to the files
My.Computer.FileSystem.WriteAllText(SaveFileDialog1.FileName, TextBox1.Text, False)
how can i write to pdf and access files and also navigate and open files using openFileDialog?
You should be using System.IO for writting to files. Read the documentation on StreamWriter. It is very straight forward. One of the constructors for streamwriter accepts a string representation of the path to the file, and overloads allow you to specify a FileMode enumeration value. Normally you will use FileMode.OpenOrCreate when writting to the file.
OpenFileDialog is also straightforward. Create an instance and access the selected file property to get a string representation of the path. Use the static File.Exists("path") to check that a valid path was returned, then use the give path to open the file with a StreamReader.
There is more than one way to skin a cat here because static oriented FileInfo and DirectoryInfo are provided in System.IO, and there are the corresponding File and Directory classes which must be instantiated.
The use of these classes is very straightforward so I'm not going to sit here and type you example code but that should get you started.
As far as creating PDF and XLS files, I am assuming that you already have raw bytes that are in the correct format for those file types? If not, I can't help you there off hand. There are no Formatters in the .NET Framework that will convert ASCII or Unicode strings to a format that is acceptable for PDF or XLS that I am aware of. You are going to either need to dig into the specifics of those file formats or find a third party utility that will format your raw bytes or text into something those specifications.
If you are recieving the PDF and XLS data in the raw already properly formatted just use BinaryWriter to create the new file and write the raw array of bytes.