View/Edit File Headers - header

I'm conducting reserach on malformed archives and this one exploit ( CVE-2012-1459 ) talkes about manipulating the header information of the archive. I have a rough understanding of all of this and would like to clear it up.
Every file has header information that contains byte length and other simular stuff?
If the above is incorrect. What is a header?
How is it possible to view/edit this information?
( Any information about headers is helpfull. Information online is not clear )

Many file formats have headers. If you wanted to look at the headers for a specific file, the best option would be to grab a thorough documentation of the format, a hex editor, a calculator that does hex to dec and vice versa conversions, and a notepad for sketching stuff on. Oftentimes, they are fairly in depth and have many levels of indirection, and understanding exactly how they fit together is fairly daunting. Some files may have many nested headers. Archives are an example: they typically have one header with information about the entire archive, and headers inside, one for each file, with details about the contents.

Related

embed identification in file and resistance to detection

Say I'm distributing a file that I want to be secret, and I assign each person that I give the file a unique id.
How can I embed this id in the file so that I can determine who leaks my file?
Some file formats have a section in which I can put information that won't render the file corrupt. But this is easily detectable by looking at the specific section, or by changing the information.
I would guess that any solution is identifiable by byte comparison, but I was wondering if there exists solutions that embed the id in a part that if changed, renders the file corrupt. (I would guess this would be file format specific, but this question is to learn about techniques, so I'd gladly read about specific cases.)
Thanks!
For image files and Unicode text you may use Steganography.
For audio files there are special watermarking algorithms that add noise not heard by humans.
You may use metadata to add watermarks, but they can be easily removed by end user.
See at what is currently possible in this SO question: Good library for Digital watermarking

Please tell me about the uses or working of cppcheck after including the header files for analysis

Please tell me the differences with/without header files during the cppcheck's analysis.
Actually i am integrating cppcheck's report with sonar, will sonar's dashboard will contain any differences?
After including header files, it took 5 days(approx) complete the analysis, even though i used -j 4 and max-config to 2 options.
And confused that, the LOC has reduced after including header files for analysis. and i could see the functions , classes are reduced to few numbers.
Does cppcheck errors on header files? if yes, what rules are applied on it? where can i find this info, thw rules that are associated with header files?
Please help.
thanks,
Dinesh
I am a Cppcheck developer.
It's not a technically trivial question if you should include headers or not. There are both benefits and drawbacks with headers for the analysis. Better type information is a good thing. Expanding macros might be a bad thing.
In case you wonder; the same checkers will be used no matter if headers are included or not. It's just that the input data is not always better when all headers are included.
I certainly recommend that you don't include any standard headers. stdio,string,stl,etc.
I personally normally don't include various system headers. I would prefer to create a cfg file instead if I use a library. That will give Cppcheck better information about the library than the headers.
I normally try to include local headers in the project. Use -I to add good paths in the project.

Extract embedded PDF file without a full parse

I want to build a utility to extract embedded files from a PDF (see section 7.11.4 of the spec). However I want the utility to be "small" and not depend on a full PDF parsing framework. I'm wondering if the file format is such that a simple tool could scan through the document for some token or sequence, and from that know where to start extracting the embedded file(s).
Potential difficulties include the possibility that the token or sequence that you scan for could validly exist elsewhere in the document leading to spurious or corrupt document extraction.
I'm not that familiar with the PDF spec, and so I'm looking for
confirmation that this is possible
a general approach that would work
There are at least two scenarios that are going to make your life difficult: encrypted files, and object streams (a compressed object that contains a collection of objects inside).
About the second item (object streams), some PDF generation tools will take most of the objects (dictionaries) inside a PDF file, put them inside a single object, and compress this single object (usually with deflate compression). These means that you cannot just skim through a PDF file looking for some particular token in order to extract some piece of information that you need while ignoring the rest. You will need to actually interpret the structure of PDF files at least partially.
Note that the embedded files that you want to extract are very likely to be compressed as well, even if an objects stream is not used.
Your program will need to be able to do at least the following:
- Processing xref tables
- Processing object streams
- Applying decoding/decompression filters to a data stream.
Once you are able to get all objects from the file, you could in theory go through all of them looking for dictionaries of type EmbeddedFile. This approach has the disadvantage that you might extract files that are not been referenced from anywhere inside the document (because a user deleted it at some point of the file's history for example)
Another approach could be to actually navigate through the structure of the file looking for embedded files on the locations specified by the PDF spec. You can find embedded files in at least the following elements (this list is from the top of my head, there might be a lot more that these):
- Names dictionary
- Document outlines
- Page annotations

Displaying only the similarities between 20 or so files?

Say I've got a directory full of html pages. Their headers and footers are basically the same, but I want to be able to see only the portions of all the pages that are the same. I'd like to call it an n-way merge, but that isn't what it is, it's looking for just the similarities between the header and the footers of all the files.
(and my header I don't mean just the <head> tag, but rather the portions of the page that are alike).
Note: There are like 20 html files.
Is there a name for a tool that does this?
If you are looking for what they have in common, you need a Clone detector.
Such a tool finds the common code fragments across an arbitrarily large
number of files and reports the commonality. A good one will discover
commonality based on target langauge structure in spite of white space changes, etc.,
e.g., it isn't comparing lines but rather copy-and-pasted structures.
I use ExamDiff

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader