Can I tell which software generated a PDF file? - pdf

Given a PDF file. Can I find out which software/libraries (e.g. PDFBox, Adobe Acrobat, iText...) where used to created/edit it?

The Adobe specification defines the Producer field (see 'Mac OS X 10.5.6 Quartz PDFContext' in screenshot nimeshjm's answer) as the name of the application that "converted from another format to PDF". In case of generating a PDF programmatically, the PDF isn't really converted so you will normally find the name of the generating SDK here.
The Creator field is related and is defined as the name of the application that created the document from which the PDF was converted. This is typically MS Word or so.
Note that this is all by convention. In practice, you cannot really rely on this and you may encounter for example empty Producer fields.

You can try opening the file in Adobe Acrobat Reader and look at the properties.
You can find this in: File -> Properties in Adobe Acrobat Reader after you open the pdf file.

You can probably get away without any PDF libraries for this type of operation. It won't be 100% reliable but I think you can probably assume 99% reliability.
So... write some code to open your PDF as a text stream and seaarch down for /Producer. You will find something like this:
69 0 obj
<<
/Creator (PDF+Forms 2.0)
/CreationDate (D:20010627111809)
/Title (Demo)
/Producer (Cardiff Software - TELEform 7.0)
/ModDate (D:20010627111810-05'00')
>>
Grab the bits between the parentheses and Bob's your uncle. Technically the text can be stored in other formats to but I think those will be pretty uncommon for this particular type of entry.
If you can't find anything here then look for the XMP data which is always guaranteed to be in clear text. It will look something like this,
39 0 obj
<</Subtype/XML/Length 15172/Type/Metadata>>stream
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c320 44.293068, Sun Jul 08 2007 18:10:11">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xap="http://ns.adobe.com/xap/1.0/"
xmlns:xapGImg="http://ns.adobe.com/xap/1.0/g/img/"
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
dc:format="application/pdf"
xap:CreatorTool="Adobe Illustrator CS2"
xap:CreateDate="2006-05-04T15:53:27-07:00"
xap:ModifyDate="2006-05-04T15:53:27-07:00"
xap:MetadataDate="2006-05-04T15:53:27-07:00"
xapMM:DocumentID="uuid:61AC83CBC0DBDA11A32BC847EF128E34"
xapMM:InstanceID="uuid:cba15bf3-d7da-4a4e-a563-fc20d13e258a"
pdf:Producer="Adobe PDF library 7.77">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">3.01 PDF components</rdf:li>
</rdf:Alt>
</dc:title>
...
The combination of these two is going to be practically always right. If you want 100% reliablity then by all means use a PDF library but for many purposes this should be sufficient.
My replies may feature concepts based around ABCpdf. It's what I work on. It's what I know. :-)

It is usually difficult to determine which software actually designed a PDF because most of Microsoft Office product can convert an edited file to PDF. By this I mean, opening a regular typed document, you have the option to save it as PDF. If you are familiar with Powerpoint slides, it can be easy to tell based on the design once the file is in PDF.
Where as on the other hand, Adobe Acrobat has the ability to create the file like those application forms we often download (from an embassy site, immigration site, etc).
Other software such as Adobe Photoshop, Illustrator, etc... can save files as PDF. Hope this help.

Related

Count PDF nodes using ghostscript?

I've already used ghostscript to check PDF files and now i need to identify a pdf with more than 1000 nodes. Is it possible to use ghostscript to count the number of nodes a PDF have?
My knowledge with ghostscript is basic and I have difficulty finding a solution in ruby (PDF reader 1.3) or using tools like imageMagick.
Edit:
I can not explain in a more technical way what kind of node I'm looking for. These nodes are equivalent to those found in the corel draw. Initially I thought it would not have equivalent in pdf however the pitstop plugin has the functionality to indentify nodes.
Example of identified nodes by PitStop Pro
Those aren't 'nodes', they are the start and end points of path segments. Ghostscript doesn't have a device to extract paths. It could do so easily enough (and recover the curve control points which don't appear to be displayed in your PitStop screen grab).
However, Ghostscript isn't an editing tool, so its not at all clear what you would intend to do with the information.
If you want this, then you are going to have to either parse the PDF yourself, or write a Ghostscript device to retrieve the information, or write a program for some other tool (eg MuPDF) to extract path information.

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Create print-ready PDF/X (with bleedbox, trimbox, mediabox, etc) programatically?

I was wondering if it is possible to programaticaly create a PDF file with an acceptable quality for the production press, ideally using only open-source libraries.
Right now the process is like this:
-create texts and images
-merge them into a postscript file
-use Acrobat Distiller to convert it to PDF (Acrobat distiller helps you check all the parameters of the PDF)
-send the PDF to the press
What I want is something like:
-take all texts and pictures in this folder
-encode them into the press-ready PDF, something similar to what Distiller produces
-send them to the press
How would you do that?
Many thanks...
Are the Ghostscript's gsdll32.dll and gswin{32,64}.c.exe with their source code and the GPL3 enough (or too much) of Open Source? They ship as part of all recent releases (newest one currently: v8.71).
Ghostscript can create very good quality PDF. See here for the most recent documentation about its PDF/A and PDF/X support.
Note, that this documentation until very recently was a bit misleading: it missed hinting at the requirement to edit+adapt the referred PDFA_def.ps or PDFX_def.ps templates. If you followed the old documentation without editing the templates to specifically point to the ICC color profile you wanted to embed, your output would be valid PDF, but would not pass all checks testing for compliancy with the official PDF/A+PDF/X standards.
You can generate pdfs using f.e. TeXML and XeLaTeX (first one to make scripting easier -- TeX has lots of quirks in syntax).
I also tried OpenJade and its DocBook support, but the quality was lower. TeX seems to do typesetting much better.
Both ways are using standalone programs... which you can use in shell scripts or call using system facilities.
You didn't mention which version of Distiller you're using. Recent versions do have a setting that lets you generate (different verions of) PDF/X. See also the *.joboptions files which ship with Distiller.

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.
PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.
The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:
xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint
I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.
Short answer:
No, I don't think so.
Long answer:
No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.
Even longer answer:
No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.
Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.
In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.
All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...
A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.
That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)
Best of luck to you
It might put its name in the creator or producer info, but I don't have a copy to check this theory with.
In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.
Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.
some converter from ppt to pdf preserve creator in comments at begin of pdf.
I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

Structure of a PDF file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.
But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(
Here is a link to Adobe's reference material
http://www.adobe.com/devnet/pdf/pdf_reference.html
You should know though that PDF is only about presentation, not structure. Parsing will not come easy.
I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.
Other helpful links:
PDF Succinctly book is longer and has helpful pictures.
Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
When I first started working with PDF, I found the PDF reference very hard to navigate.
It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.
If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.
Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.
I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.
This may help shed a little light:
(from page 11 of PDF32000.book)
PDF syntax is best understood by considering it as four parts, as shown in Figure 1:
• Objects. A PDF document is a data structure composed from a small set of basic types of data objects.
Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other
syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects.
Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream
object.
• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are
accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub-
clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level
mechanism for protecting a document’s contents from unauthorized access.
• Document structure. The PDF document structure specifies how the basic object types are used to
represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7,
"Document Structure," describes the overall document structure; later clauses address the detailed
semantics of the components.
• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of
a page or other graphical entity. These instructions, while also represented as objects, are conceptually
distinct from the objects that represent the document structure and are described separately. Sub-clause
7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.
Looks like navigating a PDF file will require a little more than a passing effort.
If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.
Didier have a tool to parse the PDF:
http://didierstevens.com/files/software/pdf-parser_V0_4_3.zip
or here:
http://blog.didierstevens.com/programs/pdf-tools/ which cataloged several related pdf-analysis tools.
Another tool is here:
http://mshahzadlatif.wordpress.com/2011/09/28/view-pdf-structure-using-adobe-acrobat-or-a-free-tool-called-pdfxplorer/
Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.
One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.
Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.
I'm trying to make up a spreadsheet to create a PDF form from code.
You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.
On windows I used a free tool PDF Analyzer to see the internal structure of PDF files.
This will help in your understanding when reading the reference manual.
(I'm affiliated with PDF Analyzer, no intention to promote)
To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:
pdfinfo -layout some_pdf_file.pdf
A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.
Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.
To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:
grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt
This will provide a unique list of the different octal codes in the document:
grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq
To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.