What the standard used by a "hybrid PDF file"? - pdf

I need to create Open and easily-readable "PDF with source-content" (also named PDF Hybrid) by software tool like Prince or PDFreactor... This DocumentFoundation's FAQ explain what it is a PDF Hybrid, but not say what standard is using:
PDF/A-3 of ISO 19005-3?
Other (simplest?) ISO 19005 feature?
Something of the ISO 32000? (something as embedded files?)
Detecting the standard of a known hybrid PDF file would be a good alternative approach... But some people say that it is impossible to detect the standard used in the PDF Hybrid file.

First of all, this Hybrid PDF appears not to be specified in an independent standard, i.e. there is no corresponding ISO/ETSI/ANSI/... standard for it.
That being said, it apparently is in particular a prominent feature of LibreOffice PDF exports:
(from the LibreOffice Writer FAQ on hybrid PDFs)
Inspecting such a file (e.g. this one) one sees that there are additional entries in the PDF trailer:
...
trailer
<</Size 128/Root 126 0 R
/Info 127 0 R
/ID [ <518EBB4C2FE2F6B638478335A7ED9CA4>
<518EBB4C2FE2F6B638478335A7ED9CA4> ]
/DocChecksum /7B00A6EE0349EB2EA1DFB5ECC5899A7C
/AdditionalStreams [/application#2Fvnd#2Eoasis#2Eopendocument#2Etext 66 0 R
]
>>
startxref
291605
%%EOF
and that referenced additional stream in object 66 indeed contains the source OpenOffice document.
Apparently applications supporting these hybrid PDF files inspect the value of that AdditionalStreams trailer entry, and if they know to handle the given document type (/application#2Fvnd#2Eoasis#2Eopendocument#2Etext here corresponds to application/vnd.oasis.opendocument.text), they provide a way to extract that embedded document and open it for editing.
Beware: Unless I overlooked some ISO norm, those extra entries strictly speaking are forbidden by the PDF specification ISO 32000: In the trailer there may only be entries with keys that are either defined for the trailer in ISO specifications or any second-class names. Neither AdditionalStreams nor DocChecksum are ISO specified or second class. Thus, strictly speaking those hybrid PDFs are invalid PDFs.

Related

PDF incremental update for digital signatures: works with regular xref table, breaks with xref streams [duplicate]

It is true you can not have common XRef tables and XRef streams in a PDF file?
I thought this is what to be called a "hybrid PDF document"!
Any idea?
Hybrid reference files are explained in ISO 32000-1 in section 7.5.8.4 Compatibility with Applications That Do Not Support Compressed Reference Streams.
it is possible to construct a file called a hybrid-reference file that is readable by readers designed only to support versions of PDF before PDF 1.5. Such a file contains objects referenced by standard crossreference tables in addition to objects in object streams that are referenced by cross-reference streams.
PS: It is not allowed, though, to freely mix both styles. As Leonard Rosenthol (Adobe PDF Architect & Principal Scientist and member of relevant standardisation committees) puts it,
you can NOT "cross the streams" (to quote the classic movie phase).
If the original PDF uses classic xrefs, you need to use the same at append time. If it uses streams, you need to use streams.
(Adding XRef table to PDF w/ XRef streams? on the Adobe PDF Language and Specifications forum)

PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?

I know that inform to a reader whether the pdf contains binary or not.
But why "25 e2 e3 cf d3" not random binary? Because so many document has that.
Is it Just because, so many use same pdf library ?
Refs:
PDF format. function of %-started sequence
comp.text.pdf>pdf format
Looking through the PDFs I have here it looks like a number of PDF processors use these very letters "%âãÏÓ", among them Adobe products.
Not all of those processors use the same basic PDF library, so the use of the same letters cannot be explained by something like that.
Most likely it is due to the fact that Adobe software creates PDFs with that second line comment. For many years developers of other software used example files produced by Adobe software as templates for the PDFs they created.
Yes, the specification ISO 32000-1 merely requires
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater.
(and the earlier PDF references also recommend the same), so there is no need to use the same binary characters.
But there also is no reason not to use them. Why deviate from the working example files produced by Adobe software in this regard?
Especially in the years before the ISO specification, when there only were the PDF references, one tended to be as Adobe-like as possible in the document structure created as the PDF references were not considered normative in nature by Adobe. Thus, if your document was valid by the references, Adobe viewers could still reject it without that counting as a bug...

Can a PDF-1.3 be a PDF/A?

I have a PDF document which is supposed to be PDF/A conform, but the metadata states that it is a PDF-1.3 document. Can a PDF-1.3 document be conform with the rules of PDF/A?
Note that the first version of PDF/A is based on PDF-1.4 - hence my confusion.
Thanks in advance!
The PDF/A-1 specification (ISO 19005 part 1) states
5.1 General
This part of ISO 19005 defines a file format for representing electronic documents known as “PDF/A-1.”
Conforming PDF/A-1 files shall adhere to all requirements of PDF Reference as modified by this part of
ISO 19005.
"PDF Reference" previously is defined as
4 Notation
...
For the purposes of this part of ISO 19005, references to the “PDF Reference” are to PDF Reference: Adobe
Portable Document Format, version 1.4, 3rd ed., as amended by Errata for PDF Reference, 3rd ed. [...]
Section 5.1 continues:
Neither the version number in the header of a PDF file nor the value of the Version key in the document catalog dictionary shall be used in determining whether a file is
in accordance with this part of ISO 19005.
As these are the only metadata that can state that it is a PDF-1.3 document, this statement of version MUST NOT be used in determining whether a file is PDF/A-1.
Thus, concerning your question:
stijndg> Can a PDF-1.3 document be conform with the rules of PDF/A?
Yes, it can.
It merely has to
adhere to all requirements of PDF Reference, version 1.4 and
adhere to the requirements of ISO 19005 part 1 of Level A conformance or Level B conformance.
Furthermore Section 5.1 recommends:
Features described in PDF specifications prior to Version 1.4 which are not explicitly
described in PDF Reference should not be used.
But "should" indicates that this is a recommendation, so if for some reason the use of such features cannot be prevented, this does not keep a PDF from being PDF/A conform.

Can I tell which software generated a PDF file?

Given a PDF file. Can I find out which software/libraries (e.g. PDFBox, Adobe Acrobat, iText...) where used to created/edit it?
The Adobe specification defines the Producer field (see 'Mac OS X 10.5.6 Quartz PDFContext' in screenshot nimeshjm's answer) as the name of the application that "converted from another format to PDF". In case of generating a PDF programmatically, the PDF isn't really converted so you will normally find the name of the generating SDK here.
The Creator field is related and is defined as the name of the application that created the document from which the PDF was converted. This is typically MS Word or so.
Note that this is all by convention. In practice, you cannot really rely on this and you may encounter for example empty Producer fields.
You can try opening the file in Adobe Acrobat Reader and look at the properties.
You can find this in: File -> Properties in Adobe Acrobat Reader after you open the pdf file.
You can probably get away without any PDF libraries for this type of operation. It won't be 100% reliable but I think you can probably assume 99% reliability.
So... write some code to open your PDF as a text stream and seaarch down for /Producer. You will find something like this:
69 0 obj
<<
/Creator (PDF+Forms 2.0)
/CreationDate (D:20010627111809)
/Title (Demo)
/Producer (Cardiff Software - TELEform 7.0)
/ModDate (D:20010627111810-05'00')
>>
Grab the bits between the parentheses and Bob's your uncle. Technically the text can be stored in other formats to but I think those will be pretty uncommon for this particular type of entry.
If you can't find anything here then look for the XMP data which is always guaranteed to be in clear text. It will look something like this,
39 0 obj
<</Subtype/XML/Length 15172/Type/Metadata>>stream
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c320 44.293068, Sun Jul 08 2007 18:10:11">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xap="http://ns.adobe.com/xap/1.0/"
xmlns:xapGImg="http://ns.adobe.com/xap/1.0/g/img/"
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
dc:format="application/pdf"
xap:CreatorTool="Adobe Illustrator CS2"
xap:CreateDate="2006-05-04T15:53:27-07:00"
xap:ModifyDate="2006-05-04T15:53:27-07:00"
xap:MetadataDate="2006-05-04T15:53:27-07:00"
xapMM:DocumentID="uuid:61AC83CBC0DBDA11A32BC847EF128E34"
xapMM:InstanceID="uuid:cba15bf3-d7da-4a4e-a563-fc20d13e258a"
pdf:Producer="Adobe PDF library 7.77">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">3.01 PDF components</rdf:li>
</rdf:Alt>
</dc:title>
...
The combination of these two is going to be practically always right. If you want 100% reliablity then by all means use a PDF library but for many purposes this should be sufficient.
My replies may feature concepts based around ABCpdf. It's what I work on. It's what I know. :-)
It is usually difficult to determine which software actually designed a PDF because most of Microsoft Office product can convert an edited file to PDF. By this I mean, opening a regular typed document, you have the option to save it as PDF. If you are familiar with Powerpoint slides, it can be easy to tell based on the design once the file is in PDF.
Where as on the other hand, Adobe Acrobat has the ability to create the file like those application forms we often download (from an embassy site, immigration site, etc).
Other software such as Adobe Photoshop, Illustrator, etc... can save files as PDF. Hope this help.

What is a "packed PDF", and how can it be read?

I have been sent versions of "packed PDF" files where the top-level PDF contains child PDFs.
The top-level PDF acts primarily as a container. The packing is not always evident in Adobe reader (e.g. when pdftk is used to pack the link does not show). I can find little by Googling for this term nor in my 2012 book ("Whittington", "PDF Explained", O'Reilly).
Is this a standard part of PDF? If so I'd be grateful for pointers. And can PDFBox analyze it?
Concerning your question whether using PDF as a container file format is a standard part of PDF:
Yes, it is. ISO 32000-1:2008 describes it in section 7.11.4 Embedded file streams.
Most prominent are files associated to some document page, see 12.5.6.15, File Attachment Annotations, and those associated with the document as a whole through the EmbeddedFiles entry (PDF 1.4) in the PDF document’s name dictionary (see 7.7.4, Name Dictionary).
#JesseGood's link to PDF File Specification on the PDFBox site explains how to deal with the latter ones.
I'm not very knowledgeable concerning PDFBox and, therefore, don't know whether it allows easy access to the other kind of attachments, too. If it does not, you will essentially have to iterate the annotations of all pages to find the file attachment annotations and handle the contents according to the PDF specification.