My PDF contains "%PDF-1.3" in header. It means PDF Version is 1.3 ,But Adobe reader( XI) installed on my system shows version as 1.5 if looking it in File > Properties.
What is right?
1.3 or 1.5?
I can get PDF version as 1.3 by reading PDF metadata in java. How can I get PDF version 1.5 through java program?
The version in the file header can be overridden later in the file, cf the PDF specification:
Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.
(section 7.5.2 File Header)
Thus,
What is right?
depends on the PDF contents. If you are not sure, please share your PDF for analysis.
Concerning questions from comments...
(1) I don’t find anything like 1.5 on pdf opening with notepad still it shows Version as 1.5. , Version would be in encoded form?
No, but it would be a name, not a number:
The value of this entry shall be a name object, not a number, and therefore shall be preceded by a SOLIDUS (2Fh) character (/) when written in the PDF file (for example, /1.4).
(Table 28 – Entries in the catalog dictionary)
So a search for "1.5" should find it. Unless, that is, compressed object streams (a PDF 1.5 feature) are used and the newest catalog has been put into such an object stream.
(2) Is there any pdf-api available in java to read such version entries.
You can read the entry using any library allowing access to its low level routines, e.g. iText, PDFBox, PDFClown, ...
(3) If Yes, how to ?
In iText for a PdfReader reader:
reader.getCatalog().getAsName(PdfName.VERSION)
In PDFClown for a Document document:
document.getVersion()
while the original header version is retrieved from a File file using:
file.getVersion()
(PDFClown information proposed by Stefano Chizzolini)
(4) Would you please let me know what type of content I should check to detect pdf’s actual version?
Usually checking the header and the catalog should suffice.
Probably, though, some programs, when spotting the use of a PDF feature only present in later PDF specifications, return the smallest PDF specification version in which all used features are present. In that case you'd have to check all the reachable PDF content.
This would especially make sense for cross reference and object streams introduced in 1.5.
Also If I edit header PDF header with version 1.6, It shows version as 1.6, so it means Adobe dosent display property overridden by Version entry in the document’s catalog dictionary, It takes later version from both of these.
That's correct, and it is also mentioned in the specification of the Version catalog entry:
The version of the PDF specification to which the document conforms (for example, 1.4) if later than the version specified in the file’s header (see 7.5.2, "File Header"). If the header specifies a later version, or if this entry is absent, the document shall conform to the version specified in the header.
(Table 28 – Entries in the catalog dictionary)
Concerning the provided screenshot
The OP provided a screenshot:
One can clearly see that the file in question is linearized (on the left side one can see the linearization parameter dictionary and on the right side this is confirmed by "Fast Web View: Yes"). Following the linearization parameter dictionary there are the cross references for the first page, and these cross references are provided as a cross reference stream, not a cross reference table.
Cross reference streams have been introduced in PDF 1.5, and PDFs using cross reference streams instead of cross reference tables cannot even be parsed according to the PDF 1.4 and 1.3 references.
I assume that Adobe Reader claims a version 1.5 because of this unparsability according to specifications before 1.5.
I think, I would not be able to fetch 1.5 as version from PDF with other API. Is it so?
I assume so, at least immediately; many libraries may hide such details (like whether cross reference streams or tables are used) from the user. As you have not provided the PDF in question, though, this is a mere assumption.
What solution I should provide to my customer? I have been working in Publishing domain segment. Working in an application developed in java, we do have the validation check : System must not allow PDF version 1.3 and before.
That requirement already is not well defined. What is a PDF version 1.3 and before?
Is it a PDF file which does claim to be 1.3 or before?
As a special case, what about PDFs claiming different versions? E.g. different entries in header and catalog, or different entries in different incremental updates. Is such a PDF 1.3 or before if one of the differing entries is 1.3 or before? Or only if all are 1.3 or before? Or does the newest catalog version entry need to be 1.3 or before?
Is it a PDF file which a chosen indicator program (e.g. Adobe Reader in a fixed version) recognizes as 1.3 or before?
Is it a PDF which is valid according to a PDF reference 1.3 or before?
Or is it a PDF which is not valid according to any PDF reference 1.4 and after?
The only thing easy to implement is the first variant (having decided on the special cases), but what customers from the publishing context most likely mean is something along the lines of the last variant.
We check pdf version using PDF Tool Box-java jar. Which gives pdf version as 1.3 ,So validation gets failed. Client is questioning that its right pdf showing a screen shot from opening PDF, File > Properties. Now, what should be the next step?
The next step? Get together with the customer and get to a common understanding what a PDF version 1.3 and before means. And then reconsider if you still want to implement that. It might be a matter of some person years.
use gostscript to convert your file.
For that this is the Linux comand:
gs -o tempPdfFilePath -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 pdfFilePath && mv tempPdfFilePath pdfFilePath
Note that you cannot read and write on the same file, so you need a
temp file name.
Related
I've two pdf documents certified (signed and validated with the same mechanism based on Itext 7 ) and when i use adobe reader DC to check their validity, only one has the green mark.
the good one:
https://1drv.ms/b/s!AkF6t4TavwMvgxWaidlUqvPvHH1r
the bad one:
https://1drv.ms/b/s!AkF6t4TavwMvgxQCMdGY61S1EvUh
Regards
David L
This is not an Adobe bug, it's a feature. (And an iText bug)
When Adobe performs the cryptographic validation, it will also perform additional checks to see if a signature was attacked or not. It analyses several suspects and if that analysis turns out negative, Adobe will show you an error message. This is Adobe misreporting the analysis and validity. However, there is a work around for these hidden requirements.
First of, iText was used in non-append mode to modify the document:
Unfortunately, in specific cases iText 7, when used in non-append mode, introduces changes that are disallowed by the specification. The issue is that iText introduces subsections. That is something the specification allows you to do, but this is explicitly disallowed for the first revision:
Section 7.5.4 Cross-Reference Table
[...] For a file that has never been incrementally updated, the cross-reference section shall contain only one subsection, whose object numbering begins at 0. [...]
Below you'll find the xref of the first revision after iText was used in non-append mode, every colored rectangle is a new subsection. To be compliant there should only be one rectangle.
This will be fixed in the upcoming 7.0.4 release, planned for end of July.
Since multiple other tools validate these two documents without any issue ...we may think that's an adobe reader bug.
In particular as Adobe Acrobat is itself is torn:
I know that inform to a reader whether the pdf contains binary or not.
But why "25 e2 e3 cf d3" not random binary? Because so many document has that.
Is it Just because, so many use same pdf library ?
Refs:
PDF format. function of %-started sequence
comp.text.pdf>pdf format
Looking through the PDFs I have here it looks like a number of PDF processors use these very letters "%âãÏÓ", among them Adobe products.
Not all of those processors use the same basic PDF library, so the use of the same letters cannot be explained by something like that.
Most likely it is due to the fact that Adobe software creates PDFs with that second line comment. For many years developers of other software used example files produced by Adobe software as templates for the PDFs they created.
Yes, the specification ISO 32000-1 merely requires
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater.
(and the earlier PDF references also recommend the same), so there is no need to use the same binary characters.
But there also is no reason not to use them. Why deviate from the working example files produced by Adobe software in this regard?
Especially in the years before the ISO specification, when there only were the PDF references, one tended to be as Adobe-like as possible in the document structure created as the PDF references were not considered normative in nature by Adobe. Thus, if your document was valid by the references, Adobe viewers could still reject it without that counting as a bug...
We are having a pdf which when opened in Acrobat Reader shows a version of 1.5 but when using Pdfbox(version 1.8.3) the version shows 1.3.
The code that we are using:
`aDocument.getDocument().getVersion()`
where aDocument is an instance of PDDocument.
Pdfbox version we are using is 1.8.3
Any help regarding this will be highly appreciated.
Hitesh Saliya already discussed that PDF in his question Adobe showing incorrect PDF Version (of PDF) in Properties. In this answer it became appearant that
version 1.3 was correct if one only takes the version header into account (there are no Version catalog entries in the document to consider);
at least version 1.5 was correct if one also took into account that object streams, cross reference streams, layers, and transparency are used.
In a way, therefore, both PDFBox and Adobe Reader are correct.
Thus, one first has to decide what one considers the version of a PDF document to be.
Is it the version the PDF file claims to be?
As a special case, what about PDFs claiming different versions? E.g. different entries in header and catalog, or different entries in different incremental updates.
Is it the version a chosen indicator program (e.g. Adobe Reader in a fixed version) recognizes for the PDF?
Is it the smallest / the largest version according to the respective PDF reference/specification the PDF is valid?
Could even any version in that range be a correct answer (resulting not in the version but the versions of a document)?
Some mixture of the above, e.g. the maximum of the version claimed and the lowest version according to which the PDF is valid?
Seriously, though, one can hardly expect anything more than option 1 to be implemented in a general purpose PDF library.
I have been sent versions of "packed PDF" files where the top-level PDF contains child PDFs.
The top-level PDF acts primarily as a container. The packing is not always evident in Adobe reader (e.g. when pdftk is used to pack the link does not show). I can find little by Googling for this term nor in my 2012 book ("Whittington", "PDF Explained", O'Reilly).
Is this a standard part of PDF? If so I'd be grateful for pointers. And can PDFBox analyze it?
Concerning your question whether using PDF as a container file format is a standard part of PDF:
Yes, it is. ISO 32000-1:2008 describes it in section 7.11.4 Embedded file streams.
Most prominent are files associated to some document page, see 12.5.6.15, File Attachment Annotations, and those associated with the document as a whole through the EmbeddedFiles entry (PDF 1.4) in the PDF document’s name dictionary (see 7.7.4, Name Dictionary).
#JesseGood's link to PDF File Specification on the PDFBox site explains how to deal with the latter ones.
I'm not very knowledgeable concerning PDFBox and, therefore, don't know whether it allows easy access to the other kind of attachments, too. If it does not, you will essentially have to iterate the annotations of all pages to find the file attachment annotations and handle the contents according to the PDF specification.
I have a simple one page PDF document.
Using Adobe Acrobat X (10.1.4), I added 2 graphical annotations (Ink). So far so good.
Now I opened the document in Notepad++ to inspect it. Everything seemed fine. There was the annotations array, and both annotations. All good.
Then I randomly entered one space char " " in the xref table to make the document "invalid".
When I opened it in Adobe Acrobat X (Version 10.1.4), it was capable of displaying everything like it was (apparently after automatically repairing the document) and asked me then, wether I would like to save the new version to disk. I did.
Now I opened the document in Notepad++ again, just to find, that it looks completely different, than it looked like before I did the modifications.
The most weird thing is, that most of the objects just vanished from the document! There were still references to them, but the actual objects are not there.
In addition there were a bunch of flate-decoded stuff.
Is it possible, that the Adobe Acrobat reader not only compresses streams, but also whole objects including there "x y obj" and "endobj" tags?
As of PDF 1.5 object streams have been introduced to the PDF format, cf. section 7.5.7 of the current PDF specification ISO 32000-1:2008:
An object stream, is a stream object in which a sequence of indirect objects may be stored, as an alternative to their being stored at the outermost file level.
NOTE 1 Object streams are first introduced in PDF 1.5. The purpose of object streams is to allow indirect objects other than streams to be stored more compactly by using the facilities provided by stream compression filters.
By allowing Adobe Acrobat to save the repaired version of your document, you implicitly allowed it to do that in its perferred format which due to compactness uses object streams