Pdf version information not correct using pdfbox

Pdf version information not correct using pdfbox - pdf

We are having a pdf which when opened in Acrobat Reader shows a version of 1.5 but when using Pdfbox(version 1.8.3) the version shows 1.3.
The code that we are using:
`aDocument.getDocument().getVersion()`
where aDocument is an instance of PDDocument.
Pdfbox version we are using is 1.8.3
Any help regarding this will be highly appreciated.

Hitesh Saliya already discussed that PDF in his question Adobe showing incorrect PDF Version (of PDF) in Properties. In this answer it became appearant that
version 1.3 was correct if one only takes the version header into account (there are no Version catalog entries in the document to consider);
at least version 1.5 was correct if one also took into account that object streams, cross reference streams, layers, and transparency are used.
In a way, therefore, both PDFBox and Adobe Reader are correct.
Thus, one first has to decide what one considers the version of a PDF document to be.
Is it the version the PDF file claims to be?
As a special case, what about PDFs claiming different versions? E.g. different entries in header and catalog, or different entries in different incremental updates.
Is it the version a chosen indicator program (e.g. Adobe Reader in a fixed version) recognizes for the PDF?
Is it the smallest / the largest version according to the respective PDF reference/specification the PDF is valid?
Could even any version in that range be a correct answer (resulting not in the version but the versions of a document)?
Some mixture of the above, e.g. the maximum of the version claimed and the lowest version according to which the PDF is valid?
Seriously, though, one can hardly expect anything more than option 1 to be implemented in a general purpose PDF library.

Related

Count PDF nodes using ghostscript?

I've already used ghostscript to check PDF files and now i need to identify a pdf with more than 1000 nodes. Is it possible to use ghostscript to count the number of nodes a PDF have?
My knowledge with ghostscript is basic and I have difficulty finding a solution in ruby (PDF reader 1.3) or using tools like imageMagick.
Edit:
I can not explain in a more technical way what kind of node I'm looking for. These nodes are equivalent to those found in the corel draw. Initially I thought it would not have equivalent in pdf however the pitstop plugin has the functionality to indentify nodes.
Example of identified nodes by PitStop Pro

Those aren't 'nodes', they are the start and end points of path segments. Ghostscript doesn't have a device to extract paths. It could do so easily enough (and recover the curve control points which don't appear to be displayed in your PitStop screen grab).
However, Ghostscript isn't an editing tool, so its not at all clear what you would intend to do with the information.
If you want this, then you are going to have to either parse the PDF yourself, or write a Ghostscript device to retrieve the information, or write a program for some other tool (eg MuPDF) to extract path information.

Why is one of these two itext 7 signed and validated document is not valid with Adobe DC reader?

I've two pdf documents certified (signed and validated with the same mechanism based on Itext 7 ) and when i use adobe reader DC to check their validity, only one has the green mark.
the good one:
https://1drv.ms/b/s!AkF6t4TavwMvgxWaidlUqvPvHH1r
the bad one:
https://1drv.ms/b/s!AkF6t4TavwMvgxQCMdGY61S1EvUh
Regards
David L

This is not an Adobe bug, it's a feature. (And an iText bug)
When Adobe performs the cryptographic validation, it will also perform additional checks to see if a signature was attacked or not. It analyses several suspects and if that analysis turns out negative, Adobe will show you an error message. This is Adobe misreporting the analysis and validity. However, there is a work around for these hidden requirements.
First of, iText was used in non-append mode to modify the document:
Unfortunately, in specific cases iText 7, when used in non-append mode, introduces changes that are disallowed by the specification. The issue is that iText introduces subsections. That is something the specification allows you to do, but this is explicitly disallowed for the first revision:
Section 7.5.4 Cross-Reference Table
[...] For a file that has never been incrementally updated, the cross-reference section shall contain only one subsection, whose object numbering begins at 0. [...]
Below you'll find the xref of the first revision after iText was used in non-append mode, every colored rectangle is a new subsection. To be compliant there should only be one rectangle.
This will be fixed in the upcoming 7.0.4 release, planned for end of July.

Since multiple other tools validate these two documents without any issue ...we may think that's an adobe reader bug.
In particular as Adobe Acrobat is itself is torn:

PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?

I know that inform to a reader whether the pdf contains binary or not.
But why "25 e2 e3 cf d3" not random binary? Because so many document has that.
Is it Just because, so many use same pdf library ?
Refs:
PDF format. function of %-started sequence
comp.text.pdf>pdf format

Looking through the PDFs I have here it looks like a number of PDF processors use these very letters "%âãÏÓ", among them Adobe products.
Not all of those processors use the same basic PDF library, so the use of the same letters cannot be explained by something like that.
Most likely it is due to the fact that Adobe software creates PDFs with that second line comment. For many years developers of other software used example files produced by Adobe software as templates for the PDFs they created.
Yes, the specification ISO 32000-1 merely requires
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater.
(and the earlier PDF references also recommend the same), so there is no need to use the same binary characters.
But there also is no reason not to use them. Why deviate from the working example files produced by Adobe software in this regard?
Especially in the years before the ISO specification, when there only were the PDF references, one tended to be as Adobe-like as possible in the document structure created as the PDF references were not considered normative in nature by Adobe. Thus, if your document was valid by the references, Adobe viewers could still reject it without that counting as a bug...

Adobe showing incorrect PDF Version (of PDF) in Properties

My PDF contains "%PDF-1.3" in header. It means PDF Version is 1.3 ,But Adobe reader( XI) installed on my system shows version as 1.5 if looking it in File > Properties.
What is right?
1.3 or 1.5?
I can get PDF version as 1.3 by reading PDF metadata in java. How can I get PDF version 1.5 through java program?

The version in the file header can be overridden later in the file, cf the PDF specification:
Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.
(section 7.5.2 File Header)
Thus,
What is right?
depends on the PDF contents. If you are not sure, please share your PDF for analysis.
Concerning questions from comments...
(1) I don’t find anything like 1.5 on pdf opening with notepad still it shows Version as 1.5. , Version would be in encoded form?
No, but it would be a name, not a number:
The value of this entry shall be a name object, not a number, and therefore shall be preceded by a SOLIDUS (2Fh) character (/) when written in the PDF file (for example, /1.4).
(Table 28 – Entries in the catalog dictionary)
So a search for "1.5" should find it. Unless, that is, compressed object streams (a PDF 1.5 feature) are used and the newest catalog has been put into such an object stream.
(2) Is there any pdf-api available in java to read such version entries.
You can read the entry using any library allowing access to its low level routines, e.g. iText, PDFBox, PDFClown, ...
(3) If Yes, how to ?
In iText for a PdfReader reader:
reader.getCatalog().getAsName(PdfName.VERSION)
In PDFClown for a Document document:
document.getVersion()
while the original header version is retrieved from a File file using:
file.getVersion()
(PDFClown information proposed by Stefano Chizzolini)
(4) Would you please let me know what type of content I should check to detect pdf’s actual version?
Usually checking the header and the catalog should suffice.
Probably, though, some programs, when spotting the use of a PDF feature only present in later PDF specifications, return the smallest PDF specification version in which all used features are present. In that case you'd have to check all the reachable PDF content.
This would especially make sense for cross reference and object streams introduced in 1.5.
Also If I edit header PDF header with version 1.6, It shows version as 1.6, so it means Adobe dosent display property overridden by Version entry in the document’s catalog dictionary, It takes later version from both of these.
That's correct, and it is also mentioned in the specification of the Version catalog entry:
The version of the PDF specification to which the document conforms (for example, 1.4) if later than the version specified in the file’s header (see 7.5.2, "File Header"). If the header specifies a later version, or if this entry is absent, the document shall conform to the version specified in the header.
(Table 28 – Entries in the catalog dictionary)
Concerning the provided screenshot
The OP provided a screenshot:
One can clearly see that the file in question is linearized (on the left side one can see the linearization parameter dictionary and on the right side this is confirmed by "Fast Web View: Yes"). Following the linearization parameter dictionary there are the cross references for the first page, and these cross references are provided as a cross reference stream, not a cross reference table.
Cross reference streams have been introduced in PDF 1.5, and PDFs using cross reference streams instead of cross reference tables cannot even be parsed according to the PDF 1.4 and 1.3 references.
I assume that Adobe Reader claims a version 1.5 because of this unparsability according to specifications before 1.5.
I think, I would not be able to fetch 1.5 as version from PDF with other API. Is it so?
I assume so, at least immediately; many libraries may hide such details (like whether cross reference streams or tables are used) from the user. As you have not provided the PDF in question, though, this is a mere assumption.
What solution I should provide to my customer? I have been working in Publishing domain segment. Working in an application developed in java, we do have the validation check : System must not allow PDF version 1.3 and before.
That requirement already is not well defined. What is a PDF version 1.3 and before?
Is it a PDF file which does claim to be 1.3 or before?
As a special case, what about PDFs claiming different versions? E.g. different entries in header and catalog, or different entries in different incremental updates. Is such a PDF 1.3 or before if one of the differing entries is 1.3 or before? Or only if all are 1.3 or before? Or does the newest catalog version entry need to be 1.3 or before?
Is it a PDF file which a chosen indicator program (e.g. Adobe Reader in a fixed version) recognizes as 1.3 or before?
Is it a PDF which is valid according to a PDF reference 1.3 or before?
Or is it a PDF which is not valid according to any PDF reference 1.4 and after?
The only thing easy to implement is the first variant (having decided on the special cases), but what customers from the publishing context most likely mean is something along the lines of the last variant.
We check pdf version using PDF Tool Box-java jar. Which gives pdf version as 1.3 ,So validation gets failed. Client is questioning that its right pdf showing a screen shot from opening PDF, File > Properties. Now, what should be the next step?
The next step? Get together with the customer and get to a common understanding what a PDF version 1.3 and before means. And then reconsider if you still want to implement that. It might be a matter of some person years.

use gostscript to convert your file.
For that this is the Linux comand:
gs -o tempPdfFilePath -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 pdfFilePath && mv tempPdfFilePath pdfFilePath
Note that you cannot read and write on the same file, so you need a
temp file name.

Create print-ready PDF/X (with bleedbox, trimbox, mediabox, etc) programatically?

I was wondering if it is possible to programaticaly create a PDF file with an acceptable quality for the production press, ideally using only open-source libraries.
Right now the process is like this:
-create texts and images
-merge them into a postscript file
-use Acrobat Distiller to convert it to PDF (Acrobat distiller helps you check all the parameters of the PDF)
-send the PDF to the press
What I want is something like:
-take all texts and pictures in this folder
-encode them into the press-ready PDF, something similar to what Distiller produces
-send them to the press
How would you do that?
Many thanks...

Are the Ghostscript's gsdll32.dll and gswin{32,64}.c.exe with their source code and the GPL3 enough (or too much) of Open Source? They ship as part of all recent releases (newest one currently: v8.71).
Ghostscript can create very good quality PDF. See here for the most recent documentation about its PDF/A and PDF/X support.
Note, that this documentation until very recently was a bit misleading: it missed hinting at the requirement to edit+adapt the referred PDFA_def.ps or PDFX_def.ps templates. If you followed the old documentation without editing the templates to specifically point to the ICC color profile you wanted to embed, your output would be valid PDF, but would not pass all checks testing for compliancy with the official PDF/A+PDF/X standards.

You can generate pdfs using f.e. TeXML and XeLaTeX (first one to make scripting easier -- TeX has lots of quirks in syntax).
I also tried OpenJade and its DocBook support, but the quality was lower. TeX seems to do typesetting much better.
Both ways are using standalone programs... which you can use in shell scripts or call using system facilities.

You didn't mention which version of Distiller you're using. Recent versions do have a setting that lets you generate (different verions of) PDF/X. See also the *.joboptions files which ship with Distiller.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas