I have a simple one page PDF document.
Using Adobe Acrobat X (10.1.4), I added 2 graphical annotations (Ink). So far so good.
Now I opened the document in Notepad++ to inspect it. Everything seemed fine. There was the annotations array, and both annotations. All good.
Then I randomly entered one space char " " in the xref table to make the document "invalid".
When I opened it in Adobe Acrobat X (Version 10.1.4), it was capable of displaying everything like it was (apparently after automatically repairing the document) and asked me then, wether I would like to save the new version to disk. I did.
Now I opened the document in Notepad++ again, just to find, that it looks completely different, than it looked like before I did the modifications.
The most weird thing is, that most of the objects just vanished from the document! There were still references to them, but the actual objects are not there.
In addition there were a bunch of flate-decoded stuff.
Is it possible, that the Adobe Acrobat reader not only compresses streams, but also whole objects including there "x y obj" and "endobj" tags?
As of PDF 1.5 object streams have been introduced to the PDF format, cf. section 7.5.7 of the current PDF specification ISO 32000-1:2008:
An object stream, is a stream object in which a sequence of indirect objects may be stored, as an alternative to their being stored at the outermost file level.
NOTE 1 Object streams are first introduced in PDF 1.5. The purpose of object streams is to allow indirect objects other than streams to be stored more compactly by using the facilities provided by stream compression filters.
By allowing Adobe Acrobat to save the repaired version of your document, you implicitly allowed it to do that in its perferred format which due to compactness uses object streams
Related
I am verifying a PDF with two signatures (Adobe Acrobat), both valid. One of them has a text say "cambio(s) varios" (my Adobe Acrobat is in Spanish) translating to Enghish "change(s) various", my question is I don´t know what it mean. Signatures are valid and the PDF is correct.
Thanks in advance
First of all, to outline what this is about, the Adobe Acrobat Reader signature panel looks like this for the document at hand
and the question is about the
1 Miscellaneous Change(s)
in-between.
According to Adobe Documentation
In a number of documents Adobe enumerates possible modification entries and characterizes "Miscellaneous Change(s)" like this:
Miscellaneous: Some changes which occur in memory or cannot be explicitly listed are labelled miscellaneous.
(e.g. in "Digital Signatures Workflow Guide for the Adobe® Acrobat Family of Products")
Now this documentation obviously is no help at all...
According to Adobe Acrobat
Fortunately Adobe Acrobat can be asked to show "Document Integrity Properties":
(Adobe Acrobat 9.5 output on "Signature Properties" - "Legal" - "View Document Integrity Properties...")
I assume it is this detail that makes Adobe Reader warn about miscellaneous changes.
In Your Document
Looking for a transfer function use in your document one quickly indeed finds one in a ExtGState resource of page 1:
The TR entry in that graphics state dictionary sets the transfer function here.
Interestingly the transfer function used is the Identity function! I assume that in most normal use cases setting the transfer function to Identity changes nothing...
What to Do
Thus, I would propose you change your original document creation to not include transfer functions, in particular not Identity transfer functions. Alternatively pre-process your documents before applying the first signature and remove such functions.
I have a pdf of which the content stream of the pdf doc looks like image1.
But once I open the pdf in adobe dc and tried to change the reading order. The entire content stream is changed. (Please see image2)
And here is the link to source pdf https://drive.google.com/file/d/1V2K3-2GdWG5DuTUv1fyfIIT54en70kI2/view
Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)
Thanks in advance !
Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)
First of all, both streams are proper, there merely are different (and in the case at hand considerably different) ways to create the same text on screen, each of them as valid as each other, and different PDF processors use different ways.
The processor that created your original PDF appears to have approached the task by dividing the text in small pieces (less than a text line) and draw these pieces as independently as possible, i.e. as separate text objects (BT..ET) with text properties set in each (Tm, Tf, Tc), positioned and rescaled by transformation matrix changes (cm), enveloped in save/restore graphics state instructions (q..Q).
Adobe Acrobat, on the other hand, appears to prefer the page main text to be contained in a single text object with text properties only set when they change and no text object or graphics state switches in-between.
Neither of these is more "proper" or more "graphical" than the other. If anything, these structures mirror how these instructions are stored or processed internally by the respective PDF processor.
That being said, you do want to convert from the former style into the latter.
The main problem is that the latter style is not standardized (at least there is no published document normatively describing it). So, while you can surely attempt to follow the lead of the example you have, you can never be sure that you understood the style exactly. Thus, you always have to expect differences emerging in special, not yet encountered situations. Furthermore, there is no guarantee Adobe will meticulously adhere to that style across software versions.
Nonetheless, you can of course attempt to follow the style (as you perceive it) as well as possible.
An implementation will have to walk through the respective content stream, keeping track of the current graphics state, and transform the text drawing (and related) instructions into a single text object for as long as possible.
You have tagged your question both itext and pdfbox. Thus, you appear to be undecided with which PDF library to implement this. Here some ideas for both choices:
For processing content streams and keeping track of the current graphics state, iText offers its com.itextpdf.text.pdf.parser API, in particular the PdfContentStreamProcessor (iText 5.x) / its com.itextpdf.kernel.pdf.canvas.parser API, in particular the PdfCanvasProcessor (iText 7.x).
You can extend them to in addition to analyzing the current contents also replace the content stream in question with an updated version, e.g. like I did in this answer for iText 5 or in this answer for iText 7.
PDFBox for the same task offers the class hierarchy based on the PDFStreamEngine. Based on these classes it should similarly be possible to create a graphics state aware content stream editor.
Both libraries also offer simpler classes for parsing the content streams into sequences of instructions, but those classes don't keep track of the graphics state, leaving that for you to implement.
Given a PDF file. Can I find out which software/libraries (e.g. PDFBox, Adobe Acrobat, iText...) where used to created/edit it?
The Adobe specification defines the Producer field (see 'Mac OS X 10.5.6 Quartz PDFContext' in screenshot nimeshjm's answer) as the name of the application that "converted from another format to PDF". In case of generating a PDF programmatically, the PDF isn't really converted so you will normally find the name of the generating SDK here.
The Creator field is related and is defined as the name of the application that created the document from which the PDF was converted. This is typically MS Word or so.
Note that this is all by convention. In practice, you cannot really rely on this and you may encounter for example empty Producer fields.
You can try opening the file in Adobe Acrobat Reader and look at the properties.
You can find this in: File -> Properties in Adobe Acrobat Reader after you open the pdf file.
You can probably get away without any PDF libraries for this type of operation. It won't be 100% reliable but I think you can probably assume 99% reliability.
So... write some code to open your PDF as a text stream and seaarch down for /Producer. You will find something like this:
69 0 obj
<<
/Creator (PDF+Forms 2.0)
/CreationDate (D:20010627111809)
/Title (Demo)
/Producer (Cardiff Software - TELEform 7.0)
/ModDate (D:20010627111810-05'00')
>>
Grab the bits between the parentheses and Bob's your uncle. Technically the text can be stored in other formats to but I think those will be pretty uncommon for this particular type of entry.
If you can't find anything here then look for the XMP data which is always guaranteed to be in clear text. It will look something like this,
39 0 obj
<</Subtype/XML/Length 15172/Type/Metadata>>stream
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c320 44.293068, Sun Jul 08 2007 18:10:11">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xap="http://ns.adobe.com/xap/1.0/"
xmlns:xapGImg="http://ns.adobe.com/xap/1.0/g/img/"
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
dc:format="application/pdf"
xap:CreatorTool="Adobe Illustrator CS2"
xap:CreateDate="2006-05-04T15:53:27-07:00"
xap:ModifyDate="2006-05-04T15:53:27-07:00"
xap:MetadataDate="2006-05-04T15:53:27-07:00"
xapMM:DocumentID="uuid:61AC83CBC0DBDA11A32BC847EF128E34"
xapMM:InstanceID="uuid:cba15bf3-d7da-4a4e-a563-fc20d13e258a"
pdf:Producer="Adobe PDF library 7.77">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">3.01 PDF components</rdf:li>
</rdf:Alt>
</dc:title>
...
The combination of these two is going to be practically always right. If you want 100% reliablity then by all means use a PDF library but for many purposes this should be sufficient.
My replies may feature concepts based around ABCpdf. It's what I work on. It's what I know. :-)
It is usually difficult to determine which software actually designed a PDF because most of Microsoft Office product can convert an edited file to PDF. By this I mean, opening a regular typed document, you have the option to save it as PDF. If you are familiar with Powerpoint slides, it can be easy to tell based on the design once the file is in PDF.
Where as on the other hand, Adobe Acrobat has the ability to create the file like those application forms we often download (from an embassy site, immigration site, etc).
Other software such as Adobe Photoshop, Illustrator, etc... can save files as PDF. Hope this help.
I have an c# web application and I want to check if the pdf document contains cross reference stream. And if it contains the cross reference stream then convert it to reference table.
Detection is fairly easy. Search the file from its end for "%%EOF"; proper PDF files actually end with an '%%EOF' line, not so proper ones may have some trash bytes following that marker. The line before that marker line contains the position of the last cross references (cf. Adobe copy of ISO-32000-1:2008 section 7.5.5). Go to the position noted here.
If at that position you find the xref keyword, the PDF has a cross reference table. If you find a PDF stream object instead (ibidem section 7.5.8), the PDF has a cross reference stream. If you find neither there, something about the file is fishy.
Conversion is difficult, though, especially if the PDF also uses object streams which only can be used with cross reference streams. You might want to use a library like iText(Sharp) to read the PDF and export it again with less compression enabled.
Furthermore, if the PDF is signed, conversion is impossible without breaking the signature.
one thing to note when converting from a cross-reference stream to something you can parse is that a cross-reference stream allows for a new type of reference entries. Along "uncompressed", and "free" you now have "compressed" as a new reference type.
This entry cannot directly be converted 1:1 to a normal cross-reference table. A "compressed" reference entry inside of a cross-reference stream points to a so called "object stream". The latter contains multiple indexed objects. A "compressed" entry in the xref stream then names an object stream and an index. The first line inside an object stream is then used to resolve the index to a byte-offset inside the object stream.
If that topic is still of interest to you I suggest that you have a look at chapter "3.4.7 Cross-Reference Streams". Especially the paragraph "Compatibility with Applications That Do Not Support PDF 1.5" can help. It deals with a so called "hybrid-reference" that does what you want somehow.
Now, that we more or less know how to convert an xref stream to an xref table let's continue with detecting an xref stream.
You can search for a stream with /Type/XRef (with variable spaces between the two keywords).
Also, if you have any streams of /Type/ObjStm you can deduce that there must be a xref stream, since only xref streams can point to object streams ;) (see above for an explanation).
Last but not least, if the PDF version of the document that you parse is less than 1.5 you can be somewhat sure that no xref stream is included. This heavily depends on the PDF authoring tool that created your document. Some stick to the reference some don't.
I hope this helps.
It's kind of a hack, but you can use following code to detect if a PDF contains cross-reference streams.
The code uses Docotic.Pdf library.
public static bool ContainsCrossReferenceStreams(string fileName)
{
using (PdfDocument document = new PdfDocument(fileName))
{
return document.SaveOptions.UseObjectStreams;
}
}
When the library opens a PDF it sets SaveOptions.UseObjectStreams to true if the source document uses cross-reference streams. Otherwise the property returns false.
Disclaimer: I work for the vendor of the library.
I'd like to write some (java) code that takes a PDF document, and creates named destinations from all of the bookmarks. I think the iText API is the easiest way of doing this, but I have never used the API before.
How would you go about writing this sort of code with the iText API? Can iText do the parsing needed to manipulate existing PDFs by itself? The kind of manipulations I am thinking of are:
Open,
Find bookmarks,
Create destinations,
Save,
Close.
Or is there a different API that would be better?
Followup: I submitted a patch to iText a few months ago (it has now been accepted and is part of HEAD) that adds text parsing capabilities to iText. PdfBox (mentioned below) has (had?) problems with reading newer PDFs that use xref streams instead of the older xref table format.
Another library that is very good at parsing existing PDF files is PdfBox It can also be used for modifying an existing PDF. FYI - this is the text parser that Lucene uses.
I will also mention that iText does have the ability to parse a PDF file, it's just not great at parsing the text content on each page. If you are looking at accessing the PDF higher level constructs (Dictionaries, etc...) that are used for storing bookmarks, etc... and you don't mind getting your hands a little dirty with reading the PDF spec, you can absolutely do what you are asking about (we do it quite a bit ourselves).
The PDF Spec is big, but readable for the most part, and you don't have to worry about the bulk of it (which is geared towards actual page content and rendering) if all you are trying to do is extract bookmarks.
I'll just warn you up front that you may be disappointed with this. iText isn't really intended to be used as a parser. It's really more for creating entirely new PDF documents, but you can take a whack at it.
To start, using iText, you won't be able to modify the existing PDF document. What you can do, though, is to make a copy with the additional features that you want. (If somebody else knows better, please let me know, this drives me crazy.)
What you will want to do is create a PdfReader object from an input stream on your source file. Then create a PdfCopy object (which is just an extended PdfWriter that makes getting data from an existing source more convenient) for your destination.
As far as I can tell, the bookmarks cannot be obtained from iText at all. Another library may be needed. I think jpedal may have the ability to extract them (it can get them as an XML document, which you may then have to parse to get what you want.) However you get them, you can then add them to a java.util.List, and set that list as outline on the PDFCopy. The bookmarks themselves are just HashMaps with a particular set of keys. I'm not sure what all of the values are, but they include "Title", "Action" (which seems to be where you'd specify that this is a named destination, though I don't know what that value would be), and "URI" (which is used if this is an external link -- I suspect that this would specify the name of the named destination that you're linking to). Again, the specifics are hard to find.
Then iterate over the pages of the reader, importing each page to the PdfCopy. this page may help you.
Sorry I'm not more helpful to you. Good luck.
P.S. If anybody else knows of a better tool that's either (L)GPL or BSD licensed, I'd love to hear about it.