Get annotation from a pdf to add to another document - vb.net

I am using iTextSharp version 5.0.
For my projet, I need to copy my pdf document into another pdf document using pdfWriter. I can't use pdfCopy nor pdfStamper.
So all the annotations get lost during this operation.
To begin, I started to find how to get the annotations of the "pencil comment drawing markup" as shown below on adobe reader UI:
For my tests, I am using this pdf document with a drawing markup I added my self: https://easyupload.io/3c6i1g
I found how to get the annotation dictionary:
Dim pdfReader As New PdfReader(pdfPath)
Dim page As PdfDictionary = pdfReader.GetPageN(0)
Dim annots As PdfArray = page.GetAsArray(PdfName.ANNOTS)
If annots IsNot Nothing Then
For i = 0 To annots.Size - 1
Dim annotDict As PdfDictionary = annots.GetAsDict(i)
Dim annotContents As PdfString = annotDict.GetAsString(PdfName.CONTENT)
Dim annotSubtype As PdfString = annotDict.GetAsString(PdfName.SUBTYPE)
Dim annotName As PdfString = annotDict.GetAsString(PdfName.T)
Next
End If
When the loop is parsing my comment the annotName variable returns my name, so I am sure to parse the annotation I am looking for but the annotSubtype is equal Nothing, how is that possible? According to the pdf specification at section 12.5.2 table 1666 (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf), the subtype parameter is required, so wouldn't it means this should not be at nothing?
Also, how can I get the image related to this annotation? I thought it would be stored in the content of the annotation dictionary but this is also returning nothing in the code above...
about why I can't use pdfStamper at the first place : one of the page of my pdf document must be resized (downscaled) in order to add some text at the bottom of the page, so I must use pdfWriter for that.
Question: How can I get the drawn line of a comment annotation with iTextSharp 5.0?

There are a lot of single questions in your post...
When the loop is parsing my comment the annotName variable returns my name, so I am sure to parse the annotation I am looking for but the annotSubtype is equal Nothing, how is that possible?
According to the pdf specification at section 12.5.2 table 1666 (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf), the subtype parameter is required, so wouldn't it means this should not be at nothing?
According to table 164 in section 12.5.2 of ISO 32000-1, the Subtype entry indeed is required, but it also specified to be a name while you try to retrieve a string instead:
Dim annotSubtype As PdfString = annotDict.GetAsString(PdfName.SUBTYPE)
As the Subtype entry of that annotation in your PDF correctly is a name, GetAsString returns Nothing.
Thus, call GetAsName instead and expect a PdfName return type.
Also, how can I get the image related to this annotation? I thought it would be stored in the content of the annotation dictionary but this is also returning nothing in the code above...
The Contents entry is specified in the same table as above to be optional and (if present) to have a text string value containing a Text that shall be displayed for the annotation or, if this type of annotation does not display text, an alternate description of the annotation’s contents in human-readable form. As the annotation merely is a scribble, what should the annotation have as Contents value?
As your annotation actually is an Ink annotation, you can find the representation of the scribble in the required InkList and optional BS entries of the annotation, see table 182 of section 12.5.6.13 of ISO 32000-1.
The value of InkList is An array of n arrays, each representing a stroked path. Each array shall be a series of alternating horizontal and vertical coordinates in default user space, specifying points along the path. When drawn, the points shall be connected by straight lines or curves in an implementation-dependent way.
The value of the BS (if present) is A border style dictionary (see Table 166) specifying the line width and dash pattern that shall be used in drawing the paths.
Beware, though: The annotation dictionary’s AP entry, if present, takes precedence over the InkList and BS entries. And in your PDF the annotation has an appearance entry. So the actually displayed content is that of the Normal appearance stream which contains vector graphics instructions drawing your scribble.
about why I can't use pdfStamper at the first place : one of the page of my pdf document must be resized (downscaled) in order to add some text at the bottom of the page, so I must use pdfWriter for that.
First of all, this only means that you have to do something special to that special page, there is no need to damage all pages by copying them with a PdfWriter. You could manipulate that single page in a separate document, then use PdfCopy to copy the pages before that page from the original PDF, then that page from the separate PDF, and then all pages after that page from the original again.
Thus, you'd only have to fix the annotations of that special page, the annotations on the other pages could remain untouched.
Furthermore, you can even use the PdfStamper if you are ready to use low level iText routines. In particular before stamping you can apply the static PdfReader method GetPageContent to the page dictionary of the special page to retrieve the page content as byte array, build a new byte array from it in which you prepend an affine transformation which does the downscaling, and set the new byte array as content of the page in question using the SetPageContent method of the underlying PdfReader
Even in this scenario, though, you'd have to adjust the annotation coordinates (both of their rectangles and of other coordinates like the InkList in your case)...
Question: How can I get the drawn line of a comment annotation with iTextSharp 5.0?
See above, the annotation of the scribble is an Ink annotation and the drawn path is specified in the InkList and BS entries of its dictionary and additionally instantiated in its normal appearance stream.

Related

Looking for software or API that will give me co-ordinates of text in a pdf

Simple question I hope - I have a pdf and want to detect the co-ordinates of specific word(s) or placeholder text. I then intend to use itextsharp to stamp a replacement bit of text on top at the co-ordinates found.
Can anyone recommend anything please?
Thanks
As answered in the comments, one could use iText to perform such a task. Maybe there are some better solutions, however, I doubt it. The cause of the mentioned issue, i.e. "[itextsharp] sometimes give co-ords of the start of the sentence the search text is in", is that sometimes glyphs are so close, that their boxes overlap, hence I don't see how it could be handled as you want.
So you can do the following:
extend LocationTextExtractionStrategy class and override eventOccurred, for example, as follows:
#Override
public void eventOccurred(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
// Obtain all the necesary information from renderInfo, for example
LineSegment segment = renderInfo.getBaseline();
// ...
}
pass an instance of such an extended class to PdfTextExtractor.getTextFromPage as follows:
PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1), new ExtendedLocationTextExtractionStrategy()
once text is found, the event will be triggered.
There are some difficulties in such a solution, of course, because the text you want to find and write above could be present in the PDF not as "Text", but "T", "ex", t", or even "t", "x", "e", "T". However, since you use iText, you may want to harness the advantages of one of its products - pdfSweep. This product aims to completely remove unnecessary content from the PDF, with such a content being passed either as some locations (which you want to obtain, so that is not an option) or regexes.
This is how to create such a regex strategy (to find all "Dolor" and "dolor" instances in the document, completely remove them (from all the streams, so that they are either not observed from a PDF viewer nor found in the underlying PDF objects):
RegexBasedCleanupStrategy strategy = new RegexBasedCleanupStrategy("(D|d)olor").setRedactionColor(ColorConstants.GREEN);
This is how to use it:
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.cleanUp(pdf); // a PdfDocument instance
And this is how to write some text on the location, at which the unnecessary text was present:
for (IPdfTextLocation location : strategy.getResultantLocations()) {
Rectangle rect = location.getRectangle();
// do something, for exapmle, write some text
}

Special characters in PDF form fields and global and fieldbased DR

I have a question regarding a weird form field behaviour.
Two pdf documents, both have textfield(s) using Helvetica as a font
Both are filled with values using the same iText logic (cp. below)
The field value (/V) is correct for both PDFs however the field appearance is not.
One Pdf is working fine the other scrambles special character like the euro symbol € or German characters like üöäß.
I tried to define a substitute font (as described in the book) however never got € and ß to work.
The only difference I could find is that a /DR dictionary is defined on field level for the non-working PDF (in adition to the global one). But if I remove it, the € sign still doesn't work. Please note, that I am not talking about asian or some exotic unicode characters here - all are part of the standard helvetica font (as the other PDF proves)
Question(s):
Any ideas how to get the non working PDF to correctly display the characters?
Or does the PDF violates the pdf spec somehow? (It was created using Acrobat which makes that unlikely but not impossible).
If you suggest to replace the form field font - how can I differentiate between working and non working PDF files since I don't want to do that for perfectly valid and working files
Update: The code is not the problem (I am certain of that since its the same code for both) however for the sake of completeness here it is:
AcroFields acroFields = stamper.getAcroFields();
try {
boolean successful = acroFields.setField("Mitarbeiter", "öäü߀#");
if (!successful) {
//throw some exception
}
}
catch (DocumentException de) {
//some exceptionhandling
}
I didn't find any clues in the PDF reference about this, but the font that is used for the field doesn't define an encoding. However: an encoding is defined at the level of the resource dictionary (/DR). If you use that encoding, then the appearance of the field is created correctly. Note that the ISO specification doesn't say anything about the existence of an /Encoding entry at the level of the resource dictionary.
I've made a small update to iText. You can check the changes in revision 6693. This way, iText will now check if the /DR dictionary has encoding values in case no encoding is defined at the level of the font. With this fix, your form is filled out correctly.

PDFBox- is the reading order guaranteed with PDFTextStripper's processTextPosition ?

I am using PdfTextStripper (PDFBox 1.8.2) to process every TextPosition in a pdf file. I have tested with a lot of files and I noticed that it processes text in the reading order. However, this does not hold good if a pdf has footers (the docx which I exported as pdf). The pdfTextStripper processes the footer first and then the body of the file.
Is this expected behavior ? Is there a way I can specify the order ? or is there any way I can identify its a footer and I can make the adjustment in my code ?
PdfTextStripper has an attribute SortByPosition (getSortByPosition & setSortByPosition). It's false by default.
If this attribute is false, the PdfTextStripper essentially extracts the text in the order in which it appears in the PDF page content stream.
This order can be totally mangled (because in the content stream you use operators which can position the next printed text anywhere on the page) but often text sections belonging together are kept together (because the operations required for such sections often are inserted in that stream as a block).
Headers and footers, though, often are added at the same time and, therefore, appear together before or after the main body text.
If this attribute is true, though, the PdfTextStripper essentially extracts the text from top to bottom, from left to right (unless the reading order is defined to be right to left). (Ok, ok, it also respects article beads, but you hardly can count on them being used in general.)
This order is good in case of one-column text, and headers come first and footers last, but unless proper article beads are used, multi-column pages get mangled up.
BTW, you can switch off the use of article beads using the attribute ShouldSeparateByBeads (getSeparateByBeads & setShouldSeparateByBeads).

how to make pdf layers(optional content group) in tree structure

I have not long ago post a quesion about how to use the optional content group in pdf. But now I have a new question. How to make these optional content group in a tree structure.
For example. I have 4 different layers. these layers are all OCG layer. 3 layers is text labels, 1 layer have veccotr graphic. So i want it shown as this:
Alllayers
---labels
--layer1
--layer2
--layer3
---layer4
I use a pdf doc as an example
this is in chinese, the chinses character is the name of the layer. Just this meaning.
The answer to this question will depend on which pdf library you are using to generate your files. In general, you will need to produce a file that has an Order array in the optional content configuration dictionary that represents the tree that you want to display.
From the PDF Reference Document:
Key: Order
Type: array
Description: (Optional) An array specifying the order for presentation of optional content groups in a conforming reader’s user interface. The array elements may include the following objects:
-Optional content group dictionaries, whose Name entry shall bedisplayed in the user interface by the conforming reader.
-Arrays of optional content groups which may be displayed by a conforming reader in a tree or outline structure. Each nested array may optionally have as its first element a text string to be used as a non-selectable label in a conforming reader’s user interface.

How to obtain PDF table of contents (outline) data in iOS (iPad)?

I am building an iPad application that displays PDFs, and I'd like to be able to display the table of contents and the let user navigate to the relevant pages.
I have invested several hours in research at this point, and it appears that since PDFKit is [not supported in iOS], my only option is to parse the PDF meta data manually.
I have looked at several solutions, but all of them are silent on one point - how to associate a page in the "outline" metadata with the real page number of the item. I have examined my PDF document with [the Voyeur tool] and I can see the outline in the tree.
[This solution] helped me figure out how to navigate down the Outline/A/S/D tree to find the "Dest" object, but it performs some kind of object comparison using [self.pages indexOfObjectIdenticalTo:destPageDic] that I don't understand.
I have read the [official PDF spec from adobe], and section "12.3.2.3 Named Destinations" describes the way that an outline entry can point to a page:
Instead of being defined directly with
the explicit syntax shown in Table
151, a destination may be referred to
indirectly by means of a name object
(PDF 1.1) or a byte string (PDF 1.2).
And continues with this line which is utterly incomprehensible to me:
The value of this entry shall be a
dictionary in which each key is a
destination name and the corresponding
value is either an array defining the
destination, using the syntax shown in
Table 151, or a dictionary with a D
entry whose value is such an array.
This refers to page 366, "12.3.2.2 Explicit Destinations" where a table describes a page: "In each case, page is an indirect reference to a page object"
So is the result of CGPDFDocumentGetPage or CGPDFPageGetDictionary an "indirect reference to a page object"?
I found a [thread on lists.apple.com] that discusses. [This comment] implies that you can compare the address (in memory?) of a CGPDFPageGetDictionary object for a given page and compare it to the pages in the "Outline" tree of the PDF meta data.
However, when I look at the address of page objects in the Outline tree and compare them to addresses they are never the same. The line used in that thread "TTDPRINT(#"%d => %p", k+1, dict);" is printing "dict" as a pointer in memory.. there's no reason to believe that an object returned there would be the same as one returned somewhere else.. they'd be in different places in memory!
My last hope was to look at the source code from apple's command line "outline" tool [mentioned in this book] (as [suggested by this thread]), but I can't find it anywhere.
Bottom line - does anyone have some insight into how PDF outlines work, or know of some open source code (preferably objective-c) that reads PDF outlines?
ARGG: I had all kinds of links posted here, but apparently a new user can only post one link at a time
The result of CGPDFDocumentGetPage is the same as an indirect page reference you get when resolving a destination in an outline item. Both are essentially dictionaries and you can compare them using ==. When you have a CGPDFDictionaryRef that you want to know the page number of, you can do something like this:
CGPDFDocumentRef doc = ...;
CGPDFDictionaryRef outlinePageRef = ...;
for (int p=1; p<=CGPDFDocumentGetNumberOfPages(doc); p++) {
CGPDFPageRef page = CGPDFDocumentGetPage(doc, p);
if (page == outlinePageRef) {
printf("found the page number: %i", p);
break;
}
}
An explicit destination however is not a page, but an array with the first element being the page. The other elements are the scroll position on the page etc.