iTextSharp - when extracting a page it fails to carry over Adobe rectangle highlighting important info - pdf

Per the following site...
http://forums.asp.net/t/1630140.aspx?extracting+pdf+pages+using+itextsharp
...I use the function ExtractPages to produce a new PDF based on range of page numbers. My problem is that I noticed a PDF that had a rectangle on the 2nd page was not extracted along with the page. This causes me some fear that perhaps Adobe comments are not being carried over as well as the pages get extracted.
Is there a way I can adjust this code to take into consideration to bring over comments and objects like rectangles to the new PDF when ExtractPages is called? Am I missing a syntax or is that not available with version 5.5.0 of iTextSharp?

Your use of the verb extract in the context of extracting pages is confusing. People will think you want to extract text from a page. In reality, you want to import or copy pages.
The example you refer to uses PdfWriter. That's wrong: you should use PdfStamper (if only one existing PDF is involved) or PdfCopy (if multiple existing PDFs are involved). See my answer to the question How to keep original rotate page in itextSharp (dll) to find out why the example on forums.asp.net is a really, really bad example.
The fact that a page has "a rectangle" is irrelevant. Maybe the rectangle is an annotation. In that case, you're throwing that rectangle away by using the wrong example. Maybe the origin of the page is different from 0,0...
If your purpose is to create a new PDF containing only a selection of pages of the original PDF, please read my answer to Function that I can use to remove a single page from a PDF using iText

Related

"Re-paginate" PDF using iText

Disclaimer:
I am using iText 5. I know this is generally frowned upon (vs. using iText 7), but I am working with considerable legacy code that uses iText 5, and upgrading does not fall under my control.
Requirements:
A "simple" PDF/A is received as input (text only, these are generated from RTF), as well as a float value corresponding to a desired first page length in inches.
A PDF/A must be output that is identical to the input PDF, except it is paginated as follows: first page length = input value; each subsequent (not first or last) page will fill a standard page length; the last page will be truncated a constant number of points below the content nearest the bottom of the page. Note that input and output width will be identical and constant.
Progress / Approach:
I have extended the SimpleTextExtractionStrategy to generate XML containing font information (size and family, bold or italics, etc.) as well as location information (relative an absolute coordinate system where the origin is at the top left corner of the first page of the input PDF) for each "span" of text extracted from the input PDF.
I then generate a new PDF page by page (where each page is the desired length according to the requirements outlined above), filtering the extracted XML info with LINQ based on the bounds of each new page, and adding appropriately formatted text at the appropriate location using ColumnText.ShowTextAligned(...).
Problem:
The approach outlined above does fine. It generates PDFs with the desired page structure, but some information is lost in translation, namely colored text and underlined text. While colored text shouldn't be seen in these PDFs, underlined text absolutely must be detected.
This set of requirements should also include PDFs with tables. I originally planned on implementing a different module that adheres to the same interface for table PDFs, as these are generated and used separately from the PDFs generated from RTF, and iText has relatively strong table functionality built in.
The two concerns outlined above, coupled with the fact that my described approach was born out of an attempt to reuse existing code leads me to believe that an entirely different approach may be necessary or at least much better. It seems to me that there should be a way to capture content byte info and clip it as necessary to "re-paginate" the input PDF, only worrying about moving content that falls along a page boundary.
Essentially, I am looking for (iText based) recommendations for a better approach. Pseudo-code type answers or simply recommendations for classes / interfaces that may help are acceptable. While it would be nice to handle text and tables together, any advice pertinent to one or the other would also be appreciated. I have perused much of the available documentation on the iText website and other SO questions, but have not found quite what I'm looking for.
Note that no code is included in this question as I am looking for a high-level approach that is entirely different from what I have tried.
Edit:
I didn't notice it before, but the way in which I was reusing fonts (similar to this) resulted in some unexpected (but documented as such) behavior. It seems that I will need to avoid extracting information for re pagination at the text level, as it will be difficult to ensure continuity of fonts between input and output.
I solved this problem a while ago, but figured I would post my solution. I'm sure it's not the most efficient solution, but it works well for my purposes. Note that this will re-paginate a PDF as described in the question containing text only. Table PDF's are handled separately.
The basic process is this:
Use a custom TextExtractionStrategy to extract XML containing information regarding ascent and descent lines for all text in the input PDF, as well as what page it originally appears on.
Given the page length requirements as described in the question (first page = input value, subsequent = standard length, last page = fit content) and the XML info regarding text positions, determine what content will fit on each page of the output PDF. Create a map of where each input page will need to be cropped (top and bottom, note that each input page may be cropped more than once), as well as a map of which cropped pages will need to be "concatenated" together in the final output.
Copy the input PDF page by page to an intermediate temporary PDF (using PdfCopier). If an input page must be cropped more than once (ex: first 2 inches of input page 1 = page 1 output, next 6 inches of input page 1 = page 2 output, final 0.5 inch of input page 1 = top of page 3 output), ensure that it is copied the appropriate number of times (1 time per crop).
Crop each page of the intermediate copied PDF appropriately. This is done by modifying the MediaBox and / or CropBox.
Concatenate the appropriate cropped pages together into the final output PDF's pages. I used a PdfWriter to first create a new page of the appropriate height, then add each appropriate cropped page at the appropriate position in the output PDF page's byte content usingcontentByte.AddTemplate(inputCroppedPage, 0, bottomOfLastAddedCroppedPage).
To anyone who managed to read and understand all of that, congratulations. To anyone else, please let me know what you if you are confused. The solution described above is a little twisted and tough to put into words. While there is too much code to post here (and I am not at liberty to share the code on GitHub or similar), I would be happy to answer any questions that will help someone else implement something similar.
The TextExtractionStrategy mentioned in step 1 was inspired by this answer. Essentially, I used System.Xml.Linq to create an XML document rather than concatentating strings to form HTML, and I ignored any font information, storing only information regarding where text is located in the page (you'll see that this information is available in the linked answer, just isn't written into the final HTML).

Determine the Text that can Display in Multiline PDTextField

Is there a way to determine the text that will actually display in a PDTextField when the PDF prints? If I call setValue and then getValue, it returns all of the text even though it will not all display.
I am trying to fill out a form with a limited size multiline text field that has the notation to attach another page for more details. I would like to limit the text to that which will display and generate the added detail page.
Thanks for indulging a PDFbox newbie.
There is no direct way to find that out as the details of the text layout such as line breaks, padding, line spacing are hidden inside the non public class PlainTextFormatter inside the org.apache.pdfbox.pdmodel.interactive.formpackage. So you'd need to replicate that code.
PDFBox tries to resemble the calculations done by Adobe Acrobat and Adobe Reader but the details of such calculations are not part of the PDF specification. So doing your calculation is only valid for a similar layout model. Other form filling applications might have a slightly different layout model and as a result your results will not apply to these.
In addition to that Acrobat (and PDFBox) place text although it might be partially clipped. Look at the results of the AlignmentTest.javaunit test to see what I mean. So one might have a different expectation to what 'fitting' really means.
As I've thought about passing the information about which text fitted back to the calling application anyway I've opened an enhancement request https://issues.apache.org/jira/browse/PDFBOX-3413 for that.

Livecycle Designer - Table row overlap issue

I been looking all around for a solution to my issue but i can't find a fix yet. Here's my problem:
I have a dynamic PDF which contains a table and several text fields per row that grow vertically as the user adds text (multiple lines and expand to fit vertically). The table properly breaks when the content doesn't fit in the current page, however, I found out that in some scenarios, with a certain amount of characters, the row sometimes overlaps the content in the next page (See below).
By by adding more text to the offending line, the content in that row properly breaks to the next page (See below)
I am not sure whether or not this is a known issue with the tool or I am missing some sort of configuration/setting in the template. I haven't found anything online or in the Adobe Documentation. Any thoughts?
I am using:
Adobe Acrobat Pro 9
Adobe LiveCycle Designer ES 8.2
The form version of the PDF runs in Adobe Reader 7.0.5 (For compatibility purposes with one of the tools our clients are using)
Thanks in advance
After a long time looking for a solution, I found a single blog of someone who had the same issue, which by the way Adobe was kind enough to not document it... Anyhow, the following two processing instructions need to be added to the XML
<?layout allowDissonantSplits 1?>
<?layout allowJaggedRowSplits 1?>
My XML looks like this:
<template xmlns="http://www.xfa.org/schema/xfa-template/2.4/">
<?formServer defaultPDFRenderFormat acrobat7.0.5dynamic?>
<?formServer allowRenderCaching 0?>
<?formServer formModel both?>
<?layout allowDissonantSplits 1?>
<?layout allowJaggedRowSplits 1?>
The author specifies that the directives should only be added if this problem occurs. I wonder why these instructions should only be used in this situation. The full blog post can be found here:
http://blogs.adobe.com/dmcmahon/2011/10/10/lc-forms-es-text-overlapping-on-page-break-using-nested-subforms/
Hope this saves time to someone else

Possible to control PDF layout with iText?

I'm writing some logic to build a large single PDF file that our users can print at their convenience. I'm using Java's iText library (through Clojure's clj-pdf).
I'm trying to have the PDF show the same exact template form on every single page, however I can't seem to find any documentation or indication that one can have PDF content "fit to a page".
The text in these forms varies a little bit, so there's a chance it might require more of fewer text lines per page. This means that the content has a chance of spilling over to the next page, or being too short, making the next page creep up into the previous one, breaking the requirement of "one form per page" for the rest of the document.
I'm trying to figure out if my option is pretty much only to manually check the length of the text on each page and potentially crop it by hand if I goes over n lines, or if the PDF format somehow supports a smart way of having paragraphs+tables+headings all fit in one page. Some UI systems allow you to control how spill-over is handled, anywhere from cropping to resizing the font, so I'm curious if PDF supports anything of that sort.
Edit: ended up going with pagebreaks for simplicity, wasn't aware of that option when I wrote this question.
If you want to take control over the space taken by text, for instance to fit it on a single page, the way to go would be to create a ColumnText object and to add the content in simulation mode. If the text fits the page, add it for real. If it doesn't, use a smaller font size. This is demonstrated in the MovieAds example where snippets of text are fitted into AcroForm fields.

How do I embed a source PDF onto an existing page in a PDF?

I need to programmatically embed an existing PDF (a small graphic) onto a specfic page on an existing PDF. Using iTextSharp I've been able to add a new page containing this embedded PDF, but what need is to modify an existing page by adding this graphic. Is this possible using iTextSharp or any other PDF-generation libarary?
I tend to do this sort of thing using Context, which is a Tex-based layout tool that in integrated into the pdftex Tex/Metapost engine. There's a learning curve involved, and installing Context isn't entirely trivial, but it makes very general programmatic document processing involving PDFs easy once you get the hang of it.
For this problem, you'd define two overlays, with the first overlay being the main PDF that you set to a background, and then on the page you want to change, defining a foreground overlay with a \setlayer command, which contains a single \framed box, which superimposes the second PDF using a \externalfigure command.
The nice thing about Context for this kind of task is that it works with PDF as its internal representation all the way through, so there is no unexpected blow up in file size or deterioration in image quality, which you can get with other tools that convert between formats.