Determine the Text that can Display in Multiline PDTextField - pdfbox

Is there a way to determine the text that will actually display in a PDTextField when the PDF prints? If I call setValue and then getValue, it returns all of the text even though it will not all display.
I am trying to fill out a form with a limited size multiline text field that has the notation to attach another page for more details. I would like to limit the text to that which will display and generate the added detail page.
Thanks for indulging a PDFbox newbie.

There is no direct way to find that out as the details of the text layout such as line breaks, padding, line spacing are hidden inside the non public class PlainTextFormatter inside the org.apache.pdfbox.pdmodel.interactive.formpackage. So you'd need to replicate that code.
PDFBox tries to resemble the calculations done by Adobe Acrobat and Adobe Reader but the details of such calculations are not part of the PDF specification. So doing your calculation is only valid for a similar layout model. Other form filling applications might have a slightly different layout model and as a result your results will not apply to these.
In addition to that Acrobat (and PDFBox) place text although it might be partially clipped. Look at the results of the AlignmentTest.javaunit test to see what I mean. So one might have a different expectation to what 'fitting' really means.
As I've thought about passing the information about which text fitted back to the calling application anyway I've opened an enhancement request https://issues.apache.org/jira/browse/PDFBOX-3413 for that.

Related

"Re-paginate" PDF using iText

Disclaimer:
I am using iText 5. I know this is generally frowned upon (vs. using iText 7), but I am working with considerable legacy code that uses iText 5, and upgrading does not fall under my control.
Requirements:
A "simple" PDF/A is received as input (text only, these are generated from RTF), as well as a float value corresponding to a desired first page length in inches.
A PDF/A must be output that is identical to the input PDF, except it is paginated as follows: first page length = input value; each subsequent (not first or last) page will fill a standard page length; the last page will be truncated a constant number of points below the content nearest the bottom of the page. Note that input and output width will be identical and constant.
Progress / Approach:
I have extended the SimpleTextExtractionStrategy to generate XML containing font information (size and family, bold or italics, etc.) as well as location information (relative an absolute coordinate system where the origin is at the top left corner of the first page of the input PDF) for each "span" of text extracted from the input PDF.
I then generate a new PDF page by page (where each page is the desired length according to the requirements outlined above), filtering the extracted XML info with LINQ based on the bounds of each new page, and adding appropriately formatted text at the appropriate location using ColumnText.ShowTextAligned(...).
Problem:
The approach outlined above does fine. It generates PDFs with the desired page structure, but some information is lost in translation, namely colored text and underlined text. While colored text shouldn't be seen in these PDFs, underlined text absolutely must be detected.
This set of requirements should also include PDFs with tables. I originally planned on implementing a different module that adheres to the same interface for table PDFs, as these are generated and used separately from the PDFs generated from RTF, and iText has relatively strong table functionality built in.
The two concerns outlined above, coupled with the fact that my described approach was born out of an attempt to reuse existing code leads me to believe that an entirely different approach may be necessary or at least much better. It seems to me that there should be a way to capture content byte info and clip it as necessary to "re-paginate" the input PDF, only worrying about moving content that falls along a page boundary.
Essentially, I am looking for (iText based) recommendations for a better approach. Pseudo-code type answers or simply recommendations for classes / interfaces that may help are acceptable. While it would be nice to handle text and tables together, any advice pertinent to one or the other would also be appreciated. I have perused much of the available documentation on the iText website and other SO questions, but have not found quite what I'm looking for.
Note that no code is included in this question as I am looking for a high-level approach that is entirely different from what I have tried.
Edit:
I didn't notice it before, but the way in which I was reusing fonts (similar to this) resulted in some unexpected (but documented as such) behavior. It seems that I will need to avoid extracting information for re pagination at the text level, as it will be difficult to ensure continuity of fonts between input and output.
I solved this problem a while ago, but figured I would post my solution. I'm sure it's not the most efficient solution, but it works well for my purposes. Note that this will re-paginate a PDF as described in the question containing text only. Table PDF's are handled separately.
The basic process is this:
Use a custom TextExtractionStrategy to extract XML containing information regarding ascent and descent lines for all text in the input PDF, as well as what page it originally appears on.
Given the page length requirements as described in the question (first page = input value, subsequent = standard length, last page = fit content) and the XML info regarding text positions, determine what content will fit on each page of the output PDF. Create a map of where each input page will need to be cropped (top and bottom, note that each input page may be cropped more than once), as well as a map of which cropped pages will need to be "concatenated" together in the final output.
Copy the input PDF page by page to an intermediate temporary PDF (using PdfCopier). If an input page must be cropped more than once (ex: first 2 inches of input page 1 = page 1 output, next 6 inches of input page 1 = page 2 output, final 0.5 inch of input page 1 = top of page 3 output), ensure that it is copied the appropriate number of times (1 time per crop).
Crop each page of the intermediate copied PDF appropriately. This is done by modifying the MediaBox and / or CropBox.
Concatenate the appropriate cropped pages together into the final output PDF's pages. I used a PdfWriter to first create a new page of the appropriate height, then add each appropriate cropped page at the appropriate position in the output PDF page's byte content usingcontentByte.AddTemplate(inputCroppedPage, 0, bottomOfLastAddedCroppedPage).
To anyone who managed to read and understand all of that, congratulations. To anyone else, please let me know what you if you are confused. The solution described above is a little twisted and tough to put into words. While there is too much code to post here (and I am not at liberty to share the code on GitHub or similar), I would be happy to answer any questions that will help someone else implement something similar.
The TextExtractionStrategy mentioned in step 1 was inspired by this answer. Essentially, I used System.Xml.Linq to create an XML document rather than concatentating strings to form HTML, and I ignored any font information, storing only information regarding where text is located in the page (you'll see that this information is available in the linked answer, just isn't written into the final HTML).

Forcing screen-reader read alt-text on a text element when creating a PDF in iText7

Is there any way to add alt text to a text element in iText? I have seen there is a way to do it for images. Basically, I would like the screen reader to read something besides the actual text that is being displayed. There are two situations in my document that I would need to do this.
One is when the screen-reader is reading an acronym I would like the alt-text to force the screen reader to read each letter instead of trying to read a word. (ie read DIET as D-I-E-T instead of diet)
The second is when it is reading a phone number I would like it to read outloud "phone" before the number. In the document it is currently just the number which would be a little confusing for disabled users. I am unable to actually change the layout to include the word "phone" for non-technical reasons.
There is a method for that.
new Paragraph("Lorem").getAccessibilityProperties().setActualText("Ipsum")
You can call this method on every class that implements IAccessibleElement.

PDFtk and number formatted PDF form field

I'm using pdftk to fill in the form with the generated fdf.
In the PDF form, the form field is configured as a number field with 2 decimal point, and the negative value will be showing with parentheses, for example, -4444 will become (4444.00)
Using any PDF viewers and changing the value on form did make the form display the value correctly with the behaviour explained above (negative value will become value with parentheses)
Tested also with the FDF (by importing to the form), the negative value will be displayed correctly as well.
But when using the pdftk fill-form action, the negative value remains as it is without changing the display, which is still showing -4444 and not (4444.00)
Is anyone experienced this before / has a solution for this?
Update #1
I've also tested Apache PDFBox, it has the same issue :(
And now I'm trying to achieve this by using the PDF's javascript, any clue that this way will works?
Update #2
came across this thread How to refresh Formatting on Non-Calculated field and refresh Calculated fields in Fillable PDF form and so gave it a try with iText as well. However still unable to make it works
Finally, i've found some ways to get the formatting works in the PDF form fields.
Approach #1
Requirements
pdftk
PDF form with javascript
In the PDF, create a "Document Javascript" (see how) and re-assign the form's value to make it dirty as mentioned by Denis in the thread of Update #2. The script could be as simple as below:
var text1 = this.getField('text1');
text1.value = text1.value;
Downside
Javascript will only be triggered when you open the PDF, and if you would like to set the PDF with some ownerPassword to prevent user edit the file, the javascript will just failed because of read-only form fields. Otherwise, imagine you have 100 form fields in the PDF, re-assign each of them is a nightmare.
Approach #2
iText
PDF form
I personally will prefer this approach. The powerful iText already has the API to set the form field and at the same time formatting the field display (see how). The generated PDF is ready to print as well with the correct format.
Downside
Either using the iText API to find out the existing format of the form field, or you will hard code the format in your codes. Require more efforts than using the pdftk.
I'm using a simpler version of ChinKang's #1.
In Tools -> Document Javascripts:
this.calculateNow();
One thing to check is in Tools -> Set field calculation order that any calculations are ordered appropriately.
I then use https://github.com/ccnmtl/fdfgen and pdftk to fill the forms in.
The only gotcha is that you cannot flatten the PDF

Possible to control PDF layout with iText?

I'm writing some logic to build a large single PDF file that our users can print at their convenience. I'm using Java's iText library (through Clojure's clj-pdf).
I'm trying to have the PDF show the same exact template form on every single page, however I can't seem to find any documentation or indication that one can have PDF content "fit to a page".
The text in these forms varies a little bit, so there's a chance it might require more of fewer text lines per page. This means that the content has a chance of spilling over to the next page, or being too short, making the next page creep up into the previous one, breaking the requirement of "one form per page" for the rest of the document.
I'm trying to figure out if my option is pretty much only to manually check the length of the text on each page and potentially crop it by hand if I goes over n lines, or if the PDF format somehow supports a smart way of having paragraphs+tables+headings all fit in one page. Some UI systems allow you to control how spill-over is handled, anywhere from cropping to resizing the font, so I'm curious if PDF supports anything of that sort.
Edit: ended up going with pagebreaks for simplicity, wasn't aware of that option when I wrote this question.
If you want to take control over the space taken by text, for instance to fit it on a single page, the way to go would be to create a ColumnText object and to add the content in simulation mode. If the text fits the page, add it for real. If it doesn't, use a smaller font size. This is demonstrated in the MovieAds example where snippets of text are fitted into AcroForm fields.

Modify character spacing in a PDF form field

I'm trying to build a web app to programmatically fill out a PDF form. I am going to configure my form first in Adobe Acrobat, then write a Java app with iText to fill out all the form fields via user input from the web. The base form I need to fill out comes from the US government. They created form fields with extremely large kerning (character spacing) values I need to change. However, there appears to be no way to modify this value in the Acrobat UI.
Does anyone know how to manipulate character spacing on form fields in Acrobat 8.0 for Windows? I could try to use iText to programmatically manipulate the kerning of the original document, but this would be much more tedious.
I believe I figured this out: kerning is called "combing" in acrobat, and each of the form fields have been "combed". The strange thing is this option isn't checked when I view the properties of the form field, but "combing" is the behaviour I was attempting to replicate.