Printing pdf document on paper with predifned layout - pdf

We need to print a pdf document on the page which has predefined fields on it, a formular basically, which fields needs to be filled.
We are using iTextSharp to create pdfs and we use absoulte positioning for elements based on the formular fields positioning. For instance, if the field starts 20mm from left and 20 mm from top I will put data to start at 21mm from the left and 21 mm from top so it fits inside that field. And it works well on my printer.
But my question is, can different printers mess up positioning because of different margins, font sizes, etc... Maybe it will be the same, I am not aware of what differences can different printers bring.
Is it important that user chooses Actual size option when printing pdf?
I need to know what difficulties I can expect, better to know it now then waiting customers calling when this is in production.

The problem you anticipate, exist. It can be avoided by setting a viewer preference.
See How to prevent the resizing of pages in PDF?
You have to set the print scaling to none:
writer.addViewerPreference(PdfName.PRINTSCALING, PdfName.NONE);
That's the line you'll need if you are using iText 5 (writer is an instance of PdfWriter). If you are using iText 7, you can define the viewer preferences like this:
PdfDocument pdf = new PdfDocument(new PdfWriter(dest));
PdfViewerPreferences preferences = new PdfViewerPreferences();
preferences.setPrintScaling(PdfViewerPreferencesConstants.NONE);
pdf.getCatalog().setViewerPreferences(preferences);
See Handling events; setting viewer preferences and printer properties.
Of course, end users can always overrule the print scaling in their PDF viewer, but that's their responsibility, not yours.

Related

Creating a custom diploma from a designer template, filling in data

I need to be able to create PDFs with printable diplomas for sports events (10K runs etc).
A graphic designer creates a beautiful diploma with placeholder texts (name of participant, finish time) - and I need to get from that to a finished PDF (on the fly), which the participant can download.
What output should I get from the designer (file format, prepared in any special way)?
How do I take that file, fill in data and generate PDF?
How can this be accomplished using IText?
I have done a lot of generating PDFs from HTML and Word docs, but this is something new to me, so I am can't figure out where to start.
My best idea right now is to have the designer export as PDF without placeholder text, but with x/y coordinates and font on where to input name, time etc... But I would prefer to not have to store the x/y coordinates, font etc - and just be able to fill in a "template"...
There are several possibilites e.g:
Let your designer create a diploma PDF
Add form fields at the places you want to add name and event etc. This can be done by you or the designer or a PDF lib like openPdf / PdfBox / iText
Fill in the data using openPdf / PdfBox / iText and afterwards make the field readonly
You could even sign the PDF afterwards (and thus "protect" it from changes)
OR
2a) You could also add text to an existing PDF but this is a bit trickier since you need to know the coordinates and need to care about length issues etc.

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Determine the Text that can Display in Multiline PDTextField

Is there a way to determine the text that will actually display in a PDTextField when the PDF prints? If I call setValue and then getValue, it returns all of the text even though it will not all display.
I am trying to fill out a form with a limited size multiline text field that has the notation to attach another page for more details. I would like to limit the text to that which will display and generate the added detail page.
Thanks for indulging a PDFbox newbie.
There is no direct way to find that out as the details of the text layout such as line breaks, padding, line spacing are hidden inside the non public class PlainTextFormatter inside the org.apache.pdfbox.pdmodel.interactive.formpackage. So you'd need to replicate that code.
PDFBox tries to resemble the calculations done by Adobe Acrobat and Adobe Reader but the details of such calculations are not part of the PDF specification. So doing your calculation is only valid for a similar layout model. Other form filling applications might have a slightly different layout model and as a result your results will not apply to these.
In addition to that Acrobat (and PDFBox) place text although it might be partially clipped. Look at the results of the AlignmentTest.javaunit test to see what I mean. So one might have a different expectation to what 'fitting' really means.
As I've thought about passing the information about which text fitted back to the calling application anyway I've opened an enhancement request https://issues.apache.org/jira/browse/PDFBOX-3413 for that.

Cropping a region from a PDF page with PDFBox

I am trying to crop a region out of a PDF page programmatically. Specifically, my input is going to be a single page PDF and a bounding box on the page. Output is going to be a PDF that contains the characters, graphics paths and images from the original PDF, and it should look like the original PDF. In other words, I want a function that is similar to cropping a region out of an image, but with PDFs.
Three questions:
Is it at all possible to do? From my knowledge of PDFs, it seems possible. But I'm no expert, so I would like to know first if there are some things I'm missing here.
Is there any open source software for this?
Can PDFBox do this currently? I couldn't find such a functionality but I might have missed it. Does anybody know of any attempt of doing this?
1- Yes, this is called the crop box.
2- Yes, e.g. PDFBox.
3- Yes, just open a PDF, set a crop box, and save it:
PDDocument doc = PDDocument.load(new File(...));
PDPage page = doc.getPage(0);
page.setCropBox(new PDRectangle(20, 20, 200, 400));
doc.save(...);
doc.close();
The numbers in PDRectangle are user space units. 1 unit = 1/72 inches.
Note that the contents outside the cropbox are not gone, they are just hidden.

Possible to control PDF layout with iText?

I'm writing some logic to build a large single PDF file that our users can print at their convenience. I'm using Java's iText library (through Clojure's clj-pdf).
I'm trying to have the PDF show the same exact template form on every single page, however I can't seem to find any documentation or indication that one can have PDF content "fit to a page".
The text in these forms varies a little bit, so there's a chance it might require more of fewer text lines per page. This means that the content has a chance of spilling over to the next page, or being too short, making the next page creep up into the previous one, breaking the requirement of "one form per page" for the rest of the document.
I'm trying to figure out if my option is pretty much only to manually check the length of the text on each page and potentially crop it by hand if I goes over n lines, or if the PDF format somehow supports a smart way of having paragraphs+tables+headings all fit in one page. Some UI systems allow you to control how spill-over is handled, anywhere from cropping to resizing the font, so I'm curious if PDF supports anything of that sort.
Edit: ended up going with pagebreaks for simplicity, wasn't aware of that option when I wrote this question.
If you want to take control over the space taken by text, for instance to fit it on a single page, the way to go would be to create a ColumnText object and to add the content in simulation mode. If the text fits the page, add it for real. If it doesn't, use a smaller font size. This is demonstrated in the MovieAds example where snippets of text are fitted into AcroForm fields.