PDFBox: Fill out a PDF with adding repeatively a one-page template containing a form

PDFBox: Fill out a PDF with adding repeatively a one-page template containing a form - pdfbox

Following SO question Java pdfBox: Fill out pdf form, append it to pddocument, and repeat I had trouble appending a cloned page to a new PDF.
Code from this page seemed really interesting, but didn't work for me.
Actually, the answer doesn't work because this is the same PDField you always modify and add to the list. So the next time you call 'getField' with initial name, it won't find it and you get an NPE. I tried with the same pdfbox version used (1.8.12) in the nice github project, but can't understand how he gets this working.
I had the same issue today trying to append a form on pages with different values in it. I was wondering if the solution was not to duplicate field, but can't succeed to do it properly. I always end with a PDF containing same values for each form.
(I provided a link to the template document for Mkl, but now I removed it because it doesn't belong to me)
Edit: Following Mkl's advices, I figured it out what I was missing, but performances are really bad with duplicating every pages. File size isn't satisfying. Maybe there's a way to optimize this, reusing similar parts in the PDF.

Finally I got it working without reloading the template each time. So the resulting file is as I wanted: not too big (4Mb for 164 pages).
I think I did 2 mistakes before: one on page creation, and probably one on field duplication.
So here is the working code, if someone happens to be stuck on the same problem.
Form creation:
PDAcroForm finalForm = new PDAcroForm(finalDoc, new COSDictionary());
finalForm.setDefaultResources(originForm.getDefaultResources())
Page creation:
PDPage clonedPage = templateDocument.getPage(0);
COSDictionary clonedDict = new COSDictionary(clonedPage.getCOSObject());
clonedDict.removeItem(COSName.ANNOTS);
clonedPage = new PDPage(clonedDict);
finalDoc.addPage(clonedPage);
Field duplication: (rename field to become unique and set value)
PDTextField field = (PDTextField) originForm.getField(fieldName);
PDPage page = finalDoc.getPages().get(nPage);
PDTextField clonedField = new PDTextField(finalForm);
List<PDAnnotationWidget> widgetList = new ArrayList<>();
for (PDAnnotationWidget paw : field.getWidgets()) {
PDAnnotationWidget newWidget = new PDAnnotationWidget();
newWidget.getCOSObject().setString(COSName.DA, paw.getCOSObject().getString(COSName.DA));
newWidget.setRectangle(paw.getRectangle());
widgetList.add(newWidget);
}
clonedField.setQ(field.getQ()); // To get text centered
clonedField.setWidgets(widgetList);
clonedField.setValue(value);
clonedField.setPartialName(fieldName + cnt++);
fields.add(clonedField);
page.getAnnotations().addAll(clonedField.getWidgets());
And at the end of the process:
finalDoc.getDocumentCatalog().setAcroForm(finalForm);
finalForm.setFields(fields);
finalForm.flatten();

Related

After content stream recreation (using PDFbox) getting error in axesPDF (insert spaces)?

when I am recreating a pdf after some changes , if I use the output pdf in axesPDF to fix spaces issue I am getting "ERROR". The only difference I observed my input pdf have Array of tokens but my recreated pdf has Dictionary of elements. As shown in below. Does that cause the problem? How can I recreate similar structure? (Left one is input pdf right one output pdf after editing)
Input pdf
output pdf with changes
The code I am using to save the pdf is
PDStream newContents = new PDStream(document);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(tokens);
out.close();
document.getPage(pg_ind).setContents(newContents);
newPDF.addPage(document.getPage(pg_ind));
newPDF.save()
Please help me on this. Thanks in advance.
Updating question along with error.
Another Input file
The error is
The button I used.
I am wondering this time even the content stream is in COSDictionary format it's giving error. Something else causing this.

Can PDF notes/annotations include links to other pages in the PDF?

I want to add an annotation to a PDF page (i.e. something that would show as a pop-up note or appear in the list of notes for the current page).
And in that note, I want to say "See page 93", where clicking on that takes the user to page 93.
Is that possible? It seems like a useful feature, but I haven't been able to find any examples.
And if so, can it be done with Apache PDF Box?

Yes (it is possible) and yes (it can be done with PdfBox). That question has been asked before and answered several times. Read the follwing answer here and the see the full code here.
try ( InputStream resource = getClass().getResourceAsStream("some.pdf")) {
PDDocument document = Loader.loadPDF(resource);
PDPage page = document.getPage(1);
PDAnnotationLink link = new PDAnnotationLink();
PDPageDestination destination = new PDPageFitWidthDestination();
PDActionGoTo action = new PDActionGoTo();
//destination.setPageNumber(2);
destination.setPage(document.getPage(2));
action.setDestination(destination);
link.setAction(action);
link.setPage(page);
link.setRectangle(page.getMediaBox());
page.getAnnotations().add(link);
document.save(new File("RESULT_FOLDER", "output-with-link.pdf"));
}
Other answers are here and here.

iText 7 need to skip reading page header elements

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?
public class TEirHeaderEventHandler : IEventHandler
{
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
Rectangle headerRect = new Rectangle(60, 725, 495, 96);
Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
//creating content for header
CreateHeaderContent(headerCanvas);
headerCanvas.Close();
}
private void CreateHeaderContent(Canvas canvas)
{
//Create header content
Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
table.SetWidth(UnitValue.CreatePercentValue(100));
Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
cell1.SetBorder(Border.NO_BORDER);
table.AddCell(cell1);
Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell2.SetBorder(Border.NO_BORDER);
table.AddCell(cell2);
Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell3.SetBorder(Border.NO_BORDER);
table.AddCell(cell3);
canvas.Add(table);
}
}
public static void CreatePdf()
{
using (MemoryStream writeStream = new MemoryStream())
using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
{
PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
pdf.SetTagged();
iTextDocument document = new iTextDocument(pdf);
TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
//Convert html to pdf
HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
document.Close();
byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
File.WriteAllBytes(outputPdfFile, bytes);
}
}
Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.

General
The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.
element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This marks the content in the content stream and it excludes the element from the structure tree.
Your case
However, your specific case is more nuanced.
For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.
Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.
If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.
If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.
Summary: looking at the structure tree, there's no difference between your original code and the approach of General.
Adobe reading order / accessibility settings
From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:
From https://helpx.adobe.com/reader/using/accessibility-features.html:
Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.
In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.
With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.
Conclusion
Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.

Headers and footers are typically pagination artifacts and should be marked as such in the following way:
table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

Word to HTML fields in header and footer

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

Can you insert blank lines in an already transformed PDF?

I have a situation where I need to increase the space between a table and the header on a PDF that has already been transformed from an XSL template.
I need to insert an address in the newly created space. This part is easy enough and I can do that using a stamper and a new table.
However, I am struggling to find a solution to move the grid down to make the space.
Basically I am using FOP to create the PDF from an XSL template using code similar to the following:
OutputStream out = new java.io.FileOutputStream(pdf);
Driver driver = new Driver();
driver.setRenderer(Driver.RENDER_PDF);
driver.setOutputStream(out);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(new StreamSource(xsl));
StringReader xmlStream = new StringReader(xmlData);
Source xmlSource = new StreamSource(xmlStream);
Result res = new SAXResult(driver.getContentHandler());
transformer.transform(xmlSource, res);
Is it even possible to access the PDF in a way to add the new space? If so, what are my options? I should mention that I don’t know at the time the transformation is happening that I will need the extra space. I only know I need it once I get a page count of the PDF.
Any help is greatly appreciated!

It's not possible to add "new space" per se, but it is possible to get the co-ordinates of an object on the page and then re-draw that object somewhere else. Unfortunately there's no quick and easy solution and you will need a third-party SDK to do this.
PDF isn't a word processor format, so it's not possible to simply add a couple of carriage returns, as you might in MS Word.
Try iText, it's written in Java and has a decent amount of functionality for manipulating PDFs.

It seems it is not possible (at least from everything I have tried and read) to move transformed objects around once the PDF has already been generated.
Since I was already using iText and the PdfStamper class I was able to insert a new page and insert a new table with the current address info. I did this with the following code:
//add new page
PdfStamper stamper = new PdfStamper(reader,new FileOutputStream(file);
stamper.insertPage(pageNumber,reader.getPageSizeWithRotation(1));
//add new table with data
BaseFont base = BaseFont.createFont(BaseFont.HELVETICA,"",BaseFont.NOT_EMBEDDED);
over.setFontAndSize(base,fontSize);
PdfPTable table = new PdfPTable(1);
table.getDefaultCell().setBorder(Rectangle.NO_BORDER);
table.addCell(data);
table.setTotalWidth(150f);
table.writeSelectedRows(0, -1, 73, 650, over);
This is not the answer to my question by a viable solution I thought I would share in case others get hung up on the same issue.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas