iText 7 need to skip reading page header elements - pdf

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?
public class TEirHeaderEventHandler : IEventHandler
{
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
Rectangle headerRect = new Rectangle(60, 725, 495, 96);
Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
//creating content for header
CreateHeaderContent(headerCanvas);
headerCanvas.Close();
}
private void CreateHeaderContent(Canvas canvas)
{
//Create header content
Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
table.SetWidth(UnitValue.CreatePercentValue(100));
Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
cell1.SetBorder(Border.NO_BORDER);
table.AddCell(cell1);
Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell2.SetBorder(Border.NO_BORDER);
table.AddCell(cell2);
Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell3.SetBorder(Border.NO_BORDER);
table.AddCell(cell3);
canvas.Add(table);
}
}
public static void CreatePdf()
{
using (MemoryStream writeStream = new MemoryStream())
using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
{
PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
pdf.SetTagged();
iTextDocument document = new iTextDocument(pdf);
TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
//Convert html to pdf
HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
document.Close();
byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
File.WriteAllBytes(outputPdfFile, bytes);
}
}
Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.

General
The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.
element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This marks the content in the content stream and it excludes the element from the structure tree.
Your case
However, your specific case is more nuanced.
For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.
Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.
If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.
If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.
Summary: looking at the structure tree, there's no difference between your original code and the approach of General.
Adobe reading order / accessibility settings
From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:
From https://helpx.adobe.com/reader/using/accessibility-features.html:
Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.
In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.
With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.
Conclusion
Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.

Headers and footers are typically pagination artifacts and should be marked as such in the following way:
table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

Related

Filter out anything but interactive form fields in PDF's

I'm looking for a way to filter out all objects apart from interactive form fields in PDF files.
The programming language isn't too important, but it would would love if I could do it from the Linux command line but I'm pretty much open to anything.
E.g. choose an pdf input file, and output a new pdf file with only the interactive form fields from the first.
The ultimate goal is to be able to take an already printed but unfilled form , and print only the content of the filled in form fields onto it.
The closest I've gotten is by using ghostscript:
gs -o outfile.pdf -sDEVICE=pdfwrite -dFILTERTEXT -dFILTERIMAGE infile.pdf
But that still leaves a lot of lines in my case, as well as an image despite -dFILTERIMAGE.
There's also a -dFILTERVECTOR-option but sadly it removes the formfields as well.
I'm looking for a way to filter out all objects apart from interactive form fields in PDF files.
First and foremost you have to get rid of the static page content. Using an arbitrary general purpose pdf library you can do that by clearing the contents entry of every page.
E.g. using the Java version of iText7 this can be done as follows:
try (
PdfReader pdfReader = new PdfReader(SOURCE);
PdfWriter pdfWriter = new PdfWriter(RESULT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter)
) {
for (int pageNr = 1; pageNr <= pdfDocument.getNumberOfPages(); pageNr++) {
PdfPage pdfPage = pdfDocument.getPage(pageNr);
pdfPage.getPdfObject().remove(PdfName.Contents);
pdfPage.getPdfObject().setModified();
}
}
(RemoveContent test testRemoveAllPageContentStreams)

PDFBox: Fill out a PDF with adding repeatively a one-page template containing a form

Following SO question Java pdfBox: Fill out pdf form, append it to pddocument, and repeat I had trouble appending a cloned page to a new PDF.
Code from this page seemed really interesting, but didn't work for me.
Actually, the answer doesn't work because this is the same PDField you always modify and add to the list. So the next time you call 'getField' with initial name, it won't find it and you get an NPE. I tried with the same pdfbox version used (1.8.12) in the nice github project, but can't understand how he gets this working.
I had the same issue today trying to append a form on pages with different values in it. I was wondering if the solution was not to duplicate field, but can't succeed to do it properly. I always end with a PDF containing same values for each form.
(I provided a link to the template document for Mkl, but now I removed it because it doesn't belong to me)
Edit: Following Mkl's advices, I figured it out what I was missing, but performances are really bad with duplicating every pages. File size isn't satisfying. Maybe there's a way to optimize this, reusing similar parts in the PDF.
Finally I got it working without reloading the template each time. So the resulting file is as I wanted: not too big (4Mb for 164 pages).
I think I did 2 mistakes before: one on page creation, and probably one on field duplication.
So here is the working code, if someone happens to be stuck on the same problem.
Form creation:
PDAcroForm finalForm = new PDAcroForm(finalDoc, new COSDictionary());
finalForm.setDefaultResources(originForm.getDefaultResources())
Page creation:
PDPage clonedPage = templateDocument.getPage(0);
COSDictionary clonedDict = new COSDictionary(clonedPage.getCOSObject());
clonedDict.removeItem(COSName.ANNOTS);
clonedPage = new PDPage(clonedDict);
finalDoc.addPage(clonedPage);
Field duplication: (rename field to become unique and set value)
PDTextField field = (PDTextField) originForm.getField(fieldName);
PDPage page = finalDoc.getPages().get(nPage);
PDTextField clonedField = new PDTextField(finalForm);
List<PDAnnotationWidget> widgetList = new ArrayList<>();
for (PDAnnotationWidget paw : field.getWidgets()) {
PDAnnotationWidget newWidget = new PDAnnotationWidget();
newWidget.getCOSObject().setString(COSName.DA, paw.getCOSObject().getString(COSName.DA));
newWidget.setRectangle(paw.getRectangle());
widgetList.add(newWidget);
}
clonedField.setQ(field.getQ()); // To get text centered
clonedField.setWidgets(widgetList);
clonedField.setValue(value);
clonedField.setPartialName(fieldName + cnt++);
fields.add(clonedField);
page.getAnnotations().addAll(clonedField.getWidgets());
And at the end of the process:
finalDoc.getDocumentCatalog().setAcroForm(finalForm);
finalForm.setFields(fields);
finalForm.flatten();

Word to HTML fields in header and footer

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

Change AcroFields order in existing PDF with iText?

I have a pdf with text form fields at are layered one on top of the other. When I fill the fields via iText and flatten the form, the form field that I had created on top of the other form field is now on the bottom.
For instance, I have a text field named "number_field" and that is underneath a second text field that is titled "name_field". When I set the value for those fields via iText (so 10 for number_field and 'John' for name_field), the number_field is now on top of the name_field.
How do I change the order on the page of these fields with iText? Is it possible?
Link to example PDF: https://freecompany.sharefile.com/d-s84f6d63e7d04fe79
I have made the following ticket in the issue tracker at iText Group:
A problem is caused by the fact that iText reads the field items into
a HashMap, hence there is no way to predict in which order they will
be flattened. This usually isn't a problem. I don't think this problem
occurs in case you don't flatten the PDF, because in that case, the
appearance is stored in the widget annotations and it's up to the PDF
viewer to decide which field covers another one in case of overlapping
fields.
However, if form fields overlap, then you can't predict which field
will cover which when flattening.
Suppose that we'd use a TreeMap instead of a HashMap, would this
solve the problem? Not really, because which Comparator would we
use? Sometimes a Tab-order is defined, but not always. If it's not
defined, should we order the fields in the order in which they appear
in the /Fields array? Or does it make more sense to order them based
on the order of the widget annotations in the /Annots array? Another
option is to order them based on their position on the page. In short:
this is not a decision iText should make.
However, if somebody would like to solve this problem, we could create
a Comparator member variable for PdfStamperImp. If such a
Comparator is provided (we could even provide some implementations),
then the flattening process would be executed in the order defined by
the Comparator.
This ticket has received a very low priority (I assume that you're not a customer of one of the iText Software companies), but while writing this ticket, I had another idea.
I already referred to underline portion of text using iTextSharp in the comments. In this case, you'd get all the field positions (using the getFieldPositions() method) and draw all the contents in the right order using ColumnText. This approach has several disadvantages: in order for the font, font size, font color to be correct, you'd have to examine the fields. That requires some programming.
I am now posting this as an answer, because I have a much better alternative: fill out the form in two passes! This is shown in the FillFormFieldOrder example. We fill out the form src resulting in the flattened form dest like this:
public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
go2(go1(src), dest);
}
As you can see, we execute the go1() method first:
public byte[] go1(String src) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfStamper stamper = new PdfStamper(reader, baos);
AcroFields form = stamper.getAcroFields();
form.setField("sunday_1", "1");
form.setField("sunday_2", "2");
form.setField("sunday_3", "3");
form.setField("sunday_4", "4");
form.setField("sunday_5", "5");
form.setField("sunday_6", "6");
stamper.setFormFlattening(true);
stamper.partialFormFlattening("sunday_1");
stamper.partialFormFlattening("sunday_2");
stamper.partialFormFlattening("sunday_3");
stamper.partialFormFlattening("sunday_4");
stamper.partialFormFlattening("sunday_5");
stamper.partialFormFlattening("sunday_6");
stamper.close();
reader.close();
return baos.toByteArray();
}
This fills out all the sunday_x fields and uses partial form flattening to flatten only those fields. The go1() method takes src as parameter and returns a byte[] will the partially flattened form.
The byte[] will be used as a parameter for the go2() method, that takes dest as its second parameter. Now we are going to fill out the sunday_x_notes fields:
public void go2(byte[] src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
AcroFields form = stamper.getAcroFields();
form.setField("sunday_1_notes", "It's Sunday today, let's go to the sea");
form.setField("sunday_2_notes", "It's Sunday today, let's go to the park");
form.setField("sunday_3_notes", "It's Sunday today, let's go to the beach");
form.setField("sunday_4_notes", "It's Sunday today, let's go to the woods");
form.setField("sunday_5_notes", "It's Sunday today, let's go to the lake");
form.setField("sunday_6_notes", "It's Sunday today, let's go to the river");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
}
As you can see, we now flatten all the fields. The result looks like this:
Now, you no longer have to worry about the order of the fields, not in the /Fields array, not in the /Annots array. The fields are filled out in the exact order you want to. The notes cover the dates now, instead of the other way round.

Can you insert blank lines in an already transformed PDF?

I have a situation where I need to increase the space between a table and the header on a PDF that has already been transformed from an XSL template.
I need to insert an address in the newly created space. This part is easy enough and I can do that using a stamper and a new table.
However, I am struggling to find a solution to move the grid down to make the space.
Basically I am using FOP to create the PDF from an XSL template using code similar to the following:
OutputStream out = new java.io.FileOutputStream(pdf);
Driver driver = new Driver();
driver.setRenderer(Driver.RENDER_PDF);
driver.setOutputStream(out);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(new StreamSource(xsl));
StringReader xmlStream = new StringReader(xmlData);
Source xmlSource = new StreamSource(xmlStream);
Result res = new SAXResult(driver.getContentHandler());
transformer.transform(xmlSource, res);
Is it even possible to access the PDF in a way to add the new space? If so, what are my options? I should mention that I don’t know at the time the transformation is happening that I will need the extra space. I only know I need it once I get a page count of the PDF.
Any help is greatly appreciated!
It's not possible to add "new space" per se, but it is possible to get the co-ordinates of an object on the page and then re-draw that object somewhere else. Unfortunately there's no quick and easy solution and you will need a third-party SDK to do this.
PDF isn't a word processor format, so it's not possible to simply add a couple of carriage returns, as you might in MS Word.
Try iText, it's written in Java and has a decent amount of functionality for manipulating PDFs.
It seems it is not possible (at least from everything I have tried and read) to move transformed objects around once the PDF has already been generated.
Since I was already using iText and the PdfStamper class I was able to insert a new page and insert a new table with the current address info. I did this with the following code:
//add new page
PdfStamper stamper = new PdfStamper(reader,new FileOutputStream(file);
stamper.insertPage(pageNumber,reader.getPageSizeWithRotation(1));
//add new table with data
BaseFont base = BaseFont.createFont(BaseFont.HELVETICA,"",BaseFont.NOT_EMBEDDED);
over.setFontAndSize(base,fontSize);
PdfPTable table = new PdfPTable(1);
table.getDefaultCell().setBorder(Rectangle.NO_BORDER);
table.addCell(data);
table.setTotalWidth(150f);
table.writeSelectedRows(0, -1, 73, 650, over);
This is not the answer to my question by a viable solution I thought I would share in case others get hung up on the same issue.