How do I get IDs "original" and "modified" for XFDF using iText? - pdf

The last tag in an XFDF file looks something like this:
<ids original="20639838865717E80D2556CB7B2AEC2D"
modified="754C78B10C9159419708446C3395CDBE"/>
I can get these values by exporting PDF form data from Acrobat using this method: http://wiki.developerforce.com/page/Adobe_XFDF_Ids_Determination.
However I want to get the IDs programmatically in order to build correct xfdf documents for arbitrary PDF forms.
How do I get these values using iText?

As explained in this post (removing PDFID in PDF) /ID is a recommended entry in the "trailer dictionary" (and required if an AcroForm is encrypted).
Using iText, IDs are accessed as a PdfArray of two PdfString objects in the trailer PdfDictionary. The String values will look like garbage because each is a representation of a byte array. These are the hex values you need for "original" and "modified".
The following code will print out the two IDs, which can be verified against e.g. an export from Acrobat Pro (NB Hex.encodeHexString is Apache commons-codec):
public void printIds(PdfReader reader) {
PdfDictionary trailer = reader.getTrailer();
if (trailer.contains(PdfName.ID)) {
PdfArray ids = (PdfArray) trailer.get(PdfName.ID);
PdfString original = ids.getAsString(0);
PdfString modified = ids.getAsString(1);
System.out.println(Hex.encodeHexString(original.getBytes()));
System.out.println(Hex.encodeHexString(modified.getBytes()));
}
}

Related

iText 7 need to skip reading page header elements

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?
public class TEirHeaderEventHandler : IEventHandler
{
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
Rectangle headerRect = new Rectangle(60, 725, 495, 96);
Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
//creating content for header
CreateHeaderContent(headerCanvas);
headerCanvas.Close();
}
private void CreateHeaderContent(Canvas canvas)
{
//Create header content
Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
table.SetWidth(UnitValue.CreatePercentValue(100));
Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
cell1.SetBorder(Border.NO_BORDER);
table.AddCell(cell1);
Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell2.SetBorder(Border.NO_BORDER);
table.AddCell(cell2);
Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell3.SetBorder(Border.NO_BORDER);
table.AddCell(cell3);
canvas.Add(table);
}
}
public static void CreatePdf()
{
using (MemoryStream writeStream = new MemoryStream())
using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
{
PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
pdf.SetTagged();
iTextDocument document = new iTextDocument(pdf);
TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
//Convert html to pdf
HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
document.Close();
byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
File.WriteAllBytes(outputPdfFile, bytes);
}
}
Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.
General
The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.
element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This marks the content in the content stream and it excludes the element from the structure tree.
Your case
However, your specific case is more nuanced.
For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.
Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.
If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.
If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.
Summary: looking at the structure tree, there's no difference between your original code and the approach of General.
Adobe reading order / accessibility settings
From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:
From https://helpx.adobe.com/reader/using/accessibility-features.html:
Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.
In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.
With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.
Conclusion
Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.
Headers and footers are typically pagination artifacts and should be marked as such in the following way:
table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

Filter out anything but interactive form fields in PDF's

I'm looking for a way to filter out all objects apart from interactive form fields in PDF files.
The programming language isn't too important, but it would would love if I could do it from the Linux command line but I'm pretty much open to anything.
E.g. choose an pdf input file, and output a new pdf file with only the interactive form fields from the first.
The ultimate goal is to be able to take an already printed but unfilled form , and print only the content of the filled in form fields onto it.
The closest I've gotten is by using ghostscript:
gs -o outfile.pdf -sDEVICE=pdfwrite -dFILTERTEXT -dFILTERIMAGE infile.pdf
But that still leaves a lot of lines in my case, as well as an image despite -dFILTERIMAGE.
There's also a -dFILTERVECTOR-option but sadly it removes the formfields as well.
I'm looking for a way to filter out all objects apart from interactive form fields in PDF files.
First and foremost you have to get rid of the static page content. Using an arbitrary general purpose pdf library you can do that by clearing the contents entry of every page.
E.g. using the Java version of iText7 this can be done as follows:
try (
PdfReader pdfReader = new PdfReader(SOURCE);
PdfWriter pdfWriter = new PdfWriter(RESULT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter)
) {
for (int pageNr = 1; pageNr <= pdfDocument.getNumberOfPages(); pageNr++) {
PdfPage pdfPage = pdfDocument.getPage(pageNr);
pdfPage.getPdfObject().remove(PdfName.Contents);
pdfPage.getPdfObject().setModified();
}
}
(RemoveContent test testRemoveAllPageContentStreams)

Word to HTML fields in header and footer

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

Change AcroFields order in existing PDF with iText?

I have a pdf with text form fields at are layered one on top of the other. When I fill the fields via iText and flatten the form, the form field that I had created on top of the other form field is now on the bottom.
For instance, I have a text field named "number_field" and that is underneath a second text field that is titled "name_field". When I set the value for those fields via iText (so 10 for number_field and 'John' for name_field), the number_field is now on top of the name_field.
How do I change the order on the page of these fields with iText? Is it possible?
Link to example PDF: https://freecompany.sharefile.com/d-s84f6d63e7d04fe79
I have made the following ticket in the issue tracker at iText Group:
A problem is caused by the fact that iText reads the field items into
a HashMap, hence there is no way to predict in which order they will
be flattened. This usually isn't a problem. I don't think this problem
occurs in case you don't flatten the PDF, because in that case, the
appearance is stored in the widget annotations and it's up to the PDF
viewer to decide which field covers another one in case of overlapping
fields.
However, if form fields overlap, then you can't predict which field
will cover which when flattening.
Suppose that we'd use a TreeMap instead of a HashMap, would this
solve the problem? Not really, because which Comparator would we
use? Sometimes a Tab-order is defined, but not always. If it's not
defined, should we order the fields in the order in which they appear
in the /Fields array? Or does it make more sense to order them based
on the order of the widget annotations in the /Annots array? Another
option is to order them based on their position on the page. In short:
this is not a decision iText should make.
However, if somebody would like to solve this problem, we could create
a Comparator member variable for PdfStamperImp. If such a
Comparator is provided (we could even provide some implementations),
then the flattening process would be executed in the order defined by
the Comparator.
This ticket has received a very low priority (I assume that you're not a customer of one of the iText Software companies), but while writing this ticket, I had another idea.
I already referred to underline portion of text using iTextSharp in the comments. In this case, you'd get all the field positions (using the getFieldPositions() method) and draw all the contents in the right order using ColumnText. This approach has several disadvantages: in order for the font, font size, font color to be correct, you'd have to examine the fields. That requires some programming.
I am now posting this as an answer, because I have a much better alternative: fill out the form in two passes! This is shown in the FillFormFieldOrder example. We fill out the form src resulting in the flattened form dest like this:
public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
go2(go1(src), dest);
}
As you can see, we execute the go1() method first:
public byte[] go1(String src) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfStamper stamper = new PdfStamper(reader, baos);
AcroFields form = stamper.getAcroFields();
form.setField("sunday_1", "1");
form.setField("sunday_2", "2");
form.setField("sunday_3", "3");
form.setField("sunday_4", "4");
form.setField("sunday_5", "5");
form.setField("sunday_6", "6");
stamper.setFormFlattening(true);
stamper.partialFormFlattening("sunday_1");
stamper.partialFormFlattening("sunday_2");
stamper.partialFormFlattening("sunday_3");
stamper.partialFormFlattening("sunday_4");
stamper.partialFormFlattening("sunday_5");
stamper.partialFormFlattening("sunday_6");
stamper.close();
reader.close();
return baos.toByteArray();
}
This fills out all the sunday_x fields and uses partial form flattening to flatten only those fields. The go1() method takes src as parameter and returns a byte[] will the partially flattened form.
The byte[] will be used as a parameter for the go2() method, that takes dest as its second parameter. Now we are going to fill out the sunday_x_notes fields:
public void go2(byte[] src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
AcroFields form = stamper.getAcroFields();
form.setField("sunday_1_notes", "It's Sunday today, let's go to the sea");
form.setField("sunday_2_notes", "It's Sunday today, let's go to the park");
form.setField("sunday_3_notes", "It's Sunday today, let's go to the beach");
form.setField("sunday_4_notes", "It's Sunday today, let's go to the woods");
form.setField("sunday_5_notes", "It's Sunday today, let's go to the lake");
form.setField("sunday_6_notes", "It's Sunday today, let's go to the river");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
}
As you can see, we now flatten all the fields. The result looks like this:
Now, you no longer have to worry about the order of the fields, not in the /Fields array, not in the /Annots array. The fields are filled out in the exact order you want to. The notes cover the dates now, instead of the other way round.

PdfTextExtractor.GetTextFromPage is not returning correct text

Using iTextSharp, I have the following code, that successfully pulls out the PDF's text for the majority of PDF's I'm trying to read...
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i);
}
reader.Close();
However, some of my PDF's have XFA forms (which have already been filled out), and this causes the 'text' field to be filled with the following garbage...
"Please wait... \n \nIf this message is not eventually replaced by the proper contents of the document, your PDF \nviewer may not be able to display this type of document. \n \nYou can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by \nvisiting http://www.adobe.com/products/acrobat/readstep2.html. \n \nFor more assistance with Adobe Reader visit http://www.adobe.com/support/products/\nacrreader.html. \n \nWindows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark \nof Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other \ncountries."
How can I work around this? I tried using the PdfStamper[1] from iTextSharp to flatten the PDF, but that didn't work - the resultant stream had the same garbage text.
[1]How to flatten already filled out PDF form using iTextSharp
You are confronted with a PDF that acts as a container for an XML stream. This XML stream is based on the XML Forms Architecture (XFA). The message you see, is not garbage! It is the message contained in a PDF page that is shown when opening the document in a Viewer that reads the file as if it were ordinary PDF.
For instance: if you open the document in Apple Preview, you will see the exact same message, because Apple Preview is not able to render an XFA form. It should not surprise you that you get this message when parsing the PDF contained in your file using iText. That is exactly the PDF content that is present in your file. The content you see when opening the document in Adobe Reader isn't stored in PDF syntax, it is stored as an XML stream.
You say that you've tried to flatten the PDF as described in the answer to the question How to flatten already filled out PDF form using iTextSharp.
However, that question is about flattening a form based on AcroForm technology. It is not supposed to work with XFA forms. If you want to flatten an XFA form, you need to use XFA Worker on top of iText:
[JAVA]
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
XFAFlattener xfaf = new XFAFlattener(document, writer);
xfaf.flatten(new PdfReader(baos.toByteArray()));
document.close();
[C#]
Document document = new Document();
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(dest, FileMode.Create));
XFAFlattener xfaf = new XFAFlattener(document, writer);
ms.Position = 0;
xfaf.Flatten(new PdfReader(ms));
document.Close();
The result of this flattening process is an ordinary PDF that can be parsed by your original code.