Word to HTML fields in header and footer - docx4j

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

Related

iText 7 need to skip reading page header elements

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?
public class TEirHeaderEventHandler : IEventHandler
{
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
Rectangle headerRect = new Rectangle(60, 725, 495, 96);
Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
//creating content for header
CreateHeaderContent(headerCanvas);
headerCanvas.Close();
}
private void CreateHeaderContent(Canvas canvas)
{
//Create header content
Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
table.SetWidth(UnitValue.CreatePercentValue(100));
Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
cell1.SetBorder(Border.NO_BORDER);
table.AddCell(cell1);
Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell2.SetBorder(Border.NO_BORDER);
table.AddCell(cell2);
Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell3.SetBorder(Border.NO_BORDER);
table.AddCell(cell3);
canvas.Add(table);
}
}
public static void CreatePdf()
{
using (MemoryStream writeStream = new MemoryStream())
using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
{
PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
pdf.SetTagged();
iTextDocument document = new iTextDocument(pdf);
TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
//Convert html to pdf
HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
document.Close();
byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
File.WriteAllBytes(outputPdfFile, bytes);
}
}
Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.
General
The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.
element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This marks the content in the content stream and it excludes the element from the structure tree.
Your case
However, your specific case is more nuanced.
For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.
Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.
If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.
If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.
Summary: looking at the structure tree, there's no difference between your original code and the approach of General.
Adobe reading order / accessibility settings
From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:
From https://helpx.adobe.com/reader/using/accessibility-features.html:
Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.
In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.
With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.
Conclusion
Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.
Headers and footers are typically pagination artifacts and should be marked as such in the following way:
table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

Birt export in pdf does not wordwrap long lines

My reports preview is ok.
But now, I need to export to pdf...and I've got an issue : the content of some cells are truncated to the witdh of the column.
For instance, 1 cell should display "BASELINE"...in the preview it's ok...but in pdf, it displays "BASEL".
I've been looking for a solution the whole day and did not find anything...
Of course : I don't want to fit the width of the column on the length of this word "BASELINE" because the content is dynamic...
Instead, I want to fix the column width and then, the cell should display something like that :
BASEL
INE
Any idea ?
Thanks in advance (am a little bit desperated...)
The solution is trivial in BIRT v4.9 if you've integrated the engine into your java code. Just set the PDF rendering options.
RenderOption options = new PDFRenderOption();
options.setOutputStream(out);
options.setOutputFormat("pdf");
options.setOption(PDFRenderOption.PDF_WORDBREAK, true);
options.setOption(PDFRenderOption.PDF_TEXT_WRAPPING, true);
task.setRenderOption(options);
You have to set a special PDF emitter option:
PDFRenderOption options = new PDFRenderOption();
options.setOption(PDFRenderOption.PDF_HYPHENATION, true);
This is if you integrated BIRT into your Java program.
For the viewer servlet, it is possible to set such options, too, AFAIK, but I don't know how; maybe on the URL or using environment variables.
I had the same issue. I found a very good topic about that : http://developer.actuate.com/community/forum/index.php?/topic/19827-how-do-i-use-word-wrap-in-report-design/?s=173b4ad992e47395e2c8b9070c2d3cce
This will split the string in the given number of character you want :
The function to add in a functions.js (for example). To let you know, I create some folder in my report project : one for the reports, one for the template, one for the libraries, another for the resources, I added this js file in the resources folder.
/**
* Format a long String to be smaller and be entirely showed
*
*#param longStr
* the String to split
*#param width
* the character number that the string should be
*
*#returns the string splited
*/
function wrap(longStr,width){
length = longStr.length;
if(length <= width)
return longStr;
return (longStr.substring(0, width) + "\n" + wrap(longStr.substring(width, length), width));
}
You will have to add this js file in the reports : in the properties -> Resources -> Javascript files
This is working for me.
Note: you can add this function in your data directly if you need only once...
The disadvantage of this : you will have to specify a max length for your character, you can have blank spaces in the column if you specify a number to small to fill the column.
But, this is the best way I found. Let me know if you find something else and if it's working.

PDFBox: Fill out a PDF with adding repeatively a one-page template containing a form

Following SO question Java pdfBox: Fill out pdf form, append it to pddocument, and repeat I had trouble appending a cloned page to a new PDF.
Code from this page seemed really interesting, but didn't work for me.
Actually, the answer doesn't work because this is the same PDField you always modify and add to the list. So the next time you call 'getField' with initial name, it won't find it and you get an NPE. I tried with the same pdfbox version used (1.8.12) in the nice github project, but can't understand how he gets this working.
I had the same issue today trying to append a form on pages with different values in it. I was wondering if the solution was not to duplicate field, but can't succeed to do it properly. I always end with a PDF containing same values for each form.
(I provided a link to the template document for Mkl, but now I removed it because it doesn't belong to me)
Edit: Following Mkl's advices, I figured it out what I was missing, but performances are really bad with duplicating every pages. File size isn't satisfying. Maybe there's a way to optimize this, reusing similar parts in the PDF.
Finally I got it working without reloading the template each time. So the resulting file is as I wanted: not too big (4Mb for 164 pages).
I think I did 2 mistakes before: one on page creation, and probably one on field duplication.
So here is the working code, if someone happens to be stuck on the same problem.
Form creation:
PDAcroForm finalForm = new PDAcroForm(finalDoc, new COSDictionary());
finalForm.setDefaultResources(originForm.getDefaultResources())
Page creation:
PDPage clonedPage = templateDocument.getPage(0);
COSDictionary clonedDict = new COSDictionary(clonedPage.getCOSObject());
clonedDict.removeItem(COSName.ANNOTS);
clonedPage = new PDPage(clonedDict);
finalDoc.addPage(clonedPage);
Field duplication: (rename field to become unique and set value)
PDTextField field = (PDTextField) originForm.getField(fieldName);
PDPage page = finalDoc.getPages().get(nPage);
PDTextField clonedField = new PDTextField(finalForm);
List<PDAnnotationWidget> widgetList = new ArrayList<>();
for (PDAnnotationWidget paw : field.getWidgets()) {
PDAnnotationWidget newWidget = new PDAnnotationWidget();
newWidget.getCOSObject().setString(COSName.DA, paw.getCOSObject().getString(COSName.DA));
newWidget.setRectangle(paw.getRectangle());
widgetList.add(newWidget);
}
clonedField.setQ(field.getQ()); // To get text centered
clonedField.setWidgets(widgetList);
clonedField.setValue(value);
clonedField.setPartialName(fieldName + cnt++);
fields.add(clonedField);
page.getAnnotations().addAll(clonedField.getWidgets());
And at the end of the process:
finalDoc.getDocumentCatalog().setAcroForm(finalForm);
finalForm.setFields(fields);
finalForm.flatten();

Can I define a text property as rich text?

VS 2013, VB, EF6
I am creating an object that will keep user input in one of its properties. I would like that user input to be stored as rich text. What's involved to make that stored text be rich text format? So,
Public Property Text as <what?>
I thought I would post what was my answer for others who might ask the question the same way I did. I begin by stating that my question was poorly formed because I didn't understand I'm not really storing RTF, I'm storing WYSIWYG text with html tags. But I think the question as phrased is useful because that's how many people think until they are taught by others.
Ultimately this process opens a serious XSS vector, but first we have to at least collect the WYSIWYG text.
First step: using a script-based editor capture the text with html tags. I used CKEditor which is easy to download on NuGet. It comes in 3 flavors: basic, standard and full. Another popular one seems to be TinyMCE also available through NuGet.
CKEditor must be 'wired in' to replace the existing input element. I replaced #html.editorfor with a < textarea > directly as follows. Model.UserPost.Body is the property into which I want to place the WYSIWYG text. The Raw helper is required so the output is NOT encoded allowing us to see our WYSIWYG text.
<textarea name="model.UserPost.Body" id="model_UserPost_Body" class="form-control text-box multi-line">
#Html.Raw(Model.UserPost.Body)
</textarea>
CKEditor is 'wired in' using a script element to replace the < textarea > element.
#Section Scripts
<script src="~/scripts/ckeditor/ckeditor.js"></script>
<script>
CKEDITOR.replace('model.UserPost.Body');
</script>
End Section
The script above can be added to all pages via _layout.vbhtml, or just the target page via a #Section Scripts section as shown above, which is often recommended and what I did, but that may also require adding to the standard _Layout the following in the < head > section such as follows.
#RenderSection("Styles", False)
In the controller POST method for the view the following code is needed to capture the WYSIWYG text otherwise the default filter will raise an exception when it detects anything that looks like an html tag.
Dim rawBody = Request.Unvalidated.Form("model.UserPost.Body")
userPost.Body = rawBody
There are some possible gotcha's; The 'body' property has to be removed from the Include:= list of the < Bind > element in the method paramter list if < Bind > is being used. Also, although not directly related to this solution, you can't have a Data Annotation like < Required() > on this property in the model because background checking won't be able to confirm that condition so the ModelState.IsValid flag won't ever go true.
Second step: before saving the input it MUST be checked for XSS. Microsoft has a nice video explaining basic XSS that I recommend viewing; it's only 11 minutes.
Mikesdotnetting has a nice explaination for dealing with XSS and shows a whitelisting algorithm toward the bottom of this page. The following code is based on his work.
To create a white listing approach, the HTML Agility Pack is useful to catalogue the HTML nodes for review. This is easily loaded from Nu Get as well. This is the code I used in the POST method to invoke the white list methods (Yes, it could be more compact, but this is easier to read for us novices):
Dim tempDoc = New HtmlDocument()
tempDoc.LoadHtml(rawBody)
RemoveNodes(tempDoc.DocumentNode, allowedTags)
userPost.Body = tempDoc.DocumentNode.OuterHtml
The allowed tags are what you will allow, which means everything else is rejected, hence whitelisting. This is just a sample list:
Dim allowedTags As New List(Of String)() From {"p", "em", "s", "ol", "ul", "li", "h1", "h2", "h3", "h4", "h5", "h6", "strong"}
These are the methods based on Mikesdotnetting page:
Private Sub RemoveNodes(ByVal node As HtmlNode, allowedTags As List(Of String))
If (node.NodeType = HtmlNodeType.Element) Then
If Not allowedTags.Contains(node.Name) Then
node.ParentNode.RemoveChild(node)
Exit Sub
End If
End If
If (node.HasChildNodes) Then
RemoveChildren(node, allowedTags)
End If
End Sub
Private Sub RemoveChildren(ByVal parent As HtmlNode, allowedTags As List(Of String))
For i = parent.ChildNodes.Count() - 1 To 0 Step -1
RemoveNodes(parent.ChildNodes(i), allowedTags)
Next
End Sub
So basically, (1) CKEditor captures user input with html tags that looks nice, (2) the raw input is specially requested in the Controller POST method and then (3) cleaned using a white list. After that it can be output directly to the page using #Html.Raw() because it can be trusted.
That's it. I've not really posted solutions like this before, so if I've missed something let me know and I'll correct or add it.
Rich Text is stored in the Rich Text Format.
The Rich Text Format specifications can be found here:
http://www.microsoft.com/en-us/download/details.aspx?id=10725
It is just an ordinary string. You can extract the string from a RichTextBox using the SaveFile function:
Private Function GetRTF(ByRef Box As RichTextBox) As String
Using ms As New IO.MemoryStream
Box.SaveFile(ms, RichTextBoxStreamType.RichText)
Return System.Text.Encoding.ASCII.GetString(ms.ToArray)
End Using
End Function
You can load text in the Rich Text Format into a RichTextBox using the LoadFile method of the RichTextBox. The text needs to be in the correct format:
Dim rtf As String = "{\rtf1 {\colortbl;\red0\green0\blue255;\red255\green0\blue0;}Guten Tag!\line{\i Dies} ist ein\line formatierter {\b Text}.\line Das {\cf1 Ende}.}"
Using ms As New IO.MemoryStream(System.Text.Encoding.ASCII.GetBytes(rtf))
RichTextBox1.LoadFile(ms, RichTextBoxStreamType.RichText)
End Using
Ordinary controls usually will not interpret this format in their text property.

How to custom tag word(s) in GATE JAPE grammar?

I have a set of documents and each document has different heading. Example if document heading says "Psychological Evaluation" I want to tag the document as "Medicalrule".
I loaded the document and loaded ANNIE with defaults.
In Processing Resources > New > Jape Transducer
2.1 wrote the following code in the text document and saved it as .JAPE extension
CODE :
Phase: ConjunctionIdentifier
Input: Token Split
Rule: Medicalrule
(
({Token.string=="Psychological"})+({Token.string == " "})+ ({Token.string == "Evaluation"}):Meddoc({Token.kind=="word"})
)
-->
:Meddoc
{
gate.AnnotationSet matchedAnns= (gate.AnnotationSet) bindings.get("Meddoc"); gate.FeatureMap newFeatures= Factory.newFeatureMap();newFeatures.put("rule","Medicalrule");annotations.add(matchedAnns.firstNode(),matchedAnns.lastNode(),"CC", newFeatures);
}
Loaded the above created .JAPE file and reinitialized
After the application is run the Annotation Set does not show the tag !
Am I doing wrong somewhere ?It would be great if someone could help me on this.
Appreciate your time.
Thank you
I'm sure that there is no annotation like: Token.string == " ". Try to use a SpaceToken annotation instead.
Also, why not to try gazetteers instead of hardcoding of texts values in to JAPE code?
There are three issues I can see here.
First, as ashingel says, spaces are not represented as Token annotations - this is deliberate as in most cases you don't care about the spacing between words, only the words themselves.
Second, the trailing ({Token.kind=="word"}) means that the rule will only match when "Psychological Evaluation" is followed by another word before the end of the current sentence (because you've got Split in the Input line).
Third, you're only binding the Meddoc label to the "Evaluation" token, not to the whole match.
I would try and simplify the LHS of the rule:
Phase: ConjunctionIdentifier
Input: Token Split
Rule: Medicalrule
(
{Token.string=="Psychological"}
{Token.string == "Evaluation"}
):meddoc
and for the RHS (a) you don't need to do the explicit bindings.get because you've used a labelled block so you already have the bound annots available, (b) you should use outputAS instead of annotations, and (c) you should generally avoid the add method that takes nodes, as it isn't safe if the input and output annotation sets are different. If you're using a recent snapshot of GATE then the gate.Utils static methods can help you a lot here
:meddoc {
Utils.addAnn(outputAS, meddocAnnots,"CC",
Utils.featureMap("rule","Medicalrule"));
}
If you're using 7.1 or earlier then the addAnn method isn't available so it's slightly more convoluted:
:meddoc {
try {
outputAS.add(Utils.start(meddocAnnots), Utils.end(meddocAnnots),"CC",
Utils.featureMap("rule","Medicalrule"));
} catch(InvalidOffsetException e) { // can't happen, but won't compile without
throw new JapeException(e);
}
}
Finally, just to check, you did definitely add your new JAPE Transducer PR to the end of the pipeline?