docx4j: Use docx template and replace value of MergeField variable - docx4j

So I'm trying to use docx4j. I made a simple template, use that and replace th value programmatically.
Here's my docx template.
My Name is «myName»
I’m «myAge» years old.
My address is «myAddress»
When I click toggle field codes it looks like this.
My Name is {MERGEFIELD myName \* MERGEFORMAT}
I’m {MERGEFIELD myAge \* MERGEFORMAT} years old.
My address is {MERGEFIELD myAddress \* MERGEFORMAT}
I do not know the use of MERGEFIELD but I tried to use it simply because that's what the post in the internet currently use.
Here's my code.
def config = grailsApplication.config.template.invoiceSample as Map
def String templateFileName = "${grailsApplication.config.template.dir}/${config.doc.templateFileName}"
WordprocessingMLPackage template = WordprocessingMLPackage.load(new File(templateFileName));
Map<DataFieldName, String> map = new HashMap<DataFieldName, String>();
map.put(new DataFieldName("#myName"),"Jean");
map.put(new DataFieldName("#myAge"),"30");
map.put(new DataFieldName("#myAddress"),"Sampaloc");
org.docx4j.model.fields.merge.MailMerger.setMERGEFIELDInOutput(MailMerger.OutputField.KEEP_MERGEFIELD);
org.docx4j.model.fields.merge.MailMerger.performMerge(template, map, false);
template.save(new File("C:/temp/OUT_SIMPLE.docx") );
However, It doesn't replace the value in the output file. Could you tell me what is wrong with my code?

I found out that this problem is just easy to fix.
Basically this line affects the output.
MailMerger.setMERGEFIELDInOutput(MailMerger.OutputField.KEEP_MERGEFIELD);
Based on my observation, these options do the following:
MailMerger.OutputField.KEEP_MERGEFIELD - replace the field with value but you need to click toggle field codes to reveal it.
MailMerger.OutputField.REMOVED - the codes will be gone but most importantly the value will be replaced without the need to click toggle field codes
MailMerger.OutputField.AS_FORMTEXT_REGULAR - I honestly don't know what it does, it just shows {FORMTEXT} in the word.
MailMerger.OutputField.DEFAULT - does the same thing with REMOVED.
Please feel free to update my answer if you think i've said something wrong.

Related

Birt export in pdf does not wordwrap long lines

My reports preview is ok.
But now, I need to export to pdf...and I've got an issue : the content of some cells are truncated to the witdh of the column.
For instance, 1 cell should display "BASELINE"...in the preview it's ok...but in pdf, it displays "BASEL".
I've been looking for a solution the whole day and did not find anything...
Of course : I don't want to fit the width of the column on the length of this word "BASELINE" because the content is dynamic...
Instead, I want to fix the column width and then, the cell should display something like that :
BASEL
INE
Any idea ?
Thanks in advance (am a little bit desperated...)
The solution is trivial in BIRT v4.9 if you've integrated the engine into your java code. Just set the PDF rendering options.
RenderOption options = new PDFRenderOption();
options.setOutputStream(out);
options.setOutputFormat("pdf");
options.setOption(PDFRenderOption.PDF_WORDBREAK, true);
options.setOption(PDFRenderOption.PDF_TEXT_WRAPPING, true);
task.setRenderOption(options);
You have to set a special PDF emitter option:
PDFRenderOption options = new PDFRenderOption();
options.setOption(PDFRenderOption.PDF_HYPHENATION, true);
This is if you integrated BIRT into your Java program.
For the viewer servlet, it is possible to set such options, too, AFAIK, but I don't know how; maybe on the URL or using environment variables.
I had the same issue. I found a very good topic about that : http://developer.actuate.com/community/forum/index.php?/topic/19827-how-do-i-use-word-wrap-in-report-design/?s=173b4ad992e47395e2c8b9070c2d3cce
This will split the string in the given number of character you want :
The function to add in a functions.js (for example). To let you know, I create some folder in my report project : one for the reports, one for the template, one for the libraries, another for the resources, I added this js file in the resources folder.
/**
* Format a long String to be smaller and be entirely showed
*
*#param longStr
* the String to split
*#param width
* the character number that the string should be
*
*#returns the string splited
*/
function wrap(longStr,width){
length = longStr.length;
if(length <= width)
return longStr;
return (longStr.substring(0, width) + "\n" + wrap(longStr.substring(width, length), width));
}
You will have to add this js file in the reports : in the properties -> Resources -> Javascript files
This is working for me.
Note: you can add this function in your data directly if you need only once...
The disadvantage of this : you will have to specify a max length for your character, you can have blank spaces in the column if you specify a number to small to fill the column.
But, this is the best way I found. Let me know if you find something else and if it's working.

PDFBox: Fill out a PDF with adding repeatively a one-page template containing a form

Following SO question Java pdfBox: Fill out pdf form, append it to pddocument, and repeat I had trouble appending a cloned page to a new PDF.
Code from this page seemed really interesting, but didn't work for me.
Actually, the answer doesn't work because this is the same PDField you always modify and add to the list. So the next time you call 'getField' with initial name, it won't find it and you get an NPE. I tried with the same pdfbox version used (1.8.12) in the nice github project, but can't understand how he gets this working.
I had the same issue today trying to append a form on pages with different values in it. I was wondering if the solution was not to duplicate field, but can't succeed to do it properly. I always end with a PDF containing same values for each form.
(I provided a link to the template document for Mkl, but now I removed it because it doesn't belong to me)
Edit: Following Mkl's advices, I figured it out what I was missing, but performances are really bad with duplicating every pages. File size isn't satisfying. Maybe there's a way to optimize this, reusing similar parts in the PDF.
Finally I got it working without reloading the template each time. So the resulting file is as I wanted: not too big (4Mb for 164 pages).
I think I did 2 mistakes before: one on page creation, and probably one on field duplication.
So here is the working code, if someone happens to be stuck on the same problem.
Form creation:
PDAcroForm finalForm = new PDAcroForm(finalDoc, new COSDictionary());
finalForm.setDefaultResources(originForm.getDefaultResources())
Page creation:
PDPage clonedPage = templateDocument.getPage(0);
COSDictionary clonedDict = new COSDictionary(clonedPage.getCOSObject());
clonedDict.removeItem(COSName.ANNOTS);
clonedPage = new PDPage(clonedDict);
finalDoc.addPage(clonedPage);
Field duplication: (rename field to become unique and set value)
PDTextField field = (PDTextField) originForm.getField(fieldName);
PDPage page = finalDoc.getPages().get(nPage);
PDTextField clonedField = new PDTextField(finalForm);
List<PDAnnotationWidget> widgetList = new ArrayList<>();
for (PDAnnotationWidget paw : field.getWidgets()) {
PDAnnotationWidget newWidget = new PDAnnotationWidget();
newWidget.getCOSObject().setString(COSName.DA, paw.getCOSObject().getString(COSName.DA));
newWidget.setRectangle(paw.getRectangle());
widgetList.add(newWidget);
}
clonedField.setQ(field.getQ()); // To get text centered
clonedField.setWidgets(widgetList);
clonedField.setValue(value);
clonedField.setPartialName(fieldName + cnt++);
fields.add(clonedField);
page.getAnnotations().addAll(clonedField.getWidgets());
And at the end of the process:
finalDoc.getDocumentCatalog().setAcroForm(finalForm);
finalForm.setFields(fields);
finalForm.flatten();

Word to HTML fields in header and footer

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

Clear TextField rather than append

I'm using the following statement to clear text filed value:
input.value("abc")
input.value("")
input.value("def")
But, instead of clearing and set new value, it is appending the new value to old value. ('abcdef').
Is there any way to clear the TextField, before setting new val?
You can clear using the selenium element:
input.firstElement().clear()
And you can send keys using << like so:
input << "abc"
You can use the selenium Keys to backspace the texts that you already had entered. You can try many different ways to accomplish that. Here is a simple way to do that:
import org.openqa.selenium.Keys
input.value("abc")
input.value(Keys.chord(Keys.CONTROL, "A")+Keys.BACK_SPACE)
input.value("def")
It should do the job. Let us know whether it worked for you or not!
Cheers#!

Tika - how to extract text from PDF text: underlined, highlighted, crossed out

I'm using Tika* to parse a PDF file.
There are no problems to retrieve the document's text, but I don't figure out how to extract text:
underlined
highlighted
crossed out
Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".
Is there a solution to extract these metadata information? (underline, highligh ...)
Do you know if Tika is able to extract this data?
*http://tika.apache.org/
Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:
//inputStream is the document you wish to parse from.
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
Metadata metadata = new Metadata();
parser.parse(inputStream,handler,metadata);
System.out.println(handler.toString());
The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.