docx4j - Nodes Omitted From XmlUtils.marshalToString() - docx4j

Using XMLUtils.marshalToString() from docx4j, I have the following content at identical locations in two docx files (extracted from corresponding word/document.xml after unzipping the .docx). These are the only differences between the files:
<w:t xml:space="preserve">New line. First is </w:t>
and
<w:t xml:space="preserve">
<w:r>
<w:t xml:space="preserve">New line.</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> First is </w:t>
</w:r>
</w:t>
In the first document, the <w:t> node is output as above.
However, in the second, an empty <w:t> node is printed as follows:
<w:t xml:space="preserve"></w:t>
I checked the w:t schema at http://www.schemacentral.com/sc/ooxml/e-w_p-1.html and w:r is a valid contained element.
Edit: the above link is the schema of the w:p element, not w:t. The proper link for w:t is: http://www.schemacentral.com/sc/ooxml/e-w_t-1.html. It clearly shows the only acceptable content for w:t is a string (not a w:r or any other tags). Consequently (as suggested Jason's answer below), the XML from document.xml was invalid, and (as such) not being unmarshalled into docx4j. As a result, the text was not available for output by XmlUtils.marshalToString().
What is keeping the second block from being output?

You can trust marshalToString.
If it is returning an empty w:t, that's because the underlying org.docx4j.wml.Text object has a null or empty value field.
You need to look at whatever code is supposed to be populating that.

Related

Merging xfdf into template pdf without losing some special characters (eg. ő,Ű,č)

I have an xfdf file, which is utf8 and may contain non ASCII characters. I would like to merge it with the pdf that contains the form. I tried with pdftk, and although merging happens correctly - in terms of all fields are being populated - some characters are not appearing in the flattened pdf.
Taking the xfdf:
<?xml version="1.0" encoding="utf-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
<field name="some_data">
<value>Űző</value>
</field>
<field name="some_other_data">
<value>ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ</value>
</field>
</fields>
</xfdf>
The result pdf's fields have the following values (excluded the quotation marks):
some_data: " z "
some_other_data: "ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ"
So all the characters in some_other_data are stored correctly, but ő and Ű are stored as 00.
I also realized that if I uncompress the pdf with pdftk, I can find the original characters stored in the pdf as
/DA (/Helv 8.64 Tf 0 g)
/Subtype /Widget
/V (ţ˙ Q z\r )
/T (some_data)
The fact that the correct characters are there is also clear if I open the unflattened form with Adobe Reader. After opening, initially the form field some_data contains only the letter z surrounded with spaces, BUT if I click on the form field, the special characters are revealed, and any changes made to the field value will result in the correct characters to stay visible. On the other hand if I unfocus the form field without any modification, they disappear again..
I also tried to use numeric entities in the xfdf, but it did not help either.
I have 2 questions:
Why won't these characters appear in the value of the field when the pdf clearly contains the correct character information, and is also capable of rendering them?
Most importantly what can I do in order to have the correct characters appearing in the pdf after flattening the form? I'd prefer a solution which does not require any postprocessing once the xfdf is merged into the pdf form, but any solutions or ideas are welcome.
Thank you all!

MS Word / OpenXML: extracting grammar and spelling errors

InOpenXMLthere is the w:proofErr element which can be of spelling type (attributes w:type="spellStart" and w:type="spellEnd") or of grammar type (attributes w:type="gramStart" and w:type="gramEnd"). I need to extract errors of my document (both types), when I created a small test document (just one sentence with two errors), the information was indeed in the .docx XML file, but when I saved the whole text I need to process (a 5MB file), the information was not included in the .docx file (probably Word considers that in large documents this would be too much noise in the XML data).
How can I extract this information even in big files?
Is there some way to force MS Word to include the information in the .docx file?
If not, is there some VBA script that can mark spelling errors and grammar errors with, for example, a different color or some special character, so that the information becomes hardcoded into the file?
Here is an example, for the sentence "The children plays in the guarden" (which has an agreement error and a spelling error):
<w:t>The children </w:t>
</w:r>
<w:proofErr w:type="gramStart"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t>plays</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t xml:space="preserve"> in the </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t>guarden</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="008E17B0">
<w:rPr><w:lang w:val="en-US"/></w:rPr>
<w:t>.</w:t>
</w:r>
I would like to obtain, for example, "The children ▶plays◀ in the *guarden*"

DITA OT printing '#' in stead of Chinese characters in PDF

I am very new to DITA OT. Downloaded the DITA-OT1.5.4_full_easy_install_bin and playing around with it. I'm trying to print few characters in Simplified Chinese (zh-CN) into a PDF. I see that the characters are printed correctly in XHTML but in PDF they are printed as "#".
In the command line I see this - "Warning: Glyph "?" (0x611f) not available in font "Helvetica".
Here are the things I have tried so far:
In demo\fo\fop\conf\fop.xconf :
<fonts>
<font kerning="yes"
embed-url="file:///C:/Windows/Fonts/simsun.ttc"
embedding-mode="subset" encoding-mode="cid">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<auto-detect/>
<directory recursive="true">C:\Windows\Fonts</directory>
</fonts>
In demo\fo\cfg\fo\attrs\custom.xsl :
<xsl:attribute-set name="__fo__root">
<xsl:attribute name="font-family">SimSun</xsl:attribute>
</xsl:attribute-set>
In demo\fo\cfg\fo\font-mapping.xml added this block for Sans, Serif & Monospaced logical fonts:
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
In samples\concepts\garageconceptsoverview.xml :
<shortdesc xml:lang="zh_CN">職業道德感.</shortdesc>
And this is the command I am using to generate the PDF:
ant -Dargs.input=samples\hierarchy.ditamap -Dtranstype=pdf
Any help would be appreciated. Thanks.
[EDIT]
I see that the topic.fo file which gets generated in temp folder, does contain the Chinese characters correctly. Like this:
<fo:block font-size="10pt" keep-with-next.within-page="5" start-indent="25pt">職業道德感.</fo:block>
But I do not see the font related information anywhere in this document.
First of all you should set the "xml:lang='zh_CN'" attribute on the root elements for all DITA topics and maps. This will help the DITA OT publishing decide the language to use for static texts like "Table X" and also to decide on the charset to use for the font mappings.
Then you should run the publishing by setting the parameter "clean.temp" parameter to "no".
After the publishing you can look in the temporary files folder for a file called "topic.fo" and look inside it to see what font families are used.
Because even if you set a font on the root element, there are other places in the XSL-FO file where you have font families set explicitly.
So instead of setting a font on the XSL-FO root element you should edit the font mappings XML file and for each of the logical fonts "Sans" and "Serif" you should configure the actual font family to use for the Chinese charset, something like:
<logical-font name="Sans">
.........
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
......
</logical-font>
More about how the font mappings work:
https://www.oxygenxml.com/doc/versions/17.0/ug-editor/#topics/DITA-map-set-font-Apache-FOP.html
Update:
If you insist of having that XSLT customization which sets the "SimSun" font as a font family on the root element, then in the font-mappings.xml you need to define a new mapping for your alias:
<aliases>
<alias name="SimSun">SimSun</alias>
</aliases>
and then map the logical font to a physical one in the same font-mappings.xml:
<logical-font name="SimSun">
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
</logical-font>
0x611f , this character is a chinese character (感), helvetica is an europe font , so no this character in the "helvetica" font. You can search this "helvetica" font loaction, in this position your content(ditamap/dita) should use chinese font, not europe font. You must find that arritbute that include the [font-famliy=helvetical], modify in your own plugin [SimSun, Helvetical].
Sorry, I cannot answer your question, but you should definetely try a newer DITA-OT from http://dita-ot.github.io/. Your DITA-OT is not supported anymore. Maybe your problem fades away using the latest release.

csv to xml: not sure the best way to do it in Mule ESB

I'm new to Mule, so bear with me. I have the following CSV file that I receive:
Company1,2,123 Street,Winchester,UK
"000010","CHRISTINE","I","HAAS","A00","3978","1995-01-01","PRES",18,"F","1963-08-24",152750.00
"000020","MICHAEL","L","THOMPSON","B01","3476","2003-10-10","MANAGER",18,"M","1978-02-02",94250.00
The first line, header, contains company info plus the number of records (number of employees) in CSV file (second parm in the header).
Now I need to convert it to the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<tns:employeedata xmlns:tns="http://coxb.test.legstar.com/payrollemployee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://coxb.test.legstar.com/payrollemployee PayrollEmployee.xsd ">
<tns:employeecount>2</tns:employeecount>
<tns:employeelist>
<tns:employees>
<tns:employeenumber>000010</tns:employeenumber>
<tns:firstname>CHRISTINE</tns:firstname>
<tns:middleinitial>I</tns:middleinitial>
<tns:surname>HAAS</tns:surname>
<tns:department>A00</tns:department>
</tns:employees>
<tns:employees>
<tns:employeenumber>000020</tns:employeenumber>
<tns:firstname>MICHAEL</tns:firstname>
<tns:middleinitial>L</tns:middleinitial>
<tns:surname>THOMPSON</tns:surname>
<tns:department>B01</tns:department>
</tns:employees>
</tns:employeelist>
</tns:employeedata>
I could easily transform this file without the first line (header). My problem is how to process the header and extract/transform "employeecount".
Any help will be greatly appreciated.
The easiest way to do this is to use DataMapper. Set the input to CSV (using a sample CSV) and the output to XML (using your XSD or a sample XML).
Once you're in the mapping view, click on your employeecount field. You'll see an area where you can enter an expression. There is a non-documented parameter $in.0.__id which you can use which will contain the record count. Note that this will only work for CSV files.
Regarding how to skip the first line, DataMapper does this by default.

How do you comment out code snippets within a seam pdf file

How do you comment out pieces of code within a seam PDF generation file. XML style comments don't seem to work and the commented out code appears as it is in the pdf file.
<p:font name="times-roman" size="12" style="bold normal">
<p:text value="Full Name."/>
</p:font>
<p:font name="times-roman" size="9" style="normal">
<p:text value=" #{abc.firstName} "/>
</p:font>
According to Adobe PDF Reference:
Any occurrence of the percent sign
character (%) outside a string or
stream introduces a comment. The
comment consists of all characters
between the percent sign and the end
of the line, including regular,
delimiter, space, and tab characters.
PDF ignores comments, treating them as
if they were single white-space
characters. That is, a comment
separates the token preceding it from
the one following it; thus, the PDF
fragment
abc% comment { /% ) blah blah blah
123
is syntactically equivalent
to just the tokens abc and 123.
I have not used JBoss Seam myself, but I guess you could try combining the two comment styles (xml and PDF) in your xml input file so that your comments are not visible in the resulting pdf file.