Docx4j-ImportXHTML fails to underline text in italics - docx4j

When in html there is a mix of underlined text plus italics then the docx generated loses the underlining.
pom.xml dependencies:
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-ImportXHTML</artifactId>
<version>3.3.1</version>
</dependency>
Example snippet:
public class UnderlineTests {
static final String TEST_STRING = "<html><body><p><em>Italics</em></p>" +
"<p><strong>Bold</strong></p>" +
"<p><u>Underlined</u></p>" +
"<p><strong><em>BoldItalics</em></strong></p>" +
"<p><u><em>ItalicsUnderlined</em></u></p>" +
"<p><strong><u>BoldUnderlined</u></strong></p></body></html>";
public static void main(String[] args) throws ParserConfigurationException, Docx4JException {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage(PageSizePaper.A4, true);
XHTMLImporterImpl XHTMLImporterForContent = new XHTMLImporterImpl(wordMLPackage);
wordMLPackage.getMainDocumentPart().getContent().addAll(XHTMLImporterForContent.convert(TEST_STRING, null));
wordMLPackage.save(new File("test.docx"));
}
}
The above produces a .docx result like:
Note that the <p><u><em>ItalicsUnderlined</em></u></p> paragraph is not underlined!
Any help please (I need to keep <u> tag since the html feeding is coming from CKEditor)?

I found that if you change the order of underline and italic tags it works! very very strange:
<p><em><u>ItalicsUnderlined</u></em></p>
the above works!!!

Related

How to fix the incorrect mapping of glyphs to unicode characters seen in PDF

If you read the letters marked as Latin-1 Supplement in the PDF, there is a problem in reading them as other letters.
Here is an example of the changed output.
Can I get some help on what is causing this and how to fix it?
public static void main(String[] args) throws IOException{
PDDocument doc = PDDocument.load(new File("myPDF.pdf"));
PDFTextStripper tStripper = new PDFTextStripper();
String cont = tStripper .getText(doc);
System.out.println(cont);
}
inPDF->inConsole
'Ô' -> '«'
'Á' -> '¸'
'Ệ' -> 'Ö'
"CÔNG BÁO" -> "c«ng b¸o"

Get Jackson XMLMapper to read root element name

How do I get Jackson's XMLMapper to read the name of the root xml element when deserializing?
I am deserializing input XML to generic Java class, LinkedHashMap and then to JSON. I want to dynamically read the root element of input XML on deserialization to LinkedHashMap.
Code
XmlMapper xmlMapper = new XmlMapper();
Map entries = xmlMapper.readValue(new File("source.xml"), LinkedHashMap.class);
ObjectMapper jsonMapper = new ObjectMapper();
String json = jsonMapper.writer().writeValueAsString(entries);
System.out.println(json);
Input XML
<?xml version="1.0" encoding="ISO-8859-1"?>
<File>
<NumLeases>1</NumLeases>
<NEDOCO>18738</NEDOCO>
<NWUNIT>0004</NWUNIT>
<FLAG>SUCCESS</FLAG>
<MESSAGE>Test Upload</MESSAGE>
<Lease>
<LeaseVersion>1</LeaseVersion>
<F1501B>
<NEDOCO>18738</NEDOCO>
<NWUNIT>0004</NWUNIT>
<NTRUSTRECORDKEY>12</NTRUSTRECORDKEY>
</F1501B>
</Lease>
</File>
Actual Output
{"NumLeases":"1","NEDOCO":"18738","NWUNIT":"0004","FLAG":"SUCCESS","MESSAGE":"Test Upload","Lease":{"LeaseVersion":"1","F1501B":{"NEDOCO":"18738","NWUNIT":"0004","NTRUSTRECORDKEY":"12"}}}
Expected Output (Note: There is a root element named "File" present in JSON)
{"File":{"NumLeases":"1","NEDOCO":"18738","NWUNIT":"0004","FLAG":"SUCCESS","MESSAGE":"Test Upload","Lease":{"LeaseVersion":"1","F1501B":{"NEDOCO":"18738","NWUNIT":"0004","NTRUSTRECORDKEY":"12"}}}}
There's probably some switch somewhere to set it. Any help shall be appreciated.
Sadly there is no flag for that. It can be done with a custom implementation of com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer. (Jackson How-To: Custom Deserializers):
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.databind.*;
import com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer;
import com.fasterxml.jackson.databind.module.SimpleModule;
import com.fasterxml.jackson.dataformat.xml.XmlMapper;
import com.fasterxml.jackson.dataformat.xml.deser.FromXmlParser;
import java.io.File;
import java.io.IOException;
//...
XmlMapper xmlMapper = new XmlMapper();
xmlMapper.registerModule(new SimpleModule().addDeserializer(JsonNode.class,
new JsonNodeDeserializer() {
#Override
public JsonNode deserialize(JsonParser p, DeserializationContext ctxt) throws IOException {
String rootName = ((FromXmlParser)p).getStaxReader().getLocalName();
return ctxt.getNodeFactory()
.objectNode().set(rootName, super.deserialize(p, ctxt));
}
}));
JsonNode entries = xmlMapper.readTree(new File("source.xml"));
System.out.println(entries);
The accepted answer works for Jackson 2.10.* (and older probably), but not for any of the newer versions (might get fixed in 2.14 - source).
What worked for me:
public class CustomJsonNodeDeserializer extends JsonNodeDeserializer {
#Override
public JsonNode deserialize(JsonParser p, DeserializationContext context) throws IOException {
//first deserialize
JsonNode rootNode = super.deserialize(p, context);
//then get the root name
String rootName = ((FromXmlParser)p).getStaxReader().getLocalName();
return context.getNodeFactory().objectNode().set(rootName, rootNode);
}
}
I will update my answer if there's a new better solution.
While this Question has an accepted answer, I found that it doesn't work on the latest Jackson version 2.13.2 and uses flawed approach anyway.
new SimpleModule().addDeserializer(JsonNode.class,
new JsonNodeDeserializer() {
#Override
public JsonNode deserialize(JsonParser p, DeserializationContext ctxt) throws IOException {
String rootName = ((FromXmlParser)p).getStaxReader().getLocalName();
return ctxt.getNodeFactory()
.objectNode().set(rootName, super.deserialize(p, ctxt));
}
}));
The .getLocalName() call will return the name of the first child element, not the actual root of the parsed input. Also, fetching just the name of the element ignores the attributes, so you'll end up with just a duplicated tag name in your output.
What to do instead?
After trying a number of workarounds, I've found only one that works properly. You have to let Jackson do its root node removal and fool it with a dummy wrapper tag.
JsonNode jsonNode = XML_MAPPER.readTree("<tag>" + nestedXmlString + "</tag>");
This will wrap the XML with a dummy <tag> which is then immediately removed and forgotten.
Then, you can work with the output tree as usual:
toXmlGenerator.writeTree(jsonNode);
Caution
However, please be aware that if your XML input String contains the XML Header declaration (<?xml...), then wrapping it with a dummy tag will result in a parsing exception. To avoid this, you'll have to first remove the declaration string from the input:
String nestedXmlString = input;
if (nestedXmlString.startsWith("<?xml")) {
nestedXmlString = nestedXmlString.substring(nestedXmlString.indexOf("?>") + 2);
}

Import annotations (XFDF) to PDF

I have created a sample program to try to import XFDF to PDF using the Aspose library. The program can be run without exception, but the output PDF does not include any annotations. Any suggestions to solve this problem?
Update - 2014-12-12
I have also sent the issue to Aspose. They can reproduce the same problem and logged a ticket PDFNEWJAVA-34609 in their issue tracking system.
Following is my sample program:
public static void main(String[] args) {
final String ROOT = "C:\\PdfAnnotation\\";
final String sourcePDF = "hackermonthly-issue.pdf";
final String destPDF = "output.pdf";
final String sourceXFDF = "XFDFTest.xfdf";
try
{
// Specify the path of license file
License lic = new License();
lic.setLicense(ROOT + "Aspose.Pdf.lic");
//create an object of PdfAnnotationEditor class
PdfAnnotationEditor editor = new PdfAnnotationEditor();
//bind input PDF file
editor.bindPdf(ROOT + sourcePDF);
//create a file stream for input XFDF file to import annotations
FileInputStream fileStream = new FileInputStream(ROOT + sourceXFDF);
//create an enumeration of all the annotation types which you want to import
//int[] annType = {AnnotationType.Ink };
//import annotations of specified type(s) from XFDF file
//editor.importAnnotationFromXfdf(fileStream, annType);
editor.importAnnotationFromXfdf(fileStream);
//save output pdf file
editor.save(ROOT + destPDF);
} catch (Exception e) {
System.out.println("exception: " + e.getMessage());
}
}

comment or highlight two-column pdf using pdf-clown

I have searched for possible solution by googling/so/forums for pdfClown/pdfbox and posting the problem at SO.
Problem: I have been trying to find a solution to highlight text, which spans across multiple lines in pdf document. The pdf can have one/two-column pages.
By using pdf-clown, I was able to highlight phrases, ONLY if all the words appear in the same line. pdfBox has created the XML for individual words, I could not find solution for phrases/lines.
Please suggest solution for pdf-clown, if any. (or) any other tool that is capable of highlighting text in multiple lines in pdf, with JAVA compatibility.
I could not understand the answer similar question, but iText, any help?:
Multiline markup annotations with iText
it is possible to get the coordinates of each word in a pdf document using pdfbox, here is the code for it:
import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
import java.io.IOException;
import java.util.List;
public class PrintTextLocations extends PDFTextStripper {
public PrintTextLocations() throws IOException {
super.setSortByPosition(true);
}
public static void main(String[] args) throws Exception {
PDDocument document = null;
try {
File input = new File("C:\\path\\to\\PDF.pdf");
document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
PrintTextLocations printer = new PrintTextLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
System.out.println("Processing page: " + i);
PDStream contents = page.getContents();
if (contents != null) {
printer.processStream(page, page.findResources(), page.getContents().getStream());
}
}
} finally {
if (document != null) {
document.close();
}
}
}
protected void processTextPosition(TextPosition text) {
System.out.println("String[" + text.getXDirAdj() + ","
+ text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+ text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width="
+ text.getWidthDirAdj() + "]" + text.getCharacter());
}
}
Multi-column text is, at the moment (PDF Clown 0.1.2), not supported for extraction: the current algorithm gathers text laying on the same horizontal baseline without evaluating possible gaps between columns.
Automatic multi-column-layout detection would be possible yet somewhat tricky, as PDF is essentially (you know) an unstructured graphic format. Nonetheless, I'm considering to experiment something about that, in order to deal at least with the most common scenarios.
In the meantime, I can suggest you to try an effective workaround (it implies that you work on a document whose columns are placed in predictable areas): for each column do a separate text extraction, instructing the TextExtractor to look into the corresponding page area, then put all those partial extraction results together and apply your filter.

Docx4j Test - No File is Output

Attempting to write my first class with docx4j (http://www.docx4java.org). Basically the idea is to find a string of text in the .docx file and replace it with another string of text. Essentially a mail merge. While I'm not receiving any errors, the merged document itself is not being saved in the path I've suggested. This makes me think it's a file path problem but I don't see anything wrong with it.
package efi.mailmerge.servlets;
import java.util.List;
import javax.xml.bind.JAXBElement;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.wml.Text;
public class WordDocTest {
/**
* Open word document /Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample.docx, replace a piece of text and save
* the result to /Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample-Out.docx.
*
* The text <<CUS_FNAME>> will be replaced with John.
*
* #param args
*/
public static void main(String[] args) {
// Text nodes begin with w:t in the word document
final String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
try {
// Open the input file
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File("/Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample.docx"));
// Build a list of "text" elements
List texts = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
// Loop through all "text" elements
for (Object obj : texts) {
Text text = (Text) ((JAXBElement) obj).getValue();
// Get the text value
String textValueBefore = text.getValue();
// Perform the replacement
String textValueAfter = textValueBefore.replaceAll("<<CUS_FNAME>>", "John");
// Show the element before and after the replacement
System.out.println("textValueBefore = " + textValueBefore);
System.out.println("textValueAfter = " + textValueAfter);
// Update the text element now that we have performed the replacement
text.setValue(textValueAfter);
}
wordMLPackage.save(new java.io.File("/Users/Jeff/Development/ReServe-Unleashed/Dev/MailMerge/uploads/Sample-Out.docx"));
} catch (Docx4JException e) {
Logger.getLogger(WordDocTest.class.getName()).log(Level.SEVERE, null, e);
e.printStackTrace();
} catch (Exception e) {
Logger.getLogger(WordDocTest.class.getName()).log(Level.SEVERE, null, e);
e.printStackTrace();
}
}
}
On lines 26 and 50 you can see the input/output paths. I've confirmed that the Sample.docx input file does exist and that the uploads directory has write permissions. Can you see anything wrong with my file paths here? I could be completely on the wrong path, but this is all very new to me so I'm learning as I go.
Any and all help is very much appreciated.
At first sight, I would suggest trying with your path written the following way :
wordMLPackage.save(new java.io.File("\\Users\\Jeff\\Development\\ReServe-Unleashed\\Dev\\MailMerge\\uploads\\Sample-Out.docx"));
If it still not works, please provide the stack traces ? It could help. (if no doc is saved, there must be an exception thrown)