Docx4j v3 Docx to HTML with Images - docx4j

I'm working to convert a docx to html using Docx4j version 3.
The document contains white space consisting of tabs, spaces and newlines. The resulting HTML either has unrecognized characters or does not preserve whitespace at all.
The java code I'm using is:
WordprocessingMLPackage wordMLPackage = Docx4J.load(is);
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath( System.getProperty("user.dir") + uploadedImagesDirectory );
htmlSettings.setWmlPackage(wordMLPackage);
Docx4J.toHTML(htmlSettings, out, Docx4J.FLAG_EXPORT_PREFER_XSL);
String result = ((ByteArrayOutputStream)out).toString();
How can I preserve the whitespace in the document. Also, is there a method to apply css to a particular node? Specifically, I have 3 images which should be evenly spaced horizontally on the page.
I've looked over the documentation and searched online with no success.
Thank you.

I resolved the issue and it was not related to Docx4j.
Docx4j parsed the document perfectly! The problem was related to sending the output in an email.
I set the Spring helper javamail mime encoding to resolve this issue:
MimeMessageHelper message = new MimeMessageHelper(mimeMessage, true, "utf-8");

Related

How to adjust font size in syncfusion html to pdf

I am able to convert HTML to PDF using syncfusion.
The issue I have is it doesn't obey the font sizes
var htmlText = "<html><head><style>body{font-size:50px;}</style></head><body>Hello</body></html>";
var convertedHtmlDocument = ConvertFromHtml(htmlText);
var ms = new MemoryStream();
var fpath = AppDomain.CurrentDomain.BaseDirectory + "myfile.pdf";
SaveToFile(convertedHtmlDocument, fpath);
ms.Dispose();
It doesn't matter if I make the font size in the CSS 50 or 5, the font comes out the same size.
I also tried (same issue):
var htmlText = "<html><head><style>.myclass {font-size:50px;}</style></head><body><div class="myapp">Hello</div></body></html>";
Exporting the above 2 to an .html document behaves as the syntax suggests.
If I change my CSS to use table then it works, but I want to have a single font size for the document, not just the table
var htmlText = "<html><head><style>table{font-size:50px;}</style></head><body>Hello</body></html>"; //this works
What am I doing wrong?
We have checked the reported issue with different font size but it is working properly on our end. We have attached the sample and output for your reference.
Sample: https://www.syncfusion.com/downloads/support/directtrac/general/ze/WPF1621828031
Output: https://www.syncfusion.com/downloads/support/directtrac/general/ze/Output259867595
If still you have facing the issue, we request you share the modified code sample, input html text and product version ,screenshot of the issue to check this on our end. It will be helpful for us to analyse and assist you further on this.

Adding Arial Unicode MS to CKEditor

My web application allows user to write rich text inside CKEditor, then export the result as PDF with the Flying Saucer library.
As they need to write Greek characters, I chose to add Arial Unicode MS to the available fonts, by doing the following :
config.font_names = "*several fonts...*; Arial Unicode MS/Arial Unicode MS, serif";
This font is now displayed correctly in the CKEditor menu, but when I apply this font to any element, I get the following result :
<span style="font-family:arial unicode ms,serif;"> some text </span>
As you can notice, I lost the UpperCase characters. This has pretty bad effect during PDF export, as then Flying Saucer doesn't recognise the font and so uses Helvetica which does not support Unicode characters, so the greek characters are not displayed in the PDF.
If I change manually from code source
<span style="font-family:arial unicode ms,serif;"> some text </span>
to
<span style="font-family:Arial Unicode MS,serif;"> some text </span>
then it is working as expected, greek characters are displayed.
Has anyone met this problem before? Is there a way to avoid UpperCase characters to be changed to LowerCase?
I really want to avoid doing any kind of post-processing like :
htmlString = htmlString.replace("arial unicode ms", "Arial Unicode MS");
I agree with you regarding resolving this issue aside from Flying Saucer R8.
Although depending upon your workflow, would it be more efficient to allow CKEditor to preprocess and validate a completed HTML encoded file (render the entire document to HTML first)?
None of the CKEditor support tickets specify the true source of the issue, so I recommend confirming for yourself whether it is (A) a styling issue, or (B) a CSS processing issue, or (C) a peculiar CKEditor parsing issue.
A possible workaround:
Make a copy of the desired unicode font and import it into Type 3.2 (works on both Mac and Windows).
http://www.cr8software.net/type.html
rename the duplicate font set into something all lowercase.
Limit your font selection
config.font_names = "customfontnamehere";
Apply the style separately (unicode typeface greatvibes below) and see if that gives you the desired result:
var s = CKEDITOR.document.$.createElement( 'style' );
s.type = 'text/css';
cardElement.$.appendChild( s );
s.styleSheet.cssText =
'#font-face {' +
'font-family: \'GreatVibes\';' +
'src: url(\'' + path +'fonts/GreatVibes-Regular.eot\');' +
'}' +
style;
If the above does not work, you can try to modify the xmas plugin.js (also uses the unicode typeface greatvibes and does all sorts of cool manipulations before output), so it might be worth trying to modify it rather than start from scratch:
'<style type="text/css">' +
'#font-face {' +
'font-family: "GreatVibes";' +
'src: url("' + path +'fonts/GreatVibes-Regular.ttf");' +
'}' +
style +
'</style>' )
Whichever approach you try, the goal is to test various styling and see if CKEditor defaults back to Helvetica again.
Lastly, the CKEditor SDK has excellent support, so if you have the time and energy, you could write a plugin. Sounds daunting, I know, but notice how the plugin.js within the /plugins/font directory has priority for size attributes.
If you are not interested in producing your own plugin, I recommend contacting the prolific ckeditor plugin writer
doksoft
(listed both on their website and on his own website) and ask for a demo of his commercial plugin "CKEditor Special Symbols" which has broad unicode capability.
Hope that helps,
ClaireW
I didn't find any way to do it with Flying Saucer R8, but you can make it work using Flying Saucer R9.
The method
ITextResolver.addFont(String path, String fontFamilyNameOverride, String encoding, boolean embedded, String pathToPFB) allow you to add the fond with a specific name.
Code sample:
ITextRenderer renderer = new ITextRenderer();
// Adding fonts
renderer.getFontResolver().addFont("fonts/ARIALUNI.TTF", "arial unicode ms", BaseFont.IDENTITY_H, BaseFont.EMBEDDED, null);
renderer.getFontResolver().addFont("fonts/ARIALUNI.TTF", "Arial Unicode MS", BaseFont.IDENTITY_H, BaseFont.EMBEDDED, null);
String inputFile = "test.html";
renderer.setDocument(new File(inputFile));
renderer.layout();
String outputFile = "test.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.createPDF(os);
os.close();
You can find Flying Saucer R9 on Maven.
The simplest solution (until CKEditor fixes that bug) is to do that post-processing.
You can do it on the server (really simple, you already have the code) or with a little CKEditor plugin, but that will give you the solution that you want and unless you need to add more fonts it will work without any further changes.

Docx4J: Vertical text frame not exported to PDF

I'm using Docx4J to make an invoice model.
In the left-side of the page, it's usual to show a legal sentence as: Registered company in ... Book ... Page ...
I have inserted this in my template with a Word text frame.
Well, my issue is: when exporting to .docx, this legal text is shown perfect, but when exporting to .pdf, it's shown as an horizontal table under the other data.
The code to export to PDF is:
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(foDumpFile);
foSettings.setWmlPackage(template);
fos = new FileOutputStream(new File("/C:/mypath/prueba_OUT.pdf"));
Docx4J.toFO(foSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
Any help would be very appreciated.
Thanks.
You'd need to extend the PDF via FO code; see further How to correctly position a header image with docx4j?
Float left may or may not be easy; similarly the rotated text.
In general, the way to work on this is to take the FO generated by docx4j, then hand edit it to something which FOP can convert to a PDF you are happy with. If you can do that, then its a matter of modifying docx4j to generate that FO.

Images in Html to PDF using wkhtmltopdf in mvc 4

I am using wkhtmltopdf to convert html to pdf. I am using mvc 4. I was able to convert html to pdf. The only problem I have is that images do not render. There is small rectangle where image should appear. I have my images in database so when I get html string in my controller this is how image is shown right before I pass this string to converter:
<img src="/Images/Image/GetImageThumbnail?idImage=300" alt=""/>
So I am thinking that this approach is not working becuase I pass string to converter so image cannot be rendered. Any ideas how to solve this problem if images are in db?
I solve a similar issue by replacing src from src="/img/derp.png" to src="http://localhost/img/derp.png". I get the host part from the request that my Controller receives.
// Here I'm actually processing with HtmlAgilityPack but you get the idea
string host = request.Headers["host"];
string src = node.Attributes["src"].Value;
node.Attributes["src"].Value = "http://" + host + src;
This means that the server must be also be able to vomit images directly from URLs like that.
I guess it could be done with string.Replace as well if your HTML is in a string
string host = request.Headers["host"];
html = html.Replace("src=\"/", "src=\"http://"+host+"/"); // not tested

Tika - how to extract text from PDF text: underlined, highlighted, crossed out

I'm using Tika* to parse a PDF file.
There are no problems to retrieve the document's text, but I don't figure out how to extract text:
underlined
highlighted
crossed out
Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".
Is there a solution to extract these metadata information? (underline, highligh ...)
Do you know if Tika is able to extract this data?
*http://tika.apache.org/
Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:
//inputStream is the document you wish to parse from.
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
Metadata metadata = new Metadata();
parser.parse(inputStream,handler,metadata);
System.out.println(handler.toString());
The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.