Page breaks in html don't appear when embedded in a docx file with docx4j - docx4j

I'm outputting a Word (docx) file using docx4j, and the page breaks aren't appearing in the document.
I'm using:
hr {page-break-after: always}
for the css, but it isn't rendering as a page break in the Word document.
What html or css should I be using to get an html page break to transfer over to the docx file?

Worked for me, try to use this tag:
<br style=\"page-break-after: always; clear:both;\"></br>
All other tag not worked with style css.

Works with heading tags or paragraph tags (h1, h2, p):
I used this css style with h1 tags in my html content and successfully got the page-breaks applying after the heading in my exported word document:
String pageBreakMarker = "<h1 style=\"page-break-after: always;\"></h1>";
I think this is not working with html break tags has to do with how docx4j treats html break tags.
Even looking at the getting started sample on github, I only see the 'page-break-after' or 'page-break-before' being used with only html paragraph tags (p)
See link to Docx4j getting started guide on github below:
Docx4j Getting Started
Does not work with html break tags:
Using docx4, I can confirm I the "page-break-after" or "page-break-before" does not work when trying to create page breaks in word docs from html content with break tags (hr):
<br style=\"page-break-after: always; clear:both;\"></br>

Simply add this line before heading tags or paragraph tags (h1, h2, p) :
<p style="line-height: 100%; margin-bottom: 0mm; page-break-before: always">
When i open it in OpenOffice writer, it breaks the page.

Related

Change Title/Headings Font in Quarto PDF Output

When RMarkdown .rmd documents are knitted as PDF, the text body as well as the title, subtitle and headings are rendered in the same LaTeX standard font.
When rendering a Quarto .qmd document as PDF, the font for the text body remains the same, but the title, subtitle and headings are rendered in a different font, without serifs.
To achieve consistency between the outputs of older R Markdown documents and newer Quarto documents, I would like to change the font for the title, subtitle and headings back to the normal font. How can I achieve this?
I tried using fontfamily: in the YAML header, but this did not find the fonts I wanted. I had some success by using \setkomafont{section}{\normalfont} in include-in-header:, as this did change the font, but only for h1 headings, not for h2 nor for the title or subtitle. It also removed all other formatting for h1 (e.g. fontsize, bold, etc.), which is not what I want.
Using this answer from Tex StackExchange we can do this in quarto easily.
---
title: "Fonts"
subtitle: "Changing fonts of title, subtitle back to normal font"
author: "None"
format:
pdf:
include-in-header:
text: |
\addtokomafont{disposition}{\rmfamily}
---
## Quarto
Quarto enables you to weave together content and executable code into a
finished document. To learn more about Quarto see <https://quarto.org>.
## Running Code
When you click the **Render** button a document will be generated that includes
both content and the output of embedded code.
And of course, check the section 3.6 - Text Markup of KOMA-Script manual, which provides a very detailed list of elements (like author, chapter, title, subtitle, date, etc.) for whose such changes can be done.
If the font used in the body is known, then you can set the font used in title and headings with sansfont: .... It's wise to also set mainfont to make sure they are the same.
The default font used is Latin Modern Roman, so adding this to the YAML frontmatter should do it:
---
mainfont: Latin Modern Roman
sansfont: Latin Modern Roman
---

How to tokenize html tags with spacy?

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">
IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

DITA OT printing '#' in stead of Chinese characters in PDF

I am very new to DITA OT. Downloaded the DITA-OT1.5.4_full_easy_install_bin and playing around with it. I'm trying to print few characters in Simplified Chinese (zh-CN) into a PDF. I see that the characters are printed correctly in XHTML but in PDF they are printed as "#".
In the command line I see this - "Warning: Glyph "?" (0x611f) not available in font "Helvetica".
Here are the things I have tried so far:
In demo\fo\fop\conf\fop.xconf :
<fonts>
<font kerning="yes"
embed-url="file:///C:/Windows/Fonts/simsun.ttc"
embedding-mode="subset" encoding-mode="cid">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<auto-detect/>
<directory recursive="true">C:\Windows\Fonts</directory>
</fonts>
In demo\fo\cfg\fo\attrs\custom.xsl :
<xsl:attribute-set name="__fo__root">
<xsl:attribute name="font-family">SimSun</xsl:attribute>
</xsl:attribute-set>
In demo\fo\cfg\fo\font-mapping.xml added this block for Sans, Serif & Monospaced logical fonts:
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
In samples\concepts\garageconceptsoverview.xml :
<shortdesc xml:lang="zh_CN">職業道德感.</shortdesc>
And this is the command I am using to generate the PDF:
ant -Dargs.input=samples\hierarchy.ditamap -Dtranstype=pdf
Any help would be appreciated. Thanks.
[EDIT]
I see that the topic.fo file which gets generated in temp folder, does contain the Chinese characters correctly. Like this:
<fo:block font-size="10pt" keep-with-next.within-page="5" start-indent="25pt">職業道德感.</fo:block>
But I do not see the font related information anywhere in this document.
First of all you should set the "xml:lang='zh_CN'" attribute on the root elements for all DITA topics and maps. This will help the DITA OT publishing decide the language to use for static texts like "Table X" and also to decide on the charset to use for the font mappings.
Then you should run the publishing by setting the parameter "clean.temp" parameter to "no".
After the publishing you can look in the temporary files folder for a file called "topic.fo" and look inside it to see what font families are used.
Because even if you set a font on the root element, there are other places in the XSL-FO file where you have font families set explicitly.
So instead of setting a font on the XSL-FO root element you should edit the font mappings XML file and for each of the logical fonts "Sans" and "Serif" you should configure the actual font family to use for the Chinese charset, something like:
<logical-font name="Sans">
.........
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
......
</logical-font>
More about how the font mappings work:
https://www.oxygenxml.com/doc/versions/17.0/ug-editor/#topics/DITA-map-set-font-Apache-FOP.html
Update:
If you insist of having that XSLT customization which sets the "SimSun" font as a font family on the root element, then in the font-mappings.xml you need to define a new mapping for your alias:
<aliases>
<alias name="SimSun">SimSun</alias>
</aliases>
and then map the logical font to a physical one in the same font-mappings.xml:
<logical-font name="SimSun">
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
</logical-font>
0x611f , this character is a chinese character (感), helvetica is an europe font , so no this character in the "helvetica" font. You can search this "helvetica" font loaction, in this position your content(ditamap/dita) should use chinese font, not europe font. You must find that arritbute that include the [font-famliy=helvetical], modify in your own plugin [SimSun, Helvetical].
Sorry, I cannot answer your question, but you should definetely try a newer DITA-OT from http://dita-ot.github.io/. Your DITA-OT is not supported anymore. Maybe your problem fades away using the latest release.

multiline code block in markdown adds unwanted tabs

today I'm implementing my page in nanoc (haml templates) and I wanted to write some posts in markdown, but when it goes to multiline code blocks something weird is happening - second line in code block has additional tabs. I've tried multiple markdown syntaxes, such as:
//double tab wrapping
line 1 is fine
line 2 is wrapping (don't know why!)
and
~~~
//tilde code wrapping
line 1 is fine
line 2 is wrapping
~~~
And both solutions gives me the result something like this:
line 1 is fine
line 2 is wrapping
Inspecting elements through browser shows that there is no additional padding - this whitespace is made with tabs for sure.
Can someone help me with this? Maybe I'm doing something wrong?
When you use = in Haml to include the results of a script, Haml will re-indent the inserted text so that it matches the indentation of where it is included. For example, if you have Haml that looks something like this:
%html
%body
.foo
= insert_something
and insert_something returns some HTML like this:
<p>
This is possily generated from Markdown.
</p>
then the resulting HTML will look like this:
<html>
<body>
<div class='foo'>
<p>
This is possily generated from Markdown.
</p>
</div>
</body>
</html>
Note how the p element is indented to match its position in the document.
Normally this doesn’t matter, because of the way whitespace in HTML is collapsed. However there are HTML elements where the whitespace is important, in particular pre.
What it looks like is happening here is that your Markdown is generating something like
<pre><code>line 1 is fine
line 2 is wrapping
</code></pre>
and when it is included in your Haml file (I’m guessing you’re using Haml layouts with = yield to include the Markdown) it is being indented and the whitespace is showing up when you view the page. Note how the first line is immediately after the opening tags so there is no extra whitespace.
There are a couple of ways to fix this. If you set the :ugly option then Haml won’t re-indent blocks like this (sorry I don’t know how you set Haml options in Nanoc).
You could also use the find_and_preserve helper method. This replaces all newlines in whitespace sensitive tags with HTML entity
, so that they won’t be affected by the extra whitespace when indented:
= find_and_preserve(yield)
Haml provides an easy way to use find_and_preserve; ~ works the same as =, except that it runs find_and_preserve on the result, so you can do:
~ yield

How to handle new line in handlebar.js

I am using HandleBar.js in my rails jquery mobile application.
I have a json returned value data= "hi\n\n\n\n\nb\n\n\n\nhow r u"
which when used in .hbs file as {{data}} showing me as hi how r u and not as with the actual new line inserted
Please suggest me.
Pre tag helps me
Handlebars doesn't mess with newlines in your data unless you have registered a helper which is doing something with them. A good way of dealing with newlines in HTML without converting them to br tags would be to use the CSS property white-space while rendering the handlebars template in HTML. You can set its value to pre-line.
Read the related documentation on MDN
Look at the source of the generated file - your newline characters are probably there, HTML simply does not render newline characters as new lines.
You can insert a linebreak with <br />
However, it looks like you're trying to format the position of your lines using newline characters, which technically should be done by wrapping your lines in <p> or <div> tags and styling with CSS.
Simply use the CSS property white-space and set the value as pre-line
For a example:
<p style="white-space: pre-line">
{{text}}
</p>