How to align paragraph in XSl FO PDF - pdf

I want to align a paragraph below in XSL fO.
Label : value
Be careful not to set the heap size too large, as whatever
you allocate reduces the amount of memory available to the
operating system and other programs, which could cause
excessive paging (memory swapped back and forth between
RAM and the swap disc, which will slow your system down)
with proper alignment and indentation. Am able to create with word-breaks but indentation is missing .can anyone suggest the solution

Think you something like this?
<fo:block text-align="justify" margin-left="1.5cm">Your text here ...</fo:block>
Or you can use linefeed-treatment white-space-treatment, whitespace-collapse attributes together for preformatted texts.
And you can drop the fo:block into a fo:block-container for background coloring and bordering.
<fo:block margin-left="5mm" margin-right="5mm"
font-family="verdana" font-size="12pt"
space-before="5mm" space-after="5mm" keep-together="always"
linefeed-treatment="preserve" whitespace-treatment="pre" whitespace-collapse="false">
Your multiline
code here,
and more lines,
with multiple spaces on start and within
and empty lines ...
</fo:block>

Related

XSLT new lines not being preserved

For some reason my spaces aren't being preserved in my final PDF after xslt. My desired output is:
Static
text bold.
Here's my xslt template:
<xsl:preserve-space elements="*" />
<xsl:strip-space elements="" />
<xsl:template match="coverPage">
<fo:block font-size="12pt" color="black" text-align="center">
<xsl:text>
Static
text
</xsl:text>
</fo:block>
<fo:block font-size="12pt" color="black" text-align="center" font-weight="bold">
<xsl:text>
bold.
</xsl:text>
</fo:block>
</xsl:template>
I think there are a few issues with your XSLT:
there is no need to enclose the text inside xsl:text elements, as those text nodes are not composed only of whitespace characters and therefore will never be stripped (see Whitespace stripping for more details)
for the same reason, there is no need to use xsl:preserve-space and xsl:strip-space, unless of course you need them for other reasons
preserving linefeeds in the transformation from XML to XSL-FO is just the (required) first step, but then you must preserve them during the processing of the XSL-FO file; in order to do this, you must use the linefeed-treatment property: linefeed-treatment="preserve"
a literal linefeed is equivalent to a
entity, so in your input you have 3 linefeeds between "Static" and "text", which will produce two empty lines when preserved; if that's not what you want, you have to remove some of them
the words "text" and "bold" are inside two different fo:block elements, so this means they will always be on different lines; if you want them to be placed one beside the other, those words must be inside fo:inline elements instead (and there must be an outer fo:block to contain them)
A final word of warning
While looking at an FO file the difference between a preserved linefeed and an ignored one is not immediately apparent, as it boils down to the presence of the linefeed-treatment attribute in an ancestor element (which could be quite far from the text node itself).
Clearer ways to force a line break in a specific position include:
using different fo:block elements, each one containing the text that should create a line (or several ones)
<fo:block>Static</fo:block>
<fo:block>text <fo:inline font-weight="bold">bold.</fo:inline></fo:block>
using an empty fo:block where a line break should be
<fo:block>
Static
<fo:block/>
text <fo:inline font-weight="bold">bold.</fo:inline>
</fo:block>

identify paragraphs of pdf fiiles using itextsharp

Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:
if the horizontal axis of the first word in one line is less than that of the general lines;
if the leading of two consecutive lines is larger than that of the general ones;
if one line ends with "." and the horizontal axis of the ending word is less than that of the other lines
However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.
Thanks for your help!
I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare
Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?
When you parse a PDF using iTextSharp, you see each line as a series of TextRenderInfo object. These objects allow you to get the base line of the text:
LineSegment baseline = renderInfo.GetBaseline();
Vector startpoint = baseline.GetStartPoint();
This Vector consists of different elements: Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
You need startpoint[Vector.I2]. See also: How to detect newline from PDF using iTextSharp
The difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".

XSL: chop off string at arbitrary place

I'm using XSL to get an XML styled. The xsl defines a table with two columns. Thanks to Kevin Brown, the following code works fine to chop off at a word boundary, but what I need is to chop off at an arbitrary place.
<fo:table-cell>
<fo:block-container overflow="hidden" height="15pt"><fo:block>this is a very, very, very long text here</fo:block></fo:block-container>
</fo:table-cell>
If you generate this from XML and XSL, you would normally create a template when outputting that particular content and place ​ entities (the zero-width breaking space character). So however you do it, make the content like this (this says "very long word" with that entity between the letters:
v​e​r​y l​o​n​g w​o​r​d
So in your example (I only put them near the break so you can see):
<fo:table-cell border="1pt solid silver">
<fo:block-container overflow="hidden" height="15pt"><fo:block>this is a very, very, very l​o​n​g t​e​x​t here</fo:block></fo:block-container>
</fo:table-cell>
You would get this now ( it breaks at "o" in "long"):
A very interesting effect if you are so inclined is to set "text-align" as "justify" on that fo:block which will actually make all things align if at the end of the block you inserted an fo:leader of sufficient length to fill the cell. NOTE: This does not work in Apache FOP, it does in RenderX XEP.
Like:
<fo:table-cell border="1pt solid silver">
<fo:block-container overflow="hidden" height="15pt"><fo:block text-align="justify">this is a<fo:leader leader-length.minimum="3in"/></fo:block></fo:block-container>
</fo:table-cell>
If you did that, you would get this:

What does an /ActualText of FEFF0009 mean in a PDF?

I've been looking into a PDF file to understand how it is built.
I noticed that InDesign has created PDFs with text as below (after decompression using pdftk).
0 Tc /Span<</ActualText<FEFF0009>>> BDC
4.018 -0.2 Td
( )Tj
I understand the role of ActualText (for copy/paste/searching) but I'm wondering exactly how I should be interpreting the FEFF0009. It looks like a UTF-16 string with BOM chars to represent a tab character. This seems incorrect as it's really a space. I'm wondering if there is a special meaning here?
.. This seems incorrect as it's really a space.
No, it's really a tab.
14.9.4 Replacement Text
NOTE 1: Just as alternate descriptions can be provided for images and other items that do not translate naturally into text (as described in the preceding sub-clause), replacement text can be specified for content that does translate into text but that is represented in a nonstandard way.
(PDF 32000-1:2008)
The PDF text engine does not support the concept of 'tabs'. In this case, InDesign mimicked the function of a tab character by inserting a space in the text stream, and it could set the space width to match the distance spanned by the original tab or use a large relative positioning for the rest of the text (which it did here: the horizontal displacement of 4.018 in your code snippet).
The general idea is that a space is rendered on the position of the tab, but when you copy this text and paste somewhere else you get a tab character. I suppose the 'space' is only inserted to have something to copy.

Preserve "long" spaces in PDFBox text extraction

I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with PDFBox, that would be perfect.
There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.
Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.
PDF text extraction is difficult.
If the text was output as one big string separated by spaces such as :-
PDFTextOut(" Column 1 Column 2 Column 3");
and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.
In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.
PDFMoveTo(100,100);
PDFTextOut("Column 1");
PDFMoveTo(250,100);
PDFTextOut("Column 2");
In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.