iTextSharp - Incorrect text position - pdf

When extracting the position of the words in this example:
http://www.dertour.de/static/agb/2015/sommer/DER_Deutschland_So15.pdf
with iTextSharp 5.5.8
I'm getting 'incorrect' coordinates for some words. For example on line 17 of the first paragraph: 'gehen oder im Widerspruch zur Reiseaus-'
the x-values of the left,top position of the words are 118, 217, 296, 350, 524, 587. Only the first value seems correct (118,208,277,320,487,540). The x-value of the right-bottom point of the space-character between 'gehen' and 'oder' is 208, which seems correct and also seems to be the correct x-pos for the word 'oder'. Maybe it has something to do with the fillmode of the paragraph, but I'm not sure which actions I should perform to get the right coordinates.
I'm using LocationTextExtractionStrategy and calculate the word-positions to a 300 dpi coordinate system.
public override void RenderText(TextRenderInfo renderInfo)
{
// for the provided example
// uUnit = 1
// originX = 33.862
// originY = 33.555
// dpi = 300
// above values where calculated with code:
// PdfNumber userUnit = pageDict.GetAsNumber(PdfName.USERUNIT);
// if (userUnit != null)
// {
// uUnit = userUnit.FloatValue;
// }
// Rectangle dim = reader.GetPageSize(i);
// float originX = dim.Left;
// float originY = dim.Bottom;
// calculate coordinates:
renderInfo.GetText();
LineSegment segment = renderInfo.GetBaseline();
List<TextRenderInfo> charInfo = renderInfo.GetCharacterRenderInfos().ToList();
foreach (TextRenderInfo item in charInfo)
{
LineSegment char_segment = item.GetBaseline();
int char_left = (int)Math.Round((char_segment.GetStartPoint()[0] - originX) * dpi * uUnit / 72.0f);
int char_top = (int)Math.Round((item.GetAscentLine().GetEndPoint()[1] - originY) * dpi * uUnit / 72.0f);
int char_right = (int)Math.Round((char_segment.GetEndPoint()[0] - originX) * dpi * uUnit / 72.0f);
int char_bottom = (int)Math.Round((item.GetDescentLine().GetStartPoint()[1] - originY) * dpi * uUnit / 72.0f);
}
}

This indeed is a bug in iText & iTextSharp:
The lines with the extremely inaccurate x coordinates are those for which a large wordspacing value is set, e.g. your line:
0.2861 Tw T*
[<0047004500480045004E0000>-286<004F0044004500520000>-286<0049004D0000>-231<003700490044004500520053005000520055004300480000>-286<005A005500520000>-286<00320045004900530045004100550053000D>]TJ
(That 0.2861 argument for Tw is large.)
According to the ToUnicode map of the font in question the 0000 at the end of each word maps to the space character. Thus, iText here adds the word spacing value when calculating the x coordinates because according to the PDF specification ISO 32000-1:
Word spacing works the same way as character spacing but shall apply only to the ASCII SPACE character
(First sentence of section 9.3.3 Word Spacing)
Unfortunately it does not take into account
Word spacing shall be applied to every occurrence of the single-byte character code 32 in a string when using
a simple font or a composite font that defines code 32 as a single-byte code. It shall not apply to occurrences of
the byte value 32 in multiple-byte codes.
(Last sentence of section 9.3.3 Word Spacing)
At the 0000 above, therefore, word spacing must not be applied even though it is mapped to the space character because
the font encoding in question is purely multi-byte and
even in case of single-byte encoded space characters the word spacing is applied only at the single-byte code 32, not at a code which merely maps to the space character with ASCII code 32.
Usually this is not a problem during text extraction, usually PDF generators which encode space characters using multi-byte encodings are aware that word spacing does not apply for them and, therefore, don't change the word spacing from its default 0 value, so the iText bug here does no harm. Usage of word spacing instructions usually indicates that fonts are used which do map the single-byte code 32 to the space character.
Your PDF, on the other hand, seems to not have been created with that fact on the mind, it looks like first the word spacing has been set (0.2861 Tw), and after recognizing that it made no difference, explicit gaps have been added (-286 in the TJ instruction). (Or that was part of the development history of the PDF generator in question.)
Please be aware that positive values in the TJ argument mean a shift to the left, so negative values (as claimed for the -286 above) indeed widen or add gaps:
array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm . The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.
(Table 109 – Text-showing operators in ISO 32000-1)

Related

PDF Annotation containing Unicode characters(spanning two bytes) are not showing in firefox but working fine in chrome

Setting unicode characters in the Annotation appearance stream using arial unicode is showing up the characters correctly in chrome but not in firefox. Any Idea on this? The annotation appearance stream is as below. For example, to show a tick symbol.
BT /F3 34 Tf 1.0 0.0 0.0 rg 107.44528 635.27405 Td [
<FEFF27132713>
] TJ ET
Most likely your content stream is invalid.
If I understand you correctly, you want to control the encoding of a string parameter of a text showing instruction in a PDF content stream by prefixing it with a Unicode BOM. This does not work:
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.5, "Character encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite fonts".
(ISO 32000-2, section 9.4.3 "Text-showing operators")
In case of your example, therefore, that F3 font
either is a simple font with some single-byte encoding and your <FEFF27132713> string contains 6 separate character codes, each of them representing a glyph by itself if any,
or is a composite font possibly with a multi-byte encoding and your <FEFF27132713> string contains up to 6 separate character codes, each of them representing a glyph by itself if any.
In either case the interpretation of your string depends on a fixed encoding defined by the font object in question, you cannot manipulate it by some BOM prefix.

PDFBox - How to convert Xheight to points

I am trying to find the Xheight of a font using Pdfbox.
font is type of PDFont
println(font.name + ": " + font.fontDescriptor.xHeight)
Output of this is for font size 16pt:
TimesNewRomanPS-BoldMT: 546.0
But I am not able to identify how to convert this 546.0 into points or pixel or mm.
When you shared the PDF you took your information from, the cause became clear: The information in the font at hand simply is broken.
Details
As an example you refer to CourierNew in your example file font-list-1.pdf.
This font is used on page 2, the associated FontDescriptor is this object:
44 0 obj
<<
/StemV 42
/FontName/CourierNewPSMT
/FontStretch/Normal
/FontWeight 400
/Flags 34
/Descent -300
/FontBBox[-21 -680 638 1021]
/Ascent 832
/FontFamily(Courier New)
/CapHeight 578
/XHeight -578
/Type/FontDescriptor
/ItalicAngle 0
>>
endobj
So the font's XHeight value is -578. Which means it is rubbish in multiple ways:
It is negative. According to the specification the XHeight value is the vertical coordinate of the top of flat nonascending lowercase letters (like the letter x), measured from the baseline (ISO 32000-1, Table 122 – Entries common to all font descriptors). Having a negative value, therefore, means that all those flat nonascending lowercase letters are drawn completely way under the baseline.
This obviously is nonsense for a fairly normal font like CourierNew.
When loading the font descriptor, PDFBox executes a sanity check and takes the absolute value here which is why you have not seen the negative sign.
The absolute value of XHeight equals the CapHeight value which is specified as the vertical coordinate of the top of flat capital letters, measured from the baseline (ibidem).
Ignoring the negative XHeight sign (which is nonsense, see above), therefore, the font claims that flat nonascending lowercase letters and flat capital letters reach up to the same top coordinate.
This obviously is nonsense for CourierNew.
(The XHeight values of many other fonts in your sample file are similarly broken.)
How else to get a sensible x height value
If you really need a x height value of your fonts, you should inspect the drawing instructions for the flat nonascending lowercase letters in them and derive a x height value from their respective heights.
(This wont always succeed because those fonts may be available as embedded subsets only, and such subsets might be void of flat nonascending lowercase letters.)

PDF TJ operator

is it possible to determine if a number in TJ operator represents space between words?
Example: [(Sta)28(ry)-333(Plzenec,)]TJ
Number 28 is not enough for space, otherwise 333 it should be space according to actual font size. Font size is 9.96
First of all please be aware that there is no absolute limit number separating numbers for spaces between words from spaces for kerning. All you can do is develop heuristics which will fail for some documents, usually for very tightly set ones.
Now remember how those numbers are applied when calculating the text replacement tx or ty from the origin of the last character before the number to the origin of the first character thereafter:
(ISO 32000-1, section 9.4.4 Text Space Details, also discussed here)
Thus, first of all such a number only widens the gap to the next character if it's negative.
Furthermore, the number is applied before the font size is multiplied; thus, one does not have to take the font size into account as I incorrectly claimed in a comment to the question.
The number (scaled by 1/1000) is directly subtracted from the glyphs displacement. So one can compare it with the glyph displacements of the font in question to get an impression of the meaning of the number.
The glyph displacements essentially are the numbers from the corresponding font's Widths or W array (defaulting to the MissingWidth / DW value) scaled by 1/1000. As both the TJ numbers and the widths are scaled by 1/1000, you can directly compare them.
Thus, an obvious option would be to compare the absolute value of negative TJ numbers to the width of the space glyph in the font in question. This differs from font to font, e.g. it's 600 for Courier, 278 for Helvetica, and 250 for Times-Roman.
Spaces between words created by TJ numbers don't necessarily have to be as wide as the full space glyph of the font, but a relevant fraction of it, e.g. half its value (YMMV), can be used as minimum for interpreting a TJ number as a space between words.
Unfortunately, though, if a PDF generator creates all spaces between words by TJ numbers and none by space glyphs, and if the font is embedded as a subset only, there is no need to embed the space glyph at all. In that case you might want to use other glyphs to compare with; often the length of a capital 'M' is used as a measure for the widths of a font, you might want to use a relevant fraction thereof, e.g. one fifth (YMMV again).
You can improve your heuristics
by also taking the character spacing value Tc into account: If Tc / Tfs is negative with a relevant absolute value, the text is tightly set. In that case you might want to lessen the limit number determined as above. Or
by an analysis of all the TJ numbers in your text or those in the surrounding text. Here I can only guess, though, what might be acceptable heuristics...

How do I calculate word spacing in a PDF document? For example:

For example:
20 0 0 48 20 500.0 Tm
[(H)6(ello)54(Wor)7(ld)] TJ
0 -1.1075 TD
There is no space (32) character in this array of text.
But somehow viewers understands that 54 is a space. But 6 and 7 is a char spacing (Kerning). Any ideas?
The TJ operator is documentated in the PDF specification PDF 32000-1:2008 - Table 109 – Text-showing operators as follows:
Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space [...]. This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. [...]

pdf decoding, what do the bracket and numbers do?

my PDF file has deflate encoding, when inflating the string, it outputs something like this:
[(Lorem)-21( ipsum)-55( dolor)-14( sit)-55( amet,)-56( consectetur)-8( adipiscing)-14( elit.)-34( Donec)-15( faucibus)-49( lorem)-42( varius2)-56( mauris)-28( porttitor,)-34( et)-28( pellentesque)-1( )]TJ
what do the numbers and brackets mean?
it does not seems to be character count, or spacing,
does anyone know?
That is an array for showing text (Stuff in brackets denote array objects []), it should be followed by the TJ operator. The number is used to translate the text matrix (adjust the positioning of the text). Assuming horizontal text, a negative number moves the next glyph to the right.
From 9.4.3 Text-Showing Operators (Please see the spec for more details)
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space (see 9.4.4, "Text
Space Details"). This amount shall be subtracted from the current
horizontal or vertical coordinate, depending on the writing mode. In
the default coordinate system, a positive adjustment has the effect of
moving the next glyph painted either to the left or down by the given
amount.
The parentheses denote string objects:
String objects shall be written in one of the following two ways:
As a sequence of literal characters enclosed in parentheses ( ) (using
LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2,
"Literal Strings."
...
A literal string shall be written as an arbitrary number of characters
enclosed in parentheses. Any characters may appear in a string except
unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS
(29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be
treated specially as described in this sub-clause. Balanced pairs of
parentheses within a string require no special treatment.
I suggest getting the PDF Spec and reading it to find out more info.