PDF TJ operator - pdf

is it possible to determine if a number in TJ operator represents space between words?
Example: [(Sta)28(ry)-333(Plzenec,)]TJ
Number 28 is not enough for space, otherwise 333 it should be space according to actual font size. Font size is 9.96

First of all please be aware that there is no absolute limit number separating numbers for spaces between words from spaces for kerning. All you can do is develop heuristics which will fail for some documents, usually for very tightly set ones.
Now remember how those numbers are applied when calculating the text replacement tx or ty from the origin of the last character before the number to the origin of the first character thereafter:
(ISO 32000-1, section 9.4.4 Text Space Details, also discussed here)
Thus, first of all such a number only widens the gap to the next character if it's negative.
Furthermore, the number is applied before the font size is multiplied; thus, one does not have to take the font size into account as I incorrectly claimed in a comment to the question.
The number (scaled by 1/1000) is directly subtracted from the glyphs displacement. So one can compare it with the glyph displacements of the font in question to get an impression of the meaning of the number.
The glyph displacements essentially are the numbers from the corresponding font's Widths or W array (defaulting to the MissingWidth / DW value) scaled by 1/1000. As both the TJ numbers and the widths are scaled by 1/1000, you can directly compare them.
Thus, an obvious option would be to compare the absolute value of negative TJ numbers to the width of the space glyph in the font in question. This differs from font to font, e.g. it's 600 for Courier, 278 for Helvetica, and 250 for Times-Roman.
Spaces between words created by TJ numbers don't necessarily have to be as wide as the full space glyph of the font, but a relevant fraction of it, e.g. half its value (YMMV), can be used as minimum for interpreting a TJ number as a space between words.
Unfortunately, though, if a PDF generator creates all spaces between words by TJ numbers and none by space glyphs, and if the font is embedded as a subset only, there is no need to embed the space glyph at all. In that case you might want to use other glyphs to compare with; often the length of a capital 'M' is used as a measure for the widths of a font, you might want to use a relevant fraction thereof, e.g. one fifth (YMMV again).
You can improve your heuristics
by also taking the character spacing value Tc into account: If Tc / Tfs is negative with a relevant absolute value, the text is tightly set. In that case you might want to lessen the limit number determined as above. Or
by an analysis of all the TJ numbers in your text or those in the surrounding text. Here I can only guess, though, what might be acceptable heuristics...

Related

PDFBox - How to convert Xheight to points

I am trying to find the Xheight of a font using Pdfbox.
font is type of PDFont
println(font.name + ": " + font.fontDescriptor.xHeight)
Output of this is for font size 16pt:
TimesNewRomanPS-BoldMT: 546.0
But I am not able to identify how to convert this 546.0 into points or pixel or mm.
When you shared the PDF you took your information from, the cause became clear: The information in the font at hand simply is broken.
Details
As an example you refer to CourierNew in your example file font-list-1.pdf.
This font is used on page 2, the associated FontDescriptor is this object:
44 0 obj
<<
/StemV 42
/FontName/CourierNewPSMT
/FontStretch/Normal
/FontWeight 400
/Flags 34
/Descent -300
/FontBBox[-21 -680 638 1021]
/Ascent 832
/FontFamily(Courier New)
/CapHeight 578
/XHeight -578
/Type/FontDescriptor
/ItalicAngle 0
>>
endobj
So the font's XHeight value is -578. Which means it is rubbish in multiple ways:
It is negative. According to the specification the XHeight value is the vertical coordinate of the top of flat nonascending lowercase letters (like the letter x), measured from the baseline (ISO 32000-1, Table 122 – Entries common to all font descriptors). Having a negative value, therefore, means that all those flat nonascending lowercase letters are drawn completely way under the baseline.
This obviously is nonsense for a fairly normal font like CourierNew.
When loading the font descriptor, PDFBox executes a sanity check and takes the absolute value here which is why you have not seen the negative sign.
The absolute value of XHeight equals the CapHeight value which is specified as the vertical coordinate of the top of flat capital letters, measured from the baseline (ibidem).
Ignoring the negative XHeight sign (which is nonsense, see above), therefore, the font claims that flat nonascending lowercase letters and flat capital letters reach up to the same top coordinate.
This obviously is nonsense for CourierNew.
(The XHeight values of many other fonts in your sample file are similarly broken.)
How else to get a sensible x height value
If you really need a x height value of your fonts, you should inspect the drawing instructions for the flat nonascending lowercase letters in them and derive a x height value from their respective heights.
(This wont always succeed because those fonts may be available as embedded subsets only, and such subsets might be void of flat nonascending lowercase letters.)

How do I calculate word spacing in a PDF document? For example:

For example:
20 0 0 48 20 500.0 Tm
[(H)6(ello)54(Wor)7(ld)] TJ
0 -1.1075 TD
There is no space (32) character in this array of text.
But somehow viewers understands that 54 is a space. But 6 and 7 is a char spacing (Kerning). Any ideas?
The TJ operator is documentated in the PDF specification PDF 32000-1:2008 - Table 109 – Text-showing operators as follows:
Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space [...]. This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. [...]

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

identify paragraphs of pdf fiiles using itextsharp

Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:
if the horizontal axis of the first word in one line is less than that of the general lines;
if the leading of two consecutive lines is larger than that of the general ones;
if one line ends with "." and the horizontal axis of the ending word is less than that of the other lines
However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.
Thanks for your help!
I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare
Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?
When you parse a PDF using iTextSharp, you see each line as a series of TextRenderInfo object. These objects allow you to get the base line of the text:
LineSegment baseline = renderInfo.GetBaseline();
Vector startpoint = baseline.GetStartPoint();
This Vector consists of different elements: Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
You need startpoint[Vector.I2]. See also: How to detect newline from PDF using iTextSharp
The difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".

pdf decoding, what do the bracket and numbers do?

my PDF file has deflate encoding, when inflating the string, it outputs something like this:
[(Lorem)-21( ipsum)-55( dolor)-14( sit)-55( amet,)-56( consectetur)-8( adipiscing)-14( elit.)-34( Donec)-15( faucibus)-49( lorem)-42( varius2)-56( mauris)-28( porttitor,)-34( et)-28( pellentesque)-1( )]TJ
what do the numbers and brackets mean?
it does not seems to be character count, or spacing,
does anyone know?
That is an array for showing text (Stuff in brackets denote array objects []), it should be followed by the TJ operator. The number is used to translate the text matrix (adjust the positioning of the text). Assuming horizontal text, a negative number moves the next glyph to the right.
From 9.4.3 Text-Showing Operators (Please see the spec for more details)
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space (see 9.4.4, "Text
Space Details"). This amount shall be subtracted from the current
horizontal or vertical coordinate, depending on the writing mode. In
the default coordinate system, a positive adjustment has the effect of
moving the next glyph painted either to the left or down by the given
amount.
The parentheses denote string objects:
String objects shall be written in one of the following two ways:
As a sequence of literal characters enclosed in parentheses ( ) (using
LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2,
"Literal Strings."
...
A literal string shall be written as an arbitrary number of characters
enclosed in parentheses. Any characters may appear in a string except
unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS
(29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be
treated specially as described in this sub-clause. Balanced pairs of
parentheses within a string require no special treatment.
I suggest getting the PDF Spec and reading it to find out more info.