How the value of the tj operator is generated in a pdf document (justified text) - pdf

I can't understand and find how the value of the tj operator is generated??
Here I paste result before and after changes in the display of the text (on the second block I changed the position Left-Justice and then again comeback to Centered)
I think pdf use some of prng, but what kind of, I can't find
HElp please
[(\003\024\027\005\003\030\036\b)-114.267(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.337(\b)-111.574(#\024\002\f\005\002\021\003\007\004\f\005\b)-117.089(\003\006\002\003\b)-114.08
[(\003\024\027\005\003\030\036\b)-114.366(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.327(\b)-111.693(#\024\002\f\005\002\021\003\007\004\f\005\b)-116.98(\003\006\002\003\b)-114.188

First of all, the PDF format does not explicitly support text justification. PDF does not even know text column definitions to justify text in!
All the PDF format supports is
setting or changing the text matrix (and text line matrix), scaling, character and word spacing explicitly and
drawing text pieces which implicitly changes the text matrix.
Thus, if a PDF processor changes the justification of a line of text, it actually first has to have determined
which text pieces belong together and form that line of text;
text pieces can be given as arguments of the Tj or TJ instructions (or more seldom the " or ' instructions); in simple cases the whole line is drawn using a single instruction but you cannot count on that in general; and
what the left and right borders of the text column are to justify between;
e.g. these borders might be standard values assumed by the processor for certain page formats or derived from the current clip path.
Having determined these data, the procedure differs for different kinds of justification:
left justification - position the text matrix at the left text column border at the height of the line and simple let the text drawing instructions follow;
right justification - calculate the width of the drawn line using the current font, position the text matrix at the right text column border minus that width at the height of the line, and let the text drawing instructions follow;
center justification - calculate the width of the drawn line using the current font, position the text matrix at the middle of the text column minus half that width at the height of the line, and let the text drawing instructions follow;
full justification - calculate the width of the drawn line using the current font, set the character spacing and word spacing (using the Tc and Tw instructions, probably with a tweak of the Tz horizontal scaling) to use up the difference between that width and the text column width, position the text matrix at the left text column border at the height of the line, and let the text drawing instructions follow;
or calculate the width of the drawn line using the current font, change the text drawing instructions to use up the difference between that width and the text column width (e.g. using the numeric TJ array argument values), position the text matrix at the left text column border at the height of the line, and let the changed text drawing instructions follow;
or even apply a combination of these methods.
(The changes applied when doing a full justification - character spacing, word spacing, changes of text drawing instructions - obviously additionally are undone when later again changing to another type of justification...)
Positioning the text matrix can happen using the Tm, Td, TD, and T* instructions.
By the way, the positioning and scaling of the text also is influenced by the current transformation matrix. Thus, cm instructions can also be used for justification. But this is less likely than the use of the instructions mentioned above...
Unfortunately you merely supplied an excerpt from the array argument of a TJ instruction before and after such a justification job. One sees that the numeric elements of that array change very slightly. Whether this actually is the justification itself (as per the second option of the full justification above) or merely some computational inaccuracy cannot be told without the context.

Related

PDFBOX vertical text position with RotationMagic

I am using PDFBOX to extract the word coordinates. The PDF has some horizontal and vertical text mixed in a page, Here is the PDF, in Page 9, it has vertical text.
If I use PDFBOX -RotationMagic option, I can extract the vertical text layout correctly. However, its text Position is not correct inside PDFTextStripper class function:WriteString().
How can I convert to original coordinates (e.g., origin is at top left).
In page 9, the "1.Introduction" is vertical text and after adding "-optionMagic", the coordinates inside writeString() is like this:
If I do not apply '-rotationMagic', the textPosition is like this. it splits "1:Introduciton" to multiple words. The first one is "1:"
How can I convert the first (rotated) coordinates to the second one. Why is the first one is (446.73, 1029.6)?

Lost some text when extracting pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

How to make justify formating for paragraph in pdfbox Example given below, any suggestion?

Problem :- Not able to apply justify formatting for paragraph in pdfbox.
I tried :- I included space in between words so that it justifies the paragraph. But this does not work for a variable width font. For that I not able to identify pixel position for each character since the width in pixels of each character is different.
Please give suggestions....
I want to make this type of alignment in pdfbox as given below :-

Wrong width calculation for words in PDF

I have a PDF, which has WinAnsiEncoding(Times) and it do not have any toUnicode or differences encoding.
When I try to select or highlight text, width of text is not calculated properly.
I have all the widths but somehow width is calculated incorrectly. What can be the problem? How can I get correct CID of character in order to find correct width of the glyph? I have attached screenshots for the same. "EXPLANATIONS" word is not selected completely. This can happen only if the widths of glyph are calculated incorrectly.

How to cut out numbers from an image dynamically?

i've got to this stage:
where i can find the numbers in the above image but i need to cut them out so i can retain the order etc. but the as the number increases the spacing changes and the position of the number?
so i think it should be a find a white PX the continue until it find a solid black col and then use the points to do a simple cut any help would be great.
A simple solution would be this:
Find the first upmost horizontal line which contains white pixels
From that line find the first horizontal line which contains only black pixels
Those two lines are your upper and lower borders.
Between this borders proceed like this:
Find the first most left vertical line which contains white pixels
From that line find the last vertical line which contains only black pixels and which comes directly after a line with white pixels.
Those two lines are your left and right borders.
The steps to separate single numbers can be performed analogously.
If you need to identify which numbers are in your picture, I recommend using specialized computer vision libraries.
Some VB.net pseudo code to get you going:
Sub FindTopBorder(image As MyImage) As Integer
For y = 0 to image.Height - 1
For x = 0 to image.Width - 1
Dim pixel = image.GetPixel(x, y)
If ('Check if pixel is white here with RGB or Color') Then
Return y
End If
Next
Next
' Just in case there are no white pixels or use an exception instead
Return -1
End Sub
I would start looking into Connected component segmentation. You find a pixel which is within a character (number). Then run the connected component algorithm which finds all connected pixels under specific set of rules (e.g. slight deviation in color, stop at hard borders etc).
http://en.wikipedia.org/wiki/Connected-component_labeling
If you can use libraries, I'm sure OpenCV or similar libraries support this out of the box.
//edit
I see you need VB.net. Probably it is easiest to port some algorithm to VB or create one yourself.
See e.g. http://www.codeproject.com/Articles/336915/Connected-Component-Labeling-Algorithm
What to expect
Input
An image containing two shapes:
Output
Now each is separated into single images.