How to make justify formating for paragraph in pdfbox Example given below, any suggestion? - pdfbox

Problem :- Not able to apply justify formatting for paragraph in pdfbox.
I tried :- I included space in between words so that it justifies the paragraph. But this does not work for a variable width font. For that I not able to identify pixel position for each character since the width in pixels of each character is different.
Please give suggestions....
I want to make this type of alignment in pdfbox as given below :-

Related

valign images in xlsxwriter cell

Is there any hack for v-aligning images in a cell
I am trying to create a dashboard,those traffic lights are images.
Since once of the columns is a text-wrap and the height of those rows
are dynamic, I have no way of knowing the row height to calculate the y_offset for those images
Does anyone have a recommendation on how I can handle this? Is there a way of getting the row_height after sheet.write and text_wrap format is applied?
Is there a way of getting the row_height after sheet.write and text_wrap format is applied?
Probably not without access to Windows APIs for calculating bounding boxes for strings.
You could probably make some working estimates based on the length of your string. Each new line in text wrap is equal to 15 character units or 20 pixels.
Since once of the columns is a text-wrap and the height of those rows are dynamic, I have no way of knowing the row height to calculate the y_offset for those images
This is the main problem. In order to specify the image position exactly you will need to specify explicit row heights so that XlsxWriter can calculate where the image will go based on the size of the cell. In order words you will have to avoid the automatic row height that Excel gives you when wrapping text.
Once the row height is fixed you can position images exactly where you want them using the 'x_offset' and 'y_offset' options.
Note, you can also use conditional formatting to create traffic lights based on cell values. See Sheet9/Example 9 of this code from the XlsxWriter docs and image below. These can be centered automatically even with with text wrapping.

PDFBOX vertical text position with RotationMagic

I am using PDFBOX to extract the word coordinates. The PDF has some horizontal and vertical text mixed in a page, Here is the PDF, in Page 9, it has vertical text.
If I use PDFBOX -RotationMagic option, I can extract the vertical text layout correctly. However, its text Position is not correct inside PDFTextStripper class function:WriteString().
How can I convert to original coordinates (e.g., origin is at top left).
In page 9, the "1.Introduction" is vertical text and after adding "-optionMagic", the coordinates inside writeString() is like this:
If I do not apply '-rotationMagic', the textPosition is like this. it splits "1:Introduciton" to multiple words. The first one is "1:"
How can I convert the first (rotated) coordinates to the second one. Why is the first one is (446.73, 1029.6)?

Lost some text when extracting pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

How the value of the tj operator is generated in a pdf document (justified text)

I can't understand and find how the value of the tj operator is generated??
Here I paste result before and after changes in the display of the text (on the second block I changed the position Left-Justice and then again comeback to Centered)
I think pdf use some of prng, but what kind of, I can't find
HElp please
[(\003\024\027\005\003\030\036\b)-114.267(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.337(\b)-111.574(#\024\002\f\005\002\021\003\007\004\f\005\b)-117.089(\003\006\002\003\b)-114.08
[(\003\024\027\005\003\030\036\b)-114.366(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.327(\b)-111.693(#\024\002\f\005\002\021\003\007\004\f\005\b)-116.98(\003\006\002\003\b)-114.188
First of all, the PDF format does not explicitly support text justification. PDF does not even know text column definitions to justify text in!
All the PDF format supports is
setting or changing the text matrix (and text line matrix), scaling, character and word spacing explicitly and
drawing text pieces which implicitly changes the text matrix.
Thus, if a PDF processor changes the justification of a line of text, it actually first has to have determined
which text pieces belong together and form that line of text;
text pieces can be given as arguments of the Tj or TJ instructions (or more seldom the " or ' instructions); in simple cases the whole line is drawn using a single instruction but you cannot count on that in general; and
what the left and right borders of the text column are to justify between;
e.g. these borders might be standard values assumed by the processor for certain page formats or derived from the current clip path.
Having determined these data, the procedure differs for different kinds of justification:
left justification - position the text matrix at the left text column border at the height of the line and simple let the text drawing instructions follow;
right justification - calculate the width of the drawn line using the current font, position the text matrix at the right text column border minus that width at the height of the line, and let the text drawing instructions follow;
center justification - calculate the width of the drawn line using the current font, position the text matrix at the middle of the text column minus half that width at the height of the line, and let the text drawing instructions follow;
full justification - calculate the width of the drawn line using the current font, set the character spacing and word spacing (using the Tc and Tw instructions, probably with a tweak of the Tz horizontal scaling) to use up the difference between that width and the text column width, position the text matrix at the left text column border at the height of the line, and let the text drawing instructions follow;
or calculate the width of the drawn line using the current font, change the text drawing instructions to use up the difference between that width and the text column width (e.g. using the numeric TJ array argument values), position the text matrix at the left text column border at the height of the line, and let the changed text drawing instructions follow;
or even apply a combination of these methods.
(The changes applied when doing a full justification - character spacing, word spacing, changes of text drawing instructions - obviously additionally are undone when later again changing to another type of justification...)
Positioning the text matrix can happen using the Tm, Td, TD, and T* instructions.
By the way, the positioning and scaling of the text also is influenced by the current transformation matrix. Thus, cm instructions can also be used for justification. But this is less likely than the use of the instructions mentioned above...
Unfortunately you merely supplied an excerpt from the array argument of a TJ instruction before and after such a justification job. One sees that the numeric elements of that array change very slightly. Whether this actually is the justification itself (as per the second option of the full justification above) or merely some computational inaccuracy cannot be told without the context.

Wrong width calculation for words in PDF

I have a PDF, which has WinAnsiEncoding(Times) and it do not have any toUnicode or differences encoding.
When I try to select or highlight text, width of text is not calculated properly.
I have all the widths but somehow width is calculated incorrectly. What can be the problem? How can I get correct CID of character in order to find correct width of the glyph? I have attached screenshots for the same. "EXPLANATIONS" word is not selected completely. This can happen only if the widths of glyph are calculated incorrectly.