I am using PDFBOX to extract the word coordinates. The PDF has some horizontal and vertical text mixed in a page, Here is the PDF, in Page 9, it has vertical text.
If I use PDFBOX -RotationMagic option, I can extract the vertical text layout correctly. However, its text Position is not correct inside PDFTextStripper class function:WriteString().
How can I convert to original coordinates (e.g., origin is at top left).
In page 9, the "1.Introduction" is vertical text and after adding "-optionMagic", the coordinates inside writeString() is like this:
If I do not apply '-rotationMagic', the textPosition is like this. it splits "1:Introduciton" to multiple words. The first one is "1:"
How can I convert the first (rotated) coordinates to the second one. Why is the first one is (446.73, 1029.6)?
Related
I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.
I can't understand and find how the value of the tj operator is generated??
Here I paste result before and after changes in the display of the text (on the second block I changed the position Left-Justice and then again comeback to Centered)
I think pdf use some of prng, but what kind of, I can't find
HElp please
[(\003\024\027\005\003\030\036\b)-114.267(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.337(\b)-111.574(#\024\002\f\005\002\021\003\007\004\f\005\b)-117.089(\003\006\002\003\b)-114.08
[(\003\024\027\005\003\030\036\b)-114.366(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.327(\b)-111.693(#\024\002\f\005\002\021\003\007\004\f\005\b)-116.98(\003\006\002\003\b)-114.188
First of all, the PDF format does not explicitly support text justification. PDF does not even know text column definitions to justify text in!
All the PDF format supports is
setting or changing the text matrix (and text line matrix), scaling, character and word spacing explicitly and
drawing text pieces which implicitly changes the text matrix.
Thus, if a PDF processor changes the justification of a line of text, it actually first has to have determined
which text pieces belong together and form that line of text;
text pieces can be given as arguments of the Tj or TJ instructions (or more seldom the " or ' instructions); in simple cases the whole line is drawn using a single instruction but you cannot count on that in general; and
what the left and right borders of the text column are to justify between;
e.g. these borders might be standard values assumed by the processor for certain page formats or derived from the current clip path.
Having determined these data, the procedure differs for different kinds of justification:
left justification - position the text matrix at the left text column border at the height of the line and simple let the text drawing instructions follow;
right justification - calculate the width of the drawn line using the current font, position the text matrix at the right text column border minus that width at the height of the line, and let the text drawing instructions follow;
center justification - calculate the width of the drawn line using the current font, position the text matrix at the middle of the text column minus half that width at the height of the line, and let the text drawing instructions follow;
full justification - calculate the width of the drawn line using the current font, set the character spacing and word spacing (using the Tc and Tw instructions, probably with a tweak of the Tz horizontal scaling) to use up the difference between that width and the text column width, position the text matrix at the left text column border at the height of the line, and let the text drawing instructions follow;
or calculate the width of the drawn line using the current font, change the text drawing instructions to use up the difference between that width and the text column width (e.g. using the numeric TJ array argument values), position the text matrix at the left text column border at the height of the line, and let the changed text drawing instructions follow;
or even apply a combination of these methods.
(The changes applied when doing a full justification - character spacing, word spacing, changes of text drawing instructions - obviously additionally are undone when later again changing to another type of justification...)
Positioning the text matrix can happen using the Tm, Td, TD, and T* instructions.
By the way, the positioning and scaling of the text also is influenced by the current transformation matrix. Thus, cm instructions can also be used for justification. But this is less likely than the use of the instructions mentioned above...
Unfortunately you merely supplied an excerpt from the array argument of a TJ instruction before and after such a justification job. One sees that the numeric elements of that array change very slightly. Whether this actually is the justification itself (as per the second option of the full justification above) or merely some computational inaccuracy cannot be told without the context.
I would like to draw 2 texts onto my PDF.
The first text should be aligned to the top left corner.
This works fine.
I'm using:
canvas = stamper.GetOverContent(i)
watermarkFont = iTextSharp.text.pdf.BaseFont.CreateFont(iTextSharp.text.pdf.BaseFont.HELVETICA, iTextSharp.text.pdf.BaseFont.CP1252, iTextSharp.text.pdf.BaseFont.NOT_EMBEDDED)
watermarkFontColor = iTextSharp.text.BaseColor.RED
canvas.MoveTo(0, 0) 'I think the canvas is the space that we draw onto. My documents always start at position X=0 and Y=0, so move to 0,0 should be fine
canvas.BeginText()
canvas.SetFontAndSize(watermarkFont, 12)
canvas.SetColorFill(watermarkFontColor)
canvas.ShowTextAligned(Element.ALIGN_TOP, uText, 0, 830, 0) 'is 830 the width of the available space?
canvas.EndText()
Now I would like to draw another text approximately 100 pixels below the first text.
I'm using:
canvas.MoveTo(0, 100) 'let's draw the second text at X=100, Y=100
canvas.BeginText()
canvas.SetFontAndSize(watermarkFont, 12)
canvas.SetColorFill(watermarkFontColor)
canvas.ShowTextAligned(Element.ALIGN_CENTER, uBewirtung, 0, 830, 0)
canvas.EndText()
The second text however doesn't show up at all.
I suspect I'm drawing outside the document, but I don't see my mistake.
The MoveTo() method is meant for drawing paths (lines amd shapes in graphics state), not text (in text state). It adds an m operator to the content stream. If you are a PDF specialist, you should use the SetTextMatrix() method inside your BT/ET text block: What does setTextMatrix of contentByte class in iText do?
Note the if; it is important. If you are not a PDF specialist, you shouldn't be toying around with those methods. You should use ColumnText.ShowTextAligned() instead of BeginText(), EndText() and all of the lines you added in-between. Those methods are meant for people who speak PDF syntax.
Problem :- Not able to apply justify formatting for paragraph in pdfbox.
I tried :- I included space in between words so that it justifies the paragraph. But this does not work for a variable width font. For that I not able to identify pixel position for each character since the width in pixels of each character is different.
Please give suggestions....
I want to make this type of alignment in pdfbox as given below :-
I have a PDF, which has WinAnsiEncoding(Times) and it do not have any toUnicode or differences encoding.
When I try to select or highlight text, width of text is not calculated properly.
I have all the widths but somehow width is calculated incorrectly. What can be the problem? How can I get correct CID of character in order to find correct width of the glyph? I have attached screenshots for the same. "EXPLANATIONS" word is not selected completely. This can happen only if the widths of glyph are calculated incorrectly.