Textposition got in PDFBox - pdfbox

I am now using PDFBox to get the textposition of one character using TextPosition.getX() and TextPosition.getY().
But I am not sure whether the coordinate returned is the upper left or the bottom left of the character.
What i want is the upper left and the bottom right coordinate of the character.
Thanks.

Related

What is the scope of a path in a PDF text object?

I am confused by seemingly contradicting information in the PDF Standard (PDF 32000-1:2008).
To simplify, I assume no transparency complications at all, as I am confused enough as is, so alphas are all 1, and I have blending mode normal, and text knockout is not a topic.
The issue is text state (more precisely Tr, setting the text rendering mode) and text showing (Tj and friends) and how to reconcile these with the scope of what a SINGLE path is.
So these three things, A, B, and C, seem contradictory:
A. On p.247 (in section 9.3.6, under Table 106), it says (emphasis with caps is mine): "At the end of the text object, the accumulated glyph outlines, if any, shall be combined INTO A SINGLE PATH, treating the individual outlines as subpaths of that path and applying the nonzero winding number rule (...)."
B. According to p.113, Figure 9 ("Graphics Objects"), the allowed operators within a text object include: "Color" (e.g. a change of fill color), "Text state" (so e.g. changing Tr, the text rendering mode), and "Text-showing" (e.g. (Hello) Tj).
According to B, the following would then be a valid text object:
C. On p. 246, in the middle of the page it says (emphasis in caps is mine): "In [text rendering] modes that perform both filling and stroking, the effect shall be as if each glyph outline where filled and then stroked in separate operations. If any of the glyphs overlap, the result shall be equivalent to filling and stroking them ONE AT A TIME, producing the appearance of stacked opaque glyphs, rather then first filling and stroking them all at once".
So, according to A, all the glyphs (in my example in B) in "Red hello filled" and "Blue hello filled then stroked" must be considered as A SINGLE path, and the individual glyphs are subpaths...
My understanding is that ONE path can be filled with ONE color (I do not mean gradients of course, I am talking about solid colors, like the red and the blue in my example in B), is this not so? And I can only apply 1 text rendering mode to it, right? How does it make sense then that I can change color and text rendering mode WITHIN the text object, which B tells me, if according to A, I am dealing with a SINGLE path?
And according to C, even each glpyh is to be considered as a SINGLE path as we paint one after, so on top of, the other, which is exactely the concept of PATH. I can not fill ONE PATH "one at a time"...
So, bottomline question, what is the scope of a path in a PDF text object?
Thank you very much for help!
If I understand the spec correctly, you misinterpret the term "the accumulated glyph outlines". It does not refer to all the outline paths of the shown glyphs and their rendering as you assume. Instead it refers only to the outlines that are added to path for clipping if any!
Thus, all the text is filled and/or stroked immediately in the text object according to the state at the time its text showing instruction is evaluated, one glyph at a time. Consequentially, text drawn by different text showing instructions may have different colors and different other characteristics and later glyphs may overlap earlier ones.
Eventually, at the end of a text object all the outlines of the text drawn with text rendering mode 4..7 in the object - if any - are combined in that single path A talks about and used to intersect the previous clipping path.

React-native aligning header left, right suffixes along with title

Here is the problem I am facing:
I have a component in react native called Header. Header has 3 properties - left suffix (for back button for example), right suffix (for additional actions, sometimes 1 button, sometimes 2 buttons, sometimes none) and title.
Layout rules are as follows:
Title should always be centered in a header no matter if Right or Left suffixes are present. If Right suffix is present but Left suffix is not, title should still be centered in the middle of the header.
Right and Left suffixes should always be visible if they are declared. Meaning if I have Right Suffix + Left Suffix + a very long title, title should shrink to make space for suffixes.
I have made snack to demonstrate my problem: https://snack.expo.io/#anjayka/header-challenge
As you can see most of the layout works fine - If I add right suffix, title stays in position, if I remove left suffix - its still in position. The problem comes after title is a very long text - it expands so much that it pushes out suffixes entirely.
Any help solving this puzzle is appreciated
Just wrap the tile in View with position: 'absolute' and center the tile Text in the middle, and give the header Text style a maxWidth prop

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre.
Why is it so difficult technically speaking ?
Let's not assume you are talking about PDFs which merely wrap some bitmap image because it should be clear that in that case you can only resort to OCR with all its restrictions.
Let's instead assume that text is drawn in the PDF at hand.
What is drawn on a PDF page is determined by a sequence of instructions in the content stream of that page. "Text is drawn" on a page means that among those instructions there are some setting the font to use by the instructions to come, some setting the text position and direction to use by the instructions to come, and some actually drawing text given by "string arguments".
Text extraction is the task of taking the sequence of instructions from a content stream and instead of drawing the text as indicated by the font and position setting instructions, to export it in a sensible order using a standard encoding, usually the encoding of the character type of the used programming language / platform.
The first problem is to understand the encoding of the string arguments of those text drawing instructions:
each font can have its own encoding; to extract the text one cannot simply ignore everything but the instructions drawing text and concatenate their string contents, you always have to take the current font into account (some extremely simple text extractors ignore this and, therefore, fail pretty often to return something sensible);
there are a large number of predefined encodings, some reminding of encodings you know, e.g. WinAnsiEncoding, many you likely don't know, e.g. Add-RKSJ-H; these encodings may use a constant number of bytes per glyph or they may be mixed-multibyte; so a text extractor must support very many encodings to start with;
encodings also may be completely ad-hoc and arbitrary; in particular in case of embedded subset fonts one often sees ad-hoc encodings generated by dealing out character codes from some starting value whenever one is needed; i.e. the first glyph in a given font used on a page is given the starting value as code, the next, different glyph is given the starting value plus one, the next, different one the starting value plus two, etc; "Hello World" and a starting value of 48 (ASCII value of '0') would result in "01223453627"; these fonts may contain a mapping to Unicode but they are not required to.
The next problem is to make sense out of the order of the strings:
the string drawing instructions may occur in an arbitrary order, e.g "Hello" might be drawn "lo" first, then after moving back "el", then after again moving back "H"; to extract the text one cannot ignore text positioning instructions and simply concatenate text strings, you always have to take the current position into account (some simple text extractors ignore this and, therefore, can fail to return something sensible);
multi-columnar text may present a difficulty, text may be drawn line by line, e.g. first the text of the top line of the first column, then the top line of the second column, then the second line of the first column, then the second line of the second column, etc.; there need not be any hints in the PDF that the text is multi-columnar.
Another problem is to recognize formatting or styling artifacts:
spaces between words need not be created by drawing a space glyph, it may also be done by text position changing instructions; text extractors not trying to recognize gaps created by text positioning instructions may return a result without spaces; on the other hand the same technique can be used to draw adjacent glyphs at an optimal distance, aka kerning; text extractors trying to recognize gaps created by text positioning instructions may falsely return spaces where there should be none;
sometimes selected words are printed s p a c e d o u t for extra emphasis; in the extracted text these gaps might be presented as space characters which automatic postprocessing of the text may see as word separators;
usually for bold text one uses a different, bold font program; if that is not at hand, people sometimes get creative and emulate bold by printing the same text twice with a minute offset; with a slightly larger offset (or a different transformation) and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, you end up having some duplicate characters in the output.
More problems arise due to incomplete or wrong extra information:
ToUnicode maps of fonts (optional maps from character code to Unicode) may be incomplete or contain errors; there e.g. are many questions here on stack overflow dealing with incorrect ToUnicode maps for Indian writings; the text extraction results reflect these errors;
there even are PDFs with contradictory information, e.g. with an error in the ToUnicode map but the correct information in an ActualText entry; this is used by some PDF creators to allow correct copy&paste from some programs (preferring an ActualText entry in such a situation) while injecting errors in the output of other programs (preferring ToUnicode information then).
Yet another problem arises if you expect the text extractor to extract only text eventually visible in the page:
text may be drawn outside the current clipping area or outside the visible page area; text extractors need to keep these in mind;
text may be drawn using the rendering mode "invisible"; text extractors have to keep an eye on the rendering mode;
text may be drawn using the same color as the background; to recognize this, a text extractor can not only look at the current instruction and a few graphics state details, it has to take into account anything drawn beforehand in the location of the text;
text may be drawn as a clip path; to recognize whether this text is visible in the end, a text extractor must keep track of what is drawn in the text area as long as the clip path is active;
text may be covered by something else later; a text extractor must drop recognized text in such a case; but depending on blend modes and transparency settings these coverings might or might not allow the text to shine through; thus, for a correct result the text extractor must for each glyph keep track of the color its drawn with, the color of the backdrop, and what all those spiffy effects do with those colors later on; and of course, both glyph color and backdrop color can be interesting, e.g. some shading colors; and the color spaces involved may differ, requiring one to convert back and forth between color spaces; and so on.
Furthermore, text may be drawn where text extractors usually don't look:
some tools hide text from text extraction by putting it into a pattern and filling the page area with that pattern;
similarly there are type 3 fonts; each character in a type 3 font is represented by its own content stream; thus, a tool can draw all text in the content stream of a single type 3 font glyph and then draw that glyph on the page.
...
You surely have meanwhile gotten an idea why text extraction results can be less than optimal. And be assured, the list above is not complete, there still are more complications for text extraction.

Hebrew punctuation displayed incorrectly in Objective-C

I have the very basic line:
self.label.text = #",הם הכריחו אותה לשתות ויסקי";
Notice the comma at the left of the string. When this displays in a UILabel, I see the comma at the right. This is one example of punctuation problems I am seeing with Hebrew.
Any ideas for resolving this?
Most of the text you have is right-to-left, but a comma is left-to-right. You are displaying source code here as it is displayed by Xcode. It's not at all obvious what rules Xcode would choose to display such text. You would be much confident about what your source code is if you write
self.label.text = #"הם הכריחו אותה לשתות ויסקי" ",";
for example or
self.label.text = #"," "הם הכריחו אותה לשתות ויסקי";
so you know 100% sure what text you have in Xcode. After that I'm afraid it's very much a matter of reading the documentation and seeing what you need to do. While characters in text have some ordering, a text field on its own has a text ordering as well. You can have latin text with a bit of hebrew inside, or hebrew (right to left) text with a bit of latin inside, and they will behave differently.
What you describe looks like a left-to-right text field that is used to display some hebrew text, so the overall display order is left to right, but hebrew items inside (not the comma) are displayed right to left. You'd need to change the display order of the text field itself.
I've been reading up on Bi-directional text, it seems as though certain Unicode characters specify certain properties of the following text. Through experimentation, I've found that the Right-To-Left Isolate character, or U+2067 ⁧, will cause the text that follows to be displayed in the correct order. So the Objective C solution to the problem was:
self.label.text = [#"\u2067" stringByAppendingString: #",הם הכריחו אותה לשתות ויסקי"];

What does setTextMatrix of contentByte class in iText do?

I am using iText and am very new to it. There have been several situations where I think I could have figured out the problem with my code if I knew what I was doing - I use examples without knowing the workings behind the code, and even as I look at the source I can't figure out what the programmer was thinking.
What does setTextMatrix of contentByteArray in iText do? And how do I figure out the parameter values I need?
For example:
cb.setTextMatrix(1, 0);
The input parameters are x,y coordinates in points, unless CTM scaling was defined.
0,0 would be the bottom left of the template you are referencing.
The position is the 'baseline' of the text, rather than the top or the bottom.
Transcribed from this source:
https://sourceforge.net/p/itext/mailman/message/12855218/
first parameter sets left margin, second parameter sets bottom margin.