extract the structure of a letter - objective-c

Say I have the letter 'a'. My objective is to redraw the letter using 'little colored flowers'. How do I extract the path/structure/blueprint of a letter in order to use it as a path for my custom draw?

See this article: Low-level text rendering. The long second listing outlines the use of the CTFontCreatePathForGlyph() function. You probably don't need to include all the setup that takes place in that listing if you just want to convert a single glyph rather than multiple lines of text.

Related

Add spacing between numbers after some specific digits in React Native

I need to show dashes as hints and need to have equal spacing after some digits in Text Input similar to the picture attached .
Do I need to use separate text inputs for these or is it possible to achieve this in a single input text field? The user should be able to enter the number in a single go.
If you know phone number format you can do it manually in one input. You are saving it as a real number, but displaying it with spaces. If you can have various formats, from various countries you can use this package https://www.npmjs.com/package/react-native-phone-number-input
The best way to achieve that is by using mask input you can use react-native-mask-input library to do it you can check the library from here.

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre.
Why is it so difficult technically speaking ?
Let's not assume you are talking about PDFs which merely wrap some bitmap image because it should be clear that in that case you can only resort to OCR with all its restrictions.
Let's instead assume that text is drawn in the PDF at hand.
What is drawn on a PDF page is determined by a sequence of instructions in the content stream of that page. "Text is drawn" on a page means that among those instructions there are some setting the font to use by the instructions to come, some setting the text position and direction to use by the instructions to come, and some actually drawing text given by "string arguments".
Text extraction is the task of taking the sequence of instructions from a content stream and instead of drawing the text as indicated by the font and position setting instructions, to export it in a sensible order using a standard encoding, usually the encoding of the character type of the used programming language / platform.
The first problem is to understand the encoding of the string arguments of those text drawing instructions:
each font can have its own encoding; to extract the text one cannot simply ignore everything but the instructions drawing text and concatenate their string contents, you always have to take the current font into account (some extremely simple text extractors ignore this and, therefore, fail pretty often to return something sensible);
there are a large number of predefined encodings, some reminding of encodings you know, e.g. WinAnsiEncoding, many you likely don't know, e.g. Add-RKSJ-H; these encodings may use a constant number of bytes per glyph or they may be mixed-multibyte; so a text extractor must support very many encodings to start with;
encodings also may be completely ad-hoc and arbitrary; in particular in case of embedded subset fonts one often sees ad-hoc encodings generated by dealing out character codes from some starting value whenever one is needed; i.e. the first glyph in a given font used on a page is given the starting value as code, the next, different glyph is given the starting value plus one, the next, different one the starting value plus two, etc; "Hello World" and a starting value of 48 (ASCII value of '0') would result in "01223453627"; these fonts may contain a mapping to Unicode but they are not required to.
The next problem is to make sense out of the order of the strings:
the string drawing instructions may occur in an arbitrary order, e.g "Hello" might be drawn "lo" first, then after moving back "el", then after again moving back "H"; to extract the text one cannot ignore text positioning instructions and simply concatenate text strings, you always have to take the current position into account (some simple text extractors ignore this and, therefore, can fail to return something sensible);
multi-columnar text may present a difficulty, text may be drawn line by line, e.g. first the text of the top line of the first column, then the top line of the second column, then the second line of the first column, then the second line of the second column, etc.; there need not be any hints in the PDF that the text is multi-columnar.
Another problem is to recognize formatting or styling artifacts:
spaces between words need not be created by drawing a space glyph, it may also be done by text position changing instructions; text extractors not trying to recognize gaps created by text positioning instructions may return a result without spaces; on the other hand the same technique can be used to draw adjacent glyphs at an optimal distance, aka kerning; text extractors trying to recognize gaps created by text positioning instructions may falsely return spaces where there should be none;
sometimes selected words are printed s p a c e d o u t for extra emphasis; in the extracted text these gaps might be presented as space characters which automatic postprocessing of the text may see as word separators;
usually for bold text one uses a different, bold font program; if that is not at hand, people sometimes get creative and emulate bold by printing the same text twice with a minute offset; with a slightly larger offset (or a different transformation) and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, you end up having some duplicate characters in the output.
More problems arise due to incomplete or wrong extra information:
ToUnicode maps of fonts (optional maps from character code to Unicode) may be incomplete or contain errors; there e.g. are many questions here on stack overflow dealing with incorrect ToUnicode maps for Indian writings; the text extraction results reflect these errors;
there even are PDFs with contradictory information, e.g. with an error in the ToUnicode map but the correct information in an ActualText entry; this is used by some PDF creators to allow correct copy&paste from some programs (preferring an ActualText entry in such a situation) while injecting errors in the output of other programs (preferring ToUnicode information then).
Yet another problem arises if you expect the text extractor to extract only text eventually visible in the page:
text may be drawn outside the current clipping area or outside the visible page area; text extractors need to keep these in mind;
text may be drawn using the rendering mode "invisible"; text extractors have to keep an eye on the rendering mode;
text may be drawn using the same color as the background; to recognize this, a text extractor can not only look at the current instruction and a few graphics state details, it has to take into account anything drawn beforehand in the location of the text;
text may be drawn as a clip path; to recognize whether this text is visible in the end, a text extractor must keep track of what is drawn in the text area as long as the clip path is active;
text may be covered by something else later; a text extractor must drop recognized text in such a case; but depending on blend modes and transparency settings these coverings might or might not allow the text to shine through; thus, for a correct result the text extractor must for each glyph keep track of the color its drawn with, the color of the backdrop, and what all those spiffy effects do with those colors later on; and of course, both glyph color and backdrop color can be interesting, e.g. some shading colors; and the color spaces involved may differ, requiring one to convert back and forth between color spaces; and so on.
Furthermore, text may be drawn where text extractors usually don't look:
some tools hide text from text extraction by putting it into a pattern and filling the page area with that pattern;
similarly there are type 3 fonts; each character in a type 3 font is represented by its own content stream; thus, a tool can draw all text in the content stream of a single type 3 font glyph and then draw that glyph on the page.
...
You surely have meanwhile gotten an idea why text extraction results can be less than optimal. And be assured, the list above is not complete, there still are more complications for text extraction.

How do I make iText 7 diacritic mark stacking work correctly?

I have run into a problem with iText 7 where diacritic marks are painted on top of one another instead of stacking properly when multiple marks are used on a single character. Is there a setting that makes them appear correctly, or is this a bug in iText 7? Any help greatly appreciated. This can be observed if you create a text object in your PDF like below. Obviously, replace the relevant bit with an actual font object, rather than than what I have in there.
new Text("ḗ and ṓ are characters that display incorrectly").setFont(<UNICODE COMPATIBLE FONT LIKE CHARIS>);
While Bruno and Benoit correctly pointed out that for advanced typography stuff like stacking diacritical marks you need pdfCalligraph module, there is a workaround you can try at your own risk. If your combinations of base glyph and diacritics are real, meaning they occur in real texts in some languages or some other known contexts, then such combinations are most probably present in Unicode and have their own number associated with them. For instance, in text you provided, they are 0xU1E17 and 0x1E53 Unicode characters. Some fonts may contain such glyphs, so that there is a second option to showing base glyph and stacking diacritics: showing combined glyphs. For example, ArialUni shipped with Windows does contain the above mentioned glyphs.
To try this approach, you would need the following code for composing known Unicode base glyph + diacritics combinations into single glyphs:
String originalStr = "ḗ and ṓ are characters that display incorrectly";
String normalizedStr = java.text.Normalizer.normalize(originalStr, Normalizer.Form.NFC);
new Text(normalizedStr); // Use this normalized Text instance
The result that I got with ArialUni:
But again as I mentioned do it at your own risk because it will only work if there are necessary combinations present both in Unicode and font. For correct rendering you still should use pdfCalligraph.

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.