Missing presentation forms (glyphs) of some arabic characters in Unicode - pdf

I am working on a code that generates PDF containing arabic texts. For each character, I am choosing the correct glyph in the presentation forms to display the text correctly. This works fine but Unicode doesn't contain presentation form of all arabic characters.
For example \u067D ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS ٽ. There is no presentation form of this character even though the character has medial form, as can be seen in this string: لٽط
What is the reason that presentation forms of this and other characters are missing?
Is the character not used in practice?
Can the simple ARABIC LETTER TEH, which contains only one dot above and has presentation forms, be used instead?
Or is it necessary to somehow build this character (e.g. by using \uFBB6 THREE DOTS ABOVE character)?

The Arabic presentation forms should never be used for writing text. They exist only because they were needed for compatibility with older standards long ago. As such, there aren’t presentation forms for all Arabic letters in Unicode, only those necessary for this specific purpose. Many letters were also added long after the presentation forms ceased being relevant altogether. See the FAQ on Arabic for more information.
Arabic text should always be entered and stored using the regular letters (from the blocks Arabic, Arabic Supplement, and Arabic Extended-A). These letters will then automatically assume the correct shape depending on where they are situated in the word (initial, medial, or final) as can be seen in the example string you provided.
Using the character U+FBB6 ﮶ ARABIC SYMBOL THREE DOTS ABOVE would not be appropriate in this context because it is not a combining mark. It isn’t used to build new characters, but to talk about the symbol itself in isolation. From the code chart for Arabic Presentation Forms-A:
These are spacing symbols representing Arabic letter diacritics
considered in isolation, as for example as in discussions about the
Arabic script.
If the software you are using does not handle Arabic letter joining correctly, then there simply is no Unicode-defined way to enter the medial form of ٽ in your document. You will either have to switch to another framework entirely, or (as a last resort) encode the contextual forms you need as private-use characters in a new font, but I strongly recommend against that solution.

Related

Need to prevent entry of "U+3000" full-width Japanese spaces in Word doc

Short version:
So I have a Word doc with very specific formatting guidelines, into which Japanese users have to input text. Despite being specifically told not to, instead of using "tab" to create indents, they will very often use full-width Japanese spaces (U+3000). I want to somehow prevent the entry of this character to avoid having to reformat.
Long version:
We send out Japanese/English-language script templates for Japanese users to input their own dialogue into. They will often ignore the formatting of the script, using hard returns and full-width spaces to hack-format the doc (very common practice among Japanese Word users). This leads to unnecessary time spent on re-formatting. As I see it I have three options:
Prevent the entry of undesired characters, by blocking the character
or setting up a dialogue box every time it is used.
Automate a dialogue box to pop up when the document is saved, displaying a message to users to make sure no undesired characters were used.
Create a macro to auto-replace undesired characters on my end.
Any suggestions? Help is very much appreciated.

Thai character not rendered correctly in PDF

My app should be able to output a PDF file containing the user guide in several supported languages. (I'm using pdfkit)
I had some troubles finding a suitable font for Thai: some so-called Thai supported languages (included Noto Thai from Google) would output squares, question marks or even worse unreadable stuff.
After a bit of research, I found one that seemed to work reasonably well, until our Thai guy noted that the charachters
ต่ำ
were rendered like in the picture below, basically with the two elements above the first character collapsed with one covering the other
I'm using Nimbus Sans Thai Family downloaded from myfonts.com that, by the way, would seem able to render those characters correctly, as you might appreciate trying to copypaste ต่ำ in the preview input
Any hints?
Your font is incomplete in a certain way. It lacks some glyphs that usually reside in Private Use Area (PUA) of Unicode.
Some applications (I'm aware of Microsoft Word) can manually overcome this problem, but your rendering app (and Adobe Acrobat Viewer) does not.
You should either find a font with these glyphs presenting or alternatively find an application that would displace the existing glyphs manually.
Many fonts, despite they claim supporting Thai (and they, indeed, contain "regular" Thai glyphs), can be incomplete.
Besides canonic glyphs, a well-formed font should contain a "Private Use
Area" (PUA) subrange that contains glyphs in non-canonical forms. Those
glyphs include:
Tone marks shifted to the upper position for use in combination with upper
vowels (SARA_I, SARA_UE, etc) and shifted in a lower position in case of Consonant + Tone Mark and no upper vowel;
Tone marks and upper-vowels slightly shifted to the left for use in combination with PO_PLA, FO_FAN, etc (otherwise it would overlap with the consonants' upper tail);
also, both effects combined, e.g. the tone mark shifted down-left at the same time:
Special glyphs for YO_YING and THO_THAN (with no tail) for use in combination with under-vowels;
Several more;
Normally, when a rendered app finds above mentioned symbol combinations, it looks for substitute glyphs in PUA area. If not found, it simply falls back to default glyph, which happens in your case.
Here are two screenshots of PUA areas of Arial Unicode and FreeSerif
which are self-explanatory: FreeSerif has PUA empty. I think, the same problem occurs with your Nimbus font.
And the final observation. Incorrect fonts can be incorrect in different ways. Above I have described a more canonical case when the standard positions of tone marks a upper positions, while non-standard positions are shifted down (or are absent, which constitutes an incomplete font).
There are, however, fonts that behave the opposite way; they (only) contain tone marks in lower positions. This is what you seem to observe.
The problem is that PDFKit does not perform complex script rendering.
Several scripts such as arabic, thai etc, require glyph substitution and re-positioning depending on context (position in string, neighbor characters) and PDFKit seems not to do it.
PDF viewer applications display exactly what is defined in the PDF file. The Nimbus Sans Thai font probably includes all the required glyphs but what bytebuster explains in his answer needs to be performed by PDFKit and not by the viewer application.

How to export text document containing astral Unicode characters to PDF

I regularly create documents that need Unicode characters above U+FFFF. Unfortunately, OpenOffice and LibreOffice are both unable to correctly export these characters when creating a PDF. The actual data gets mangled by a completely asinine algorithm, while the display just consists of various overlapping question mark boxes.
This is not a font issue. I embed all used fonts in the PDF and all characters below U+FFFF work perfectly fine.
Until now I have been working around this issue by mapping the glyphs I need to a custom PUA font. This solves the display problems, but obviously makes the actual content of the text unsearchable and quite fragile. I haven’t been able to find any settings that might affect the handling of Unicode characters in PDF.
Therefore I have three questions:
Is there a way to make OpenOffice/LibreOffice handle astral characters correctly on PDF export?
If not, is there an external tool that can convert .odt files to PDF while preserving astral characters?
If not, is there another good rich-text editor using a different file format that can deal with astral characters in PDFs?

Spanish characters in vb.net desktop application

I'm learning Spanish and wrote an application to display a form at random intervals at random places on the screen. Spanish words are taken from random positions in a text file. When the form appears, a Spanish word is presented with a definition below it. When I click anywhere on the form it goes away to appear again later. The form appears at any interval less than ten minutes, or whatever value I enter for that.
Spanish characters with accent marks do not display correctly. A label is being used to render. What is the best way to have it display properly. I haven't done localization or other languages in a desktop application, only web. I only want to change the one label if possible. Thanks
I found the answer at http://www.vbforums.com/showthread.php?655592-RESOLVED-Extended-ASCII-characters-in-Stream-I-O and used the following code:
Private Const ISO_8859_1 As Integer = 28591
Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(ISO_8859_1)
reader = New IO.StreamReader(file_name, encoding)
It was removing the high order bit when reading the file. Labels render everything in the extended ascii set without issue.

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.
To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.
PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.
Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.
If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.
Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text