How to compose syllable blocks with Hangul Jamo - input

I'm working on a project that would require the input of old Hangul syllable blocks (i.e. Hangul syllable blocks that would utilize obsolete characters such as ㆅ and ㅿ, located in the Hangul Compatibility Jamo unicode block), but I've been having quite a difficult time displaying the blocks as whole blocks (like 룰) instead of a string of separated glyphs (like ᅘᆇᇈ). Apparently, the strings ㄱㅏㅁ, 가ㅁ, and 감 are equivalent to each other, but the "GSUB features" of Hangul fonts ties them together to varying extents. From what I've gathered, a similar process is applied to Hangul Jamo, in which it's guessed how the block is to be shaped by which vowel follows is (like the difference between the ㄱ in 구 and 기) and whether not it has a final consonant (like the difference between the ㄱ in 가 and 갈).
I imagine that this is similar to how a combining diacritic would work, in which it would guess the height difference of a capital Á and minuscule á. There are many Latin fonts that don't support combining characters, and considering that although ㄱㅏㅁ, 가ㅁ, and 감 are equal, in the end 감 is a precomposed character, and the whole purpose of the Hangul Jamo unicode block is (according to the Wikipedia article on it) to "be used to dynamically compose syllables that are not available as precomposed Hangul syllables in Unicode, specifically archaic syllables containing sounds that have since merged phonetically with other sounds in modern pronunciation." This makes me wonder if the Hangul Jamo behave more like space modifying characters that would need { EQ \o(X1,X2) } to be combined with their respective character.
Most of what I've read speak about font design and command lines which makes it seem that the writer is doing a bit more than just inputting obsolete characters on a word processor, and yet behold: https://github.com/adobe-fonts/source-han-sans/issues/34 . The poster and commentators are trying to figure out Hangul Jamo composition in verticle form, yet they have already composed syllable blocks horizontally in a word processor, but how is nowhere to be seen.

Although Unicode contains code points for obsolete or archaic jamo (like ㆅ and ㅿ), there are no precomposed forms with these characters, and they are excluded from the Unicode normalisation algorithm that composes/decomposes jamo into syllable blocks (although they do have compatibility decompositions into the regular Hangul jamo block, e.g. ㅾ U+317E to ᄶ U+1136).
That is, as far as Unicode is concerned these archaic jamo do not form Hangul syllable blocks at all.
However, fonts may implement their own precomposed forms through the ccmp, ljmo, vjmo, and tjmo OpenType features for Hangul layout. Support for syllable composition using these features is up to the font designer and may go beyond what Unicode supports. Therefore, if you need support for syllable blocks that contain these jamo, you will need to find a font with such support.

It all depends on the font. Some fonts will give you the connected forms automatically, most will not. The only useful ones that I found so far are:
Malgun Gothic (comes with Windows 10)
Hayashi-Serif (downloadable for free)

The Hangul Compatibility Jamo (U+3130–U+318F) block doesn't have conjoining behavior. To get conjoining behavior, use Jamo from:
Hangul Jamo: U+1100–U+11FF
Hangul Jamo Extended-A: U+A960–U+A97F
Hangul Jamo Extended-B: U+D7B0–U+D7FF
In particular, the obsolete characters in the question are:
U+3185 ‹ㆅ› \N{HANGUL LETTER SSANGHIEUH}, which has a conjoining version at:
ᅘ U+1158 HANGUL CHOSEONG SSANGHIEUH (leading consonant)
U+317F ‹ㅿ› \N{HANGUL LETTER PANSIOS}, which has conjoining versions at:
ᅀ U+1140 HANGUL CHOSEONG PANSIOS (leading consonant)
ᇫ U+11EB HANGUL JONGSEONG PANSIOS (trailing consonant)

Related

LF in Kotlin language specification standard for what?

Recently I am learning Kotlin. When I read Kotlin language specification, I meet some problems. I do not understand LF in Kotlin language specification standard for what?
In section "1.2.1 Whitespace and comments", LF show as below.
"LF: <unicode character Line Feed U+000A>"
So LF is short for what, and "unicode character Line Feed U+000A" is what?
I think the answer is already in the question: unicode character Line Feed U+000A
As that implies, LF stands for Line Feed. It's a character, with Unicode value U+000A. It's used* to separate lines within text files — so is treated as whitespace, and is mentioned in the whitespace section of the docs.
There's plenty of information about that character on this site, on the official Unicode site, and on many other sites — all very easy to find with your favourite search engine.
(In fact, a web search is probably a much better place to start when learning a new language than from the language specification, which is highly technical and intended for people writing compilers and similar tools. Even better, start from the official docs — the ‘Basics’ section gives a quick intro, while the later sections go into full detail.)
(* Some platforms use only LF to separate lines; others use the CR character — carriage return, U+000D — followed by LF; and historically some used only CR.)

Generating PDF from scratch, how are glyphs mapped to character codes?

I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).

PDF Copy Text Issue: Weird Characters

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

not used characters from ANSI characters set

I'am developing a small programming language together with an IDE.
The ANSI character set states the subset of unused characters. Here is the complete list: 0x7F, 0x81, 0x8D, 0x8F, 0x90, 0x9D
I'd like to use some of them for an invisible code markup, so am wondering how they got printed in different environments. Can I assume they are always a whitespace, or some editors will take the honor to replace them with something like '?' or grey rectangle?
Thank you, Dmitry
You seem to be talking about Windows-1252, which is just one of many "ANSI" code pages Windows can use, and it's probably not used outside of Windows. Don't tie a new product to an obsolete technology. Not supporting Unicode (be it UTF-16le or UTF-8) is unacceptable for a programming language.
While it's rather moot to answer the direct question, the answer is no, you cannot assume they will be treated as whitespace. Some may. Some may replace with a space. Some may replace with another glyph. Some may use special colours. Some may give a warning. Some may not load the file.
By the way, if you are referring to Windows-1252, only 0x81, 0x8D, 0x8F, 0x90, 0x9D aren't assigned.
You shouldn't assume any specific behavior, as it will depend on the display widget and quite possibly on the font. Either preprocess the text for display or use an out-of-band markup mechanism (for example, many text field widgets let you attach attributes to runs of text).

I need a string that won't properly convert to ANSI using several code pages

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.
To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.
Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.
edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!
There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური
You can find more in Georgian file (ka.xml) of the CLDR DB.
If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.
For instance, try some of Byzantine Musical Symbols
There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.
What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.
Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.
You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.