I'am developing a small programming language together with an IDE.
The ANSI character set states the subset of unused characters. Here is the complete list: 0x7F, 0x81, 0x8D, 0x8F, 0x90, 0x9D
I'd like to use some of them for an invisible code markup, so am wondering how they got printed in different environments. Can I assume they are always a whitespace, or some editors will take the honor to replace them with something like '?' or grey rectangle?
Thank you, Dmitry
You seem to be talking about Windows-1252, which is just one of many "ANSI" code pages Windows can use, and it's probably not used outside of Windows. Don't tie a new product to an obsolete technology. Not supporting Unicode (be it UTF-16le or UTF-8) is unacceptable for a programming language.
While it's rather moot to answer the direct question, the answer is no, you cannot assume they will be treated as whitespace. Some may. Some may replace with a space. Some may replace with another glyph. Some may use special colours. Some may give a warning. Some may not load the file.
By the way, if you are referring to Windows-1252, only 0x81, 0x8D, 0x8F, 0x90, 0x9D aren't assigned.
You shouldn't assume any specific behavior, as it will depend on the display widget and quite possibly on the font. Either preprocess the text for display or use an out-of-band markup mechanism (for example, many text field widgets let you attach attributes to runs of text).
Related
I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).
I want to make a PDF available on my website, but want to prevent the automatic parsing by bots that might not respect the normal PDF "security". The reason is that this is also commercially published and I am allowed to share for "personal use", but must not make it widely available that way. I originally created the PDF from Word.
I have tried using Ghostscript with the dNoOutputFonts option to convert text to glyphs, but the result is ridiculously big (from 2.5 MB to 180 MB). Scrambling the text encoding seems a good option, but I barely found any posts discussing this. There seems to be a commercial solution, but I was unable to find a way to do this e.g. using Ghostscript or qpdf. Any suggestion on how to achieve this (or alternative solutions)?
Operating system: Windows 10 64bit
Available versions of Ghostscript: 9.18, 9.27
Simple example PDF
Well, that's the advantage of fonts, you only have to describe each character once. Convert to outlines and you need to describe it every time, so yeah, much bigger.
Ghostscript's pdfwrite device goes to considerable effort to try and make text searchable, because in general people shout at us when a 'searchable' file becomes 'non-searchable'. So (amongst other things) it preserves any ToUnicode CMaps in the input file. To prevent simple indexing you need to avoid that. You haven't linked to a PDF file so I can't test this, but....
There are three places you need to edit:
/ghostpdl/Resource/Init/gs_pdfwr.ps, line 642, change:
/WantsToUnicode /GetDeviceParam .special_op {
exch pop
}{
//true
}ifelse
To:
//false
In the same file, at line 982, change:
/WantsToUnicode /GetDeviceParam .special_op {
exch pop
}{
//false
}ifelse
To:
//false
Then in /ghostpdl/Resource/Init/pdf_font.ps, line 614, change:
/WantsToUnicode /GetDeviceParam .special_op {
exch pop
}{
//false
}ifelse
To:
//false
That should prevent any ToUnicode information in the inptu file making it through to the output file. Depending on the Operating System you are using, and the way Ghostscript has been built (you haven't said), you may need to tell Ghostscript to include that directory in its search path, which you do with -I/ghostpdl/Resource/Init.
You should also set -dSubsetFonts=true, that will emit all fonts as subsets, I think that's the default but I can't immediately recall and it does no harm to set it. That means the first glyph that is encountered is encodesd at index 1, the second at index 2 etc. So Hello World becomes 0x01, 0x02, 0x03, 0x03, 0x04, 0x05, 0x06, 0x04, 0x07, 0x03, 0x08. The ordering wil be consistent throughout the file (obviously) but different for every font in the file and for every file. That should be adequately scrambled I'd have thought. It certainly won't be possible to search/copy/paste trivially.
If you make an example file available I can test it.
Oh, it also just occured to me that you might be able to get the same effect by using the ps2write device to create a PostScript file, then using the pdfwrite device to convert that back to PDF. The ps2write device can't embed ToUnicode CMaps, because there's no standard support in PostScript for that. Of course, it also means the content drops back to PostScript, which may result in other, unacceptable, quality/size chanegs.
I'm working on a project that would require the input of old Hangul syllable blocks (i.e. Hangul syllable blocks that would utilize obsolete characters such as ㆅ and ㅿ, located in the Hangul Compatibility Jamo unicode block), but I've been having quite a difficult time displaying the blocks as whole blocks (like 룰) instead of a string of separated glyphs (like ᅘᆇᇈ). Apparently, the strings ㄱㅏㅁ, 가ㅁ, and 감 are equivalent to each other, but the "GSUB features" of Hangul fonts ties them together to varying extents. From what I've gathered, a similar process is applied to Hangul Jamo, in which it's guessed how the block is to be shaped by which vowel follows is (like the difference between the ㄱ in 구 and 기) and whether not it has a final consonant (like the difference between the ㄱ in 가 and 갈).
I imagine that this is similar to how a combining diacritic would work, in which it would guess the height difference of a capital Á and minuscule á. There are many Latin fonts that don't support combining characters, and considering that although ㄱㅏㅁ, 가ㅁ, and 감 are equal, in the end 감 is a precomposed character, and the whole purpose of the Hangul Jamo unicode block is (according to the Wikipedia article on it) to "be used to dynamically compose syllables that are not available as precomposed Hangul syllables in Unicode, specifically archaic syllables containing sounds that have since merged phonetically with other sounds in modern pronunciation." This makes me wonder if the Hangul Jamo behave more like space modifying characters that would need { EQ \o(X1,X2) } to be combined with their respective character.
Most of what I've read speak about font design and command lines which makes it seem that the writer is doing a bit more than just inputting obsolete characters on a word processor, and yet behold: https://github.com/adobe-fonts/source-han-sans/issues/34 . The poster and commentators are trying to figure out Hangul Jamo composition in verticle form, yet they have already composed syllable blocks horizontally in a word processor, but how is nowhere to be seen.
Although Unicode contains code points for obsolete or archaic jamo (like ㆅ and ㅿ), there are no precomposed forms with these characters, and they are excluded from the Unicode normalisation algorithm that composes/decomposes jamo into syllable blocks (although they do have compatibility decompositions into the regular Hangul jamo block, e.g. ㅾ U+317E to ᄶ U+1136).
That is, as far as Unicode is concerned these archaic jamo do not form Hangul syllable blocks at all.
However, fonts may implement their own precomposed forms through the ccmp, ljmo, vjmo, and tjmo OpenType features for Hangul layout. Support for syllable composition using these features is up to the font designer and may go beyond what Unicode supports. Therefore, if you need support for syllable blocks that contain these jamo, you will need to find a font with such support.
It all depends on the font. Some fonts will give you the connected forms automatically, most will not. The only useful ones that I found so far are:
Malgun Gothic (comes with Windows 10)
Hayashi-Serif (downloadable for free)
The Hangul Compatibility Jamo (U+3130–U+318F) block doesn't have conjoining behavior. To get conjoining behavior, use Jamo from:
Hangul Jamo: U+1100–U+11FF
Hangul Jamo Extended-A: U+A960–U+A97F
Hangul Jamo Extended-B: U+D7B0–U+D7FF
In particular, the obsolete characters in the question are:
U+3185 ‹ㆅ› \N{HANGUL LETTER SSANGHIEUH}, which has a conjoining version at:
ᅘ U+1158 HANGUL CHOSEONG SSANGHIEUH (leading consonant)
U+317F ‹ㅿ› \N{HANGUL LETTER PANSIOS}, which has conjoining versions at:
ᅀ U+1140 HANGUL CHOSEONG PANSIOS (leading consonant)
ᇫ U+11EB HANGUL JONGSEONG PANSIOS (trailing consonant)
We're developing an Eclipse plugin.
When the user types <= I would like to display a left-arrow UTF character ⇐ instead.
The file on disk still needs to contain the original "less than,equals" symbols, because that is what the programming language prescribes.
In other contexts, in the same editor, I might want to display the same character sequence <= as unicode less or equal ≤. This would help the user understand how the compiler interprets the sequence <=, depending on the context. Again, the document (and file) should not be changed, only the way we display it.
What is the easiest way to do this? Note that we're already on xText, so we use the editor provided by xText.
Eclipse text editors normally use an object implementing IDocument (usually also the numerous IDocumentExtensionXX interfaces) often by extending the AbstractDocument class.
This document class provides the text that the editor displays and is updated with the changes the user makes, so it should be able to manage converting between the file and display representations.
How do you change the name of the font embedded in the .ttf file? (It's for a device that's expecting a hard coded font name that I'd like to swap w/ another more readable openly licensed font).
I'd prefer a method which I can implement myself rather than installing a program.
TrueType is a pretty complex binary data format -- the kind that takes an entire book-length spec to describe. I've worked with it in the distant past.
There are specialized tools that can edit fonts, including metadata like names. I would not recommend trying to mess with the binary data in a font file without such a tool. There might be libraries available that you could call to manipulate TrueType data; if one existed, I would guess Python would be the most likely language to find it in, because there's a long correlation between font hackers and Python (Guido van Rossum's brother is a well-known typographer.)
This may be only useful in very specific situations, but should you need to change a font's name to something else that is the exact same length, you can do so in a hex editor (e.g. Okteta. Find all the instances of the name, and then edit them to be the new name. I found there were 2 copies of the name in each place - one that's normal, and another with 0x00 in between each letter.
The only evidence I have that this actually works is empirical with a sample size of 1.