LF in Kotlin language specification standard for what? - kotlin

Recently I am learning Kotlin. When I read Kotlin language specification, I meet some problems. I do not understand LF in Kotlin language specification standard for what?
In section "1.2.1 Whitespace and comments", LF show as below.
"LF: <unicode character Line Feed U+000A>"
So LF is short for what, and "unicode character Line Feed U+000A" is what?

I think the answer is already in the question: unicode character Line Feed U+000A
As that implies, LF stands for Line Feed. It's a character, with Unicode value U+000A. It's used* to separate lines within text files — so is treated as whitespace, and is mentioned in the whitespace section of the docs.
There's plenty of information about that character on this site, on the official Unicode site, and on many other sites — all very easy to find with your favourite search engine.
(In fact, a web search is probably a much better place to start when learning a new language than from the language specification, which is highly technical and intended for people writing compilers and similar tools. Even better, start from the official docs — the ‘Basics’ section gives a quick intro, while the later sections go into full detail.)
(* Some platforms use only LF to separate lines; others use the CR character — carriage return, U+000D — followed by LF; and historically some used only CR.)

Related

PDF Copy Text Issue: Weird Characters

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

How to compose syllable blocks with Hangul Jamo

I'm working on a project that would require the input of old Hangul syllable blocks (i.e. Hangul syllable blocks that would utilize obsolete characters such as ㆅ and ㅿ, located in the Hangul Compatibility Jamo unicode block), but I've been having quite a difficult time displaying the blocks as whole blocks (like 룰) instead of a string of separated glyphs (like ᅘᆇᇈ). Apparently, the strings ㄱㅏㅁ, 가ㅁ, and 감 are equivalent to each other, but the "GSUB features" of Hangul fonts ties them together to varying extents. From what I've gathered, a similar process is applied to Hangul Jamo, in which it's guessed how the block is to be shaped by which vowel follows is (like the difference between the ㄱ in 구 and 기) and whether not it has a final consonant (like the difference between the ㄱ in 가 and 갈).
I imagine that this is similar to how a combining diacritic would work, in which it would guess the height difference of a capital Á and minuscule á. There are many Latin fonts that don't support combining characters, and considering that although ㄱㅏㅁ, 가ㅁ, and 감 are equal, in the end 감 is a precomposed character, and the whole purpose of the Hangul Jamo unicode block is (according to the Wikipedia article on it) to "be used to dynamically compose syllables that are not available as precomposed Hangul syllables in Unicode, specifically archaic syllables containing sounds that have since merged phonetically with other sounds in modern pronunciation." This makes me wonder if the Hangul Jamo behave more like space modifying characters that would need { EQ \o(X1,X2) } to be combined with their respective character.
Most of what I've read speak about font design and command lines which makes it seem that the writer is doing a bit more than just inputting obsolete characters on a word processor, and yet behold: https://github.com/adobe-fonts/source-han-sans/issues/34 . The poster and commentators are trying to figure out Hangul Jamo composition in verticle form, yet they have already composed syllable blocks horizontally in a word processor, but how is nowhere to be seen.
Although Unicode contains code points for obsolete or archaic jamo (like ㆅ and ㅿ), there are no precomposed forms with these characters, and they are excluded from the Unicode normalisation algorithm that composes/decomposes jamo into syllable blocks (although they do have compatibility decompositions into the regular Hangul jamo block, e.g. ㅾ U+317E to ᄶ U+1136).
That is, as far as Unicode is concerned these archaic jamo do not form Hangul syllable blocks at all.
However, fonts may implement their own precomposed forms through the ccmp, ljmo, vjmo, and tjmo OpenType features for Hangul layout. Support for syllable composition using these features is up to the font designer and may go beyond what Unicode supports. Therefore, if you need support for syllable blocks that contain these jamo, you will need to find a font with such support.
It all depends on the font. Some fonts will give you the connected forms automatically, most will not. The only useful ones that I found so far are:
Malgun Gothic (comes with Windows 10)
Hayashi-Serif (downloadable for free)
The Hangul Compatibility Jamo (U+3130–U+318F) block doesn't have conjoining behavior. To get conjoining behavior, use Jamo from:
Hangul Jamo: U+1100–U+11FF
Hangul Jamo Extended-A: U+A960–U+A97F
Hangul Jamo Extended-B: U+D7B0–U+D7FF
In particular, the obsolete characters in the question are:
U+3185 ‹ㆅ› \N{HANGUL LETTER SSANGHIEUH}, which has a conjoining version at:
ᅘ U+1158 HANGUL CHOSEONG SSANGHIEUH (leading consonant)
U+317F ‹ㅿ› \N{HANGUL LETTER PANSIOS}, which has conjoining versions at:
ᅀ U+1140 HANGUL CHOSEONG PANSIOS (leading consonant)
ᇫ U+11EB HANGUL JONGSEONG PANSIOS (trailing consonant)

I need a string that won't properly convert to ANSI using several code pages

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.
To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.
Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.
edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!
There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური
You can find more in Georgian file (ka.xml) of the CLDR DB.
If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.
For instance, try some of Byzantine Musical Symbols
There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.
What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.
Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.
You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.

Autodocumentation type functionality for Fortran?

In the past I've used Doxygen for C and C++, but now I've been thrown on Fortran project and I would like to get a quick all encompassing look at the architecture.
In the past I've found reverse engineering tools to be useful where no documentation of the architecture exists.
So, is there a tool out there that will reverse engineer Fortran code?
I tried to use Doxygen, but didn't have any luck. I will be working with two different projects - one Fortran 90 and I think is in Fortran 77.
Thanks for any insights and feedback.
Tools which may help with reverse engineering:
SciTools Understand
Link with some more tools (search "fortran")
Also, maybe some of these unit testing frameworks will be helpful (I haven't used them, so I cannot comment on the pros and cons of any of them):
FUnit
FRUIT
Ftnunit
(these links link to fortranwiki, where you can find a tidbit on every one of them, and from there there are links to their home sites).
Doxygen 1.6.1 will generate documentation, call graphs, etc. for Fortran source code in free-format (F90) format. You are out of luck for auto-documenting fixed-format (F77) code with doxygen.
All is not lost, however. The conversion from fixed to free format is straightforward and can be automated to a great degree - change comment characters to '!', change continuation characters to '&', and append '&' to lines to be continued. In fact, if the appended continuation character is placed in column 73, it should be ignored by standard F77 compilers (which still only recognize code in columns 1 through 72) but will be recognized by F9x/F2003/F2008 compilers. This allows the same code to be recognized as both in fixed and free format, which lets you gracefully migrate from one format to the other.
Conveniently, there are about a thousand small programs that will do this format adjustment to some degree or another. Realistically, if you're going to be maintaining the code, you might as well move it away from the 1928 spec for Hollerith (IBM) punched cards. :)

Which is it Perl or perl, TIF or TIFF, ant or Ant, ClearCase or Clear Case?

In one sentence I have manage to create 16 possible variations on how I present information. Does it matter as long as the context is clear? Do any common mistakes irritate you?
regarding Perl: How should I capitalize Perl?
TIFF stands for Tagged Image File Format, whereas the extension of files using that format is often ".tif".
That is for the purpose of compatibility with 8.3 filenames, I believe.
I generally like the Perl way of capitalizing when used as a proper noun, but lowercasing when referring to the command itself (assuming the command is lowercase to begin with).
Well, Perl and TIFF have already been answered, so I'll add the last two
the Apache Foundation writes "Apache Ant".
Rational ClearCase (or sometimes "IBM Rational ClearCase") is written as such at its web site.
Even though Perl was originally an acronym for Practical Extration and Report Language, it is written Perl.
These things dont 'bother' me as much as they provide insights into the level of knowledge of the speaker/author. You see, we work in a industry that requires precision, so precision in language does matter as it affects the understanding of the consumer.
The one that really seems to bother me is when people fully upper case JAVA as though it was an acronym.