Looking for a test string with a char from all languages - testing

For I18n testing, I'm looking for a test string that have a good representation of all commonly used languages (supported by UTF-8) and have all the special chars of these languages that normally have issues in display.
Will use this test string to keep sure that our system process these languages correctly and have the correct font that can display all these languages correctly.
E.g. the sample text should have chars from latin languages, Far East Languages, right to left languages...

There is no clear answer to your question, as it is full of ambiguous terms, for instance "commonly used languages" or "normally have issues in display". This is highly dependent on OS, OS version, the text engine used to display the text, fonts installed. Pretty much the whole tech stack.
Sprinkling "all" in the question (all the special chars, all ... languages) make any answer useless.
You will looking at a string of tens thousands of characters. Then you have a lot of combining marks, and ligatures. Do you want to check all of those combinations too? Those might also have "issues in display"
If all you want to do is check that your application works in (most) languages, try taking some (not all) characters from each Unicode block. Might also want to avoid historical scripts (i.e. cuneiform, Egyptian hieroglyphs, etc.) the are not covered by common fonts.
In general, if you application does not corrupt the string somehow, it will render properly. And if it does not, then it is not your app at fault, it is some limitation in the underlying technology (i.e. the Windows console)
If you explain what you are trying to do, you might get a better answer.
Or you can just search for internationalization testing.

Related

Antlr and PL/I grammar

Right now we would like to have the grammar of PL/I, COBOL based on Antlr4. Is there anyone provide these grammars
If not, can you please share your thought/experience on developing these grammars from scratch
Thanks
I assume you mean IBM PL/I and COBOL. (Not many other PL/Is around, but I don't think that really changes the answer much).
The obvious place to look for mature ANTLR grammars is ANTLR3 grammar library; no PL/1 or COBOL grammars there. The Antlr V4 (a very new, radical, backwards incompatible reengineering of ANTLR3) main page talks about Java and C#; no hint of PL/1 or COBOL there; given its newness, no surprise. If you are really lucky, somebody may have one they will give you and speak up.
Developing such grammars is difficult for several reasons (based on personal experience building production-quality parsers for these two specific items, using a very strong parser system different than ANTLR [see my bio for more details]):
The character set and column layout rules (columns 1-5, 6 and 72-80 are special) may be an issue: the languages you describe are typically written in EBCDIC historically in punch-card 80 column format without line break characters between lines. Translation to ASCII sometimes produces nasty glitches; the ASCII end-of-line character is occasionally found in the middle of COBOL literal strings as a binary value, but because it has the same exact code in EBCDIC and ASCII, after translation it will (be and) appear to be an ASCII newline break character. Character strings can also be long but split across multiple lines; but columns 72-80 by definition have to be ignored. Column 6 may contain a "D" character, which affects interpretation of the following source lines as "debug" or "not". This means you need to get 80 column processing right. I don't know what ANTLR has to support processing characters-in-column-areas. You'll also need to worry about DBCS encoding of string literals, and variations of that if the source code is used in non-English speaking countries, such as Japan.
These languages are large and complex; IBM has had 40 years to decorate them with cruft. The IBM COBOL manual is some 600 pages ... then you discover that COBOL also includes a Report Writer, which is another 600 page document. Capturing all the nuances of the lexical tokens and the grammar rules will take effort, and you have to do that from the IBM manuals, which don't contain nice BNF-style descriptions, which means guessing from the textual description and some examples. For COBOL, expect several thousand grammar rules; PL/1 is less complicated in the abstract. Expect a certain amount of "lies"; we've encountered a number of places where the reference documentation clearly says certain things are not legal, and yet the IBM compilers (based on real, running source code) accepts them, and vice versa. The only way you find these is by empirical experiments.
Both languages have constructs that are difficult to parse, e.g., requiring arbitrary lookahead and/or local ambiguity. ANTLR4 is much better than ANTLR3 from my understanding on these, but that doesn't mean these aspects will be easy. PL/1 is particularly nasty in this regard: it has no keywords, but hundreds of keywords-in-context. To resolve these one has to get the lexer and the parser to cooperate, and even then there may be many locally ambiguous parses. ANTLR3 doesn't do these well; ANTLR4 is supposed to be better but I don't know how it handles this, if it does at all.
To verify these parsers are right, you will need to run them on millions of lines of code (which means you have to have access to such code samples), and correct any errors you find. This takes a long time (in our case, several years of more or less continuous work/improvement to get production quality grammars that work on large code bases). You might be miraculously faster than this; good luck.
You need to build a preprocessor for COBOL (COPY ... REPLACING), whose details are poorly documented, and eventually another one for PL/1 (which I understand to be fully Turing capable).
After you build a parser, you need to capture a syntax tree; here ANTLR4 is supposed to be pretty good in that it will capture one for the grammar you give it. That may or may not be the AST you want; with several thousand grammar rules, I'd expect not. ANTLR3 requires you to add, manually, indications of where and how to form the AST.
After you get the AST, you'll want to do something with it. This means you will need to build at least symbol tables (mappings from identifier instances to their declarations and any related type information). ANTLR provides nothing special to support this AFAIK except for support in walking the ASTs. This, too, is hard to get right, COBOL has crazy rules about how an unqualified identifier reference can be interpreted as to a specific data field if there are no other conflicting interpretations. (There's lots more to Life After Parsing if you want to have good semantic information about the program; see my bio for more details; for each of these semantic aspects you have develop them and then for validation go back and run them on large code bases again.).
TL;DR
Building parsers (well, "front ends") for these languages is a lot of work no matter what parsing engine you choose. Likely explains why they aren't already in ANTLR's grammar zoo.
Have a look at the OpenSource Cobol-85 Parser from ProLeap, based on antlr4 and creating ASTs and ASGs as well.
And, best of all, it really works !
https://github.com/uwol/proleap-cobol-parser
I am not aware of a comparable PLI-grammar, but a very good start is the EBNF-definition from Ralf Lämmel (CWI, Amsterdam) & Chris Verhoef (WINS, Universiteit van Amsterdam)
http://www.cs.vu.nl/grammarware/browsable/os-pli-v2r3/

How to use CFStringTokenizer with Chinese and Japanese?

I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.
Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.
I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).
I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.
[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.
If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.
Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).
Also have a look at NSLinguisticTagger.
But by itself won't give you much more.
Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.
You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.
What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)
After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.
Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.
Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.
To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.

What is the point of the lower camel case variable casing convention (thisVariable, for example)?

I hope this doesn't get closed due to being too broad. I know it comes down to personal preference, but there is an origin to all casing conventions and I would like to know where this one came from and a logical explanation as to why people use it.
It's where you go all like var empName;. I call that lower camel, although it's probably technically called something else. Personally, I go like var EmpName. I call that proper camel and I like it.
When I first started programming, I began with the lower camel convention. I didn't know why. I just followed the examples set by all the old guys. Variables and functions (VB) got lower camel while subs and properties got proper camel. Then, after I finally acquired a firm grasp on programming itself, I became comfortable enough to question the tactics of my mentors. It didn't make logical sense to me to use lower camel because it wasn't consistent, especially if you have a variable that consists of one word which ends up being in all lowercase. There is also no validation mechanism in place to make sure you are appropriately using lower vs. upper camel, so I asked why not just use proper camel for everything. It's consistent since all variable names are subject to proper camelization.
Having dug deeper into it, it turns out that this is a very sensitive issue to many programmers when it is brought to question. They usually answer with, "Well, it's just personal preference" or "That's just how I learned it". Upon prodding further, it usually invokes a sort of dogmatic reaction with the person as I attempt to find a logical reason behind their use of lower camel.
So anyone want to shed a little history and logic behind casing of the proper camelatory variety?
It's a combination of two things:
The convention of variables starting with lower case, to differentiate from classes or other entities which use a capital. This is also sometimes used to differentiate based on access level (private/public)
CamelCasing as a way to make multi-word names more readable without spaces (of course this is a preference over underscore, which some people use). I would guess the logic is that CamelCasing is easier/faster for some to type than word_underscores.
Whether or not it gets used is of course up to whomever is setting the coding standards that govern the code being written. Underscores vs CamelCase, lowercasevariables vs Uppercasevariables. CamelCase + lowercasevariable = camelCase
In languages like C# or VB, the standard is to start private things with lowercase and start public/protected things with uppercase. This way, just by looking at the first letter you can tell whether the thing you are messing could be used by other classes and thus any changes need more scrutiny. Also, there are tools to enforce naming conventions like this. The one created/used internally at Microsoft is called StyleCop and is available as a free download.
Historically, well named variables in C (a case-sensitive language) consisted of a single word in lower case. UPPERCASE was reserved for macros.
Then came along C++, where classes are usually CapitalizedAndCamelCased, and variables/functions consisting of several words are camelCased. (Note that C people tend to dislike camelCase, and instead write identifiers_this_way.
From there, it spread.
And, yes, probably other case-sensitive languages have had some influence.
lowerCamelCase I think has become popular because of java and javascript.
In java, it is specifically defined why, that the first word should be a verb with small letters where the remaining words start with a capital letter.
The reason why java chose lowerCamelCase I think depends on what they wanted to solve. Java was launched in 1995 as a language that would make programming easy. C/C++ that was often used was often considered difficult and too technical.
This was something java claimed to solve, more people would be able to program and the same code would work on different hardware. The code was the documentation, you didn't need to comment code, just read and everything would be great.
lowerCamelCase makes it harder to write "technical" code because it removes options to use uppercase and lowercase letters to better describe the code from a technical perspective. Java didn't want to be hard, java was the language to use where everyone could learn to program.
javascript in browsers was created in 10 days by Brendan Eich in 1995. Why javascript selected lowerCamelCase I think is because of java. It has nothing to do with java but it has "java" in its name "javascript".

Should Unicode be allowed in usernames? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Why do most (all?) websites only support usernames in ASCII? Are there any security considerations if an admin decides to start accepting Unicode usernames?
Homoglyph attacks. User 'cat' and 'сat' are different unicode strings although they look the same. The first letter in the second 'сat' is Russian 'с' - "CYRILLIC SMALL LETTER ES" to be exact. The system can't easily tell that you're spoofing another user's name - to the computer the nicks are different.
Edit: Preventing mixed scripts does not solve the problem. For example 'сосо' is pure Cyryllic and can be used to spoof ascii 'coco'.
Also, left-to-right override (and friends.) Leave them unsanitized and they'll mess up your whole page.
HTTP authentication?
There could be some problems with sending the unicode username (and/or password) over existing protocols. One case that I have run into before is with Basic authentication. There is no well defined way to handle sending these unicode usernames/passwords in the basic auth headers.
While it is at all questionable why there should ever be username and not just a 'password' to identify a user, I think there's no reason to disallow unicode usernames.
What's more important, is that password to be validated as lanuguage-agnostic: it should treat keystokes regardless of user's keyboard setting. This means, "שלום" and "akuo" would be the same password. This is important, because the user often doesn't see the password characters he's typing, and they are getting severely pissed if the CAPSLOCK is on.
While you can go ahead and allow unicode, understand that some usernames will not work as expected thanks to different cultures applying different rules to the same characters.
Consider the basic case for breaking case sensivitity: In Turkish, the usernames "Id1" and "id1" are different (in Turkish there are two different Is, one with a dot and one without, resulting in 2 captial and 2 small letters that do not match the same captialization rules as English). So while any Turkish person can enter their name in their own language, the program will not treat their name as they expect - instead it will undergo a strange transformation into mutant English.
Special latin characters in European languages have similar overlaps, making it seemingly random as to which language they are being entered in. Other regions of the world have similar shared characters where the rules of use differ - in some cases national and cultural hatreds could result in some very angry people when the characters making up their username are treated as if it was written in the language of their hated enemy (due to that being the operating systems default setting for those foreign characters).
Your observation is not always true. And, the choice of ASCII is largely human factors rather than technical or security issues.
For most of the case, it is just for the ease of programming. A programmer never know that all software, libraries, utilities in website will break or not with some characters. Why risks the website development while ASCII works well? Also, some packaged web software would hinder the use of Unicode in user name. This contributes the issue that many websites only support usernames in ASCII.
Theoretically, all current software can handle 8-bit data well. There is no problem in storage or transmission nowadays. Even if some protocols not, they can translate in UTF-7 or with other transformation schemes.
There are some issues with Unicode. It is more on the side of data processing. It might be display, fonts, readiness of software and software libraries for non-BMP characters, collation, comparison, input methods, writing directions. Administrators might not knowledgeable enough to handle them. Depending on the nature of website, it could be a problem, but mostly not.
For admin purpose, it is not easy to type some exotic characters. It makes admin hard to search for users. It is also hard for an admin to keep offensive usernames in foreign languages off the website.
However, it is not uncommon that Chinese usernames are used Chinese website. It might not always in ASCII. So do other cultures and languages. Some global projects accept nearlly all kinds of Unicode characters. Wikipedia is an example.
Plain ASCII is rare, I'd say. Often it's just that no one thinks of it since in Western Europe Latin 1 suffices and for the US as well. Some databases make distinctions between text in legacy character sets and Unicode (varchar vs. nvarchar) or for other databases a special character set has to be set.
Especially in the US many people never even notice that ASCII won't be enough. Some try to find excuses with »Users have to enter it« or similar which are mostly bogus, though.
To answer your question, I doubt there are security considerations, except maybe for spoofing other people's names using different scripts (a and а look identical, but one is Latin, one is Cyrillic – this has been done with URLs before). Generally I see it as an oversight by developers who probably should know better.
I would say a big reason is the lack of support for unicode in most PHP installations. It isn't easy to work with, so why allow it when the possibilities in ASCII are sufficient to cover your entire user base?
Or, we could just stop giving a crap about what a username looks like, and whether WE can pronounce/ remember it. That should be the USERS concern. If no one remembers you, that's your loss. And, as for name spoofing, that is almost unavoidable in any case. And yet, rarely do you ever hear of username spoofs.
Imagine a forum, imagine someone posts with an account that LOOKS identical to yours. You get in trouble, say you didn't do it, post a link to your history, see the post isn't there. Click the profile of the guy who ACTUALLY posted it, and bam, you have his profile. He's now bannable.
Having the same name doesn't mean you have the same user data. Any application that doesn't make it easy for you to differentiate two similar users is piss poor anyway and needs to be rewritten.

How to format source-code in a Framemaker document? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have to write up a technical document in Framemaker that explains various programming source-code.
So my document consists of a bunch of text, followed by a bunch of source code (Java, XML) and then followed by more text, etc.
This question is not about whether I should or should not use Framemaker - that is the software I have to use . . .
What I'm confused about is how to format source code as part of my document. Has anyone done this for a technical document and come across any instructions or tips? So far my Googling hasn't produced anything relevant to what I need to do.
At the very least, create a paragraph style for code samples, use a good monospaced font, and don't forget to turn off hyphenation.
When I used to do this, I would create a table style and paste the code in there, so I had a nice title header above it, and it stood out a bit. The only gotcha there is that Frame table cells won't break across a page break, so if your code is longer than a page or threatens to go below the bottom of a page, you'll need to create multiple rows in your table and break up the code across the rows.
From a paper I wrote on this some years ago which will be available again online next week.
Typographers are primarily concerned with legibility, and have tools, practices, and traditions
dating back hundreds and indeed thousands of years on which to rely when setting texts in
natural languages. However, computer programs are not written in natural languages. They
are written in ‘programming languages’: artificial languages, which have their own rules of
syntax, their own conventions of presentation, and their own criteria of legibility. Computer
code is therefore a special domain for typesetting, just as are music, mathematics, and chemistry.
These domains have their own rules, which are not the rules used when setting natural
languages.
Computer programming itself is of very recent origin, and the
practice of setting it in type doesn’t go back more than about 45 years: significant volumes of
computer code have only been published in the last 20 years or less. The associated typographical
discipline is immature or indeed practically non-existent, and the typographical
expectations of the practitioners in the field are also low, as you can see by inspecting many trade books. There's no reason why you can't try to do better.
Use a sans serif font. In one of my books I used the same font family, FF Scala for the text and FF Scala Sans for the code. I think it looks great but there are contrary opinions: these may force you to use a monospaced font, although personally I think this is very outdated. Avoid Courier, it doesn't blend with anything.
Indentation is part of the notation. You must respect the existing left indents. The source code will already be tabbed. Reduce each tab to one or two spaces at most, otherwise you will run out of horizontal room.
Try to lose as much vertical space as possible, e.g. suppress blank lines.Try to get the entire sample on to one page. Let it float if necessary to accomplish that.
Line breaks are part of the notation. Don't add line breaks without consulting the author.
Quotation marks are part of the notation. Don't change single to double or vice versa.
Justification: Computer programs are always written, viewed, and set left-justified, right-ragged.
Page breaks. When setting computer code in a book, page breaks can’t just follow the simple orphan/widow principles used when typesetting natural languages. Instead, the logical ‘blocks’ of the code must be kept together if possible. It is not usually possible for the typographer
to determine the block boundaries in code, although a blank line is generally an acceptable
point for a page break. ‘Block comments’ should be kept with the following block of code.
If you don’t know what these are, ask the author.
Hyphenation. Programming languages are not natural languages and do not observe the usual hyphenation conventions. Consult the author if you need to hyphenate, or just don't. Words in program text must never be hyphenated or line-broken except in accordance with the author’s instructions.
Upper and lower case. Case in program code is usually significant to the computer, and practically always to writers and their readers. Pairs of words are often used which differ only in case, representing different things: e.g. BufferedOutputStream and bufferedOutputStream.
Programmers, especially author-programmers, are usually highly systematic about
case, in ways which may not necessarily make sense to the typographer (or other programmers!).
Practical recommendations
Indent in em units. The solution to many of the issues in typesetting computer programs is the em. The author’s tabs will most likely be to the next multiple of 8 spaces (1 , 9 , 17, …); typographic tabs for program code should be in multiples of 1 or 2 ems. Adopting the em as the unit of indentation may at first ‘look funny’ to the author, as the indents may be much narrower than seen on screens or printouts. However, as long as the vertical alignment of tab stops is preserved, the author’s intention is fully preserved.
Line breaks must be as per MS.
Page breaks: If page breaks may occur in the middle of program code, the author must be consulted as to preferred page break points. Usually this is to be avoided altogether in short examples; in longer programs, the author should indicate all possible page breaks in the MS.
Quotes: Conventionally, ‘straight’ quotes are used, not typographic quotes. This is historically determined, by the use of fonts without typographic quotes (e.g. Courier, Helvetica) in typeset computer code. It is not required by the properties of the notation.
I see no reason against using typographic quotes when setting computer programs as
long as single quotes stay single and double quotes stay double, i.e. as long as the author’s
quotes are preserved rather than ‘corrected’ to standard typographic practice.
Numerals: Conventionally, lining numerals have always been used in program code. If you can be bothered using old-style numerals in program code, or if the font is built that way, I can see no reason against it. You must choose a font in which 1, I, and l (lower-case L) are distinct, as also 0 (zero) and O.