What characters are allowed in function names etc? - elm

As the title says, which characters are allowed in identifiers (function, variable, and record field names)? aöø all seem to be fine, as do '_9 if not the first character. <$;% do not. Is it documented somewhere which ranges/blocks of unicode characters and symbols are allowed?
Follow-up question: which characters are allowed in infix operators?

So, after reading the Haskell specs (which can be assumed has influenced Elm), the JavaScript specs, and trial and error, I have arrived at the following rules:
An identifier must begin with a character from the unicode categories:
Uppercase letter (Lu) (modules, types)
Lowercase letter (Ll) (functions, variables)
Titlecase letter (Lt) (modules, types)
The rest of the characters must belong to any of the following categories:
Uppercase letter (Lu)
Lowercase letter (Ll)
Titlecase letter (Lt)
Modifier letter (Lm)
Other letter (Lo)
Decimal digit number (Nd)
Letter number (Nl)
Or be _ (except for in module names).
Technically "Other number" (No) seems to also be valid in Elm, but it crashes after it's been compiled to JavaScript.
I used this tool to get the ranges for each category.

Related

simple input of diacritical marks, and superscripts

There are times when you need to input modified variables with diacritical marks, or superscripts.
Seems like declare_index_properties allows doing it at the stage of display print.
But it is neither simple, nor very useful in formulas.
is there a simple way of adding hats, umlauts, and ', "strokes on top of a symbol, making it distinguishable from the symbol without such mark both to interpreter and to human eye?
Maxima doesn't have a notion of declaring a symbol to have diacritical marks or other combining marks on it. However, Maxima allows Unicode characters in symbol names if the underlying Lisp implementation allows Unicode; almost all of them allow Unicode. GCL is the only Lisp implementation, so far as I know, which doesn't handle Unicode correctly.
WxMaxima appears to allow Unicode characters to be input. At least, it worked that way when I tried some examples. Command-line Maxima allows Unicode if the terminal it is running in allows Unicode.
I think any Unicode character should be OK in a string. For symbols, any character which passes ALPHA-CHAR-P (a build-in Lisp function) can be part of a symbol name. Also, any character which is declared to be alphabetic (via declare("x", alphabetic) where x is the character in question) can be part of a symbol name.
I think wxMaxima has some capability to allow the user to select characters with diacritical marks from a menu; I haven't tried it. When I want to use Unicode characters, I end up just pasting them from a web page or something. I have used https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html as a source of characters in the past.

Grammar and unicode characters

Why the below Grammar fails to parse for unicode characters?
it parses fine after removing word boundaries from <sym>.
#!/usr/bin/env perl6
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<✓> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('✓'); # Nil
From the « and » "left and right word boundary" doc:
[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.
✓ isn't a word character. So the word boundary assertion fails.
What is and isn't a "word character"
"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:
Characters whose Unicode general category starts with an L, which stands for Letter.1
Characters whose Unicode general category is Nd, which stands for Number, decimal.2
_, an underscore.
"alpha 'Nd under"
In a comment below #p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".
But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2
This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".
Footnotes
1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.
2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.
I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:
for <y ✓ Ⅲ> {
say $_.uniprops;
say m/<|w>/;
}
The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<ら> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('ら'); # This is a hiragana letter, so it works.

Xpath: whitespace encoding

I need to create an XPath query to select a JCR node whose name contains a whitespace character.
For instance: /jcr:root/foo bar/
But that results in an invalid query.
How should whitespaces be encoded in an XPath query?
Try using something like this XPath query:
/jcr:root/foo_x0020_bar/
The JSR-170 (JCR 1.0) specification defines how XPath can be used to query a JCR repository, and even though JSR-283 (or JCR 2.0) deprecated XPath as a query language, many of the implementations still support XPath along with the other query languages (including the more powerful JCR-SQL2).
Now, regarding the rules for escaping characters in XPath, JSR-170 states the following in Section 6.6.4.9:
The names of elements and attributes (corresponding to nodes and properties, respectively) within an XPath statement must correspond to the form in which they (notionally) appear in the document view. This means that spaces (and any other non-XML characters) within names must be encoded according to the rules described in 6.4.3 Escaping of Names.
Section 6.4.3 defines how such characters are escaped in names:
The escape character is the underscore (“_”). Any invalid character is escaped as _xHHHH_, where HHHH is the four-digit hexadecimal UTF-16 code for the character. When producing escape sequences the implementation should use lowercase letters for the hex digits a-f. When unescaping, however, both upper and lowercase alphabetic hexadecimal characters must be recognized.
Although you didn't ask about it, you can easily do the same query in JCR-SQL2:
SELECT * FROM [nt:base] WHERE ISSAMENODE('/foo_x0020_bar')

RegEx to find % symbols in a string that don't form the start of a legal two-digit escape sequence?

I would like a regular expression to find the %s in the source string that don't form the start of a valid two-hex-digit escaped character (defined as a % followed by exactly two hexadecimal digits, upper or lower case) that can be used to replace only these % symbols with %25.
(The motivation is to make the best guess attempt to create legally escaped strings from strings of various origins that may be legally percent escaped and may not, and may even be a mixture of the two, without damaging the data intent if the original string was already correctly encoded, e.g. by blanket re-encoding).
Here's an example input string.
He%20has%20a%2050%%20chance%20of%20living%2C%20but%20there%27s%20only%20a%2025%%20chance%20of%20that.
This doesn't conform to any encoding standard because it is a mix of valid escaped characters eg. %20 and two loose percentage symbols. I'd like to convert those %s to %25s.
My progress so far is to identify a regex %[0-9a-z]{2} that finds the % symbols that are legal but I can't work out how to modify it to find the ones that aren't legal.
%(?![0-9a-fA-F]{2})
Should do the trick. Use a look-ahead to find a % NOT followed by a valid two-digit hexadecimal value then replace the found % symbol with your %25 replacement.
(Hopefully this works with (presumably) NSRegularExpression, or whatever you're using)
%(?![a-fA-F0-9]{2})
That's a percent followed by a negative lookahead for two hex digits.

Collect a word between two spaces in objective c

I'm trying to implement stuff similar to spell check, but I need to get the word that is limited by a space. EX: "HI HOW R U", I need to collect HI, HOW and so on as they type. i.e. After user hits HI and space I need to collect HI and do a spell check.
Check the documentation for NSString Here. You want the message componentsSepeparatedByString:.
I don't know objective-C, but I'm fairly sure it'll have a Regexp library - although it'd be straightforward to code it without one.
Regexp: \b([^\s])*\b
\b = word boundary (whitespace, comma, dot, exclamation-mark, etc.)
\s = whitespace character
[...] = character set
[^...] = negated character set (any character(s) EXCEPT ...)
() = grouping construct
* = zero or more times
So the suggested expression would start matching at any word boundary, then match every subsequent character that is not a whitespace character, then match a word boundary.
Your stated case is so simple you may just want to look for spaces (one char at a time) and get the substring, but RegExp is very widely used across a range of languages and platforms, and so it's fairly easy to find an expression when you need to - and one often does for common stuff like checking if zip codes, phone numbers, email addresses and so on are syntactically correct. So it's worth learning in any case. :)