Grammar and unicode characters - raku

Why the below Grammar fails to parse for unicode characters?
it parses fine after removing word boundaries from <sym>.
#!/usr/bin/env perl6
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<✓> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('✓'); # Nil

From the « and » "left and right word boundary" doc:
[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.
✓ isn't a word character. So the word boundary assertion fails.
What is and isn't a "word character"
"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:
Characters whose Unicode general category starts with an L, which stands for Letter.1
Characters whose Unicode general category is Nd, which stands for Number, decimal.2
_, an underscore.
"alpha 'Nd under"
In a comment below #p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".
But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2
This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".
Footnotes
1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.
2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.

I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:
for <y ✓ Ⅲ> {
say $_.uniprops;
say m/<|w>/;
}
The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<ら> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('ら'); # This is a hiragana letter, so it works.

Related

Making character class with modifier symbols in Perl 6

I'd like to make a user-defined character class of "vowels", which will match any literal English vowel letter (a, e, i, o, u) as well as any of these letters with any possible diacritics: ắ ḗ ú̱ å ų̄ ẹ́ etc.
This is what I've tried to do, but it doesn't work:
> my $vowel = / <[aeiou]> <:Sk>* /
/ <[aeiou]> <:Sk>* /
> "áei" ~~ m:g/ <$vowel> /
(「e」 「i」)
You could try use ignoremark:
The :ignoremark or :m adverb instructs the regex engine to only
compare base characters, and ignore additional marks such as combining
accents.
For your example:
my $vowel = /:m<[aeiou]>/;
.say for "áeikj" ~~ m:g/ <$vowel> /;
Output:
「á」
「e」
「i」
The reason you can't match a vowel with a combining character using / <[aeiou]> <:Sk>* / is that strings in Perl 6 are operated on at the grapheme level. At that level, ų̄ is already just a single character, and <[aeiou]> being a character class already matches one whole character.
The right solution is, as Håkon pointed out in the other answer, to use the ignoremark adverb. You can put it before the regex like rx:m/ <[aeiou]> / or inside of it, or even turn it on and off at different points with :m and :!m.

How hive sentences function breaks each sentence

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does that mean? Below is an example of my tries, I tried adding period (.) and exclamatory sign(!) at different points of the sentence. I'm getting different outputs, can someone explain on this?
with period (.)
select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 1 array
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
with '!'
select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 2 arrays
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
If you understand the functionality of sentences()..it clears your doubt.
Definition of sentences(str):
Splits str into arrays of sentences, where each sentence is an array
of words.
Example:
SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;
[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]
SELECT sentences('review . language') FROM movies;
[["review","language"]]
An exclamation point is a type of punctuation mark that goes at the end of a sentence. Other examples of related punctuation marks include periods and question marks, which also go at the end of sentences.But as per the definition of sentences() ,Unnecessary punctuation, such as periods and commas in English, is automatically stripped.So,we are able to get two arrays of words with !. It completely involves java.util.Locale.java
I don't know the actual reason but observed after period(.) if you put space and next word first letter as capital then it is working.
Here I changed from where to Where it it worked. However this is not require for !
Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.
And this is giving below output
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

What characters are allowed in function names etc?

As the title says, which characters are allowed in identifiers (function, variable, and record field names)? aöø all seem to be fine, as do '_9 if not the first character. <$;% do not. Is it documented somewhere which ranges/blocks of unicode characters and symbols are allowed?
Follow-up question: which characters are allowed in infix operators?
So, after reading the Haskell specs (which can be assumed has influenced Elm), the JavaScript specs, and trial and error, I have arrived at the following rules:
An identifier must begin with a character from the unicode categories:
Uppercase letter (Lu) (modules, types)
Lowercase letter (Ll) (functions, variables)
Titlecase letter (Lt) (modules, types)
The rest of the characters must belong to any of the following categories:
Uppercase letter (Lu)
Lowercase letter (Ll)
Titlecase letter (Lt)
Modifier letter (Lm)
Other letter (Lo)
Decimal digit number (Nd)
Letter number (Nl)
Or be _ (except for in module names).
Technically "Other number" (No) seems to also be valid in Elm, but it crashes after it's been compiled to JavaScript.
I used this tool to get the ranges for each category.

Oracle regexp_like pattern with POSIX character class

I have username in this pattern
ref_2_34_aaa_dos
ref_2_34_bbb_dos
How can I use regexp_like for this?
SELECT username FROM all_users WHERE regexp_like(username, '^ref_2_34_[:alpha:]_dos$')
does not work. Nor can I use ESCAPE '\' with regexp_like. it would give a syntax error.
You need to put the POSIX character class [:alpha:] into a bracket expression (i.e. [...]) and apply a + quantifier to it:
regexp_like(username, '^ref_2_34_[[:alpha:]]+_dos$')
The + quantifier means there can be 1 or more letters between the last but one and last underscores.
If your string may have no user name at that location (it is empty), and you want to get those entries too, you would need to replace the plus with the * quantifier that matches zero or more occurrences of the quantified subpattern.
Since comments require some more clarifications, here is some bracket expression and POSIX character class reference.
Bracket expressions
Matches any single character in the list within the brackets. The following operators are allowed within the list, but other metacharacters included are treated as literals:
Range operator: -
POSIX character class: [: :]
POSIX collation element: [. .]
POSIX character equivalence class: [= =]
A dash (-) is a literal when it occurs first or last in the list, or as an ending range point in a range expression, as in [#--]. A right bracket (]) is treated as a literal if it occurs first in the list.
POSIX Character Class (can be a part of the bracket expression):
[:class:] - Matches any character belonging to the specified POSIX character class. You can use this operator to search for characters with specific formatting such as uppercase characters, or you can search for special characters such as digits or punctuation characters. The full set of POSIX character classes is supported.
...
The expression [[:upper:]]+ searches for one or more consecutive uppercase characters.
Bracket expressions can be considered a kind of a "container" construct for multiple atoms, that, as a whole regex unit, matches some class of characters you defined. If you need to match a <, or >, or letters, you may combine them into 1 bracket expression [<>[:alpha:]]. To match zero or more of <, > or letters, add a * quantifier after ]: [<>[:alpha:]]*.
Or, to imitate a trailing word boundary, one might use [^_[:alnum:]] (say, in a ($|[^_[:alnum:]]) pattern) that matches any character but a _, digits and letters ([:alnum:] matches alphanumerical symbols).

How to separate words characters and non word characters?

Unicode have categories of characters. Some are alpha numeric. Some are punctuation.
What about if I want to know whether a word belongs to keyword or not
For example,
A,a,b,c, tend to belong to words. So is Ƈ,Ǝ,ǟ, so are all chinese characters.
Sentences like
Hello World, I "like" (to) eat ƇƎǟ and 款开源 ©
Have keywords:
Hello
World
I
like
to
eat
ƇƎǟ
款
开
源
Here, , (),© are not word characters and hence should just be ignored and use.
© doesn't count as punctuation either. '©'.IsPunctuation returns false in vb.net but I want to get rid of that too.
Now I want to make a program that can split sentences into keywords. For that I need to know which characters are word characters and which one is not.
Is there a vb.net function for that?
Do it the other way round: use IsLetter for your test. Or better yet, use regular expressions to split your string by words:
Dim str = "Hello World, I ""like"" (to) eat ƇƎǟ and 款开源 ©"
Dim wordPattern As New Regex("\p{L}+")
For Each match in wordPattern.Matches(str))
Console.WriteLine(match)
Next
Here, \p{L} matches any word character. However, the above matches “款开源” in a single rather than in separate matches since there is no separator between the characters.
u need to deal with "keycodes"
like if u only want letters [a-z]
then
for(c>='a' && c<='z'){
}
or
for(c>=97 && C<=122){
}