Difference between single pipe and double pipe in Raku Regex (| Vs ||) - raku

There are two types of alternation in Raku's regex: the | and ||. What is the difference ?
say 'foobar' ~~ / foo || foobar / # 「foo」
say 'foobar' ~~ / foo | foobar / # 「foobar」

The || is the old alternation behaviour: try alternation from the first declared to the last
The | try alternation from the longest to the shortest declarative atom. It is called the Longest Token Matching Spec strategy.
say 'foobar' ~~ / foo || foobar / # 「foo」 is the first declared
say 'foobar' ~~ / foo | foobar / # 「foobar」 is the longest token
More detailed answer in this post

Related

Multiple Patterns in Regex

Can there be multiple patterns in Regexp_Replace.
Pattern 1 : '^#.*'
Pattern 2: '^//.*'
Pattern 3 : '^&&.*'
I want all three patterns in same regexp_replace function like
select REGEXP_REPLACE ('Unit testing last level','Pattern 1,Pattern 2,Pattern 3','',1,0,'m')
from dual;
You can use an alternation group where all alternative branches are |-separated.
^(#|//|&&).*
The (...) form a grouping construct where you may place your various #, &&, and other possible "branches". A | is an alternation operator.
The pattern will match:
^ - start of a line (as you are passing m match_parameter)
(#|//|&&) - either #, // or &&
.* - any 0+ chars other than a newline (since n match_parameter is not used).

ANTLR4 - How do I get the token TYPE as the token text in ANTLR?

Say I have a grammar that has tokens like this:
AND : 'AND' | 'and' | '&&' | '&';
OR : 'OR' | 'or' | '||' | '|' ;
NOT : 'NOT' | 'not' | '~' | '!';
When I visualize the ParseTree using TreeViewer or print the tree using tree.toStringTree(), each node's text is the same as what was matched.
So if I parse "A and B or C", the two binary operators will be "and" / "or".
If I parse "A && B || C", they'll be "&&" / "||".
What I would LIKE is for them to always be "AND" / "OR / "NOT", regardless of what literal symbol was matched. Is this possible?
This is what the vocabulary is for. Use yourLexer.getVocabulary() or yourParser.getVocabulary() and then vocabulary.getSymbolicName(tokenType) for the text representation of the token type. If that returns an empty string try as second step vocabulary.getLiteralName(tokenType), which returns the text used to define the token.

Objective-C operator precedence of square brackets used as message expression/notation?

Does the Objective-C message expression (message notation) operator, which uses square brackets [], have the same precedence as the C operator for array subscripting, which also uses square brackets []?
I refer to this table of C operators.
Also, an analogous question applies to the Objective-C "dot syntax" operator for accessor method invocation compared to the C operator for "element selection by reference". Do they have the same precedence?
I searched for an hour for a straightforward, definitive answer to this basic question. Surprisingly, I did not find one. Hence, this question. Links welcome.
You have become confused because many explanations of grammars and precedence take the shortcut of saying that operators have precedence. They don't. It is productions in the grammar that have precedence, and they have precedence relative to other productions. It is only meaningful for two productions to have precedence relative to each other if the grammar is ambiguous (meaning it can produce two different parse trees for the same input), and if the ambiguity is resolved by specifying the precedence of one production over the other.
Let me explain with an example.
Here's a toy grammar:
expression =
| IDENTIFIER
| NUMBER
| expression '+' expression
| expression '*' expression
| expression '(' expression ')' // function call
| '(' expression ')' // grouping
| expression '[' expression ']' // array subscript
| '[' expression IDENTIFIER ':' expression ']' // message send
;
Now, consider parsing 1 + 2 * 3 with this grammar. There are two valid parse trees:
+ *
/ \ / \
1 * + 3
/ \ / \
2 3 1 2
By specifying that the * production has a higher precedence than the + production, we require the parser to produce the left tree instead of the right tree. Thus the idea of a precedence relationship between the + production and the * production makes sense: it has an effect on the parser's output.
Similarly, 1 + foo(3) has two parse trees:
+ ()
/ \ / \
1 () + 3
/ \ / \
foo 3 1 foo
So again the idea of a precedence relationship between the '+' production and the function call production makes sense. The case of 1 + foo[3] (which uses the subscript production in place of the function call production) is analogous, so it makes sense to specify a precedence relationship between the '+' production and the subscript production.
Now consider 1 + (2 * 3). The grammar can only produce one possible parse tree:
+
/ \
1 ( )
|
*
/ \
2 3
There is no need for a precedence relationship between the + production and the grouping production, because there is only one way to parse this input. It would be meaningless to specify that the grouping production has higher precedence than the + production, because there is no other parse tree that you could produce by doing so.
Finally, consider 1 + [2 add:3]. This is analogous to the grouping example. There is only one possible parse tree:
+
/ \
/ \
1 [ ]
/ | \
/ | \
2 add 3
No other parse tree is possible. There is no need to specify a precedence relationship between the + production and the message send production. Specifying a precedence relationship between them would have no effect, because the grammar simply doesn't allow this input to be parsed any other way.
They are in the same precedence group. I believe the message send [] is equivalent to () because the runtime treats them as parenthesis in the case of messages.
http://www.techotopia.com/index.php/Objective-C_2.0_Operator_Precedence

regex - lazy and non-capturing

String to search:
VALUES ('9gfdg', to_date('1876/12/06','YYYY/MM/DD'), null)
Regex search so far:
VALUES\s*\(\s*'?\s*(.+?)\s*'?\s*,\s*'?\s*(.+?)\s*'?\s*,\s*'?\s*(.+?)\s*'?\s*\)
Regex replace to 3 groups: ie \1 \2 \3
I am aiming for a result of:
9gfdg to_date('1876/12/06' ,'YYYY/MM/DD') null
but instead get (because of that extra comma in to_Date and also lazy instead of greedy):
9gfdg to_date('1876/12/06 YYYY/MM/DD , null)
Note:
It is exactly 3 fields (the values within th 3 fields may be different but you get the idea of the format I am grappling with). ie each of the fields could have commas (usually character values, could be a keyword such as null, could be a number or could be a to_Date expression.
Regex engine is VBA/VBscript
Anyone have any pointers on fixing up this regex?
Here is a solution.
Notice the regex for $field: it is yet another application of the normal* (special normal*)* pattern, with normal being anything but a comma ([^,]) and special a comma as long as it is not followed by two single quotes (,(?!'')). The first normal, however, is made non empty using + instead of *.
Demonstration code in perl. The string concatenation operator in perl is a dot:
fge#erwin $ cat t.pl
#!/usr/bin/perl -W
use strict;
# Value separator: a comma optionally surrounded by spaces
my $value_separator = '\s*,\s*';
# Literal "null", and a number
my $null = 'null';
my $number = '\d+';
# Text field
my $normal = '[^,]'; # Anything but a comma
my $special = ",(?!'')"; # A comma, _not_ followed by two single quotes
my $field = "'$normal+(?:$special$normal*)*'"; # a text field
# A to_date() expression
my $to_date = 'to_date\(\s*' . $field . $value_separator . $field . '\s*\)';
# Any field
my $any_field = '(' . $null . '|' . $number . '|' . $field . '|' . $to_date . ')';
# The full regex
my $full_regex = '^\s*VALUES\s*\(\s*' . $any_field . $value_separator . $any_field
. $value_separator . $any_field . '\s*\)\s*$';
# This builds a compiled form of the regex
my $re = qr/$full_regex/;
# Read from stdin, try and match (m//), if match, print the three captured groups
while (<STDIN>) {
m/$re/ and print <<EOF;
Argument 1: -->$1<--
Argument 2: -->$2<--
Argument 3: -->$3<--
EOF
}
Demonstration output:
fge#erwin ~ $ perl t.pl
VALUES ('9gfdg', to_date('1876/12/06','YYYY/MM/DD'), null)
Argument 1: -->'9gfdg'<--
Argument 2: -->to_date('1876/12/06','YYYY/MM/DD')<--
Argument 3: -->null<--
VALUES('prout', 'ma', 'chere')
Argument 1: -->'prout'<--
Argument 2: -->'ma'<--
Argument 3: -->'chere'<--
VALUES(324, 'Aiie, a comma', to_date('whatever', 'is there, even commas'))
Argument 1: -->324<--
Argument 2: -->'Aiie, a comma'<--
Argument 3: -->to_date('whatever', 'is there, even commas')<--
One thing to note: you will notice that I don't ever use any lazy quantifiers, and not even the dot!
edit: special in a field is actually a comma not followed by two single quotes, not one
If only the second parameter can have commas in it, you could do something like:
^VALUES\s*\(\s*'?([^',]*)'?\s*,\s*(.*?)\s*,\s*'?([^',]*)'?\s*\)$
Otherwise I don't know what features that regex flavor supports, so hard to make something more fun. Altho you could always make a limited depth nested parentheses regex if (?R) is not supported.
For the more general case you could try something like:
^\s*
VALUES\s*
\(
\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*,\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*,\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*
\)\s*
$
Spaces removed:
^\s*VALUES\s*\(\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*,\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*,\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*\)\s*$
Replace with:
\1\2 \3\4 \5\6
Should work for one nested level of parentheses without any quoted parenthesis in them.
PS: Not tested. You can usually use the spaced regex if your flavor supports the /x flag.

Negating inside lexer- and parser rules

How can the negation meta-character, ~, be used in ANTLR's lexer- and parser rules?
Negating can occur inside lexer and parser rules.
Inside lexer rules you can negate characters, and inside parser rules you can negate tokens (lexer rules). But both lexer- and parser rules can only negate either single characters, or single tokens, respectively.
A couple of examples:
lexer rules
To match one or more characters except lowercase ascii letters, you can do:
NO_LOWERCASE : ~('a'..'z')+ ;
(the negation-meta-char, ~, has a higher precedence than the +, so the rule above equals (~('a'..'z'))+)
Note that 'a'..'z' matches a single character (and can therefor be negated), but the following rule is invalid:
ANY_EXCEPT_AB : ~('ab') ;
Because 'ab' (obviously) matches 2 characters, it cannot be negated. To match a token that consists of 2 character, but not 'ab', you'd have to do the following:
ANY_EXCEPT_AB
: 'a' ~'b' // any two chars starting with 'a' followed by any other than 'b'
| ~'a' . // other than 'a' followed by any char
;
parser rules
Inside parser rules, ~ negates a certain token, or more than one token. For example, you have the following tokens defined:
A : 'A';
B : 'B';
C : 'C';
D : 'D';
E : 'E';
If you now want to match any token except the A, you do:
p : ~A ;
And if you want to match any token except B and D, you can do:
p : ~(B | D) ;
However, if you want to match any two tokens other than A followed by B, you cannot do:
p : ~(A B) ;
Just as with lexer rules, you cannot negate more than a single token. To accomplish the above, you need to do:
P
: A ~B
| ~A .
;
Note that the . (DOT) char in a parser rules does not match any character as it does inside lexer rules. Inside parser rules, it matches any token (A, B, C, D or E, in this case).
Note that you cannot negate parser rules. The following is illegal:
p : ~a ;
a : A ;