What is the best way to treat repetitions in regexes like abc | cde | abc | cde | cde | abc or <regex1> | <regex2> | <regex3> | <regex4> | <regex5> | <regex6>, where many of regexN will be the same literals?
To explain what I mean, I'll give an example from German. Here is a sample grammar that can parse several Present tense verbal forms.
grammar Verb {
token TOP {
<base>
<ending>
}
token base {
geh |
spiel |
mach
}
token ending {
e | # 1sg
st | # 2sg
t | # 3sg
en | # 1pl
t | # 2pl
en # 3pl
}
}
my #verbs = <gehe spielst machen>;
for #verbs -> $verb {
my $match = Verb.parse($verb);
say $match;
}
Endings for 1pl and 3pl (en) are the same, but for the sake of clarity it's more convenient to put them both into the token (in my real-life data inflexional paradigms are much more complex, and it's easy to get lost). The token ending works as expected, but I understand that if I put en only once, the program would work a bit faster (I've made tests with regexes consisting of many such repeated elements, and yes, the performance suffers greatly). With my data, there are lots of such repetitions, so I wonder what is the best way to treat them?
Of course, I could put the endings in an array outside the grammar, make this array .unique and then just pass the values:
my #endings = < ... >;
#endings .= unique;
...
token ending { #endings }
But taking data out of the grammar will make it less clear. Also, in some cases it might be necessary to make each ending a separate token (token ending {<ending_1sg> | <ending_2sg> ... <ending_3pl>}, which would be impossible if they are defined outside the grammar.
If I understand you, for the sake of clarity, you want to repeat regex terms with a comment that describes which notes it's a separate concept? Just comment the line out.
By the way, since empty regexes are ignored in this case, it's okay to begin the line with your branch operator, instead of putting it at the end. It makes things easier, especially when you need to add and remove lines. So I suggest something like this:
grammar Verb {
# ...
token ending {
| e # 1sg
| st # 2sg
| t # 3sg
| en # 1pl
#| t # 2pl
#| en # 3pl
}
}
Because what you're writing is exclusively for the human, not for the parser. Now, if you wanted to use the different regexes to go into different parse matches so you could access the ending as either $<_3sg> or $<_2p1> (named regexes so both would succeed), you can't comment it out, and you're gonna have to force the computer to do the extra work. And obviously you'll need to use :exhaustive or :overlap. Instead, I would suggest you make a named regex that represents both 3sg and 2p1, and define it like I did above: write them both but comment one out.
Related
I'm trying to write a Raku grammar that can parse commands that ask for programming puzzles.
This is a simplified version just for my question, but the commands combine a difficulty level with an optional list of languages.
Sample valid input:
No language: easy
One language: hard javascript
Multiple languages: medium javascript python raku
I can get it to match one language, but not multiple languages. I'm not sure where to add the :g.
Here's an example of what I have so far:
grammar Command {
rule TOP { <difficulty> <languages>? }
token difficulty { 'easy' | 'medium' | 'hard' }
rule languages { <language>+ }
token language { \w+ }
}
multi sub MAIN(Bool :$test) {
use Test;
plan 5;
# These first 3 pass.
ok Command.parse('hard', :token<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :token<difficulty>), '<difficulty> should not parse random words';
# Why does this parse <languages>, but <language> fails below?
ok Command.parse('js', :rule<languages>), '<languages> can parse a language';
# These last 2 fail.
ok Command.parse('js', :token<language>), '<language> can parse a language';
# Why does this not match both words? Can I use :g somewhere?
ok Command.parse('js python', :rule<languages>), '<languages> can parse multiple languages';
}
This works, even though my test #4 fails:
my token wrd { \w+ }
'js' ~~ &wrd; #=> 「js」
Extracting multiple languages works with a regex using this syntax, but I'm not sure how to use that in a grammar:
'js python' ~~ m:g/ \w+ /; #=> (「js」 「python」)
Also, is there an ideal way to make the order unimportant so that difficulty could come anywhere in the string? Example:
rule TOP { <languages>* <difficulty> <languages>? }
Ideally, I'd like for anything that is not a difficulty to be read as a language. Example: raku python medium js should read medium as a difficulty and the rest as languages.
There are two things at issue here.
To specify a subrule in a grammar parse, the named argument is always :rule, regardless whether in the grammar it's a rule, token, method, or regex. Your first two tests are passing because they represent valid full-grammar parses (that is, TOP), as the :token named argument is ignored since it's unknown.
That gets us:
ok Command.parse('hard', :rule<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :rule<difficulty>), '<difficulty> should not parse random words';
ok Command.parse('js', :rule<languages> ), '<languages> can parse a language';
ok Command.parse('js', :rule<language> ), '<language> can parse a language';
ok Command.parse('js python', :rule<languages> ), '<languages> can parse multiple languages';
# Output
ok 1 - <difficulty> can parse a difficulty
ok 2 - <difficulty> should not parse random words
ok 3 - <languages> can parse a language
ok 4 - <language> can parse a language
not ok 5 - <languages> can parse multiple languages
The second issue is how implied whitespace is handled in a rule. In a token, the following are equivalent:
token foo { <alpha>+ }
token bar { <alpha> + }
But in a rule, they would be different. Compare the token equivalents for the following rules:
rule foo { <alpha>+ }
token foo { <alpha>+ <.ws> }
rule bar { <alpha> + }
token bar { [<alpha> <.ws>] + }
In your case, you have <language>+, and since language is \w+, it's impossible to match two (because the first one will consume all the \w). Easy solution though, just change <language>+ to <language> +.
To allow the <difficulty> token to float around, the first solution that jumps to my mind is to match it and bail in a <language> token:
token language { <!difficulty> \w+ }
<!foo> will fail if at that position, it can match <foo>. This will work almost perfect until you get a language like 'easyFoo'. The easy fix there is to ensure that the difficulty token always occurs at a word boundary:
token difficulty {
[
| easy
| medium
| hard
]
>>
}
where >> asserts a word boundary on the right.
I have a number of hosts (servers) and I want to search through all of them except for 4 different ones.
Here is what I have working at the moment to exclude the 4 servers:
(host!="ajl2dal8" OR host!="ajl2dal9" OR host!="ajl2atl8" OR host!="ajl2atl9")
While this works fine, its fairly sizable and will only get longer if I need to exclude more. Since they all begin with ajl2 and have either atl or dal and a number, is there any way I could get something like this to work:
(host!="ajl2[atl|dal][1|2|3|4]")
The search command (which is implied before the first pipe) does not support regular expressions. You can use wildcards, however, as in (host!="ajl2*"). You can use regular expressions after the first pipe with the where or regex commands.
... | where NOT match(host, "ajl2[atl|dal][1|2|3|4]") | ...
... | regex host!="ajl2[atl|dal][1|2|3|4]" | ...
In the ANTLRv4 grammar that one can find in the grammars-v4 repository (https://github.com/antlr/grammars-v4/blob/master/antlr4/ANTLRv4Parser.g4) the optional rule ebnfSuffix is:
sometimes matched using ebnfSuffix?, see lexerElement
sometimes matched using (ebnfSuffix | ), see element.
I was indeed asking to myself, and here as well, if the two have slightly different meaning.
The grammars-v4 repository has another example in https://github.com/antlr/grammars-v4/blob/master/cql3/CqlParser.g4 of the same two patterns with respect to beginBatch rule used has optional element or together with an empty alternative.
EDIT: I've added here the part of the grammar I'm referring to as suggested:
lexerElement
: labeledLexerElement ebnfSuffix? <-- case 1: optional rule
| lexerAtom ebnfSuffix?
| lexerBlock ebnfSuffix?
| actionBlock QUESTION?
;
element
: labeledElement (ebnfSuffix |) <-- case 2: block with empty alternative
| atom (ebnfSuffix |)
| ebnf
| actionBlock QUESTION?
;
Both ebnfSuffix? and (ebnfSuffix | ) result in exactly the same behaviour: they (greedily) optionally match ebnfSuffix.
The fact that they're both being used in a grammar could be because it was translated from some spec (or other grammar) that used that notation and that notation didn't have the ? operator, but that's just guessing.
Personally I'd just use ebnfSuffix?.
I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.
The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.
I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.
Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.