Proper way to resolve ANTLR lexer rule ambiguities?

Proper way to resolve ANTLR lexer rule ambiguities? - antlr

Please see the source code available at: https://gist.github.com/1684022.
I've got two tokens defined:
ID : ('a'..'z' | 'A'..'Z') ('0'..'9' | 'a'..'z' | 'A'..'Z' | ' ')*;
PITCH
: (('A'|'a') '#'?)
| (('B'|'b') '#'?)
| (('C'|'c') '#'?);
Obviously, the letter "A" would be an ambiguity.
I further define:
note : PITCH;
name : ID;
main : name ':' note '\n'?
Now, if I enter "A:A" as input to the parser, I always get an error. Either the parser expects PITCH or ID depending on whether ID or PITCH is defined first:
mismatched input 'A' expecting ID
What is the proper way to resolve this so that it works as intended?
As is described, although it makes intuitive sense how the parsing should work, ANTLR doesn't do the "right thing". That is, even though the main rule says a name/ID should come first, the lexer seems to be ignorant of this and identifies "A" as a PITCH because it follows the "longest match"/"which comes first" rule rather than the more reasonable "what the rule says" rule.
Is the only solution to fake/hack it by matching both ID and PITCH, and then recombining them later as dasblinkenlight says?

Here is how I would re-factor this grammar to make it work:
ID : (('a'..'z' | 'A'..'Z') ('0'..'9' | 'a'..'z' | 'A'..'Z' | ' ')+)
| ('d'..'z' | 'D'..'Z');
PITCH : 'a'..'c' | 'A'..'C';
SHARP : '#';
note : PITCH SHARP?;
name : ID | PITCH;
main : name ':' note '\n'? EOF
This separates long names from one-character pitch names, which get "reunited" in the parser. Also the "sharp" token gets its own name, and gets recognized in the parser as an optional token.

Related

Parsing letter ranges with ANTLR

I have the following parser rules:
defDirective : defType whiteSpace letterSpec (whiteSpace? COMMA whiteSpace? letterSpec)*;
defType :
DEFBOOL | DEFBYTE | DEFINT | DEFLNG | DEFLNGLNG | DEFLNGPTR | DEFCUR |
DEFSNG | DEFDBL | DEFDATE |
DEFSTR | DEFOBJ | DEFVAR
;
letterSpec : universalLetterRange | letterRange | singleLetter;
singleLetter : RESTRICTED_LETTER;
universalLetterRange : upperCaseA whiteSpace? MINUS whiteSpace? upperCaseZ;
upperCaseA : {_input.Lt(1).Text.Equals("A")}? RESTRICTED_LETTER;
upperCaseZ : {_input.Lt(1).Text.Equals("Z")}? RESTRICTED_LETTER;
letterRange : firstLetter whiteSpace? MINUS whiteSpace? lastLetter;
firstLetter : RESTRICTED_LETTER;
lastLetter : RESTRICTED_LETTER;
whiteSpace : (WS | LINE_CONTINUATION)+;
with the relevant Lexer Rules:
RESTRICTED_LETTER : [a-zA-Z];
MINUS : '-';
COMMA : ',';
WS : [ \t];
LINE_CONTINUATION : [ \t]* UNDERSCORE [ \t]* '\r'? '\n';
and the DefTypes matching their camel-case spelling.
Now when I try to test this on the following inputs, it works exactly as expected:
DefInt I,J,K
DefBool A-Z
It does not work however on arbitary letter ranges (see rule letterRange). When I use the input DefByte B-F, I get the error message "line 1:8 mismatched input 'B' expecting RESTRICTED_LETTER"
I've tried expressing RESTRICTED_IDENTIFIER as a range ('A'..'Z'|'a'..'z'), but that didn't change anything about the error message.
When changing the first whiteSpace in defDirective to whiteSpace+ the error message gets a little longer (now including WS and LINE_CONTINUATION in the expected alternatives).
Also the parse-tree generated by the IntelliJ ANTLR Plugin suddenly starts recognizing the F as a singleLetter, which it previously didn't.
This behaviour seems to be consistent between targetlanguages Java and CSharp.
Previously the rule used to be a lot more relaxed, but that led to incorrect parse-trees, so I kinda want to fix this.
How can I correctly recognize letterRange here?

So ... #BartKiers had the right suspicion. The given Lexer rules weren't all the rules involved in the process.
The full grammar contains a lexer rule B_CHAR : B that's used in a special case of an unrelated grammar rule. That B_CHAR took precedence over RESTRICTED_LETTER when lexing the input stream.
The grammar rules presented are correct (and work fine), but the B_CHAR token needs to be removed from the Tokens lexed.

antlr4 - conflicting rules, how to fix

I have the following rules:
property : NAME;
value : STRING | NUMBER;
NUMBER : ('0'..'9')+;
NAME : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_')+;
STRING : '"' (~'"')* '"';
When a property is a number, ANTLR says:
line 1:14 mismatched input '5' expecting NAME
I understand why this happens. The NUMBER rule is mentioned before the NAME rule, so it has precedence. The number is recognized by the NUMBER rule.
What is the common way to handle this in ANTLR? I could rewrite the property rule as following, but I don't really know if it is a good idea, as I am introducing redundancy.
property : NAME | NUMBER;
Re-ordering NUMBER and NAME isn't a good idea either, as it will break the value rule for numbers (same problem).
Important to note: I am fairly new to ANTLR and am still learning.

Yes, property : NAME | NUMBER; is the way to do it.

Capturing formatted variable declarations in ANTLR

I have a simple lexer/grammar I've been working on and I'm having trouble understanding the standard operating procedure for matching formatted variables. I am trying to match the following:
Variable name can be 1 character minimum. If it is one char, it must be an uppercase or lowercase letter.
If it is greater than 1 character, it must begin with a letter of any case, and then be followed by any number of characters, including numbers, underscore and the dollar sign.
I've rewritten this several times, in many flavors, and I always get the following error:
Decision can match input such as "SINGLELETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input"
Would really appreciate some insight. I understand there is some ambiguity in my grammar, but I am a bit confused why multiple alternatives can be matched, once we enter the original matching loop. Thank you!
variablename
: (SINGLELETTER)
| (SINGLELETTER|UNDERSCORE)( SINGLELETTER|UNDERSCORE | DOLLAR | NUMBER)*;
SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';

Why not make VariableName, a lexer rule which produces a single token for the entire name?
Variablename
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Also, the way you wrote variableName does not follow point #2 you wrote (the grammar allows the variable to start with _, but you didn't allow that in your explanation).

multi alternative and rule of thumb of grammar granularity

I asked related questions here and here, now I have a new question, but really I am asking for some general rule of thinking.
Here is the grammar:
grammar post2;
post2: action_cmd+
;
action_cmd
: cmd_name action_cmd_def
;
action_cmd_def
: (cmd_chars | cmd_literal)+ Semi_colon
;
cmd_name
: 'a'..'z' ('a'..'z' | '0'..'9' | '_' )*
;
cmd_chars
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.' | ':' | '-' |'\\')
;
cmd_literal
: SINGLE_QUOTE ~(SINGLE_QUOTE | '\n' | '\r') SINGLE_QUOTE
;
SINGLE_QUOTE
: '\''
;
Semi_colon
: ';'
;
WS : ('\t' | ' ')+ {$channel = HIDDEN;};
New_Line : ('\r' | '\n')+ {$channel = HIDDEN;};
It is not a surprise I got this error -
warning(200): post2.g:16:45:
Decision can match input such as "'_'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
The error is about rule "cmd_name".
I believe the reason is, as Bart indicated in another thread, when there is such input as "abc__", it can be parsed as either "abc_"(cmd_name) and "_"(action_cmd_def/cmd_chars) or "abc__"(cmd_name).
Here are my questions:
1) How to fix it? I tried adding "options {greedy=true;}" in front of cmd_name, but the error persists.
2) I know if I combine cmd_name and action_cmd_def into one, then the problem will be gone, this leads to the question of grammar granularity. Since ANTLR has such a powerful lexer/parser function, I really like to use the grammar to filter out meaningful string out, in this case, I know the input data for "action_cmd" must start with a command name string and then follow some messy stuff, so I like the grammar to do separate the 2 parts; otherwise I will have to write in action part using the target language (C in my case), but going deeper granularity brings so much trouble, I am in doubt if I am at a wrong track.
With this, I like to ask, what is your rule of thumb as of the grammar granularity? Am I going nuts in using grammar?

This is a genuine ambiguity, but the greedy option should work for you. Maybe it needs to be at the subrule level? See if this works:
cmd_name
: 'a'..'z' (options {greedy=true;} : 'a'..'z' | '0'..'9' | '_' )*
As for the second part of your question, I think your rule granularity is fine. You can also resort to using syntactic predicates if there is an ambiguity that needs more than just the greedy flag to solve. It is well documented in the ANTLR 3 book, but not so well on the website.
It amounts to trying to match the predicate syntactically. If it succeeds then it matches it for real, if it fails then it uses the other alternatives. For instance, in C you don't know if you have a function declaration or definition until you see the end of the declaration, which has no lower limit on its length. So you use a syntactic predicate to say "let's see if it is a declaration, if it is, then match it for real, if not then try the other alternatives.
externalDef
: ( "typedef" | declaration )=> declaration
| functionDef
| asm_expr
;

How to use similar lexers

I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.

But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.
Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.
Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Proper way to resolve ANTLR lexer rule ambiguities? - antlr

Related

Parsing letter ranges with ANTLR

antlr4 - conflicting rules, how to fix

Capturing formatted variable declarations in ANTLR

multi alternative and rule of thumb of grammar granularity

How to use similar lexers

Categories

Resources