antlr4 - conflicting rules, how to fix - antlr

I have the following rules:
property : NAME;
value : STRING | NUMBER;
NUMBER : ('0'..'9')+;
NAME : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_')+;
STRING : '"' (~'"')* '"';
When a property is a number, ANTLR says:
line 1:14 mismatched input '5' expecting NAME
I understand why this happens. The NUMBER rule is mentioned before the NAME rule, so it has precedence. The number is recognized by the NUMBER rule.
What is the common way to handle this in ANTLR? I could rewrite the property rule as following, but I don't really know if it is a good idea, as I am introducing redundancy.
property : NAME | NUMBER;
Re-ordering NUMBER and NAME isn't a good idea either, as it will break the value rule for numbers (same problem).
Important to note: I am fairly new to ANTLR and am still learning.

Yes, property : NAME | NUMBER; is the way to do it.

Related

Capturing formatted variable declarations in ANTLR

I have a simple lexer/grammar I've been working on and I'm having trouble understanding the standard operating procedure for matching formatted variables. I am trying to match the following:
Variable name can be 1 character minimum. If it is one char, it must be an uppercase or lowercase letter.
If it is greater than 1 character, it must begin with a letter of any case, and then be followed by any number of characters, including numbers, underscore and the dollar sign.
I've rewritten this several times, in many flavors, and I always get the following error:
Decision can match input such as "SINGLELETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input"
Would really appreciate some insight. I understand there is some ambiguity in my grammar, but I am a bit confused why multiple alternatives can be matched, once we enter the original matching loop. Thank you!
variablename
: (SINGLELETTER)
| (SINGLELETTER|UNDERSCORE)( SINGLELETTER|UNDERSCORE | DOLLAR | NUMBER)*;
SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Why not make VariableName, a lexer rule which produces a single token for the entire name?
Variablename
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Also, the way you wrote variableName does not follow point #2 you wrote (the grammar allows the variable to start with _, but you didn't allow that in your explanation).

EBNF grammar (ANTLR)

I've got a problem with EBNF grammar in ANTLRWorks:
line 37:
upper_lower_case
: LOWER_CASE
| UPPER_CASE
;
line 42:
CLASSNAME
: UPPER_CASE (DIGITS | upper_lower_case )*
;
line 51:
UPPER_CASE
: 'A'..'Z'
;
line 55:
LOWER_CASE
: 'a'..'z'
;
line 60:
DIGITS : '0'..'9'
;
I want CLASSNAME to always start with the capital letter and than it can consists of digits, upper or lower case letters.
Error log:
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "'0'..'9'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "<EOT>" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
[13:11:59] error(201): classgenerator.g:43:42: The following alternatives can never be
matched: 3
[13:11:59] error(208): classgenerator.g:60:1: The following token definitions can never
be matched because prior tokens match the same input: UPPER_CASE,DIGITS
Could anyone help me solve this problem?
Thanks in advance.
Regards,
Hladeo
EDIT:
So I should use fragment keyword if it doesn't refers to the tokens? In this way using fragment keyword will be wrong?
tokens {
PUBLIC = '+';
PRIVATE = '-';
PROTECTED = '=';
}
fragment ACCESSOR
: PUBLIC
| PRIVATE
| PROTECTED
;
and another question.
OBJECTNAME
: UPPER_LOWER_CASE (UPPER_LOWER_CASE | DIGIT)*
;
OBJECTNAME should consists of at least one letter (upper or lower cased doesn't matter) and optionally of another letters or digits - what's wrong with that part of the code? When I try to type for example variable - it's okay, but when I start with capital letter Variable I'm getting an error:
line 1:15 mismatched input 'Variable' expecting OBJECTNAME
Your lexer rule CLASSNAME currently references parser rule upper_lower_case (lexer rules start with an uppercase letter; parser rules start with lowercase). Lexer rules can only reference lexer rules.
In addition, it appears that UPPER_CASE, LOWER_CASE, and DIGITS should not create tokens themselves so they should be marked as fragment rules. In the following example, I changed DIGITS to DIGIT since it only ever matches one digit.
CLASSNAME : UPPER_CASE (DIGIT | UPPER_LOWER_CASE)*;
fragment UPPER_LOWER_CASE : LOWER_CASE | UPPER_CASE;
fragment UPPER_CASE : 'A'..'Z';
fragment LOWER_CASE : 'a'..'z';
fragment DIGIT : '0'..'9';
Edit 1 (for the edits in the question):
A piece of text in the input can only have one token type. For example, consider the input text X3. Since this text could match a CLASSNAME or an OBJECTNAME, the lexer will end up assigning it the type of the first rule appearing in the grammar. In other words, if CLASSNAME appears before OBJECTNAME in the grammar, the input X3 will always be a CLASSNAME token and will never be a OBJECTNAME token. If OBJECTNAME appears before CLASSNAME in the grammar, the input X3 will always be an OBJECTNAME and never be a CLASSNAME (in fact, in this case no token will ever be a CLASSNAME).
Your ACCESSOR rule looks like it should be a parser rule, like the following:
accessor : PUBLIC | PROTECTED | PRIVATE;
Edit 2 (for the comment about distinguishing CLASSNAME and OBJECTNAME):
To distinguish between CLASSNAME and OBJECTNAME, you can create a lexer rule IDENTIFIER which matches either.
IDENTIFIER : UPPER_LOWER_CASE (DIGIT | UPPER_LOWER_CASE)*;
You can then create a parser rule to handle the distinction:
classname : IDENTIFIER;
objectname : IDENTIFIER;
Obviously this allows x3 to be a classname, which is not valid in your language. When possible, I always prefer to relax the parser rules a bit and perform further validation later where I can provide a better error message. For example, if you allow x3 to match classname, then after you parse the input and have an AST (ANTLR 3) or parse tree (ANTLR 4), you can look for all instances of classname and make sure that the matched IDENTIFIER starts with the required upper case letter.
Example error message produced by the parser's automatic error reporting:
line 1:15 mismatched input 'variable' expecting CLASSNAME
Example error message produced by separate validation:
line 1:15 class name variable must start with an upper case letter

multi alternative and rule of thumb of grammar granularity

I asked related questions here and here, now I have a new question, but really I am asking for some general rule of thinking.
Here is the grammar:
grammar post2;
post2: action_cmd+
;
action_cmd
: cmd_name action_cmd_def
;
action_cmd_def
: (cmd_chars | cmd_literal)+ Semi_colon
;
cmd_name
: 'a'..'z' ('a'..'z' | '0'..'9' | '_' )*
;
cmd_chars
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.' | ':' | '-' |'\\')
;
cmd_literal
: SINGLE_QUOTE ~(SINGLE_QUOTE | '\n' | '\r') SINGLE_QUOTE
;
SINGLE_QUOTE
: '\''
;
Semi_colon
: ';'
;
WS : ('\t' | ' ')+ {$channel = HIDDEN;};
New_Line : ('\r' | '\n')+ {$channel = HIDDEN;};
It is not a surprise I got this error -
warning(200): post2.g:16:45:
Decision can match input such as "'_'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
The error is about rule "cmd_name".
I believe the reason is, as Bart indicated in another thread, when there is such input as "abc__", it can be parsed as either "abc_"(cmd_name) and "_"(action_cmd_def/cmd_chars) or "abc__"(cmd_name).
Here are my questions:
1) How to fix it? I tried adding "options {greedy=true;}" in front of cmd_name, but the error persists.
2) I know if I combine cmd_name and action_cmd_def into one, then the problem will be gone, this leads to the question of grammar granularity. Since ANTLR has such a powerful lexer/parser function, I really like to use the grammar to filter out meaningful string out, in this case, I know the input data for "action_cmd" must start with a command name string and then follow some messy stuff, so I like the grammar to do separate the 2 parts; otherwise I will have to write in action part using the target language (C in my case), but going deeper granularity brings so much trouble, I am in doubt if I am at a wrong track.
With this, I like to ask, what is your rule of thumb as of the grammar granularity? Am I going nuts in using grammar?
This is a genuine ambiguity, but the greedy option should work for you. Maybe it needs to be at the subrule level? See if this works:
cmd_name
: 'a'..'z' (options {greedy=true;} : 'a'..'z' | '0'..'9' | '_' )*
As for the second part of your question, I think your rule granularity is fine. You can also resort to using syntactic predicates if there is an ambiguity that needs more than just the greedy flag to solve. It is well documented in the ANTLR 3 book, but not so well on the website.
It amounts to trying to match the predicate syntactically. If it succeeds then it matches it for real, if it fails then it uses the other alternatives. For instance, in C you don't know if you have a function declaration or definition until you see the end of the declaration, which has no lower limit on its length. So you use a syntactic predicate to say "let's see if it is a declaration, if it is, then match it for real, if not then try the other alternatives.
externalDef
: ( "typedef" | declaration )=> declaration
| functionDef
| asm_expr
;

How to use similar lexers

I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.
Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.
Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).

Proper way to resolve ANTLR lexer rule ambiguities?

Please see the source code available at: https://gist.github.com/1684022.
I've got two tokens defined:
ID : ('a'..'z' | 'A'..'Z') ('0'..'9' | 'a'..'z' | 'A'..'Z' | ' ')*;
PITCH
: (('A'|'a') '#'?)
| (('B'|'b') '#'?)
| (('C'|'c') '#'?);
Obviously, the letter "A" would be an ambiguity.
I further define:
note : PITCH;
name : ID;
main : name ':' note '\n'?
Now, if I enter "A:A" as input to the parser, I always get an error. Either the parser expects PITCH or ID depending on whether ID or PITCH is defined first:
mismatched input 'A' expecting ID
What is the proper way to resolve this so that it works as intended?
As is described, although it makes intuitive sense how the parsing should work, ANTLR doesn't do the "right thing". That is, even though the main rule says a name/ID should come first, the lexer seems to be ignorant of this and identifies "A" as a PITCH because it follows the "longest match"/"which comes first" rule rather than the more reasonable "what the rule says" rule.
Is the only solution to fake/hack it by matching both ID and PITCH, and then recombining them later as dasblinkenlight says?
Here is how I would re-factor this grammar to make it work:
ID : (('a'..'z' | 'A'..'Z') ('0'..'9' | 'a'..'z' | 'A'..'Z' | ' ')+)
| ('d'..'z' | 'D'..'Z');
PITCH : 'a'..'c' | 'A'..'C';
SHARP : '#';
note : PITCH SHARP?;
name : ID | PITCH;
main : name ':' note '\n'? EOF
This separates long names from one-character pitch names, which get "reunited" in the parser. Also the "sharp" token gets its own name, and gets recognized in the parser as an optional token.