EBNF grammar (ANTLR) - antlr

I've got a problem with EBNF grammar in ANTLRWorks:
line 37:
upper_lower_case
: LOWER_CASE
| UPPER_CASE
;
line 42:
CLASSNAME
: UPPER_CASE (DIGITS | upper_lower_case )*
;
line 51:
UPPER_CASE
: 'A'..'Z'
;
line 55:
LOWER_CASE
: 'a'..'z'
;
line 60:
DIGITS : '0'..'9'
;
I want CLASSNAME to always start with the capital letter and than it can consists of digits, upper or lower case letters.
Error log:
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "'0'..'9'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "<EOT>" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
[13:11:59] error(201): classgenerator.g:43:42: The following alternatives can never be
matched: 3
[13:11:59] error(208): classgenerator.g:60:1: The following token definitions can never
be matched because prior tokens match the same input: UPPER_CASE,DIGITS
Could anyone help me solve this problem?
Thanks in advance.
Regards,
Hladeo
EDIT:
So I should use fragment keyword if it doesn't refers to the tokens? In this way using fragment keyword will be wrong?
tokens {
PUBLIC = '+';
PRIVATE = '-';
PROTECTED = '=';
}
fragment ACCESSOR
: PUBLIC
| PRIVATE
| PROTECTED
;
and another question.
OBJECTNAME
: UPPER_LOWER_CASE (UPPER_LOWER_CASE | DIGIT)*
;
OBJECTNAME should consists of at least one letter (upper or lower cased doesn't matter) and optionally of another letters or digits - what's wrong with that part of the code? When I try to type for example variable - it's okay, but when I start with capital letter Variable I'm getting an error:
line 1:15 mismatched input 'Variable' expecting OBJECTNAME

Your lexer rule CLASSNAME currently references parser rule upper_lower_case (lexer rules start with an uppercase letter; parser rules start with lowercase). Lexer rules can only reference lexer rules.
In addition, it appears that UPPER_CASE, LOWER_CASE, and DIGITS should not create tokens themselves so they should be marked as fragment rules. In the following example, I changed DIGITS to DIGIT since it only ever matches one digit.
CLASSNAME : UPPER_CASE (DIGIT | UPPER_LOWER_CASE)*;
fragment UPPER_LOWER_CASE : LOWER_CASE | UPPER_CASE;
fragment UPPER_CASE : 'A'..'Z';
fragment LOWER_CASE : 'a'..'z';
fragment DIGIT : '0'..'9';
Edit 1 (for the edits in the question):
A piece of text in the input can only have one token type. For example, consider the input text X3. Since this text could match a CLASSNAME or an OBJECTNAME, the lexer will end up assigning it the type of the first rule appearing in the grammar. In other words, if CLASSNAME appears before OBJECTNAME in the grammar, the input X3 will always be a CLASSNAME token and will never be a OBJECTNAME token. If OBJECTNAME appears before CLASSNAME in the grammar, the input X3 will always be an OBJECTNAME and never be a CLASSNAME (in fact, in this case no token will ever be a CLASSNAME).
Your ACCESSOR rule looks like it should be a parser rule, like the following:
accessor : PUBLIC | PROTECTED | PRIVATE;
Edit 2 (for the comment about distinguishing CLASSNAME and OBJECTNAME):
To distinguish between CLASSNAME and OBJECTNAME, you can create a lexer rule IDENTIFIER which matches either.
IDENTIFIER : UPPER_LOWER_CASE (DIGIT | UPPER_LOWER_CASE)*;
You can then create a parser rule to handle the distinction:
classname : IDENTIFIER;
objectname : IDENTIFIER;
Obviously this allows x3 to be a classname, which is not valid in your language. When possible, I always prefer to relax the parser rules a bit and perform further validation later where I can provide a better error message. For example, if you allow x3 to match classname, then after you parse the input and have an AST (ANTLR 3) or parse tree (ANTLR 4), you can look for all instances of classname and make sure that the matched IDENTIFIER starts with the required upper case letter.
Example error message produced by the parser's automatic error reporting:
line 1:15 mismatched input 'variable' expecting CLASSNAME
Example error message produced by separate validation:
line 1:15 class name variable must start with an upper case letter

Related

ANTLR4 No Viable Alternative At Input

I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.

Ambiguous Lexer rules in Antlr

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.
Example:
conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';
Input: 1 in in meters
The word "in" matches the lexer rules UNIT and CONVERT.
How can this be solved while keeping the grammar file readable?
When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule
conversion: NUMBER UNIT CONVERT UNIT;
can't work because there are three UNIT tokens :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<UNIT>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<UNIT>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<UNIT>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>
What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :
grammar Question;
question
#init {System.out.println("Question last update 0132");}
: conversion NL EOF
;
conversion
: NUMBER unit1=ID convert=ID unit2=ID
{System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
" to convert " + $convert.text + " " + $unit2.text);}
;
ID : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;
NUMBER : DIGIT+ ;
NL : [\r\n] ;
WS : [ \t] -> channel(HIDDEN) ; // -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<ID>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<ID>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<ID>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters
Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.
Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.
In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.
In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:
conversion: NUMBER TEXT TEXT TEXT
and validating the text values in your ANTLR listener/tree-walker/etc.
EDIT
Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.
My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.
Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.
The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;

Capturing formatted variable declarations in ANTLR

I have a simple lexer/grammar I've been working on and I'm having trouble understanding the standard operating procedure for matching formatted variables. I am trying to match the following:
Variable name can be 1 character minimum. If it is one char, it must be an uppercase or lowercase letter.
If it is greater than 1 character, it must begin with a letter of any case, and then be followed by any number of characters, including numbers, underscore and the dollar sign.
I've rewritten this several times, in many flavors, and I always get the following error:
Decision can match input such as "SINGLELETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input"
Would really appreciate some insight. I understand there is some ambiguity in my grammar, but I am a bit confused why multiple alternatives can be matched, once we enter the original matching loop. Thank you!
variablename
: (SINGLELETTER)
| (SINGLELETTER|UNDERSCORE)( SINGLELETTER|UNDERSCORE | DOLLAR | NUMBER)*;
SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Why not make VariableName, a lexer rule which produces a single token for the entire name?
Variablename
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Also, the way you wrote variableName does not follow point #2 you wrote (the grammar allows the variable to start with _, but you didn't allow that in your explanation).

How to use similar lexers

I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.
Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.
Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).