Capturing formatted variable declarations in ANTLR - antlr

I have a simple lexer/grammar I've been working on and I'm having trouble understanding the standard operating procedure for matching formatted variables. I am trying to match the following:
Variable name can be 1 character minimum. If it is one char, it must be an uppercase or lowercase letter.
If it is greater than 1 character, it must begin with a letter of any case, and then be followed by any number of characters, including numbers, underscore and the dollar sign.
I've rewritten this several times, in many flavors, and I always get the following error:
Decision can match input such as "SINGLELETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input"
Would really appreciate some insight. I understand there is some ambiguity in my grammar, but I am a bit confused why multiple alternatives can be matched, once we enter the original matching loop. Thank you!
variablename
: (SINGLELETTER)
| (SINGLELETTER|UNDERSCORE)( SINGLELETTER|UNDERSCORE | DOLLAR | NUMBER)*;
SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';

Why not make VariableName, a lexer rule which produces a single token for the entire name?
Variablename
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Also, the way you wrote variableName does not follow point #2 you wrote (the grammar allows the variable to start with _, but you didn't allow that in your explanation).

Related

Can a fragment use another fragment in ANTLR4

When I use a fragment in ANTLR4, can I use another fragment?
For example, I want to define a fragment NUM_FRAGMENT which uses other fragments:
fragment DIGIT: [1-9];
fragment ZERO: [0];
fragment NUM_FRAGMENT: ZERO | DIGIT | [0-9];
Is the example above allowed in ANTLR4?
Yes, fragments can use other fragments.
In your example fragment NUM_FRAGMENT: ZERO | DIGIT | [0-9];, you can write fragment NUM_FRAGMENT: ZERO | DIGIT;.
Note that the naming of the rules is not entirely correct: DIGIT suggests it matches any digit (from 0 to 9). And NUM_FRAGMENT suggests it matches a number which should match one or more digits.
I'd write the rules like this:
fragment NON_ZERO : [1-9];
fragment ZERO : '0';
fragment DIGIT : ZERO | NON_ZERO;
fragment NUM : DIGIT+;

Parsing letter ranges with ANTLR

I have the following parser rules:
defDirective : defType whiteSpace letterSpec (whiteSpace? COMMA whiteSpace? letterSpec)*;
defType :
DEFBOOL | DEFBYTE | DEFINT | DEFLNG | DEFLNGLNG | DEFLNGPTR | DEFCUR |
DEFSNG | DEFDBL | DEFDATE |
DEFSTR | DEFOBJ | DEFVAR
;
letterSpec : universalLetterRange | letterRange | singleLetter;
singleLetter : RESTRICTED_LETTER;
universalLetterRange : upperCaseA whiteSpace? MINUS whiteSpace? upperCaseZ;
upperCaseA : {_input.Lt(1).Text.Equals("A")}? RESTRICTED_LETTER;
upperCaseZ : {_input.Lt(1).Text.Equals("Z")}? RESTRICTED_LETTER;
letterRange : firstLetter whiteSpace? MINUS whiteSpace? lastLetter;
firstLetter : RESTRICTED_LETTER;
lastLetter : RESTRICTED_LETTER;
whiteSpace : (WS | LINE_CONTINUATION)+;
with the relevant Lexer Rules:
RESTRICTED_LETTER : [a-zA-Z];
MINUS : '-';
COMMA : ',';
WS : [ \t];
LINE_CONTINUATION : [ \t]* UNDERSCORE [ \t]* '\r'? '\n';
and the DefTypes matching their camel-case spelling.
Now when I try to test this on the following inputs, it works exactly as expected:
DefInt I,J,K
DefBool A-Z
It does not work however on arbitary letter ranges (see rule letterRange). When I use the input DefByte B-F, I get the error message "line 1:8 mismatched input 'B' expecting RESTRICTED_LETTER"
I've tried expressing RESTRICTED_IDENTIFIER as a range ('A'..'Z'|'a'..'z'), but that didn't change anything about the error message.
When changing the first whiteSpace in defDirective to whiteSpace+ the error message gets a little longer (now including WS and LINE_CONTINUATION in the expected alternatives).
Also the parse-tree generated by the IntelliJ ANTLR Plugin suddenly starts recognizing the F as a singleLetter, which it previously didn't.
This behaviour seems to be consistent between targetlanguages Java and CSharp.
Previously the rule used to be a lot more relaxed, but that led to incorrect parse-trees, so I kinda want to fix this.
How can I correctly recognize letterRange here?
So ... #BartKiers had the right suspicion. The given Lexer rules weren't all the rules involved in the process.
The full grammar contains a lexer rule B_CHAR : B that's used in a special case of an unrelated grammar rule. That B_CHAR took precedence over RESTRICTED_LETTER when lexing the input stream.
The grammar rules presented are correct (and work fine), but the B_CHAR token needs to be removed from the Tokens lexed.

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

ANTLR v4: Same character has different meaning in different contexts

This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;

EBNF grammar (ANTLR)

I've got a problem with EBNF grammar in ANTLRWorks:
line 37:
upper_lower_case
: LOWER_CASE
| UPPER_CASE
;
line 42:
CLASSNAME
: UPPER_CASE (DIGITS | upper_lower_case )*
;
line 51:
UPPER_CASE
: 'A'..'Z'
;
line 55:
LOWER_CASE
: 'a'..'z'
;
line 60:
DIGITS : '0'..'9'
;
I want CLASSNAME to always start with the capital letter and than it can consists of digits, upper or lower case letters.
Error log:
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "'0'..'9'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "<EOT>" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
[13:11:59] error(201): classgenerator.g:43:42: The following alternatives can never be
matched: 3
[13:11:59] error(208): classgenerator.g:60:1: The following token definitions can never
be matched because prior tokens match the same input: UPPER_CASE,DIGITS
Could anyone help me solve this problem?
Thanks in advance.
Regards,
Hladeo
EDIT:
So I should use fragment keyword if it doesn't refers to the tokens? In this way using fragment keyword will be wrong?
tokens {
PUBLIC = '+';
PRIVATE = '-';
PROTECTED = '=';
}
fragment ACCESSOR
: PUBLIC
| PRIVATE
| PROTECTED
;
and another question.
OBJECTNAME
: UPPER_LOWER_CASE (UPPER_LOWER_CASE | DIGIT)*
;
OBJECTNAME should consists of at least one letter (upper or lower cased doesn't matter) and optionally of another letters or digits - what's wrong with that part of the code? When I try to type for example variable - it's okay, but when I start with capital letter Variable I'm getting an error:
line 1:15 mismatched input 'Variable' expecting OBJECTNAME
Your lexer rule CLASSNAME currently references parser rule upper_lower_case (lexer rules start with an uppercase letter; parser rules start with lowercase). Lexer rules can only reference lexer rules.
In addition, it appears that UPPER_CASE, LOWER_CASE, and DIGITS should not create tokens themselves so they should be marked as fragment rules. In the following example, I changed DIGITS to DIGIT since it only ever matches one digit.
CLASSNAME : UPPER_CASE (DIGIT | UPPER_LOWER_CASE)*;
fragment UPPER_LOWER_CASE : LOWER_CASE | UPPER_CASE;
fragment UPPER_CASE : 'A'..'Z';
fragment LOWER_CASE : 'a'..'z';
fragment DIGIT : '0'..'9';
Edit 1 (for the edits in the question):
A piece of text in the input can only have one token type. For example, consider the input text X3. Since this text could match a CLASSNAME or an OBJECTNAME, the lexer will end up assigning it the type of the first rule appearing in the grammar. In other words, if CLASSNAME appears before OBJECTNAME in the grammar, the input X3 will always be a CLASSNAME token and will never be a OBJECTNAME token. If OBJECTNAME appears before CLASSNAME in the grammar, the input X3 will always be an OBJECTNAME and never be a CLASSNAME (in fact, in this case no token will ever be a CLASSNAME).
Your ACCESSOR rule looks like it should be a parser rule, like the following:
accessor : PUBLIC | PROTECTED | PRIVATE;
Edit 2 (for the comment about distinguishing CLASSNAME and OBJECTNAME):
To distinguish between CLASSNAME and OBJECTNAME, you can create a lexer rule IDENTIFIER which matches either.
IDENTIFIER : UPPER_LOWER_CASE (DIGIT | UPPER_LOWER_CASE)*;
You can then create a parser rule to handle the distinction:
classname : IDENTIFIER;
objectname : IDENTIFIER;
Obviously this allows x3 to be a classname, which is not valid in your language. When possible, I always prefer to relax the parser rules a bit and perform further validation later where I can provide a better error message. For example, if you allow x3 to match classname, then after you parse the input and have an AST (ANTLR 3) or parse tree (ANTLR 4), you can look for all instances of classname and make sure that the matched IDENTIFIER starts with the required upper case letter.
Example error message produced by the parser's automatic error reporting:
line 1:15 mismatched input 'variable' expecting CLASSNAME
Example error message produced by separate validation:
line 1:15 class name variable must start with an upper case letter