Can a fragment use another fragment in ANTLR4 - antlr

When I use a fragment in ANTLR4, can I use another fragment?
For example, I want to define a fragment NUM_FRAGMENT which uses other fragments:
fragment DIGIT: [1-9];
fragment ZERO: [0];
fragment NUM_FRAGMENT: ZERO | DIGIT | [0-9];
Is the example above allowed in ANTLR4?

Yes, fragments can use other fragments.
In your example fragment NUM_FRAGMENT: ZERO | DIGIT | [0-9];, you can write fragment NUM_FRAGMENT: ZERO | DIGIT;.
Note that the naming of the rules is not entirely correct: DIGIT suggests it matches any digit (from 0 to 9). And NUM_FRAGMENT suggests it matches a number which should match one or more digits.
I'd write the rules like this:
fragment NON_ZERO : [1-9];
fragment ZERO : '0';
fragment DIGIT : ZERO | NON_ZERO;
fragment NUM : DIGIT+;

Related

ANTLR v4 signed integer rule

in ANTLR4 grammar for minijava example I want to parse signed integers by following rules:
IntegerLiteral:IntegerSign? DecimalIntegerLiteral;
fragment
DecimalIntegerLiteral:DecimalNumeral IntegertypeSuffix?;
fragment
IntegerSign:'+'|'-';
fragment
IntegertypeSuffix:[lL];
fragment
DecimalNumeral:'0'| NonZeroDigit(Digits?| Underscores Digits);
fragment
Digits:Digit(DigitsAndUnderscores? Digit)?;
fragment
Digit:'0'| NonZeroDigit;
fragment
NonZeroDigit:[1-9];
fragment
DigitsAndUnderscores:DigitOrUnderscore+;
fragment
DigitOrUnderscore:Digit| '_';
fragment
Underscores:'_'+;
but in parsing it get errors due to expression rule:
expression:expression PLUS expression # addExpression
how should I avoid this conflict?

Getting inconsistent results

I'm using ANTLR 4.6 and I was trying to do some clean up on my grammar and ended up breaking it. I found out that it's because I had made the following change that I assumed would have been equivalent. Can someone explain why they are different?
First try
DIGIT : [0-9] ;
LETTER : [a-zA-Z] ;
ident : ('_'|LETTER) ('_'|LETTER|DIGIT)* ;
Second try
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
ident : LETTER (LETTER | DIGIT)* ;
Both produce different results than this
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
IDENT : LETTER (LETTER | DIGIT)* ;
In both your tries you changed your ident rule from a lexer rule to a parser rule since you wrote it in lower case and since it's the only difference from the second try I assume that's the problem. The lexer rules are for defining tokens for parsing, parsing rules define the way you construct your AST. Beware that making changes like that can result in great differences in the way the your AST is constructed.

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Capturing formatted variable declarations in ANTLR

I have a simple lexer/grammar I've been working on and I'm having trouble understanding the standard operating procedure for matching formatted variables. I am trying to match the following:
Variable name can be 1 character minimum. If it is one char, it must be an uppercase or lowercase letter.
If it is greater than 1 character, it must begin with a letter of any case, and then be followed by any number of characters, including numbers, underscore and the dollar sign.
I've rewritten this several times, in many flavors, and I always get the following error:
Decision can match input such as "SINGLELETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input"
Would really appreciate some insight. I understand there is some ambiguity in my grammar, but I am a bit confused why multiple alternatives can be matched, once we enter the original matching loop. Thank you!
variablename
: (SINGLELETTER)
| (SINGLELETTER|UNDERSCORE)( SINGLELETTER|UNDERSCORE | DOLLAR | NUMBER)*;
SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Why not make VariableName, a lexer rule which produces a single token for the entire name?
Variablename
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Also, the way you wrote variableName does not follow point #2 you wrote (the grammar allows the variable to start with _, but you didn't allow that in your explanation).

EBNF grammar (ANTLR)

I've got a problem with EBNF grammar in ANTLRWorks:
line 37:
upper_lower_case
: LOWER_CASE
| UPPER_CASE
;
line 42:
CLASSNAME
: UPPER_CASE (DIGITS | upper_lower_case )*
;
line 51:
UPPER_CASE
: 'A'..'Z'
;
line 55:
LOWER_CASE
: 'a'..'z'
;
line 60:
DIGITS : '0'..'9'
;
I want CLASSNAME to always start with the capital letter and than it can consists of digits, upper or lower case letters.
Error log:
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "'0'..'9'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:11:59] warning(200): classgenerator.g:43:42:
Decision can match input such as "<EOT>" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
[13:11:59] error(201): classgenerator.g:43:42: The following alternatives can never be
matched: 3
[13:11:59] error(208): classgenerator.g:60:1: The following token definitions can never
be matched because prior tokens match the same input: UPPER_CASE,DIGITS
Could anyone help me solve this problem?
Thanks in advance.
Regards,
Hladeo
EDIT:
So I should use fragment keyword if it doesn't refers to the tokens? In this way using fragment keyword will be wrong?
tokens {
PUBLIC = '+';
PRIVATE = '-';
PROTECTED = '=';
}
fragment ACCESSOR
: PUBLIC
| PRIVATE
| PROTECTED
;
and another question.
OBJECTNAME
: UPPER_LOWER_CASE (UPPER_LOWER_CASE | DIGIT)*
;
OBJECTNAME should consists of at least one letter (upper or lower cased doesn't matter) and optionally of another letters or digits - what's wrong with that part of the code? When I try to type for example variable - it's okay, but when I start with capital letter Variable I'm getting an error:
line 1:15 mismatched input 'Variable' expecting OBJECTNAME
Your lexer rule CLASSNAME currently references parser rule upper_lower_case (lexer rules start with an uppercase letter; parser rules start with lowercase). Lexer rules can only reference lexer rules.
In addition, it appears that UPPER_CASE, LOWER_CASE, and DIGITS should not create tokens themselves so they should be marked as fragment rules. In the following example, I changed DIGITS to DIGIT since it only ever matches one digit.
CLASSNAME : UPPER_CASE (DIGIT | UPPER_LOWER_CASE)*;
fragment UPPER_LOWER_CASE : LOWER_CASE | UPPER_CASE;
fragment UPPER_CASE : 'A'..'Z';
fragment LOWER_CASE : 'a'..'z';
fragment DIGIT : '0'..'9';
Edit 1 (for the edits in the question):
A piece of text in the input can only have one token type. For example, consider the input text X3. Since this text could match a CLASSNAME or an OBJECTNAME, the lexer will end up assigning it the type of the first rule appearing in the grammar. In other words, if CLASSNAME appears before OBJECTNAME in the grammar, the input X3 will always be a CLASSNAME token and will never be a OBJECTNAME token. If OBJECTNAME appears before CLASSNAME in the grammar, the input X3 will always be an OBJECTNAME and never be a CLASSNAME (in fact, in this case no token will ever be a CLASSNAME).
Your ACCESSOR rule looks like it should be a parser rule, like the following:
accessor : PUBLIC | PROTECTED | PRIVATE;
Edit 2 (for the comment about distinguishing CLASSNAME and OBJECTNAME):
To distinguish between CLASSNAME and OBJECTNAME, you can create a lexer rule IDENTIFIER which matches either.
IDENTIFIER : UPPER_LOWER_CASE (DIGIT | UPPER_LOWER_CASE)*;
You can then create a parser rule to handle the distinction:
classname : IDENTIFIER;
objectname : IDENTIFIER;
Obviously this allows x3 to be a classname, which is not valid in your language. When possible, I always prefer to relax the parser rules a bit and perform further validation later where I can provide a better error message. For example, if you allow x3 to match classname, then after you parse the input and have an AST (ANTLR 3) or parse tree (ANTLR 4), you can look for all instances of classname and make sure that the matched IDENTIFIER starts with the required upper case letter.
Example error message produced by the parser's automatic error reporting:
line 1:15 mismatched input 'variable' expecting CLASSNAME
Example error message produced by separate validation:
line 1:15 class name variable must start with an upper case letter