Exclude tokens from Identifier lexical rule - antlr

I have Identifier lexical rule:
Identifier
: ( 'a'..'z' | 'A'..'Z' | '_' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' )*
;
LogicalOr and LogicalAnd rules:
LogicalOr : '| ' | '||' | OR;
LogicalAnd : '&' | '&&' | AND;
fragment Or : '[Oo][Rr]';
fragment And : '[Aa][Nn][Dd]';
strings "and" and "or" are recognized as identifiers, instead of logicalAnd and logicalOr. Could someone help me to solve this problem please?

There are two potential issues at play. First and foremost, ANTLR 3 does not support the character class syntax introduced by ANTLR 4. Your Or fragment literally matches the input [Oo][Rr]; it does not match OR, or, or oR. The same applies to your And fragment. You need to write the rule like this instead:
fragment
Or
: ('O' | 'o') ('R' | 'r')
;
If this does not resolve your issue, then you need to make sure your LogicalOr and LogicalAnd rules are positioned before the Identifier rule in the grammar. The rule which appears first will determine what token type is assigned for this input sequence.

Related

Im just starting with ANTLR and I cant decipher where Im messing up with mismatched input error

I've just started using antlr so Id really appreciate the help! Im just trying to make a variable declaration declaration rule but its not working! Ive put the files Im working with below, please lmk if you need anything else!
INPUT CODE:
var test;
GRAMMAR G4 FILE:
grammar treetwo;
program : (declaration | statement)+ EOF;
declaration :
variable_declaration
| variable_assignment
;
statement:
expression
| ifstmnt
;
variable_declaration:
VAR NAME SEMICOLON
;
variable_assignment:
NAME '=' NUM SEMICOLON
| NAME '=' STRING SEMICOLON
| NAME '=' BOOLEAN SEMICOLON
;
expression:
operand operation operand SEMICOLON
| expression operation expression SEMICOLON
| operand operation expression SEMICOLON
| expression operation operand SEMICOLON
;
ifstmnt:
IF LPAREN term RPAREN LCURLY
(declaration | statement)+
RCURLY
;
term:
| NUM EQUALITY NUM
| NAME EQUALITY NUM
| NUM EQUALITY NAME
| NAME EQUALITY NAME
;
/*Tokens*/
NUM : '0' | '-'?[1-9][0-9]*;
STRING: [a-zA-Z]+;
BOOLEAN: 'true' | 'false';
VAR : 'var';
NAME : [a-zA-Z]+;
SEMICOLON : ';';
LPAREN: '(';
RPAREN: ')';
LCURLY: '{';
RCURLY: '}';
EQUALITY: '==' | '<' | '>' | '<=' | '>=' | '!=' ;
operation: '+' | '-' | '*' | '/';
operand: NUM;
IF: 'if';
WS : [ \t\r\n]+ -> skip;
Error I'm getting:
(line 1,char 0): mismatched input 'var' expecting {NUM, 'var', NAME, 'if'}
Your STRING rule is the same as your NAME rule.
With the ANTLR lexer, if two lexer rules match the same input, the first one declared will be used. As a result, you’ll never see a NAME token.
Most tutorials will show you have to dump out the token stream. It’s usually a good idea to view the token stream and verify your Lexer rules before getting too far into your parser rules.

Can't create a variable with just one letter

I wish the variables could be declared with only one letter in the name.
When I write Integer aa; all work, but
when I type Integer a; then grun says: mismatched input 'a' expecting ID.
I've seen the inverse problem but it didn't help. I think my code is right but I can't see where I'm wrong. This is my lexer:
lexer grammar Symbols;
...
LineComment: '//' ~[\u000A\u000D]* -> channel(HIDDEN) ;
DelimetedComment: '/*' .*? '*/' -> channel(HIDDEN) ;
String: '"' .*? '"' ;
Character: '\'' (EscapeSeq | .) '\'' ;
IntegerLiteral: '0' | (ADD?| SUB) DecDigitNoZero DecDigit+ ;
FloatLiteral: ((ADD? | SUB) (DecDigitNoZero DecDigit*)? DOT DecDigit+ | IntegerLiteral) [F] ;
DoubleLiteral: ((ADD? | SUB) (DecDigitNoZero DecDigit*)? DOT DecDigit+ | IntegerLiteral) [D] ;
LongLiteral: IntegerLiteral [L] ;
HexLiteral: '0' [xX] HexDigit (HexDigit | UNDERSCORE)* ;
BinLiteral: '0' [bB] BinDigit (BinDigit | UNDERSCORE)* ;
OctLiteral: '0' [cC] OctDigit (OctDigit | UNDERSCORE)* ;
Booleans: TRUE | FALSE ;
Number: IntegerLiteral | FloatLiteral | DoubleLiteral | BinLiteral | HexLiteral | OctLiteral | LongLiteral ;
EscapeSeq: UniCharacterLiteral | EscapedIdentifier;
UniCharacterLiteral: '\\' 'u' HexDigit HexDigit HexDigit HexDigit ;
EscapedIdentifier: '\\' ('t' | 'b' | 'r' | 'n' | '\'' | '"' | '\\' | '$') ;
HexDigit: [0-9a-fA-F] ;
BinDigit: [01] ;
OctDigit: [0-7];
DecDigit: [0-9];
DecDigitNoZero: [1-9];
ID: [a-z] ([a-zA-Z_] | [0-9])*;
TYPE: [A-Z] ([a-zA-Z] | UNDERSCORE | [0-9])* ;
DATATYPE: Number | String | Character | Booleans ;
When you get an error like "Unexpected input 'foo', expected BAR" and you think "But 'foo' is a BAR", the first thing you should do is to print the token stream for your input (you can do this by running grun Symbols tokens -tokens inputfile). If you do this, you'll see that the a in your input is recognized as a HexDigit, not as an ID.
Why does this happen? Because both HexDigit and ID match the input a and ANTLR (like most lexer generators) resolves ambiguities according to the maximal munch rule: When multiple rules can match the current input, it chooses the one that produces the longest match (which is why variables with more than one letter work) and then resolves ties by picking the one that is defined first, which is HexDigit in this case.
Note that the lexer does not care which lexer rules are used by the parser and when. The lexer decides which tokens to produce solely based on the contents of the lexer grammar, so the lexer does not know or care that the parser wants an ID right now. It looks at all rules that match and then picks one according to the maximal munch rule and that's it.
In your case you never actually use HexDigit in your parser grammar, so there is no reason why you'd ever want a HexDigit token to be created. Therefore HexDigit should not be a lexer rule - it should be a fragment:
fragment HexDigit : [0-9a-fA-F];
This also applies to your other rules that aren't used in the parser, including all the ...Digit rules.
PS: Your Number rule will never match because of these same rules. It should probably be a parser rule instead (or the other number rules should be fragments if you don't care which kind of number literal you have).

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.

ANTLR : A lexer or a parser error?

I wrote a simple lexer in ANTLR and the grammer for ID is something like this :
ID : (('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*|'_'('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*);
(No digits are allowed at the beginning)
when I generated the code (in java) and tested the input :
3a
I expected an error but the input was recognized as "INT ID" , how can i fix the grammer to make it report an error ?(with only lexer rules)
Thanks for your attention
Note that your rule could be rewritten into:
ID
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' |'_')*
;
or with fragments (rules that won't produce tokens, but are only used by other lexer rules):
ID
: (Letter | '_') (Letter| Digit |'_')*
;
fragment Letter
: 'a'..'z'
| 'A'..'Z'
;
fragment Digit
: '0'..'9'
;
But if input like "3a" is recognized by your lexer and produces the tokens INT and ID, then you shouldn't change anything. A problem with such input would probably come up in your parser rule(s) because it is semantically incorrect.
If you really want to let the lexer handle this kind of stuff, you could do something like this:
INT
: Digit+ (Letter {/* throw an exception */})?
;
And if you want to allow INT literals to possibly end with a f or L, then you'd first have to inspect the contents of Letter and if it's not "f" or "L", the you throw an exception.