I'm at a very beginning of learning ANTLR4 lexer rules. My goal is to create a simple grammar for Java properties files. Here is what I have so far:
lexer grammar PropertiesLexer;
LineComment
: ( LineCommentHash
| LineCommentExcl
)
-> skip
;
fragment LineCommentHash
: '#' ~[\r\n]*
;
fragment LineCommentExcl
: '!' ~[\r\n]*
;
fragment WrappedLine
: '\\'
( '\r' '\n'?
| '\n'
)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
Key
: KeyLetterStart
( KeyLetter
| Escaped
)*
;
fragment KeyLetterStart
: ~[ \t\r\n:=]
;
fragment KeyLetter
: ~[\t\r\n:=]
;
fragment Escaped
: '\\' .?
;
Equal
: ( '\\'? ':'
| '\\'? '='
)
;
Value
: ValueLetterBegin
( ValueLetter
| Escaped
| WrappedLine
)*
;
fragment ValueLetterBegin
: ~[ \t\r\n]
;
fragment ValueLetter
: ~ [\r\n]+
;
Whitespace
: [ \t]+
-> skip
;
My test file is this one:
# comment 1
# comment 2
#
.key1= value1
key2\:sub=value2
key3 \= value3
key4=value41\
value42
# comment3
#comment4
key=value
When I run grun, I'm getting following output:
[#0,30:42='.key1= value1',<Value>,4:0]
[#1,45:60='key2\:sub=value2',<Value>,5:0]
[#2,63:76='key3 \= value3',<Value>,6:0]
[#3,81:102='key4=value41\\r\nvalue42',<Value>,8:0]
[#4,130:138='key=value',<Value>,13:0]
[#5,141:140='<EOF>',<EOF>,14:0]
I don't understand why the Value definition is matched. When commenting out the Value definition, however, it recognizes the Key and Equal definitions:
[#0,30:34='.key1',<Key>,4:0]
[#1,35:35='=',<Equal>,4:5]
[#2,37:42='value1',<Key>,4:7]
[#3,45:49='key2\',<Key>,5:0]
[#4,50:50=':',<Equal>,5:5]
[#5,51:53='sub',<Key>,5:6]
[#6,54:54='=',<Equal>,5:9]
[#7,55:60='value2',<Key>,5:10]
[#8,63:68='key3 \',<Key>,6:0]
[#9,69:69='=',<Equal>,6:6]
[#10,71:76='value3',<Key>,6:8]
[#11,81:84='key4',<Key>,8:0]
[#12,85:85='=',<Equal>,8:4]
[#13,86:93='value41\',<Key>,8:5]
[#14,96:102='value42',<Key>,9:0]
[#15,130:132='key',<Key>,13:0]
[#16,133:133='=',<Equal>,13:3]
[#17,134:138='value',<Key>,13:4]
[#18,141:140='<EOF>',<EOF>,14:0]
but how to let it recognize the Key, Equal and Value definitons?
ANTLR's lexer rules match as much characters as possible, that is why you're seeing all these Value tokens being created (they match the most characters).
Lexical modes seem like a good fit to use here. Something like this:
lexer grammar PropertiesLexer;
COMMENT
: [!#] ~[\r\n]* -> skip
;
KEY
: ( '\\' ~[\r\n] | ~[\r\n\\=:] )+
;
EQUAL
: [=:] -> pushMode(VALUE_MODE)
;
NL
: [\r\n]+ -> skip
;
mode VALUE_MODE;
VALUE
: ( ~[\\\r\n] | '\\' . )+
;
END_VALUE
: [\r\n]+ -> skip, popMode
;
I have tried to write a grammar to recognize expressions like:
(A + MAX(B) ) / ( C - AVERAGE(A) )
IF( A > AVERAGE(A), 0, 1 )
X / (MAX(X)
Unfortunately antlr3 fails with these errors:
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula
I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks
Code:
grammar DerivedKeywords;
options {
output=AST;
//backtrack=true;
}
WS : ( ' ' | '\t' | '\n' | '\r' )
{ skip(); }
;
//for numbers
DIGIT
: '0'..'9'
;
//for both integer and real number
NUMBER
: (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )?
;
// Boolean operatos
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
EQ : '=';
NEQ : '!=';
GT : '>';
LT : '<';
GTE : '>=';
LTE : '<=';
COMMA : ',';
// Token for Functions
IF : 'IF';
MAX : 'MAX';
MIN : 'MIN';
AVERAGE : 'AVERAGE';
VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')*
;
// OPERATORS
LPAREN : '(' ;
RPAREN : ')' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
STAR : '*' ;
expression : formula;
formula
: functionExpression
| additiveExpression
| LPAREN! a=formula RPAREN! // First Problem
;
additiveExpression
: a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )*
;
multiplicativeExpression
: a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )*
;
unaryExpression
: MINUS^ u=unaryExpression
| primaryExpression
;
functionExpression
: f=functionOperator LPAREN e=formula RPAREN
| IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN
;
functionOperator :
MAX | MIN | AVERAGE;
primaryExpression
: NUMBER
// Used for scientific numbers
| DIGIT
| VARIABLE
| formula
;
// Boolean stuff
booleanExpression
: orExpression;
orExpression : a=andExpression (OR^ b=andExpression )*
;
andExpression
: a=notExpression (AND^ b=notExpression )*
;
notExpression
: NOT^ t=booleanTerm
| booleanTerm
;
booleanOperator :
GT | LT | EQ | GTE | LTE | NEQ;
booleanTerm : a=formula op=booleanOperator b=formula
| LPAREN! booleanTerm RPAREN! // Second problem
;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
- this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same.
You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.
I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;
I defined the following grammar, following Scott Stanchfield tutorial.
grammar SampleScript;
program
:
declaration+
;
declaration
: macrodeclaration
;
macrodeclaration
:
MACRO STRING (LEFTPAREN parameters RIGHTPAREN)?
statement*
ENDMACRO
;
statement
: assignmentStatement
| ifStatement
| iterationStatement
| jumpStatement
| procedureCallStatement
| dimStatement
| labeledStatement
;
actualParameters
: expression (',' expression?)*
;
parameters
: ID (',' ID)*
;
assignmentStatement
: ID ASSIGN expression
| ID MATRIXASSIGN expression
;
ifStatement
: IF expression THEN (statement|compoundStatement)
(ELSE expression (statement|compoundStatement))?
;
iterationStatement
: WHILE expression compoundStatement
| FOR ID '=' expression TO expression (STEP expression)? compoundStatement
;
jumpStatement
: BREAK
| CONTINUE
| GOTO ID
| RETURN LEFTPAREN expression RIGHTPAREN
;
procedureCallStatement //todo: expression statement
: ID LEFTPAREN actualParameters? RIGHTPAREN
;
dimStatement
: DIM ID LEFTBRACKET expression(',' expression)* RIGHTBRACKET (',' ID LEFTBRACKET expression(',' expression)* RIGHTBRACKET)*
;
labeledStatement
: ID ':' statement
;
compoundStatement
: DO statement* END
;
term
: NUMBER
| STRING
| ID
| LEFTPAREN expression RIGHTPAREN //( )
| ID LEFTPAREN actualParameters RIGHTPAREN //Procedure Call
| ID (LEFTBRACKET expression RIGHTBRACKET)+ //Array Arr[3]
| ID ('.' expression)+ //Array Arr.Length
| LEFTBRACE (expression)? (',' expression)* RIGHTBRACE //{"OK","False"}
;
negation
: 'not'* term
;
unary
: ('-')* negation
;
mult
: unary (('*' | '/') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
//Keywords
DIM: D I M;
RETURN: R E T U R N;
FOR: F O R;
STEP: S T E P;
TO: T O;
WHILE: W H I L E;
DO: D O;
END: E N D;
GOTO: G O T O;
BREAK: B R E A K;
CONTINUE: C O N T I N U E;
IF: I F;
THEN: T H E N;
ELSE: E L S E;
MACRO :M A C R O;
ENDMACRO :E N D M A C R O;
ID : ('_'|LETTER) ('_'|LETTER|DIGIT)*;
ASSIGN: '=';
MATRIXASSIGN: ':=';
LEFTPAREN : '(';
RIGHTPAREN : ')';
LEFTBRACKET : '[';
RIGHTBRACKET : ']';
LEFTBRACE : '{';
RIGHTBRACE : '}';
//STRING : '"' .*? '"' ; // match anything in "..."
STRING
: '"' (STRING_ESCAPE_SEQ|~('\n'|'\r'))*? '"'
| '\'' (STRING_ESCAPE_SEQ|~('\n'|'\r'))*? '\''
;
/// stringescapeseq ::= "\" <any source character>
fragment STRING_ESCAPE_SEQ //'\\"'
: '\\' .
;
UNSIGNED_INT : DIGIT+; //('0' | '1'..'9' '0'..'9'*);
UNSIGNED_FLOAT: DIGIT+ '.' DIGIT* Exponent?
| '.' DIGIT+ Exponent?
| DIGIT+ Exponent
;
NUMBER
: UNSIGNED_INT
| UNSIGNED_FLOAT
;
fragment DIGIT : [0-9] ; // not a token by itself
fragment Exponent : ('e'|'E') ('+'|'-')? (DIGIT)+ ;
LINE_COMMENT : '//' .*? '\r'? '\n' -> skip ; // Match "//" stuff '\n'
COMMENT : '/*' .*? '*/' -> skip ; // Match "/*" stuff "*/"
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment LETTER : [A-Za-z];
WS : [ \t\n\r]+ -> skip ; // skip spaces, tabs, newlines
I am trying to parse following code
Macro 'test' (x)
a=1
b=2
c={}
d = x(3,4)
matrixinfo_skim = GetMatrixInfo(m_skim)
showmessage (i2s(a))
showarray(c)
endmacro
and gets the error below, I spent over 2 days on it and couldn't figure out why it could not parse the assignment statements a=1 and later? someone please help me..
[#0,0:4='Macro',<30>,1:0]
[#1,6:11=''test'',<41>,1:6]
[#2,13:13='(',<35>,1:13]
[#3,14:14='x',<32>,1:14]
[#4,15:15=')',<36>,1:15]
[#5,20:20='a',<32>,3:0]
[#6,21:21='=',<33>,3:1]
[#7,22:22='1',<42>,3:2]
[#8,25:25='b',<32>,4:0]
[#9,26:26='=',<33>,4:1]
[#10,27:27='2',<42>,4:2]
[#11,30:30='c',<32>,5:0]
[#12,31:31='=',<33>,5:1]
[#13,32:32='{',<39>,5:2]
[#14,33:33='}',<40>,5:3]
[#15,36:36='d',<32>,6:0]
[#16,38:38='=',<33>,6:2]
[#17,40:40='x',<32>,6:4]
[#18,41:41='(',<35>,6:5]
[#19,42:42='3',<42>,6:6]
[#20,43:43=',',<2>,6:7]
[#21,44:44='4',<42>,6:8]
[#22,45:45=')',<36>,6:9]
[#23,48:62='matrixinfo_skim',<32>,7:0]
[#24,64:64='=',<33>,7:16]
[#25,66:78='GetMatrixInfo',<32>,7:18]
[#26,79:79='(',<35>,7:31]
[#27,80:85='m_skim',<32>,7:32]
[#28,86:86=')',<36>,7:38]
[#29,91:101='showmessage',<32>,9:0]
[#30,103:103='(',<35>,9:12]
[#31,104:106='i2s',<32>,9:13]
[#32,107:107='(',<35>,9:16]
[#33,108:108='a',<32>,9:17]
[#34,109:109=')',<36>,9:18]
[#35,110:110=')',<36>,9:19]
[#36,113:121='showarray',<32>,10:0]
[#37,122:122='(',<35>,10:9]
[#38,123:123='c',<32>,10:10]
[#39,124:124=')',<36>,10:11]
[#40,127:134='endmacro',<31>,11:0]
[#41,140:139='<EOF>',<-1>,13:0]
line 3:2 extraneous input '1' expecting {'-', 'not', ID, '(', '{', STRING, NUMBER}
line 4:2 extraneous input '2' expecting {'-', 'not', ID, '(', '{', STRING, NUMBER}
line 6:6 mismatched input '3' expecting {'-', 'not', ID, '(', '{', STRING, NUMBER}
line 6:8 extraneous input '4' expecting {',', ')'}
(program (declaration (macrodeclaration Macro 'test' ( (parameters x) ) (statement (assignmentStatement a = (expression (relation (add (mult (unary 1 (negation (term b))))) = (add (mult (unary 2 (negation (term c))))) = (add (mult (unary (negation (term { }))))))))) (statement (assignmentStatement d = (expression (relation (add (mult (unary (negation (term x ( (actualParameters (expression (relation (add (mult (unary 3))))) , 4) )))))))))) (statement (assignmentStatement matrixinfo_skim = (expression (relation (add (mult (unary (negation (term GetMatrixInfo ( (actualParameters (expression (relation (add (mult (unary (negation (term m_skim)))))))) )))))))))) (statement (procedureCallStatement showmessage ( (actualParameters (expression (relation (add (mult (unary (negation (term i2s ( (actualParameters (expression (relation (add (mult (unary (negation (term a)))))))) ))))))))) ))) (statement (procedureCallStatement showarray ( (actualParameters (expression (relation (add (mult (unary (negation (term c)))))))) ))) endmacro)))
As the error messages indicate, things go wrong with the numbers which is matched by the expression in the assignmentStatement rule, which ultimately is (or should be) matched as a NUMBER in the term rule.
Looking at the lexer rules responsible for the creation of a NUMBER token:
UNSIGNED_INT : DIGIT+;
UNSIGNED_FLOAT: DIGIT+ '.' DIGIT* Exponent?
| '.' DIGIT+ Exponent?
| DIGIT+ Exponent
;
NUMBER
: UNSIGNED_INT
| UNSIGNED_FLOAT
;
it appears that a NUMBER token is never created since a NUMBER matches either a UNSIGNED_INT or an UNSIGNED_FLOAT. But since these 2 tokens are defined before the NUMBER is defined, the lexer creates UNSIGNED_INT and UNSIGNED_FLOAT tokens instead of NUMBER tokens.
You need to change UNSIGNED_INT and UNSIGNED_FLOAT into fragment rules instead:
fragment UNSIGNED_INT : DIGIT+;
fragment UNSIGNED_FLOAT: DIGIT+ '.' DIGIT* Exponent?
| '.' DIGIT+ Exponent?
| DIGIT+ Exponent
;
NUMBER
: UNSIGNED_INT
| UNSIGNED_FLOAT
;
Be sure to understand what a fragment is: What does "fragment" mean in ANTLR?
I have a problem parsing integer & hex numbers. I want to parse C++ enums with the following rules:
grammar enum;
rule_enum
: 'enum' ID '{' enum_values+ '}'';';
enum_values
: enum_value (COMMA enum_value)+;
enum_value
: ID ('=' number)?;
number : hex_number | integer_number;
hex_number
: '0' 'x' HEX_DIGIT+;
integer_number
: DIGIT+;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9');
COMMA : ',';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
The problem I have is the following - when parsing code like:
enum Enum
{
Option1 = 0,
Option2 = 1
};
it does not recognize the 0 as integer_number but tries to parse it as hex_number. How can I resolve this?
Thank you.
Tobias
First, fragment rules can only be "seen" by lexer rules, not parser rules. So, the following is invalid:
integer_number
: DIGIT+; // can't use DIGIT here!
fragment
DIGIT : ('0'..'9');
To fix your ambiguity with these numbers, it's IMO best to make these integer- and hex numbers lexer rules instead of parser rules.
An example:
grammar enum;
rule_enum
: 'enum' ID '{' enum_values+ '}'';';
enum_values
: enum_value (COMMA enum_value)+;
enum_value
: ID ('=' number)?;
number
: HEX_NUMBER
| INTEGER_NUMBER
;
HEX_NUMBER
: '0' 'x' HEX_DIGIT+;
INTEGER_NUMBER
: DIGIT+;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9');
COMMA : ',';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
which produces the following parse tree of your example snippet:
The following ANTLR works for just the number bit of the enum.
(editted to include Bart's advice below)
grammar enum;
number :
integer_number | hex_number ;
hex_number
: HEX_NUMBER;
integer_number
: INT_NUMBER;
HEX_NUMBER
: HEX_INTRO HEX_DIGIT+;
INT_NUMBER
: DIGIT+;
HEX_INTRO
: '0x';
DIGIT : ('0'..'9');
HEX_DIGIT
: ('0'..'9'|'a'..'f'|'A'..'F') ;