I'm at a very beginning of learning ANTLR4 lexer rules. My goal is to create a simple grammar for Java properties files. Here is what I have so far:
lexer grammar PropertiesLexer;
LineComment
: ( LineCommentHash
| LineCommentExcl
)
-> skip
;
fragment LineCommentHash
: '#' ~[\r\n]*
;
fragment LineCommentExcl
: '!' ~[\r\n]*
;
fragment WrappedLine
: '\\'
( '\r' '\n'?
| '\n'
)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
Key
: KeyLetterStart
( KeyLetter
| Escaped
)*
;
fragment KeyLetterStart
: ~[ \t\r\n:=]
;
fragment KeyLetter
: ~[\t\r\n:=]
;
fragment Escaped
: '\\' .?
;
Equal
: ( '\\'? ':'
| '\\'? '='
)
;
Value
: ValueLetterBegin
( ValueLetter
| Escaped
| WrappedLine
)*
;
fragment ValueLetterBegin
: ~[ \t\r\n]
;
fragment ValueLetter
: ~ [\r\n]+
;
Whitespace
: [ \t]+
-> skip
;
My test file is this one:
# comment 1
# comment 2
#
.key1= value1
key2\:sub=value2
key3 \= value3
key4=value41\
value42
# comment3
#comment4
key=value
When I run grun, I'm getting following output:
[#0,30:42='.key1= value1',<Value>,4:0]
[#1,45:60='key2\:sub=value2',<Value>,5:0]
[#2,63:76='key3 \= value3',<Value>,6:0]
[#3,81:102='key4=value41\\r\nvalue42',<Value>,8:0]
[#4,130:138='key=value',<Value>,13:0]
[#5,141:140='<EOF>',<EOF>,14:0]
I don't understand why the Value definition is matched. When commenting out the Value definition, however, it recognizes the Key and Equal definitions:
[#0,30:34='.key1',<Key>,4:0]
[#1,35:35='=',<Equal>,4:5]
[#2,37:42='value1',<Key>,4:7]
[#3,45:49='key2\',<Key>,5:0]
[#4,50:50=':',<Equal>,5:5]
[#5,51:53='sub',<Key>,5:6]
[#6,54:54='=',<Equal>,5:9]
[#7,55:60='value2',<Key>,5:10]
[#8,63:68='key3 \',<Key>,6:0]
[#9,69:69='=',<Equal>,6:6]
[#10,71:76='value3',<Key>,6:8]
[#11,81:84='key4',<Key>,8:0]
[#12,85:85='=',<Equal>,8:4]
[#13,86:93='value41\',<Key>,8:5]
[#14,96:102='value42',<Key>,9:0]
[#15,130:132='key',<Key>,13:0]
[#16,133:133='=',<Equal>,13:3]
[#17,134:138='value',<Key>,13:4]
[#18,141:140='<EOF>',<EOF>,14:0]
but how to let it recognize the Key, Equal and Value definitons?
ANTLR's lexer rules match as much characters as possible, that is why you're seeing all these Value tokens being created (they match the most characters).
Lexical modes seem like a good fit to use here. Something like this:
lexer grammar PropertiesLexer;
COMMENT
: [!#] ~[\r\n]* -> skip
;
KEY
: ( '\\' ~[\r\n] | ~[\r\n\\=:] )+
;
EQUAL
: [=:] -> pushMode(VALUE_MODE)
;
NL
: [\r\n]+ -> skip
;
mode VALUE_MODE;
VALUE
: ( ~[\\\r\n] | '\\' . )+
;
END_VALUE
: [\r\n]+ -> skip, popMode
;
I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;
As a homework I should create a parser for VC language V10.1 using ANTLR.
Here is my work:
grammar MyVCgrm2;
program : ( funcDecl | varDecl )*
;
funcDecl :
type identifier paraList compoundStmt
;
varDecl
: type initDeclaratorList ';'
;
initDeclaratorList
: initDeclarator ( ',' initDeclarator )*
;
initDeclarator
: declarator ( '=' initialiser )?
;
declarator
: identifier identifier2
;
identifier2
: |'[' INTLITERAL? ']'
;
initialiser
: expr
| '{' expr ( ',' expr ) * '}'
;
// primitive types
type
: 'void' | 'boolean' | 'int' | 'float'
;
// identifiers
identifier
: ID
;
// statements
compoundStmt
: '{' varDecl* stmt* '}'
;
stmt
:
compoundStmt
| ifStmt
| forStmt
| whileStmt
| breakStmt
| continueStmt
| returnStmt
| exprStmt
;
ifStmt
: 'if' '(' expr ')' stmt ( 'else' stmt )?
;
forStmt
: 'for' '(' expr? ';' expr? ';' expr? ')' stmt
;
whileStmt
: 'while' '(' expr ')' stmt
;
breakStmt
: 'break' ';'
;
continueStmt
: 'continue' ';'
;
returnStmt
: 'return' expr? ';'
;
exprStmt
: expr? ';'
;
// expressions
expr
: assignmentExpr
;
assignmentExpr
: ( condOrExpr '=' )* condOrExpr
;
condOrExpr
: condAndExpr condOrExpr2
;
condOrExpr2
: '||' condAndExpr condOrExpr
|
;
condAndExpr
: equalityExpr condAndExpr2
;
condAndExpr2
: '&&' equalityExpr condAndExpr2|
;
equalityExpr
: relExpr equalityExpr2
;
equalityExpr2
: '==' relExpr equalityExpr2 | '!=' relExpr equalityExpr2|
;
relExpr
: additiveExpr relExpr2
;
relExpr2
: '<' additiveExpr relExpr | '<=' additiveExpr relExpr | '>' additiveExpr relExpr| '>=' additiveExpr relExpr|
;
additiveExpr
:
multiplicativeExpr additiveExpr2
;
additiveExpr2
:
'+' multiplicativeExpr additiveExpr2
| '-' multiplicativeExpr additiveExpr2
|
;
multiplicativeExpr
:
unaryExpr multiplicativeExpr2
;
multiplicativeExpr2
:
'*' unaryExpr multiplicativeExpr2
| '/' unaryExpr multiplicativeExpr2
|
;
unaryExpr
: '+' unaryExpr
| '-' unaryExpr
| '!' unaryExpr
| primaryExpr
;
primaryExpr
: identifier primaryExpr2
| '(' expr ')'
| INTLITERAL
| FLOATLITERAL
| BOOLLITERAL
| STRINGLITERAL
;
primaryExpr2
: argList?
| '[' expr ']'
;
// parameters
paraList
: '(' properParaList? ')'
;
properParaList
: paraDecl ( ',' paraDecl ) *
;
paraDecl
: type declarator
;
argList
:'(' properArgList? ')'
;
properArgList
: arg ( ',' arg ) *
;
arg
: expr
;
BOOLLITERAL
: 'true'|'false'
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?')*
;
INTLITERAL : '0'..'9'+
;
FLOATLITERAL
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRINGLITERAL
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
But when by trying generate code I got these errors in console:
[13:48:38] Checking Grammar MyVCgrm2.g...
[13:48:38] warning(200): MyVCgrm2.g:53:27:
Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:78:21: [fatal] rule assignmentExpr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] error(211): MyVCgrm2.g:111:2: [fatal] rule additiveExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] warning(200): MyVCgrm2.g:111:2:
Decision can match input such as "'+' ID" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:142:5: [fatal] rule primaryExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
Warnings and errors are about these productions (in order): ifStmt, assignmentExpr, additiveExpr2, primaryExpr2.
I've done all left factorings and left recursion eliminations. I don't understand what ANTLR meant by "Resolve by left-factoring".
Why do I get these errors and how do I resolve them?
Thank you
I have an AlgebraRelacional.g4 file with this. I need to read a file with a syntax like a CSV file, put the content in some memory tables and then resolve relational algebra operations with that. Can you tell me if I am doing it right?
Example data file to read:
cod_buy(char);name_suc(char);Import(int);date_buy(date)
“P-11”;”DC Med”;900;01/03/14
“P-14”;”Center”;1500;02/05/14
Current ANTLR grammar:
grammar AlgebraRelacional;
SEL : '\u03C3'
;
PRO : '\u220F'
;
UNI : '\u222A'
;
DIF : '\u002D'
;
PROC : '\u0058'
;
INT : '\u2229'
;
AND : 'AND'
;
OR : 'OR'
;
NOT : 'NOT'
;
EQ : '='
;
DIFERENTE : '!='
;
MAYOR : '>'
;
MENOR : '<'
;
SUMA : '+'
;
MULTI : '*'
;
IPAREN : '('
;
DPAREN : ')'
;
COMA : ','
;
PCOMA : ';'
;
Comillas: '"'
;
file : hdr row+ ;
hdr : row ;
row : field (',' field)* '\r'? '\n' ;
field : TEXT | STRING | ;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ;
I suggest you that read this document (http://is.muni.cz/th/208197/fi_b/bc_thesis.pdf), It contains usefull information about how to write a parser for relational algebra. That is not ANTLR, but you only has to translate the grammar in BNF to EBNF.
I have a problem parsing integer & hex numbers. I want to parse C++ enums with the following rules:
grammar enum;
rule_enum
: 'enum' ID '{' enum_values+ '}'';';
enum_values
: enum_value (COMMA enum_value)+;
enum_value
: ID ('=' number)?;
number : hex_number | integer_number;
hex_number
: '0' 'x' HEX_DIGIT+;
integer_number
: DIGIT+;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9');
COMMA : ',';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
The problem I have is the following - when parsing code like:
enum Enum
{
Option1 = 0,
Option2 = 1
};
it does not recognize the 0 as integer_number but tries to parse it as hex_number. How can I resolve this?
Thank you.
Tobias
First, fragment rules can only be "seen" by lexer rules, not parser rules. So, the following is invalid:
integer_number
: DIGIT+; // can't use DIGIT here!
fragment
DIGIT : ('0'..'9');
To fix your ambiguity with these numbers, it's IMO best to make these integer- and hex numbers lexer rules instead of parser rules.
An example:
grammar enum;
rule_enum
: 'enum' ID '{' enum_values+ '}'';';
enum_values
: enum_value (COMMA enum_value)+;
enum_value
: ID ('=' number)?;
number
: HEX_NUMBER
| INTEGER_NUMBER
;
HEX_NUMBER
: '0' 'x' HEX_DIGIT+;
INTEGER_NUMBER
: DIGIT+;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9');
COMMA : ',';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
which produces the following parse tree of your example snippet:
The following ANTLR works for just the number bit of the enum.
(editted to include Bart's advice below)
grammar enum;
number :
integer_number | hex_number ;
hex_number
: HEX_NUMBER;
integer_number
: INT_NUMBER;
HEX_NUMBER
: HEX_INTRO HEX_DIGIT+;
INT_NUMBER
: DIGIT+;
HEX_INTRO
: '0x';
DIGIT : ('0'..'9');
HEX_DIGIT
: ('0'..'9'|'a'..'f'|'A'..'F') ;