I am new to the world of ANTLR, so I am hoping that someone could do a code review and see if what I am attempting todo is being done in the most efficient means possible. What I am trying todo is to parse what essentially is a (multi)-conditional statement which will be turned into Lambda statements:
Example:
Student Object
{
string ID { get; set;}
string Campus { get; set;}
string FirstName { get; set;}
}
Evaluation Statements:
StudentID="XXX"
StudentID="XXX" || FirstName=="John"
(StudentID="XXX" || FirstName=="John")
(StudentID="XXX" || FirstName=="John") || (StudentID="YYY" || FirstName=="DOE")
The following is the code that I've created so far. Could someone let me know if I'm going in the right direction:
/*lexer */grammar Test;
options {
output=AST;
}
tokens {
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : multiExpression ((AND^|OR^) multiExpression)* EOF! ;
multiExpression
: primaryExpression ((AND^|OR^) primaryExpression)* |
LPAREN! primaryExpression ((AND^|OR^) primaryExpression)* RPAREN!;
primaryExpression
: ID (EQUALS^ | GTR^ | LTN^ | GTREQ^ | LSTEQ^ | NOT^) value |
LPAREN! ID (EQUALS^ | GTR^ | LTN^ | GTREQ^ | LSTEQ^ | NOT^) value RPAREN! ;
value: STRING
|ID
|INT
|FLOAT
|CHAR;
//expr : term ( ( PLUS | MINUS ) term )* ;
//term : factor ( ( MULT | DIV ) factor )* ;
//factor : NUMBER ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
EQUALS : '==';
GTR : '>';
LTN : '<';
GTREQ : '>=';
LSTEQ : '<=';
NOT : '!=';
OR : '||';
LPAREN : '(';
RPAREN : ')';
AND : '&&';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
INT : ('0'..'9')+;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
Related
I am new to antlr. I am trying to parse some queries like [network-traffic:src_port = '123] and [network-traffic:src_port =] and [network-traffic:src_port = ] and ... I have a grammar as follows:
grammar STIXPattern;
pattern
: observationExpressions EOF
;
observationExpressions
: <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
| observationExpressionOr #observationExpressionOr_
;
observationExpressionOr
: <assoc=left> observationExpressionOr OR observationExpressionOr #observationExpressionOred
| observationExpressionAnd #observationExpressionAnd_
;
observationExpressionAnd
: <assoc=left> observationExpressionAnd AND observationExpressionAnd #observationExpressionAnded
| observationExpression #observationExpression_
;
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
comparisonExpression
: <assoc=left> comparisonExpression OR comparisonExpression #comparisonExpressionOred
| comparisonExpressionAnd #comparisonExpressionAnd_
;
comparisonExpressionAnd
: <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd #comparisonExpressionAnded
| propTest #comparisonExpressionAndpropTest
;
propTest
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
| objectPath NOT? (GT|LT|GE|LE) orderableLiteral # propTestOrder
| objectPath NOT? IN setLiteral # propTestSet
| objectPath NOT? LIKE StringLiteral # propTestLike
| objectPath NOT? MATCHES StringLiteral # propTestRegex
| objectPath NOT? ISSUBSET StringLiteral # propTestIsSubset
| objectPath NOT? ISSUPERSET StringLiteral # propTestIsSuperset
| LPAREN comparisonExpression RPAREN # propTestParen
| objectPath NOT? (EQ|NEQ) objectPathThl # propTestThlEqual
;
startStopQualifier
: START TimestampLiteral STOP TimestampLiteral
;
withinQualifier
: WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
;
repeatedQualifier
: REPEATS IntPosLiteral TIMES
;
objectPath
: objectType COLON firstPathComponent objectPathComponent?
;
objectPathThl
: varThlType DOT firstPathComponent objectPathComponent?
;
objectType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
varThlType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
firstPathComponent
: IdentifierWithoutHyphen
| StringLiteral
;
objectPathComponent
: <assoc=left> objectPathComponent objectPathComponent # pathStep
| '.' (IdentifierWithoutHyphen | StringLiteral) # keyPathStep
| LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK # indexPathStep
;
setLiteral
: LPAREN RPAREN
| LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
;
primitiveLiteral
: orderableLiteral
| BoolLiteral
| edgeCases
;
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
orderableLiteral
: IntPosLiteral
| IntNegLiteral
| FloatPosLiteral
| FloatNegLiteral
| StringLiteral
| BinaryLiteral
| HexLiteral
| TimestampLiteral
;
IntNegLiteral :
'-' ('0' | [1-9] [0-9]*)
;
IntNoSign :
('0' | [1-9] [0-9]*)
;
IntPosLiteral :
'+'? ('0' | [1-9] [0-9]*)
;
FloatNegLiteral :
'-' [0-9]* '.' [0-9]+
;
FloatPosLiteral :
'+'? [0-9]* '.' [0-9]+
;
HexLiteral :
'h' QUOTE TwoHexDigits* QUOTE
;
BinaryLiteral :
'b' QUOTE
( Base64Char Base64Char Base64Char Base64Char )*
( (Base64Char Base64Char Base64Char Base64Char )
| (Base64Char Base64Char Base64Char ) '='
| (Base64Char Base64Char ) '=='
)
QUOTE
;
StringLiteral :
QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
;
BoolLiteral :
TRUE | FALSE
;
TimestampLiteral :
't' QUOTE
[0-9] [0-9] [0-9] [0-9] HYPHEN
( ('0' [1-9]) | ('1' [012]) ) HYPHEN
( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
'T'
( ([01] [0-9]) | ('2' [0-3]) ) COLON
[0-5] [0-9] COLON
([0-5] [0-9] | '60')
(DOT [0-9]+)?
'Z'
QUOTE
;
//////////////////////////////////////////////
// Keywords
AND: 'AND' ;
OR: 'OR' ;
NOT: 'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE: 'LIKE' ;
MATCHES: 'MATCHES' ;
ISSUPERSET: 'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST: 'LAST' ;
IN: 'IN' ;
START: 'START' ;
STOP: 'STOP' ;
SECONDS: 'SECONDS' ;
TRUE: 'true' ;
FALSE: 'false' ;
WITHIN: 'WITHIN' ;
REPEATS: 'REPEATS' ;
TIMES: 'TIMES' ;
// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
[a-zA-Z_] [a-zA-Z0-9_]*
;
IdentifierWithHyphen :
[a-zA-Z_] [a-zA-Z0-9_-]*
;
EQ : '=' | '==';
NEQ : '!=' | '<>';
LT : '<';
LE : '<=';
GT : '>';
GE : '>=';
QUOTE : '\'';
COLON : ':' ;
DOT : '.' ;
COMMA : ',' ;
RPAREN : ')' ;
LPAREN : '(' ;
RBRACK : ']' ;
LBRACK : '[' ;
PLUS : '+' ;
HYPHEN : MINUS ;
MINUS : '-' ;
POWER_OP : '^' ;
DIVIDE : '/' ;
ASTERISK : '*';
EQRBRAC : ']';
fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];
// Whitespace and comments
//
WS : [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
: .
;
Now when I feed [network-traffic:src_port = '123] I expect antlr to parse the query to '123 and ]
However the grammar return '123] and is not able to separate '123 and ]
Am missing anything?
grammar does not separate '123 and ] though the rule is set for it
That is not true. The quote and 123 are separate tokens. As demonstrated/suggested in your previous ANTLR question: start by printing all the tokens to your console to see what tokens are being created. This should always be the first thing you do when trying to debug an ANTLR grammar. It will save you a lot of time and headache.
The fact [network-traffic:src_port = '123] is not parsed properly, is because the ](RBRACK) is being consumed by the alternative observationExpressionSimple:
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
Because RBRACK was already consumed by a parser rule, the edgeCases rule can't consume this RBRACK token as well.
To fix this, change your rule:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
into this:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign)
;
Now [network-traffic:src_port = '123] will be parsed properly:
Recently I am trying to upgrade my project from Antlr3 to Antlr4. But after making change in the grammar file, it seems the equations that worked previously is no longer working. I am new to Antlr4 so unable to understand whether my change broke something or not.
Here is my original grammar file:
grammar equation;
options {
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF!;
equation: variable ASSIGN expression -> ^(EQUATION variable expression)
;
parExpression
: LPAREN expression RPAREN -> ^(PAREXPR expression)
;
expression
: conditionalexpression -> ^(EXPR conditionalexpression)
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR^ andExpression )*
;
andExpression
: comparisonExpression ( AND^ comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ^ | NE^ | LTE^ | GTE^ | LT^ | GT^) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS^ | MINUS^) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES^ | DIVIDE^) unaryExpression )*
;
unaryExpression
: NOT unaryExpression -> ^(UNARYEXPR NOT unaryExpression)
| MINUS unaryExpression -> ^(UNARYEXPR MINUS unaryExpression)
| exponentexpression;
exponentexpression
: primary (CARET^ primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING -> ^(CONSTANT STRING) | numeric -> ^(CONSTANT numeric);
booleantok : BOOLEAN -> ^(BOOLEAN);
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER -> IDENTIFIER+;
function
: scopedidentifier LPAREN argumentlist RPAREN -> ^(FUNCTION scopedidentifier argumentlist);
variable: scopedidentifier -> ^(VARIABLE scopedidentifier);
argumentlist: (expression) ? (COMMA! expression)*;
WS : (' '|'\r'|'\n'|'\t')+ {$channel=HIDDEN;};
COMMENT : '/*' .* '*/' {$channel=HIDDEN;};
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
And here is what I have after my changes:
grammar equation;
options {
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .* '*/' ->channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
A sample equation that I am trying to parse (which was working OK before) is:
[a].[b] = 1.76 * [Product_DC].[PDC_Inbound_Pallets] * if(product_dc.[PDC_DC] =="US84",1,0)
Thanks in advance.
Tokens should be listed with comma , not semicolon ;. See also Token Section paragraph in official doc.
Since ANTLR 4.7 backslash is not required for double quote escaping. STRING: (('\"') ( (~('\"')) )* ('\"'))+; should be rewritten to STRING: ('"' ~'"'* '"')+;.
You missed question mark in multiline comment token for non-greedy matching: '/*' .* '*/' -> '/*' .*? '*/'.
So, the fixed grammar looks like this:
grammar equation;
options {
}
tokens {
VARIABLE,
CONSTANT,
EXPR,
PAREXPR,
EQUATION,
UNARYEXPR,
FUNCTION,
BINARYOP,
LIST
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: ('"' ~'"'* '"')+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
As a homework I should create a parser for VC language V10.1 using ANTLR.
Here is my work:
grammar MyVCgrm2;
program : ( funcDecl | varDecl )*
;
funcDecl :
type identifier paraList compoundStmt
;
varDecl
: type initDeclaratorList ';'
;
initDeclaratorList
: initDeclarator ( ',' initDeclarator )*
;
initDeclarator
: declarator ( '=' initialiser )?
;
declarator
: identifier identifier2
;
identifier2
: |'[' INTLITERAL? ']'
;
initialiser
: expr
| '{' expr ( ',' expr ) * '}'
;
// primitive types
type
: 'void' | 'boolean' | 'int' | 'float'
;
// identifiers
identifier
: ID
;
// statements
compoundStmt
: '{' varDecl* stmt* '}'
;
stmt
:
compoundStmt
| ifStmt
| forStmt
| whileStmt
| breakStmt
| continueStmt
| returnStmt
| exprStmt
;
ifStmt
: 'if' '(' expr ')' stmt ( 'else' stmt )?
;
forStmt
: 'for' '(' expr? ';' expr? ';' expr? ')' stmt
;
whileStmt
: 'while' '(' expr ')' stmt
;
breakStmt
: 'break' ';'
;
continueStmt
: 'continue' ';'
;
returnStmt
: 'return' expr? ';'
;
exprStmt
: expr? ';'
;
// expressions
expr
: assignmentExpr
;
assignmentExpr
: ( condOrExpr '=' )* condOrExpr
;
condOrExpr
: condAndExpr condOrExpr2
;
condOrExpr2
: '||' condAndExpr condOrExpr
|
;
condAndExpr
: equalityExpr condAndExpr2
;
condAndExpr2
: '&&' equalityExpr condAndExpr2|
;
equalityExpr
: relExpr equalityExpr2
;
equalityExpr2
: '==' relExpr equalityExpr2 | '!=' relExpr equalityExpr2|
;
relExpr
: additiveExpr relExpr2
;
relExpr2
: '<' additiveExpr relExpr | '<=' additiveExpr relExpr | '>' additiveExpr relExpr| '>=' additiveExpr relExpr|
;
additiveExpr
:
multiplicativeExpr additiveExpr2
;
additiveExpr2
:
'+' multiplicativeExpr additiveExpr2
| '-' multiplicativeExpr additiveExpr2
|
;
multiplicativeExpr
:
unaryExpr multiplicativeExpr2
;
multiplicativeExpr2
:
'*' unaryExpr multiplicativeExpr2
| '/' unaryExpr multiplicativeExpr2
|
;
unaryExpr
: '+' unaryExpr
| '-' unaryExpr
| '!' unaryExpr
| primaryExpr
;
primaryExpr
: identifier primaryExpr2
| '(' expr ')'
| INTLITERAL
| FLOATLITERAL
| BOOLLITERAL
| STRINGLITERAL
;
primaryExpr2
: argList?
| '[' expr ']'
;
// parameters
paraList
: '(' properParaList? ')'
;
properParaList
: paraDecl ( ',' paraDecl ) *
;
paraDecl
: type declarator
;
argList
:'(' properArgList? ')'
;
properArgList
: arg ( ',' arg ) *
;
arg
: expr
;
BOOLLITERAL
: 'true'|'false'
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?')*
;
INTLITERAL : '0'..'9'+
;
FLOATLITERAL
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRINGLITERAL
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
But when by trying generate code I got these errors in console:
[13:48:38] Checking Grammar MyVCgrm2.g...
[13:48:38] warning(200): MyVCgrm2.g:53:27:
Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:78:21: [fatal] rule assignmentExpr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] error(211): MyVCgrm2.g:111:2: [fatal] rule additiveExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] warning(200): MyVCgrm2.g:111:2:
Decision can match input such as "'+' ID" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:142:5: [fatal] rule primaryExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
Warnings and errors are about these productions (in order): ifStmt, assignmentExpr, additiveExpr2, primaryExpr2.
I've done all left factorings and left recursion eliminations. I don't understand what ANTLR meant by "Resolve by left-factoring".
Why do I get these errors and how do I resolve them?
Thank you
My Grammar file (see below) parses queries of the type:
(name = Jon AND age != 16 OR city = NY);
However, it doesn't allow something like:
(name = 'Jon Smith' AND age != 16);
ie, it doesn't allow assign to a field values with more than one word, separated by White Spaces. How can I modify my grammar file to accept that?
options
{
language = Java;
output = AST;
}
tokens {
BLOCK;
RETURN;
QUERY;
ASSIGNMENT;
INDEXES;
}
#parser::header {
package pt.ptinovacao.agorang.antlr;
}
#lexer::header {
package pt.ptinovacao.agorang.antlr;
}
query
: expr ('ORDER BY' NAME AD)? ';' EOF
-> ^(QUERY expr ^('ORDER BY' NAME AD)?)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: NAME equality_op atom -> ^(equality_op NAME atom)
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
;
id_list
: '(' ID (',' ID)* ')'
-> ID+
;
NAME
: 'equipType'
| 'equipment'
| 'IP'
| 'site'
| 'managedDomain'
| 'adminState'
| 'dataType'
;
AD : 'ASC' | 'DESC' ;
equality_op
: '='
| '!='
| 'IN'
| 'NOT IN'
;
logical_op
: 'AND'
| 'OR'
;
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;
Include the String token as an alternative in your atom rule:
atom
: ID
| id_list
| Int
| Number
| String
;
I've got a grammar in ANTLR and don't understand how it can be recursive. Is there any way to get ANTLR to show the derivation that it used to see that my rules are recursive?
The recursive grammar in it's entirety:
grammar DeadMG;
options {
language = C;
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
program
: namespace_scope_definitions;
namespace_scope_definitions
: (namespace_definition | type_definition | function_definition | variable_definition)+;
type_scope_definitions
: (type_definition | function_definition | variable_definition)*;
namespace_definition
: 'namespace' ID '{' namespace_scope_definitions '}';
type_definition
: 'type' ID? (':' expression (',' expression)+ )? '{' type_scope_definitions '}';
function_definition
: ID '(' function_argument_list ')' ('(' function_argument_list ')')? ('->' expression)? compound_statement;
function_argument_list
: expression? ID (':=' expression)? (',' function_argument_list)?;
variable_definition
: 'static'? expression? ID ':=' expression
| 'static'? expression ID ('(' (expression)* ')')?;
literal_expression
: CHAR
| FLOAT
| INT
| STRING
| 'auto'
| 'type'
| type_definition;
primary_expression
: literal_expression
| ID
| '(' expression ')';
expression
: assignment_expression;
assignment_expression
: logical_or_expression (('=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<='| '>>=' | '&=' | '^=' | '|=') assignment_expression)*;
logical_or_expression
: logical_and_expression ('||' logical_and_expression)*;
logical_and_expression
: inclusive_or_expression ('&&' inclusive_or_expression)*;
inclusive_or_expression
: exclusive_or_expression ('|' exclusive_or_expression)*;
exclusive_or_expression
: and_expression ('^' and_expression)*;
and_expression
: equality_expression ('&' equality_expression)*;
equality_expression
: relational_expression (('=='|'!=') relational_expression)*;
relational_expression
: shift_expression (('<'|'>'|'<='|'>=') shift_expression)*;
shift_expression
: additive_expression (('<<'|'>>') additive_expression)*;
additive_expression
: multiplicative_expression (('+' multiplicative_expression) | ('-' multiplicative_expression))*;
multiplicative_expression
: unary_expression (('*' | '/' | '%') unary_expression)*;
unary_expression
: '++' primary_expression
| '--' primary_expression
| ('&' | '*' | '+' | '-' | '~' | '!') primary_expression
| 'sizeof' primary_expression
| postfix_expression;
postfix_expression
: primary_expression
| '[' expression ']'
| '(' expression* ')'
| '.' ID
| '->' ID
| '++'
| '--';
initializer_statement
: expression ';'
| variable_definition ';';
return_statement
: 'return' expression ';';
try_statement
: 'try' compound_statement catch_statement;
catch_statement
: 'catch' '(' ID ')' compound_statement catch_statement?
| 'catch' '(' '...' ')' compound_statement;
for_statement
: 'for' '(' initializer_statement expression? ';' expression? ')' compound_statement;
while_statement
: 'while' '(' initializer_statement ')' compound_statement;
do_while_statement
: 'do' compound_statement 'while' '(' expression ')';
switch_statement
: 'switch' '(' expression ')' '{' case_statement '}';
case_statement
: 'case:' (statement)* case_statement?
| 'default:' (statement)*;
if_statement
: 'if' '(' initializer_statement ')' compound_statement;
statement
: compound_statement
| return_statement
| try_statement
| initializer_statement
| for_statement
| while_statement
| do_while_statement
| switch_statement
| if_statement;
compound_statement
: '{' (statement)* '}';
More specifically, I am having trouble with the following rules:
namespace_scope_definitions
: (namespace_definition | type_definition | function_definition | variable_definition)+;
type_scope_definitions
: (type_definition | function_definition | variable_definition)*;
ANTLR is saying that alternatives 2 and 4, that is, type_definition and variable_definition, are recursive. Here's variable_definition:
variable_definition
: 'static'? expression? ID ':=' expression
| 'static'? expression ID ('(' (expression)* ')')?;
and here's type_definition:
type_definition
: 'type' ID? (':' expression (',' expression)+ )? '{' type_scope_definitions '}';
'type' itself, and type_definition, is a valid expression in my expression syntax. However, removing it is not resolving the ambiguity, so it doesn't originate there. And I have plenty of other ambiguities I need to resolve- detailing all the warnings and errors would be quite too much, so I'd really like to see more details on how they are recursive from ANTLR itself.
My suggestion is to remove most of the operator precedence rules for now:
expression
: multiplicative_expression
(
('+' multiplicative_expression)
| ('-' multiplicative_expression)
)*;
and then inline the rules that have a single caller to isolate the ambiguities. Yes it is tedious.
I found a few ambiguities in the grammar, fixed them and got a lot less warnings. However, I think that probably, LL is just not the right parsing algorithm for me. I am writing a custom parser and lexer. It would still have been nice if ANTLR would show me how it found the problems though, so that I might intervene and fix them.