As a homework I should create a parser for VC language V10.1 using ANTLR.
Here is my work:
grammar MyVCgrm2;
program : ( funcDecl | varDecl )*
;
funcDecl :
type identifier paraList compoundStmt
;
varDecl
: type initDeclaratorList ';'
;
initDeclaratorList
: initDeclarator ( ',' initDeclarator )*
;
initDeclarator
: declarator ( '=' initialiser )?
;
declarator
: identifier identifier2
;
identifier2
: |'[' INTLITERAL? ']'
;
initialiser
: expr
| '{' expr ( ',' expr ) * '}'
;
// primitive types
type
: 'void' | 'boolean' | 'int' | 'float'
;
// identifiers
identifier
: ID
;
// statements
compoundStmt
: '{' varDecl* stmt* '}'
;
stmt
:
compoundStmt
| ifStmt
| forStmt
| whileStmt
| breakStmt
| continueStmt
| returnStmt
| exprStmt
;
ifStmt
: 'if' '(' expr ')' stmt ( 'else' stmt )?
;
forStmt
: 'for' '(' expr? ';' expr? ';' expr? ')' stmt
;
whileStmt
: 'while' '(' expr ')' stmt
;
breakStmt
: 'break' ';'
;
continueStmt
: 'continue' ';'
;
returnStmt
: 'return' expr? ';'
;
exprStmt
: expr? ';'
;
// expressions
expr
: assignmentExpr
;
assignmentExpr
: ( condOrExpr '=' )* condOrExpr
;
condOrExpr
: condAndExpr condOrExpr2
;
condOrExpr2
: '||' condAndExpr condOrExpr
|
;
condAndExpr
: equalityExpr condAndExpr2
;
condAndExpr2
: '&&' equalityExpr condAndExpr2|
;
equalityExpr
: relExpr equalityExpr2
;
equalityExpr2
: '==' relExpr equalityExpr2 | '!=' relExpr equalityExpr2|
;
relExpr
: additiveExpr relExpr2
;
relExpr2
: '<' additiveExpr relExpr | '<=' additiveExpr relExpr | '>' additiveExpr relExpr| '>=' additiveExpr relExpr|
;
additiveExpr
:
multiplicativeExpr additiveExpr2
;
additiveExpr2
:
'+' multiplicativeExpr additiveExpr2
| '-' multiplicativeExpr additiveExpr2
|
;
multiplicativeExpr
:
unaryExpr multiplicativeExpr2
;
multiplicativeExpr2
:
'*' unaryExpr multiplicativeExpr2
| '/' unaryExpr multiplicativeExpr2
|
;
unaryExpr
: '+' unaryExpr
| '-' unaryExpr
| '!' unaryExpr
| primaryExpr
;
primaryExpr
: identifier primaryExpr2
| '(' expr ')'
| INTLITERAL
| FLOATLITERAL
| BOOLLITERAL
| STRINGLITERAL
;
primaryExpr2
: argList?
| '[' expr ']'
;
// parameters
paraList
: '(' properParaList? ')'
;
properParaList
: paraDecl ( ',' paraDecl ) *
;
paraDecl
: type declarator
;
argList
:'(' properArgList? ')'
;
properArgList
: arg ( ',' arg ) *
;
arg
: expr
;
BOOLLITERAL
: 'true'|'false'
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?')*
;
INTLITERAL : '0'..'9'+
;
FLOATLITERAL
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRINGLITERAL
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
But when by trying generate code I got these errors in console:
[13:48:38] Checking Grammar MyVCgrm2.g...
[13:48:38] warning(200): MyVCgrm2.g:53:27:
Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:78:21: [fatal] rule assignmentExpr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] error(211): MyVCgrm2.g:111:2: [fatal] rule additiveExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] warning(200): MyVCgrm2.g:111:2:
Decision can match input such as "'+' ID" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:142:5: [fatal] rule primaryExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
Warnings and errors are about these productions (in order): ifStmt, assignmentExpr, additiveExpr2, primaryExpr2.
I've done all left factorings and left recursion eliminations. I don't understand what ANTLR meant by "Resolve by left-factoring".
Why do I get these errors and how do I resolve them?
Thank you
Related
I am new to antlr. I am trying to parse some queries like [network-traffic:src_port = '123] and [network-traffic:src_port =] and [network-traffic:src_port = ] and ... I have a grammar as follows:
grammar STIXPattern;
pattern
: observationExpressions EOF
;
observationExpressions
: <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
| observationExpressionOr #observationExpressionOr_
;
observationExpressionOr
: <assoc=left> observationExpressionOr OR observationExpressionOr #observationExpressionOred
| observationExpressionAnd #observationExpressionAnd_
;
observationExpressionAnd
: <assoc=left> observationExpressionAnd AND observationExpressionAnd #observationExpressionAnded
| observationExpression #observationExpression_
;
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
comparisonExpression
: <assoc=left> comparisonExpression OR comparisonExpression #comparisonExpressionOred
| comparisonExpressionAnd #comparisonExpressionAnd_
;
comparisonExpressionAnd
: <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd #comparisonExpressionAnded
| propTest #comparisonExpressionAndpropTest
;
propTest
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
| objectPath NOT? (GT|LT|GE|LE) orderableLiteral # propTestOrder
| objectPath NOT? IN setLiteral # propTestSet
| objectPath NOT? LIKE StringLiteral # propTestLike
| objectPath NOT? MATCHES StringLiteral # propTestRegex
| objectPath NOT? ISSUBSET StringLiteral # propTestIsSubset
| objectPath NOT? ISSUPERSET StringLiteral # propTestIsSuperset
| LPAREN comparisonExpression RPAREN # propTestParen
| objectPath NOT? (EQ|NEQ) objectPathThl # propTestThlEqual
;
startStopQualifier
: START TimestampLiteral STOP TimestampLiteral
;
withinQualifier
: WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
;
repeatedQualifier
: REPEATS IntPosLiteral TIMES
;
objectPath
: objectType COLON firstPathComponent objectPathComponent?
;
objectPathThl
: varThlType DOT firstPathComponent objectPathComponent?
;
objectType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
varThlType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
firstPathComponent
: IdentifierWithoutHyphen
| StringLiteral
;
objectPathComponent
: <assoc=left> objectPathComponent objectPathComponent # pathStep
| '.' (IdentifierWithoutHyphen | StringLiteral) # keyPathStep
| LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK # indexPathStep
;
setLiteral
: LPAREN RPAREN
| LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
;
primitiveLiteral
: orderableLiteral
| BoolLiteral
| edgeCases
;
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
orderableLiteral
: IntPosLiteral
| IntNegLiteral
| FloatPosLiteral
| FloatNegLiteral
| StringLiteral
| BinaryLiteral
| HexLiteral
| TimestampLiteral
;
IntNegLiteral :
'-' ('0' | [1-9] [0-9]*)
;
IntNoSign :
('0' | [1-9] [0-9]*)
;
IntPosLiteral :
'+'? ('0' | [1-9] [0-9]*)
;
FloatNegLiteral :
'-' [0-9]* '.' [0-9]+
;
FloatPosLiteral :
'+'? [0-9]* '.' [0-9]+
;
HexLiteral :
'h' QUOTE TwoHexDigits* QUOTE
;
BinaryLiteral :
'b' QUOTE
( Base64Char Base64Char Base64Char Base64Char )*
( (Base64Char Base64Char Base64Char Base64Char )
| (Base64Char Base64Char Base64Char ) '='
| (Base64Char Base64Char ) '=='
)
QUOTE
;
StringLiteral :
QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
;
BoolLiteral :
TRUE | FALSE
;
TimestampLiteral :
't' QUOTE
[0-9] [0-9] [0-9] [0-9] HYPHEN
( ('0' [1-9]) | ('1' [012]) ) HYPHEN
( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
'T'
( ([01] [0-9]) | ('2' [0-3]) ) COLON
[0-5] [0-9] COLON
([0-5] [0-9] | '60')
(DOT [0-9]+)?
'Z'
QUOTE
;
//////////////////////////////////////////////
// Keywords
AND: 'AND' ;
OR: 'OR' ;
NOT: 'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE: 'LIKE' ;
MATCHES: 'MATCHES' ;
ISSUPERSET: 'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST: 'LAST' ;
IN: 'IN' ;
START: 'START' ;
STOP: 'STOP' ;
SECONDS: 'SECONDS' ;
TRUE: 'true' ;
FALSE: 'false' ;
WITHIN: 'WITHIN' ;
REPEATS: 'REPEATS' ;
TIMES: 'TIMES' ;
// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
[a-zA-Z_] [a-zA-Z0-9_]*
;
IdentifierWithHyphen :
[a-zA-Z_] [a-zA-Z0-9_-]*
;
EQ : '=' | '==';
NEQ : '!=' | '<>';
LT : '<';
LE : '<=';
GT : '>';
GE : '>=';
QUOTE : '\'';
COLON : ':' ;
DOT : '.' ;
COMMA : ',' ;
RPAREN : ')' ;
LPAREN : '(' ;
RBRACK : ']' ;
LBRACK : '[' ;
PLUS : '+' ;
HYPHEN : MINUS ;
MINUS : '-' ;
POWER_OP : '^' ;
DIVIDE : '/' ;
ASTERISK : '*';
EQRBRAC : ']';
fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];
// Whitespace and comments
//
WS : [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
: .
;
Now when I feed [network-traffic:src_port = '123] I expect antlr to parse the query to '123 and ]
However the grammar return '123] and is not able to separate '123 and ]
Am missing anything?
grammar does not separate '123 and ] though the rule is set for it
That is not true. The quote and 123 are separate tokens. As demonstrated/suggested in your previous ANTLR question: start by printing all the tokens to your console to see what tokens are being created. This should always be the first thing you do when trying to debug an ANTLR grammar. It will save you a lot of time and headache.
The fact [network-traffic:src_port = '123] is not parsed properly, is because the ](RBRACK) is being consumed by the alternative observationExpressionSimple:
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
Because RBRACK was already consumed by a parser rule, the edgeCases rule can't consume this RBRACK token as well.
To fix this, change your rule:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
into this:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign)
;
Now [network-traffic:src_port = '123] will be parsed properly:
Recently I am trying to upgrade my project from Antlr3 to Antlr4. But after making change in the grammar file, it seems the equations that worked previously is no longer working. I am new to Antlr4 so unable to understand whether my change broke something or not.
Here is my original grammar file:
grammar equation;
options {
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF!;
equation: variable ASSIGN expression -> ^(EQUATION variable expression)
;
parExpression
: LPAREN expression RPAREN -> ^(PAREXPR expression)
;
expression
: conditionalexpression -> ^(EXPR conditionalexpression)
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR^ andExpression )*
;
andExpression
: comparisonExpression ( AND^ comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ^ | NE^ | LTE^ | GTE^ | LT^ | GT^) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS^ | MINUS^) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES^ | DIVIDE^) unaryExpression )*
;
unaryExpression
: NOT unaryExpression -> ^(UNARYEXPR NOT unaryExpression)
| MINUS unaryExpression -> ^(UNARYEXPR MINUS unaryExpression)
| exponentexpression;
exponentexpression
: primary (CARET^ primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING -> ^(CONSTANT STRING) | numeric -> ^(CONSTANT numeric);
booleantok : BOOLEAN -> ^(BOOLEAN);
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER -> IDENTIFIER+;
function
: scopedidentifier LPAREN argumentlist RPAREN -> ^(FUNCTION scopedidentifier argumentlist);
variable: scopedidentifier -> ^(VARIABLE scopedidentifier);
argumentlist: (expression) ? (COMMA! expression)*;
WS : (' '|'\r'|'\n'|'\t')+ {$channel=HIDDEN;};
COMMENT : '/*' .* '*/' {$channel=HIDDEN;};
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
And here is what I have after my changes:
grammar equation;
options {
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .* '*/' ->channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
A sample equation that I am trying to parse (which was working OK before) is:
[a].[b] = 1.76 * [Product_DC].[PDC_Inbound_Pallets] * if(product_dc.[PDC_DC] =="US84",1,0)
Thanks in advance.
Tokens should be listed with comma , not semicolon ;. See also Token Section paragraph in official doc.
Since ANTLR 4.7 backslash is not required for double quote escaping. STRING: (('\"') ( (~('\"')) )* ('\"'))+; should be rewritten to STRING: ('"' ~'"'* '"')+;.
You missed question mark in multiline comment token for non-greedy matching: '/*' .* '*/' -> '/*' .*? '*/'.
So, the fixed grammar looks like this:
grammar equation;
options {
}
tokens {
VARIABLE,
CONSTANT,
EXPR,
PAREXPR,
EQUATION,
UNARYEXPR,
FUNCTION,
BINARYOP,
LIST
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: ('"' ~'"'* '"')+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
I’m currently trying to build a parser for the language Oberon using Antlr and Ecplise.
This is what I have got so far:
grammar oberon;
options
{
language = Java;
//backtrack = true;
output = AST;
}
#parser::header {package dhbw.Oberon;}
#lexer::header {package dhbw.Oberon; }
T_ARRAY : 'ARRAY' ;
T_BEGIN : 'BEGIN';
T_CASE : 'CASE' ;
T_CONST : 'CONST' ;
T_DO : 'DO' ;
T_ELSE : 'ELSE' ;
T_ELSIF : 'ELSIF' ;
T_END : 'END' ;
T_EXIT : 'EXIT' ;
T_IF : 'IF' ;
T_IMPORT : 'IMPORT' ;
T_LOOP : 'LOOP' ;
T_MODULE : 'MODULE' ;
T_NIL : 'NIL' ;
T_OF : 'OF' ;
T_POINTER : 'POINTER' ;
T_PROCEDURE : 'PROCEDURE' ;
T_RECORD : 'RECORD' ;
T_REPEAT : 'REPEAT' ;
T_RETURN : 'RETURN';
T_THEN : 'THEN' ;
T_TO : 'TO' ;
T_TYPE : 'TYPE' ;
T_UNTIL : 'UNTIL' ;
T_VAR : 'VAR' ;
T_WHILE : 'WHILE' ;
T_WITH : 'WITH' ;
module : T_MODULE ID SEMI importlist? declarationsequence?
(T_BEGIN statementsequence)? T_END ID PERIOD ;
importlist : T_IMPORT importitem (COMMA importitem)* SEMI ;
importitem : ID (ASSIGN ID)? ;
declarationsequence :
( T_CONST (constantdeclaration SEMI)*
| T_TYPE (typedeclaration SEMI)*
| T_VAR (variabledeclaration SEMI)*)
(proceduredeclaration SEMI | forwarddeclaration SEMI)*
;
constantdeclaration: identifierdef EQUAL expression ;
identifierdef: ID MULT? ;
expression: simpleexpression (relation simpleexpression)? ;
simpleexpression : (PLUS|MINUS)? term (addoperator term)* ;
term: factor (muloperator factor)* ;
factor: number
| stringliteral
| T_NIL
| set
| designator '(' explist? ')'
;
number: INT | HEX ; // TODO add real
stringliteral : '"' ( ~('\\'|'"') )* '"' ;
set: '{' elementlist? '}' ;
elementlist: element (COMMA element)* ;
element: expression (RANGESEP expression)? ;
designator: qualidentifier
('.' ID
| '[' explist ']'
| '(' qualidentifier ')'
| UPCHAR )+
;
explist: expression (COMMA expression)* ;
actualparameters: '(' explist? ')' ;
muloperator: MULT | DIV | MOD | ET ;
addoperator: PLUS | MINUS | OR ;
relation: EQUAL ; // TODO
typedeclaration: ID EQUAL type ;
type: qualidentifier
| arraytype
| recordtype
| pointertype
| proceduretype
;
qualidentifier: (ID '.')* ID ;
arraytype: T_ARRAY expression (',' expression) T_OF type;
recordtype: T_RECORD ('(' qualidentifier ')')? fieldlistsequence T_END ;
fieldlistsequence: fieldlist (SEMI fieldlist) ;
fieldlist: (identifierlist COLON type)? ;
identifierlist: identifierdef (COMMA identifierdef)* ;
pointertype: T_POINTER T_TO type ;
proceduretype: T_PROCEDURE formalparameters? ;
variabledeclaration: identifierlist COLON type ;
proceduredeclaration: procedureheading SEMI procedurebody ID ;
procedureheading: T_PROCEDURE MULT? identifierdef formalparameters? ;
formalparameters: '(' params? ')' (COLON qualidentifier)? ;
params: fpsection (SEMI fpsection)* ;
fpsection: T_VAR? idlist COLON formaltype ;
idlist: ID (COMMA ID)* ;
formaltype: (T_ARRAY T_OF)* (qualidentifier | proceduretype);
procedurebody: declarationsequence (T_BEGIN statementsequence)? T_END ;
forwarddeclaration: T_PROCEDURE UPCHAR? ID MULT? formalparameters? ;
statementsequence: statement (SEMI statement)* ;
statement : assignment
| procedurecall
| ifstatement
| casestatement
| whilestatement
| repeatstatement
| loopstatement
| withstatement
| T_EXIT
| T_RETURN expression?
;
assignment: designator ASSIGN expression ;
procedurecall: designator actualparameters? ;
ifstatement: T_IF expression T_THEN statementsequence
(T_ELSIF expression T_THEN statementsequence)*
(T_ELSE statementsequence)? T_END ;
casestatement: T_CASE expression T_OF caseitem ('|' caseitem)*
(T_ELSE statementsequence)? T_END ;
caseitem: caselabellist COLON statementsequence ;
caselabellist: caselabels (COMMA caselabels)* ;
caselabels: expression (RANGESEP expression)? ;
whilestatement: T_WHILE expression T_DO statementsequence T_END ;
repeatstatement: T_REPEAT statementsequence T_UNTIL expression ;
loopstatement: T_LOOP statementsequence T_END ;
withstatement: T_WITH qualidentifier COLON qualidentifier T_DO statementsequence T_END ;
ID : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'_'|'0'..'9')* ;
fragment DIGIT : '0'..'9' ;
INT : ('-')?DIGIT+ ;
fragment HEXDIGIT : '0'..'9'|'A'..'F' ;
HEX : HEXDIGIT+ 'H' ;
ASSIGN : ':=' ;
COLON : ':' ;
COMMA : ',' ;
DIV : '/' ;
EQUAL : '=' ;
ET : '&' ;
MINUS : '-' ;
MOD : '%' ;
MULT : '*' ;
OR : '|' ;
PERIOD : '.' ;
PLUS : '+' ;
RANGESEP : '..' ;
SEMI : ';' ;
UPCHAR : '^' ;
WS : ( ' ' | '\t' | '\r' | '\n'){skip();};
My problem is when I check the grammar I get the following error and just can’t find an appropriate way to fix this:
rule statement has non-LL(*) decision
due to recursive rule invocations reachable from alts 1,2.
Resolve by left-factoring or using syntactic predicates
or using backtrack=true option.
|---> statement : assignment
Also I have the problem with declarationsequence and simpleexpression.
When I use options { … backtrack = true; … } it at least compiles, but obviously doesn’t work right anymore when I run a test-file, but I can’t find a way to resolve the left-recursion on my own (or maybe I’m just too blind at the moment because I’ve looked at this for far too long now). Any ideas how I could change the lines where the errors occurs to make it work?
EDIT
I could fix one of the three mistakes. statement works now. The problem was that assignment and procedurecall both started with designator.
statement : procedureassignmentcall
| ifstatement
| casestatement
| whilestatement
| repeatstatement
| loopstatement
| withstatement
| T_EXIT
| T_RETURN expression?
;
procedureassignmentcall : (designator ASSIGN)=> assignment | procedurecall;
assignment: designator ASSIGN expression ;
procedurecall: designator actualparameters? ;
I am new to the world of ANTLR, so I am hoping that someone could do a code review and see if what I am attempting todo is being done in the most efficient means possible. What I am trying todo is to parse what essentially is a (multi)-conditional statement which will be turned into Lambda statements:
Example:
Student Object
{
string ID { get; set;}
string Campus { get; set;}
string FirstName { get; set;}
}
Evaluation Statements:
StudentID="XXX"
StudentID="XXX" || FirstName=="John"
(StudentID="XXX" || FirstName=="John")
(StudentID="XXX" || FirstName=="John") || (StudentID="YYY" || FirstName=="DOE")
The following is the code that I've created so far. Could someone let me know if I'm going in the right direction:
/*lexer */grammar Test;
options {
output=AST;
}
tokens {
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : multiExpression ((AND^|OR^) multiExpression)* EOF! ;
multiExpression
: primaryExpression ((AND^|OR^) primaryExpression)* |
LPAREN! primaryExpression ((AND^|OR^) primaryExpression)* RPAREN!;
primaryExpression
: ID (EQUALS^ | GTR^ | LTN^ | GTREQ^ | LSTEQ^ | NOT^) value |
LPAREN! ID (EQUALS^ | GTR^ | LTN^ | GTREQ^ | LSTEQ^ | NOT^) value RPAREN! ;
value: STRING
|ID
|INT
|FLOAT
|CHAR;
//expr : term ( ( PLUS | MINUS ) term )* ;
//term : factor ( ( MULT | DIV ) factor )* ;
//factor : NUMBER ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
EQUALS : '==';
GTR : '>';
LTN : '<';
GTREQ : '>=';
LSTEQ : '<=';
NOT : '!=';
OR : '||';
LPAREN : '(';
RPAREN : ')';
AND : '&&';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
INT : ('0'..'9')+;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
I'm a little confused. I have a grammar that works well and matches my language just as I want it to. Recently, I added a couple rules to the grammar and in the process of converting the new grammar rules to the tree grammar I am getting some strange errors. The first error I was getting was that the tree grammar was ambiguous.
The errors I received were:
[10:53:16] warning(200): ShiroDefinitionPass.g:129:17:
Decision can match input such as "'subjunctive node'" using multiple alternatives: 5, 6
As a result, alternative(s) 6 were disabled for that input
[10:53:16] warning(200): ShiroDefinitionPass.g:129:17:
Decision can match input such as "PORT_ASSIGNMENT" using multiple alternatives: 2, 6
and about 10 more similar errors.
I can't tell why the tree grammar is ambiguous. It was fine before I added the sNode, subjunctDeclNodeProd, subjunctDecl, and subjunctSelector rules.
My grammar is:
grammar Shiro;
options{
language = Java;
ASTLabelType=CommonTree;
output=AST;
}
tokens{
NEGATION;
STATE_DECL;
PORT_DECL;
PORT_INIT;
PORT_ASSIGNMENT;
PORT_TAG;
PATH;
PORT_INDEX;
EVAL_SELECT;
SUBJ_SELECT;
SUBJ_NODE_PROD;
ACTIVATION;
ACTIVATION_LIST;
PRODUCES;
}
shiro : statement+
;
statement
: nodestmt
| sNode
| graphDecl
| statestmt
| collection
| view
| NEWLINE!
;
view : 'view' IDENT mfName IDENT -> ^('view' IDENT mfName IDENT)
;
collection
: 'collection' IDENT orderingFunc path 'begin' NEWLINE
(collItem)+ NEWLINE?
'end'
-> ^('collection' IDENT orderingFunc path collItem+)
;
collItem: IDENT -> IDENT
;
orderingFunc
: IDENT -> IDENT
;
statestmt
: 'state' stateName 'begin' NEWLINE
stateHeader
'end' -> ^(STATE_DECL stateName stateHeader)
;
stateHeader
: (stateTimeStmt | stateCommentStmt | stateParentStmt | stateGraphStmt | activationPath | NEWLINE!)+
;
stateTimeStmt
: 'Time' time -> ^('Time' time)
;
stateCommentStmt
: 'Comment' comment -> ^('Comment' comment)
;
stateParentStmt
: 'Parent' stateParent -> ^('Parent' stateParent)
;
stateGraphStmt
: 'Graph' stateGraph -> ^('Graph' stateGraph)
;
stateName
: IDENT
;
time : STRING_LITERAL
;
comment : STRING_LITERAL
;
stateParent
: IDENT
;
stateGraph
: IDENT
;
activationPath
: l=activation ('.'^ (r=activation | activationList))*
;
activationList
: '<' activation (',' activation)* '>' -> ^(ACTIVATION_LIST activation+)
;
activation
: c=IDENT ('[' v=IDENT ']')? -> ^(ACTIVATION $c ($v)?)
;
graphDecl
: 'graph' IDENT 'begin' NEWLINE
graphLine+
'end'
-> ^('graph' IDENT graphLine+)
;
graphLine
: nodeProduction | portAssignment | NEWLINE!
;
nodeInternal
: (nodeProduction
| portAssignment
| portstmt
| nodestmt
| sNode
| NEWLINE!)+
;
nodestmt
: 'node'^ IDENT ('['! activeSelector ']'!)? 'begin'! NEWLINE!
nodeInternal
'end'!
;
sNode
: 'subjunctive node'^ IDENT '['! subjunctSelector ']'! 'begin'! NEWLINE!
(subjunctDeclNodeProd | subjunctDecl | NEWLINE!)+
'end'!
;
subjunctDeclNodeProd
: l=IDENT '->' r=IDENT 'begin' NEWLINE
nodeInternal
'end' -> ^(SUBJ_NODE_PROD $l $r nodeInternal )
;
subjunctDecl
: 'subjunct'^ IDENT ('['! activeSelector ']'!)? 'begin'! NEWLINE!
nodeInternal
'end'!
;
subjunctSelector
: IDENT -> ^(SUBJ_SELECT IDENT)
;
activeSelector
: IDENT -> ^(EVAL_SELECT IDENT)
;
nodeProduction
: path ('->'^ activationPath )+ NEWLINE!
;
portAssignment
: path '(' mfparams ')' NEWLINE -> ^(PORT_ASSIGNMENT path mfparams)
;
portDecl
: portType portName mfName -> ^(PORT_DECL ^(PORT_TAG portType) portName mfName)
;
portDeclInit
: portType portName mfCall -> ^(PORT_INIT ^(PORT_TAG portType) portName mfCall)
;
portstmt
: (portDecl | portDeclInit ) NEWLINE!
;
portName
: IDENT
;
portType: 'port'
| 'eval'
;
mfCall : mfName '(' mfparams ')' -> ^(mfName mfparams)
;
mfName : IDENT
;
mfparams: expression(',' expression)* -> expression+
;
// Path
path : (IDENT)('.' IDENT)*('[' pathIndex ']')? -> ^(PATH IDENT+ pathIndex?)
;
pathIndex
: portIndex -> ^(PORT_INDEX portIndex)
;
portIndex
: ( NUMBER |STRING_LITERAL )
;
// Expressions
term : path
| '(' expression ')' -> expression
| NUMBER
| STRING_LITERAL
;
unary : ('+'^ | '-'^)* term
;
mult : unary (('*'^ | '/'^ | '%'^) unary)*
;
add
: mult (( '+'^ | '-'^ ) mult)*
;
expression
: add (( '|'^ ) add)*
;
// LEXEMES
STRING_LITERAL
: '"' .* '"'
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER | UCLETTER | DIGIT)(LCLETTER | UCLETTER | DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' NEWLINE?{$channel=HIDDEN;}
;
WS
: (' ' | '\t' | '\f')+ {$channel = HIDDEN;}
;
NEWLINE : '\r'? '\n'
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
My tree grammar for the section looks like:
tree grammar ShiroDefinitionPass;
options{
tokenVocab=Shiro;
ASTLabelType=CommonTree;
}
shiro
: statement+
;
statement
: nodestmt
| sNode
| graphDecl
| statestmt
| collection
| view
;
view : ^('view' IDENT mfName IDENT)
;
collection
: ^('collection' IDENT orderingFunc path collItem+)
;
collItem: IDENT
;
orderingFunc
: IDENT
;
statestmt
: ^(STATE_DECL stateHeader)
;
stateHeader
: (stateTimeStmt | stateCommentStmt | stateParentStmt| stateGraphStmt | activation )+
;
stateTimeStmt
: ^('Time' time)
;
stateCommentStmt
: ^('Comment' comment)
;
stateParentStmt
: ^('Parent' stateParent)
;
stateGraphStmt
: ^('Graph' stateGraph)
;
stateName
: IDENT
;
time : STRING_LITERAL
;
comment : STRING_LITERAL
;
stateParent
: IDENT
;
stateGraph
: IDENT
;
activationPath
: l=activation ('.' (r=activation | activationList))*
;
activationList
: ^(ACTIVATION_LIST activation+)
;
activation
: ^(ACTIVATION IDENT IDENT?)
;
// Graph Declarations
graphDecl
: ^('graph' IDENT graphLine+)
;
graphLine
: nodeProduction
| portAssignmen
;
// End Graph declaration
nodeInternal
: (nodeProduction
|portAssignment
|portstmt
|nodestmt
|sNode )+
;
nodestmt
: ^('node' IDENT activeSelector? nodeInternal)
;
sNode
: ^('subjunctive node' IDENT subjunctSelector (subjunctDeclNodeProd | subjunctDecl)*)
;
subjunctDeclNodeProd
: ^(SUBJ_NODE_PROD IDENT IDENT nodeInternal+ )
;
subjunctDecl
: ^('subjunct' IDENT activeSelector? nodeInternal )
;
subjunctSelector
: ^(SUBJ_SELECT IDENT)
;
activeSelector returns
: ^(EVAL_SELECT IDENT)
;
nodeProduction
: ^('->' nodeProduction)
| path
;
portAssignment
: ^(PORT_ASSIGNMENT path)
;
// Port Statement
portDecl
: ^(PORT_DECL ^(PORT_TAG portType) portName mfName)
;
portDeclInit
: ^(PORT_INIT ^(PORT_TAG portType) portName mfCall)
;
portstmt
: (portDecl | portDeclInit)
;
portName
: IDENT
;
portType returns
: 'port' | 'eval'
;
mfCall
: ^(mfName mfparams)
;
mfName
: IDENT
;
mfparams
: (exps=expression)+
;
// Path
path
: ^(PATH (id=IDENT)+ (pathIndex)? )
;
pathIndex
: ^(PORT_INDEX portIndex)
;
portIndex
: ( NUMBER
|STRING_LITERAL
)
;
// Expressions
expression
: ^('+' op1=expression op2=expression)
| ^('-' op1=expression op2=expression)
| ^('*' op1=expression op2=expression)
| ^('/' op1=expression op2=expression)
| ^('%' op1=expression op2=expression)
| ^('|' op1=expression op2=expression)
| NUMBER
| path
;
In your tree grammar, here is the declaration of rule nodeInternal:
nodeInternal
: (nodeProduction
|portAssignment
|portstmt
|nodestmt
|sNode)+
;
And here is your declaration of rule subjunctDeclNodeProd:
subjunctDeclNodeProd
: ^(SUBJ_NODE_PROD IDENT IDENT nodeInternal+ )
;
When subjunctDeclNodeProd is being processed, ANTLR doesn't know how to process an input such as this, with two PATH children:
^(SUBJ_NODE_PROD IDENT IDENT ^(PATH IDENT) ^(PATH IDENT))
Should it follow rule nodeInternal once and process nodeProduction, nodeProduction or should it follow nodeInternal twice and process nodeProduction each time?
Consider rewriting subjunctDeclNodeProd without the +:
subjunctDeclNodeProd
: ^(SUBJ_NODE_PROD IDENT IDENT nodeInternal)
;
I think that will take care of the problem.