Simple ANTLR grammar not disambiguating identifiers and literals - antlr

I'm trying to match this syntax:
P = 100
require P
credit account:subaccount P
The first is an assignment. The second is a "check" that P is truthy. The third is an instruction to move 100 to account:subaccount. The problem is that the grammar I've written thinks the third line is just an assignment with a missing equal sign. I can't see why.
program: (stmt NEWLINE)+;
stmt: require | entry;
require: 'require' filtrex;
entry: (CREDIT | DEBIT) JOURNAL filtrex (IF filtrex)? (LPARENHASH EXTID RPAREN)?;
assign: ID EQ filtrex;
filtrex: math;
math
: math (TIMES | DIV) math
| math (PLUS | MINUS) math
| LPAREN math RPAREN
| (PLUS | MINUS)* atom
;
atom: NUMBER
| ID
;
NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
fragment SIGN
: ('+' | '-')
;
ID: [a-zA-Z]+[0-9a-zA-Z]*;
EQ: '=';
JOURNAL: [a-zA-Z:]+;
EXTID: [a-zA-Z0-9-]+;
COLON: ':';
CREDIT: 'credit';
DEBIT: 'debit';
IF: 'if';
NEWLINE : [\r\n];
NUM : [0-9.]+;
LPAREN: '(';
RPAREN: ')';
LPARENHASH: '(#';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/' ;
POINT: '.';
WS: [ \r\n\t] + -> skip;
UPDATE
Thanks to the suggestions below I have something that seems to work properly. Now to the implementation of the logic...
grammar Txl;
// High level language
program: stmt (NEWLINE stmt)* NEWLINE? EOF;
stmt: require | entry | assignment;
require: 'require' expr;
entry: (CREDIT | DEBIT) journal expr (IF expr)? (LPAREN 'id:' EXTID RPAREN)?;
assignment: IDENT ASSIGN expr;
journal: IDENT COLON IDENT;
expr: expr MULT expr
| expr DIV expr
| expr PLUS expr
| expr MINUS expr
| expr MOD expr
| expr POW expr
| MINUS expr
| expr AND expr
| expr OR expr
| NOT expr
| expr EQ expr
| expr NEQ expr
| expr LTE expr
| expr LT expr
| expr GTE expr
| expr GT expr
| expr QUESTION expr COLON expr
| LPAREN expr RPAREN
| NUMBER
| IDENT LPAREN args RPAREN
| IDENT
;
fnArg: expr | journal;
args: fnArg
| fnArg COMMA fnArg
|
;
// Reserved words
CREDIT: 'credit';
DEBIT: 'debit';
IF: 'if';
REQUIRE: 'require';
// Operators
MULT: '*';
DIV: '/';
MINUS: '-';
PLUS: '+';
POW: '^';
MOD: '%';
LPAREN: '(';
RPAREN: ')';
LBRACE: '[';
RBRACE: ']';
COMMA: ',';
EQ: '==';
NEQ: '!=';
GTE: '>=';
LTE: '<=';
GT: '>';
LT: '<';
ASSIGN: '=';
QUESTION: '?';
COLON: ':';
AND: 'and';
OR: 'or';
NOT: 'not';
HASH: '#';
NEWLINE : [\r\n];
WS: [ \t] + -> skip;
// Entities
NUMBER: ('0' .. '9') + ('.' ('0' .. '9') +)?;
IDENT: [a-zA-Z]+[0-9a-zA-Z]*;
EXTID: [a-zA-Z0-9-]+;

That is because the input credit is not being matched by your CREDIT rule, but by the ID rule. The lexer always tries to match as many characters as possible. So, the input credit can be matched by: ID, JOURNAL, EXTID and CREDIT. Whenever it happens that multiple rules can match the same characters, the one defined first "wins" (ID in this case). The lexer does not "listen" to what the parser is trying to match, it operates independently from the parser.
Note that the EXTID also causes the input - to be matched by it, causing the MINUS rule to never be matched.
The solution: place your keywords before the ID rule inside the grammar:
CREDIT : 'credit';
DEBIT : 'debit';
REQUIRE : 'require';
ID : [a-zA-Z]+ [0-9a-zA-Z]*;
And, if possible, I'd also remove the JOURNAL and EXTID lexer rules and try to "promote" them to parser rules:
journal
: ID COLON ID
;
extid
: ID (MINUS ID)*
;
NUMBER and NUM can also match the same, while NUM also matches input like 1........2.......22222.... I'd remove the NUM rule and only keep NUMBER.
Remove the \r\n part from WS: [ \r\n\t] + -> skip; since these are already matched by your NEWLINE rule.
By doing (stmt NEWLINE)+, every stmt must end with a new line (also the last one). This could be a better solution: stmt (NEWLINE stmt)* NEWLINE?.
The grammar could look like this:
program
: stmt (NEWLINE stmt)* NEWLINE? EOF
;
stmt
: require
| entry
| assign
;
require
: REQUIRE filtrex
;
entry
: (CREDIT | DEBIT) journal filtrex (IF filtrex)? (LPARENHASH extid RPAREN)?
;
assign
: ID EQ filtrex
;
journal
: ID COLON ID
;
extid
: ID (MINUS ID)*
;
filtrex
: math
;
math
: math (TIMES | DIV) math
| math (PLUS | MINUS) math
| LPAREN math RPAREN
| (PLUS | MINUS)* atom
;
atom
: NUMBER
| ID
;
NUMBER : [0-9]+ ('.' [0-9]+)?;
CREDIT : 'credit';
DEBIT : 'debit';
REQUIRE : 'require';
IF : 'if';
ID : [a-zA-Z]+ [0-9a-zA-Z]*;
EQ : '=';
COLON : ':';
NEWLINE : [\r\n];
LPAREN : '(';
RPAREN : ')';
LPARENHASH : '(#';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIV : '/' ;
POINT : '.';
WS : [ \t] + -> skip;
which will parse your example input like this:

Related

grammar does not separate '123 and ] though the rule is set for it

I am new to antlr. I am trying to parse some queries like [network-traffic:src_port = '123] and [network-traffic:src_port =] and [network-traffic:src_port = ] and ... I have a grammar as follows:
grammar STIXPattern;
pattern
: observationExpressions EOF
;
observationExpressions
: <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
| observationExpressionOr #observationExpressionOr_
;
observationExpressionOr
: <assoc=left> observationExpressionOr OR observationExpressionOr #observationExpressionOred
| observationExpressionAnd #observationExpressionAnd_
;
observationExpressionAnd
: <assoc=left> observationExpressionAnd AND observationExpressionAnd #observationExpressionAnded
| observationExpression #observationExpression_
;
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
comparisonExpression
: <assoc=left> comparisonExpression OR comparisonExpression #comparisonExpressionOred
| comparisonExpressionAnd #comparisonExpressionAnd_
;
comparisonExpressionAnd
: <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd #comparisonExpressionAnded
| propTest #comparisonExpressionAndpropTest
;
propTest
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
| objectPath NOT? (GT|LT|GE|LE) orderableLiteral # propTestOrder
| objectPath NOT? IN setLiteral # propTestSet
| objectPath NOT? LIKE StringLiteral # propTestLike
| objectPath NOT? MATCHES StringLiteral # propTestRegex
| objectPath NOT? ISSUBSET StringLiteral # propTestIsSubset
| objectPath NOT? ISSUPERSET StringLiteral # propTestIsSuperset
| LPAREN comparisonExpression RPAREN # propTestParen
| objectPath NOT? (EQ|NEQ) objectPathThl # propTestThlEqual
;
startStopQualifier
: START TimestampLiteral STOP TimestampLiteral
;
withinQualifier
: WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
;
repeatedQualifier
: REPEATS IntPosLiteral TIMES
;
objectPath
: objectType COLON firstPathComponent objectPathComponent?
;
objectPathThl
: varThlType DOT firstPathComponent objectPathComponent?
;
objectType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
varThlType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
firstPathComponent
: IdentifierWithoutHyphen
| StringLiteral
;
objectPathComponent
: <assoc=left> objectPathComponent objectPathComponent # pathStep
| '.' (IdentifierWithoutHyphen | StringLiteral) # keyPathStep
| LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK # indexPathStep
;
setLiteral
: LPAREN RPAREN
| LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
;
primitiveLiteral
: orderableLiteral
| BoolLiteral
| edgeCases
;
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
orderableLiteral
: IntPosLiteral
| IntNegLiteral
| FloatPosLiteral
| FloatNegLiteral
| StringLiteral
| BinaryLiteral
| HexLiteral
| TimestampLiteral
;
IntNegLiteral :
'-' ('0' | [1-9] [0-9]*)
;
IntNoSign :
('0' | [1-9] [0-9]*)
;
IntPosLiteral :
'+'? ('0' | [1-9] [0-9]*)
;
FloatNegLiteral :
'-' [0-9]* '.' [0-9]+
;
FloatPosLiteral :
'+'? [0-9]* '.' [0-9]+
;
HexLiteral :
'h' QUOTE TwoHexDigits* QUOTE
;
BinaryLiteral :
'b' QUOTE
( Base64Char Base64Char Base64Char Base64Char )*
( (Base64Char Base64Char Base64Char Base64Char )
| (Base64Char Base64Char Base64Char ) '='
| (Base64Char Base64Char ) '=='
)
QUOTE
;
StringLiteral :
QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
;
BoolLiteral :
TRUE | FALSE
;
TimestampLiteral :
't' QUOTE
[0-9] [0-9] [0-9] [0-9] HYPHEN
( ('0' [1-9]) | ('1' [012]) ) HYPHEN
( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
'T'
( ([01] [0-9]) | ('2' [0-3]) ) COLON
[0-5] [0-9] COLON
([0-5] [0-9] | '60')
(DOT [0-9]+)?
'Z'
QUOTE
;
//////////////////////////////////////////////
// Keywords
AND: 'AND' ;
OR: 'OR' ;
NOT: 'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE: 'LIKE' ;
MATCHES: 'MATCHES' ;
ISSUPERSET: 'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST: 'LAST' ;
IN: 'IN' ;
START: 'START' ;
STOP: 'STOP' ;
SECONDS: 'SECONDS' ;
TRUE: 'true' ;
FALSE: 'false' ;
WITHIN: 'WITHIN' ;
REPEATS: 'REPEATS' ;
TIMES: 'TIMES' ;
// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
[a-zA-Z_] [a-zA-Z0-9_]*
;
IdentifierWithHyphen :
[a-zA-Z_] [a-zA-Z0-9_-]*
;
EQ : '=' | '==';
NEQ : '!=' | '<>';
LT : '<';
LE : '<=';
GT : '>';
GE : '>=';
QUOTE : '\'';
COLON : ':' ;
DOT : '.' ;
COMMA : ',' ;
RPAREN : ')' ;
LPAREN : '(' ;
RBRACK : ']' ;
LBRACK : '[' ;
PLUS : '+' ;
HYPHEN : MINUS ;
MINUS : '-' ;
POWER_OP : '^' ;
DIVIDE : '/' ;
ASTERISK : '*';
EQRBRAC : ']';
fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];
// Whitespace and comments
//
WS : [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
: .
;
Now when I feed [network-traffic:src_port = '123] I expect antlr to parse the query to '123 and ]
However the grammar return '123] and is not able to separate '123 and ]
Am missing anything?
grammar does not separate '123 and ] though the rule is set for it
That is not true. The quote and 123 are separate tokens. As demonstrated/suggested in your previous ANTLR question: start by printing all the tokens to your console to see what tokens are being created. This should always be the first thing you do when trying to debug an ANTLR grammar. It will save you a lot of time and headache.
The fact [network-traffic:src_port = '123] is not parsed properly, is because the ](RBRACK) is being consumed by the alternative observationExpressionSimple:
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
Because RBRACK was already consumed by a parser rule, the edgeCases rule can't consume this RBRACK token as well.
To fix this, change your rule:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
into this:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign)
;
Now [network-traffic:src_port = '123] will be parsed properly:

Why parse failing after upgrading from Antlr 3 to Antlr 4?

Recently I am trying to upgrade my project from Antlr3 to Antlr4. But after making change in the grammar file, it seems the equations that worked previously is no longer working. I am new to Antlr4 so unable to understand whether my change broke something or not.
Here is my original grammar file:
grammar equation;
options {
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF!;
equation: variable ASSIGN expression -> ^(EQUATION variable expression)
;
parExpression
: LPAREN expression RPAREN -> ^(PAREXPR expression)
;
expression
: conditionalexpression -> ^(EXPR conditionalexpression)
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR^ andExpression )*
;
andExpression
: comparisonExpression ( AND^ comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ^ | NE^ | LTE^ | GTE^ | LT^ | GT^) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS^ | MINUS^) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES^ | DIVIDE^) unaryExpression )*
;
unaryExpression
: NOT unaryExpression -> ^(UNARYEXPR NOT unaryExpression)
| MINUS unaryExpression -> ^(UNARYEXPR MINUS unaryExpression)
| exponentexpression;
exponentexpression
: primary (CARET^ primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING -> ^(CONSTANT STRING) | numeric -> ^(CONSTANT numeric);
booleantok : BOOLEAN -> ^(BOOLEAN);
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER -> IDENTIFIER+;
function
: scopedidentifier LPAREN argumentlist RPAREN -> ^(FUNCTION scopedidentifier argumentlist);
variable: scopedidentifier -> ^(VARIABLE scopedidentifier);
argumentlist: (expression) ? (COMMA! expression)*;
WS : (' '|'\r'|'\n'|'\t')+ {$channel=HIDDEN;};
COMMENT : '/*' .* '*/' {$channel=HIDDEN;};
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
And here is what I have after my changes:
grammar equation;
options {
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .* '*/' ->channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
A sample equation that I am trying to parse (which was working OK before) is:
[a].[b] = 1.76 * [Product_DC].[PDC_Inbound_Pallets] * if(product_dc.[PDC_DC] =="US84",1,0)
Thanks in advance.
Tokens should be listed with comma , not semicolon ;. See also Token Section paragraph in official doc.
Since ANTLR 4.7 backslash is not required for double quote escaping. STRING: (('\"') ( (~('\"')) )* ('\"'))+; should be rewritten to STRING: ('"' ~'"'* '"')+;.
You missed question mark in multiline comment token for non-greedy matching: '/*' .* '*/' -> '/*' .*? '*/'.
So, the fixed grammar looks like this:
grammar equation;
options {
}
tokens {
VARIABLE,
CONSTANT,
EXPR,
PAREXPR,
EQUATION,
UNARYEXPR,
FUNCTION,
BINARYOP,
LIST
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: ('"' ~'"'* '"')+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';

"The following sets of rules are mutually left-recursive"

I have tried to write a grammar to recognize expressions like:
(A + MAX(B) ) / ( C - AVERAGE(A) )
IF( A > AVERAGE(A), 0, 1 )
X / (MAX(X)
Unfortunately antlr3 fails with these errors:
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula
I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks
Code:
grammar DerivedKeywords;
options {
output=AST;
//backtrack=true;
}
WS : ( ' ' | '\t' | '\n' | '\r' )
{ skip(); }
;
//for numbers
DIGIT
: '0'..'9'
;
//for both integer and real number
NUMBER
: (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )?
;
// Boolean operatos
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
EQ : '=';
NEQ : '!=';
GT : '>';
LT : '<';
GTE : '>=';
LTE : '<=';
COMMA : ',';
// Token for Functions
IF : 'IF';
MAX : 'MAX';
MIN : 'MIN';
AVERAGE : 'AVERAGE';
VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')*
;
// OPERATORS
LPAREN : '(' ;
RPAREN : ')' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
STAR : '*' ;
expression : formula;
formula
: functionExpression
| additiveExpression
| LPAREN! a=formula RPAREN! // First Problem
;
additiveExpression
: a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )*
;
multiplicativeExpression
: a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )*
;
unaryExpression
: MINUS^ u=unaryExpression
| primaryExpression
;
functionExpression
: f=functionOperator LPAREN e=formula RPAREN
| IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN
;
functionOperator :
MAX | MIN | AVERAGE;
primaryExpression
: NUMBER
// Used for scientific numbers
| DIGIT
| VARIABLE
| formula
;
// Boolean stuff
booleanExpression
: orExpression;
orExpression : a=andExpression (OR^ b=andExpression )*
;
andExpression
: a=notExpression (AND^ b=notExpression )*
;
notExpression
: NOT^ t=booleanTerm
| booleanTerm
;
booleanOperator :
GT | LT | EQ | GTE | LTE | NEQ;
booleanTerm : a=formula op=booleanOperator b=formula
| LPAREN! booleanTerm RPAREN! // Second problem
;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
- this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same.
You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.

Antlr 4 whitespace in string been eliminated

I'm using Antlr 4 to build a compiler for a made up language. I'm having problems with eliminating whitespace properly. It will get rid of whitespace between tokens but it also delete whitespace within the string token which is obviously not what I want. I've tried using modes to clear this issue up with no avail.
Lexer.g4
lexer grammar WaccLexer;
SEMICOLON: ';' ;
WS: [ \n\t\r\u000C]+ -> skip;
EOL: '\n' ;
BEGIN: 'begin' ;
END: 'end' ;
SKIP: 'skip' ;
READ: 'read' ;
FREE: 'free' ;
RETURN: 'return' ;
EXIT: 'exit' ;
IS: 'is' ;
PRINT: 'print' ;
PRINTLN: 'println' ;
IF: 'if' ;
THEN: 'then' ;
ELSE: 'else' ;
FI: 'fi' ;
WHILE: 'while' ;
DO: 'do' ;
DONE: 'done' ;
NEWPAIR: 'newpair' ;
CALL: 'call' ;
FST: 'fst' ;
SND: 'snd' ;
INT: 'int' ;
BOOL: 'bool' ;
CHAR: 'char' ;
STRING: 'string' ;
PAIR: 'pair' ;
EXCLAMATION: '!' ;
LEN: 'len' ;
ORD: 'ord' ;
TOINT: 'toInt' ;
DIGIT: '0'..'9' ;
LOWCHAR: 'a'..'z' ;
R: 'r' ;
F: 'f' ;
N: 'n' ;
T: 't' ;
B: 'b' ;
ZERO: '0' ;
MULTI: '*' ;
DIVIDE: '/' ;
MOD: '%' ;
PLUS: '+' ;
MINUS: '-' ;
GT: '>' ;
GTE: '>=' ;
LT: '<' ;
LTE: '<=' ;
DOUBLEEQUAL: '==' ;
EQUAL: '=' ;
NOTEQUAL: '!=' ;
AND: '&&' ;
OR: '||' ;
UNDERSCORE: '_' ;
UPCHAR: 'A'..'Z' ;
OPENSQUARE: '[' ;
CLOSESQUARE: ']' ;
OPENPARENTHESIS: '(' ;
CLOSEPARENTHESIS: ')' ;
TRUE: 'true' ;
FALSE: 'false' ;
SINGLEQUOT: '\'' ;
DOUBLEQUOT: '\"' ;
BACKSLASH: '\\' ;
COMMA: ',' ;
NULL: 'null' ;
OPENSTRING : DOUBLEQUOT -> pushMode(STRINGMODE) ;
COMMENT: '#' ~[\r\n]* '\r'? '\n' -> skip ;
mode STRINGMODE ;
CLOSESTRING : DOUBLEQUOT -> popMode ;
CHARACTER : ~[\"\'\\] | (BACKSLASH ESCAPEDCHAR) ;
STRLIT : (CHARACTER)* ;
ESCAPEDCHAR : ZERO
| B
| T
| N
| F
| R
| DOUBLEQUOT
| SINGLEQUOT
| BACKSLASH
;
Parser.g4
parser grammar WaccParser;
options {
tokenVocab=WaccLexer;
}
program : BEGIN (func)* stat END EOF;
func : type ident OPENPARENTHESIS (paramlist)? CLOSEPARENTHESIS IS stat END ;
paramlist : param (COMMA param)* ;
param : type ident ;
stat : SKIP
| type ident EQUAL assignrhs
| assignlhs EQUAL assignrhs
| READ assignlhs
| FREE expr
| RETURN expr
| EXIT expr
| PRINT expr
| PRINTLN expr
| IF expr THEN stat ELSE stat FI
| WHILE expr DO stat DONE
| BEGIN stat END
| stat SEMICOLON stat
;
assignlhs : ident
| expr OPENSQUARE expr CLOSESQUARE
| pairelem
;
assignrhs : expr
| arrayliter
| NEWPAIR OPENPARENTHESIS expr COMMA expr CLOSEPARENTHESIS
| pairelem
| CALL ident OPENPARENTHESIS (arglist)? CLOSEPARENTHESIS
;
arglist : expr (COMMA expr)* ;
pairelem : FST expr
| SND expr
;
type : basetype
| type OPENSQUARE CLOSESQUARE
| pairtype
;
basetype : INT
| BOOL
| CHAR
| STRING
;
pairtype : PAIR OPENPARENTHESIS pairelemtype COMMA pairelemtype CLOSEPARENTHESIS ;
pairelemtype : basetype
| type OPENSQUARE CLOSESQUARE
| PAIR
;
expr : intliter
| boolliter
| charliter
| strliter
| pairliter
| ident
| expr OPENSQUARE expr CLOSESQUARE
| unaryoper expr
| expr binaryoper expr
| OPENPARENTHESIS expr CLOSEPARENTHESIS
;
unaryoper : EXCLAMATION
| MINUS
| LEN
| ORD
| TOINT
;
binaryoper : MULTI
| DIVIDE
| MOD
| PLUS
| MINUS nus
| GT
| GTE
| LT
| LTE
| DOUBLEEQUAL
| NOTEQUAL
| AND
| OR
;
ident : (UNDERSCORE | LOWCHAR | UPCHAR) (UNDERSCORE | LOWCHAR | UPCHAR | DIGIT)* ;
intliter : (intsign)? (digit)+ ;
digit : DIGIT ;
intsign : PLUS
| MINUS
;
boolliter : TRUE
| FALSE
;
charliter : CHARACTER;
strliter : OPENSTRING STRLIT CLOSESTRING;
arrayliter : OPENSQUARE (expr (COMMA expr)*)? CLOSESQUARE ;
Please also remember that comment starting with # need to be ignored. Thanks in advance.
The OPENSTRING lexer rule will never be matched in your grammar because the DOUBLEQUOT rule matches exactly the same input sequence and appears before it in the grammar. If you want to define a lexer rule, but you do not actually want that lexer rule to create a token on its own, then you need to define the rule with the fragment modifier.
fragment DOUBLEQUOT : '"';
In addition, you need to correct the warnings that appear when you generate code for your grammar. At least one of them (defined as EPSILON_TOKEN) indicates a major mistake that you made that used to be an error in ANTLR 4.0 but was changed to a warning in ANTLR 4.1 since there is an edge case where it can be used without problems.

How to have both function calls and parenthetical grouping without backtrack

Is there any way to specify a grammar which allows the following syntax:
f(x)(g, (1-(-2))*3, 1+2*3)[0]
which is transformed into (in pseudo-lisp to show order):
(index
((f x)
g
(* (- 1 -2) 3)
(+ (* 2 3) 1)
)
0
)
along with things like limited operator precedence etc.
The following grammar works with backtrack = true, but I'd like to avoid that:
grammar T;
options {
output=AST;
backtrack=true;
memoize=true;
}
tokens {
CALL;
INDEX;
LOOKUP;
}
prog: (expr '\n')* ;
expr : boolExpr;
boolExpr
: relExpr (boolop^ relExpr)?
;
relExpr
: addExpr (relop^ addExpr)?
| a=addExpr oa=relop b=addExpr ob=relop c=addExpr
-> ^(LAND ^($oa $a $b) ^($ob $b $c))
;
addExpr
: mulExpr (addop^ mulExpr)?
;
mulExpr
: atomExpr (mulop^ atomExpr)?
;
atomExpr
: INT
| ID
| OPAREN expr CPAREN -> expr
| call
;
call
: callable ( OPAREN (expr (COMMA expr)*)? CPAREN -> ^(CALL callable expr*)
| OBRACK expr CBRACK -> ^(INDEX callable expr)
| DOT ID -> ^(INDEX callable ID)
)
;
fragment
callable
: ID
| OPAREN expr CPAREN
;
fragment
boolop
: LAND | LOR
;
fragment
relop
: (EQ|GT|LT|GTE|LTE)
;
fragment
addop
: (PLUS|MINUS)
;
fragment
mulop
: (TIMES|DIVIDE)
;
EQ : '==' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
LAND : '&&' ;
LOR : '||' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIVIDE : '/' ;
ID : ('a'..'z')+ ;
INT : '0'..'9' ;
OPAREN : '(' ;
CPAREN : ')' ;
OBRACK : '[' ;
CBRACK : ']' ;
DOT : '.' ;
COMMA : ',' ;
There are a couple of things wrong with your grammar:
1
Only lexer rules can be fragments, not parser rules. Some ANTLR targets simply ignore the fragment keyword in front of parser rules (like the Java target), but better just remove them from your grammar: if you decide to create a parser for a different target-language, you may run into problems because of it.
2
Without the backtrack=true, you cannot mix tree-rewrite operators (^ and !) and rewrite rules (->) because you need to create a single alternative inside relExpr instead of the two alternatives you now have (this is to eliminate an ambiguity).
In your case, you can't create the desired AST with just ^ (inside a single alternative), so you'll need to do it like this:
relExpr
: (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b))
( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
)?
)?
;
(yes, I know, it's not particularly pretty, but that can't be helped AFAIK)
Also, you can only put the LAND token in the rewrite rules if it is defined in the tokens { ... } block:
tokens {
// literal tokens
LAND='&&';
...
// imaginary tokens
CALL;
...
}
Otherwise you can only use tokens (and other parser rules) in rewrite rules if they really occur inside the parser rule itself.
3
You did not account for the unary minus in your grammar, implement it like this:
mulExpr
: unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
;
unaryExpr
: MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
| atomExpr
;
Now, to create a grammar that does not need backtrack=true, remove the ID and '(' expr ')' from your atomExpr rule:
atomExpr
: INT
| call
;
and make everything passed callable optional inside your call rule:
call
: (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
| OBRACK expr CBRACK -> ^(INDEX $call expr)
| DOT ID -> ^(INDEX $call ID)
)*
;
That way, ID and '(' expr ')' are already matched by call (and there's no ambiguity).
Taken all the remarks above into account, you could get the following grammar:
grammar T;
options {
output=AST;
}
tokens {
// literal tokens
EQ = '==' ;
GT = '>' ;
LT = '<' ;
GTE = '>=' ;
LTE = '<=' ;
LAND = '&&' ;
LOR = '||' ;
PLUS = '+' ;
MINUS = '-' ;
TIMES = '*' ;
DIVIDE = '/' ;
OPAREN = '(' ;
CPAREN = ')' ;
OBRACK = '[' ;
CBRACK = ']' ;
DOT = '.' ;
COMMA = ',' ;
// imaginary tokens
CALL;
INDEX;
LOOKUP;
UNARY_MINUS;
PARAMS;
}
prog
: expr EOF -> expr
;
expr
: boolExpr
;
boolExpr
: relExpr ((LAND | LOR)^ relExpr)?
;
relExpr
: (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b))
( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
)?
)?
;
addExpr
: mulExpr ((PLUS | MINUS)^ mulExpr)*
;
mulExpr
: unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
;
unaryExpr
: MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
| atomExpr
;
atomExpr
: INT
| call
;
call
: (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
| OBRACK expr CBRACK -> ^(INDEX $call expr)
| DOT ID -> ^(INDEX $call ID)
)*
;
callable
: ID
| OPAREN expr CPAREN -> expr
;
params
: (expr (COMMA expr)*)? -> ^(PARAMS expr*)
;
relOp
: EQ | GT | LT | GTE | LTE
;
ID : 'a'..'z'+ ;
INT : '0'..'9'+ ;
SPACE : (' ' | '\t') {skip();};
which would parse the input "a >= b < c" into the following AST:
and the input "f(x)(g, (1-(-2))*3, 1+2*3)[0]" as follows: