What's wrong with my ANTLR grammar file? - antlr
I defined the following grammar, following Scott Stanchfield tutorial.
grammar SampleScript;
program
:
declaration+
;
declaration
: macrodeclaration
;
macrodeclaration
:
MACRO STRING (LEFTPAREN parameters RIGHTPAREN)?
statement*
ENDMACRO
;
statement
: assignmentStatement
| ifStatement
| iterationStatement
| jumpStatement
| procedureCallStatement
| dimStatement
| labeledStatement
;
actualParameters
: expression (',' expression?)*
;
parameters
: ID (',' ID)*
;
assignmentStatement
: ID ASSIGN expression
| ID MATRIXASSIGN expression
;
ifStatement
: IF expression THEN (statement|compoundStatement)
(ELSE expression (statement|compoundStatement))?
;
iterationStatement
: WHILE expression compoundStatement
| FOR ID '=' expression TO expression (STEP expression)? compoundStatement
;
jumpStatement
: BREAK
| CONTINUE
| GOTO ID
| RETURN LEFTPAREN expression RIGHTPAREN
;
procedureCallStatement //todo: expression statement
: ID LEFTPAREN actualParameters? RIGHTPAREN
;
dimStatement
: DIM ID LEFTBRACKET expression(',' expression)* RIGHTBRACKET (',' ID LEFTBRACKET expression(',' expression)* RIGHTBRACKET)*
;
labeledStatement
: ID ':' statement
;
compoundStatement
: DO statement* END
;
term
: NUMBER
| STRING
| ID
| LEFTPAREN expression RIGHTPAREN //( )
| ID LEFTPAREN actualParameters RIGHTPAREN //Procedure Call
| ID (LEFTBRACKET expression RIGHTBRACKET)+ //Array Arr[3]
| ID ('.' expression)+ //Array Arr.Length
| LEFTBRACE (expression)? (',' expression)* RIGHTBRACE //{"OK","False"}
;
negation
: 'not'* term
;
unary
: ('-')* negation
;
mult
: unary (('*' | '/') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
//Keywords
DIM: D I M;
RETURN: R E T U R N;
FOR: F O R;
STEP: S T E P;
TO: T O;
WHILE: W H I L E;
DO: D O;
END: E N D;
GOTO: G O T O;
BREAK: B R E A K;
CONTINUE: C O N T I N U E;
IF: I F;
THEN: T H E N;
ELSE: E L S E;
MACRO :M A C R O;
ENDMACRO :E N D M A C R O;
ID : ('_'|LETTER) ('_'|LETTER|DIGIT)*;
ASSIGN: '=';
MATRIXASSIGN: ':=';
LEFTPAREN : '(';
RIGHTPAREN : ')';
LEFTBRACKET : '[';
RIGHTBRACKET : ']';
LEFTBRACE : '{';
RIGHTBRACE : '}';
//STRING : '"' .*? '"' ; // match anything in "..."
STRING
: '"' (STRING_ESCAPE_SEQ|~('\n'|'\r'))*? '"'
| '\'' (STRING_ESCAPE_SEQ|~('\n'|'\r'))*? '\''
;
/// stringescapeseq ::= "\" <any source character>
fragment STRING_ESCAPE_SEQ //'\\"'
: '\\' .
;
UNSIGNED_INT : DIGIT+; //('0' | '1'..'9' '0'..'9'*);
UNSIGNED_FLOAT: DIGIT+ '.' DIGIT* Exponent?
| '.' DIGIT+ Exponent?
| DIGIT+ Exponent
;
NUMBER
: UNSIGNED_INT
| UNSIGNED_FLOAT
;
fragment DIGIT : [0-9] ; // not a token by itself
fragment Exponent : ('e'|'E') ('+'|'-')? (DIGIT)+ ;
LINE_COMMENT : '//' .*? '\r'? '\n' -> skip ; // Match "//" stuff '\n'
COMMENT : '/*' .*? '*/' -> skip ; // Match "/*" stuff "*/"
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment LETTER : [A-Za-z];
WS : [ \t\n\r]+ -> skip ; // skip spaces, tabs, newlines
I am trying to parse following code
Macro 'test' (x)
a=1
b=2
c={}
d = x(3,4)
matrixinfo_skim = GetMatrixInfo(m_skim)
showmessage (i2s(a))
showarray(c)
endmacro
and gets the error below, I spent over 2 days on it and couldn't figure out why it could not parse the assignment statements a=1 and later? someone please help me..
[#0,0:4='Macro',<30>,1:0]
[#1,6:11=''test'',<41>,1:6]
[#2,13:13='(',<35>,1:13]
[#3,14:14='x',<32>,1:14]
[#4,15:15=')',<36>,1:15]
[#5,20:20='a',<32>,3:0]
[#6,21:21='=',<33>,3:1]
[#7,22:22='1',<42>,3:2]
[#8,25:25='b',<32>,4:0]
[#9,26:26='=',<33>,4:1]
[#10,27:27='2',<42>,4:2]
[#11,30:30='c',<32>,5:0]
[#12,31:31='=',<33>,5:1]
[#13,32:32='{',<39>,5:2]
[#14,33:33='}',<40>,5:3]
[#15,36:36='d',<32>,6:0]
[#16,38:38='=',<33>,6:2]
[#17,40:40='x',<32>,6:4]
[#18,41:41='(',<35>,6:5]
[#19,42:42='3',<42>,6:6]
[#20,43:43=',',<2>,6:7]
[#21,44:44='4',<42>,6:8]
[#22,45:45=')',<36>,6:9]
[#23,48:62='matrixinfo_skim',<32>,7:0]
[#24,64:64='=',<33>,7:16]
[#25,66:78='GetMatrixInfo',<32>,7:18]
[#26,79:79='(',<35>,7:31]
[#27,80:85='m_skim',<32>,7:32]
[#28,86:86=')',<36>,7:38]
[#29,91:101='showmessage',<32>,9:0]
[#30,103:103='(',<35>,9:12]
[#31,104:106='i2s',<32>,9:13]
[#32,107:107='(',<35>,9:16]
[#33,108:108='a',<32>,9:17]
[#34,109:109=')',<36>,9:18]
[#35,110:110=')',<36>,9:19]
[#36,113:121='showarray',<32>,10:0]
[#37,122:122='(',<35>,10:9]
[#38,123:123='c',<32>,10:10]
[#39,124:124=')',<36>,10:11]
[#40,127:134='endmacro',<31>,11:0]
[#41,140:139='<EOF>',<-1>,13:0]
line 3:2 extraneous input '1' expecting {'-', 'not', ID, '(', '{', STRING, NUMBER}
line 4:2 extraneous input '2' expecting {'-', 'not', ID, '(', '{', STRING, NUMBER}
line 6:6 mismatched input '3' expecting {'-', 'not', ID, '(', '{', STRING, NUMBER}
line 6:8 extraneous input '4' expecting {',', ')'}
(program (declaration (macrodeclaration Macro 'test' ( (parameters x) ) (statement (assignmentStatement a = (expression (relation (add (mult (unary 1 (negation (term b))))) = (add (mult (unary 2 (negation (term c))))) = (add (mult (unary (negation (term { }))))))))) (statement (assignmentStatement d = (expression (relation (add (mult (unary (negation (term x ( (actualParameters (expression (relation (add (mult (unary 3))))) , 4) )))))))))) (statement (assignmentStatement matrixinfo_skim = (expression (relation (add (mult (unary (negation (term GetMatrixInfo ( (actualParameters (expression (relation (add (mult (unary (negation (term m_skim)))))))) )))))))))) (statement (procedureCallStatement showmessage ( (actualParameters (expression (relation (add (mult (unary (negation (term i2s ( (actualParameters (expression (relation (add (mult (unary (negation (term a)))))))) ))))))))) ))) (statement (procedureCallStatement showarray ( (actualParameters (expression (relation (add (mult (unary (negation (term c)))))))) ))) endmacro)))
As the error messages indicate, things go wrong with the numbers which is matched by the expression in the assignmentStatement rule, which ultimately is (or should be) matched as a NUMBER in the term rule.
Looking at the lexer rules responsible for the creation of a NUMBER token:
UNSIGNED_INT : DIGIT+;
UNSIGNED_FLOAT: DIGIT+ '.' DIGIT* Exponent?
| '.' DIGIT+ Exponent?
| DIGIT+ Exponent
;
NUMBER
: UNSIGNED_INT
| UNSIGNED_FLOAT
;
it appears that a NUMBER token is never created since a NUMBER matches either a UNSIGNED_INT or an UNSIGNED_FLOAT. But since these 2 tokens are defined before the NUMBER is defined, the lexer creates UNSIGNED_INT and UNSIGNED_FLOAT tokens instead of NUMBER tokens.
You need to change UNSIGNED_INT and UNSIGNED_FLOAT into fragment rules instead:
fragment UNSIGNED_INT : DIGIT+;
fragment UNSIGNED_FLOAT: DIGIT+ '.' DIGIT* Exponent?
| '.' DIGIT+ Exponent?
| DIGIT+ Exponent
;
NUMBER
: UNSIGNED_INT
| UNSIGNED_FLOAT
;
Be sure to understand what a fragment is: What does "fragment" mean in ANTLR?
Related
Why parse failing after upgrading from Antlr 3 to Antlr 4?
Recently I am trying to upgrade my project from Antlr3 to Antlr4. But after making change in the grammar file, it seems the equations that worked previously is no longer working. I am new to Antlr4 so unable to understand whether my change broke something or not. Here is my original grammar file: grammar equation; options { language=CSharp2; output=AST; ASTLabelType=CommonTree; } tokens { VARIABLE; CONSTANT; EXPR; PAREXPR; EQUATION; UNARYEXPR; FUNCTION; BINARYOP; LIST; } equationset: equation* EOF!; equation: variable ASSIGN expression -> ^(EQUATION variable expression) ; parExpression : LPAREN expression RPAREN -> ^(PAREXPR expression) ; expression : conditionalexpression -> ^(EXPR conditionalexpression) ; conditionalexpression : orExpression ; orExpression : andExpression ( OR^ andExpression )* ; andExpression : comparisonExpression ( AND^ comparisonExpression )*; comparisonExpression: additiveExpression ((EQ^ | NE^ | LTE^ | GTE^ | LT^ | GT^) additiveExpression)*; additiveExpression : multiplicativeExpression ( (PLUS^ | MINUS^) multiplicativeExpression )* ; multiplicativeExpression : unaryExpression ( ( TIMES^ | DIVIDE^) unaryExpression )* ; unaryExpression : NOT unaryExpression -> ^(UNARYEXPR NOT unaryExpression) | MINUS unaryExpression -> ^(UNARYEXPR MINUS unaryExpression) | exponentexpression; exponentexpression : primary (CARET^ primary)*; primary : parExpression | constant | booleantok | variable | function; numeric: INTEGER | REAL; constant: STRING -> ^(CONSTANT STRING) | numeric -> ^(CONSTANT numeric); booleantok : BOOLEAN -> ^(BOOLEAN); scopedidentifier : (IDENTIFIER DOT)* IDENTIFIER -> IDENTIFIER+; function : scopedidentifier LPAREN argumentlist RPAREN -> ^(FUNCTION scopedidentifier argumentlist); variable: scopedidentifier -> ^(VARIABLE scopedidentifier); argumentlist: (expression) ? (COMMA! expression)*; WS : (' '|'\r'|'\n'|'\t')+ {$channel=HIDDEN;}; COMMENT : '/*' .* '*/' {$channel=HIDDEN;}; LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}; STRING: (('\"') ( (~('\"')) )* ('\"'))+; fragment ALPHA: 'a'..'z'|'_'; fragment DIGIT: '0'..'9'; fragment ALNUM: ALPHA|DIGIT; EQ : '=='; ASSIGN : '='; NE : '!=' | '<>'; OR : 'or' | '||'; AND : 'and' | '&&'; NOT : '!'|'not'; LTE : '<='; GTE : '>='; LT : '<'; GT : '>'; TIMES : '*'; DIVIDE : '/'; BOOLEAN : 'true' | 'false'; IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ; REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?; INTEGER: DIGIT+; PLUS : '+'; MINUS : '-'; COMMA : ','; RPAREN : ')'; LPAREN : '('; DOT : '.'; CARET : '^'; And here is what I have after my changes: grammar equation; options { } tokens { VARIABLE; CONSTANT; EXPR; PAREXPR; EQUATION; UNARYEXPR; FUNCTION; BINARYOP; LIST; } equationset: equation* EOF; equation: variable ASSIGN expression ; parExpression : LPAREN expression RPAREN ; expression : conditionalexpression ; conditionalexpression : orExpression ; orExpression : andExpression ( OR andExpression )* ; andExpression : comparisonExpression ( AND comparisonExpression )*; comparisonExpression: additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*; additiveExpression : multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )* ; multiplicativeExpression : unaryExpression ( ( TIMES | DIVIDE) unaryExpression )* ; unaryExpression : NOT unaryExpression | MINUS unaryExpression | exponentexpression; exponentexpression : primary (CARET primary)*; primary : parExpression | constant | booleantok | variable | function; numeric: INTEGER | REAL; constant: STRING | numeric; booleantok : BOOLEAN; scopedidentifier : (IDENTIFIER DOT)* IDENTIFIER; function : scopedidentifier LPAREN argumentlist RPAREN; variable: scopedidentifier; argumentlist: (expression) ? (COMMA expression)*; WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN); COMMENT : '/*' .* '*/' ->channel(HIDDEN); LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN); STRING: (('\"') ( (~('\"')) )* ('\"'))+; fragment ALPHA: 'a'..'z'|'_'; fragment DIGIT: '0'..'9'; fragment ALNUM: ALPHA|DIGIT; EQ : '=='; ASSIGN : '='; NE : '!=' | '<>'; OR : 'or' | '||'; AND : 'and' | '&&'; NOT : '!'|'not'; LTE : '<='; GTE : '>='; LT : '<'; GT : '>'; TIMES : '*'; DIVIDE : '/'; BOOLEAN : 'true' | 'false'; IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ; REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?; INTEGER: DIGIT+; PLUS : '+'; MINUS : '-'; COMMA : ','; RPAREN : ')'; LPAREN : '('; DOT : '.'; CARET : '^'; A sample equation that I am trying to parse (which was working OK before) is: [a].[b] = 1.76 * [Product_DC].[PDC_Inbound_Pallets] * if(product_dc.[PDC_DC] =="US84",1,0) Thanks in advance.
Tokens should be listed with comma , not semicolon ;. See also Token Section paragraph in official doc. Since ANTLR 4.7 backslash is not required for double quote escaping. STRING: (('\"') ( (~('\"')) )* ('\"'))+; should be rewritten to STRING: ('"' ~'"'* '"')+;. You missed question mark in multiline comment token for non-greedy matching: '/*' .* '*/' -> '/*' .*? '*/'. So, the fixed grammar looks like this: grammar equation; options { } tokens { VARIABLE, CONSTANT, EXPR, PAREXPR, EQUATION, UNARYEXPR, FUNCTION, BINARYOP, LIST } equationset: equation* EOF; equation: variable ASSIGN expression ; parExpression : LPAREN expression RPAREN ; expression : conditionalexpression ; conditionalexpression : orExpression ; orExpression : andExpression ( OR andExpression )* ; andExpression : comparisonExpression ( AND comparisonExpression )*; comparisonExpression: additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*; additiveExpression : multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )* ; multiplicativeExpression : unaryExpression ( ( TIMES | DIVIDE) unaryExpression )* ; unaryExpression : NOT unaryExpression | MINUS unaryExpression | exponentexpression; exponentexpression : primary (CARET primary)*; primary : parExpression | constant | booleantok | variable | function; numeric: INTEGER | REAL; constant: STRING | numeric; booleantok : BOOLEAN; scopedidentifier : (IDENTIFIER DOT)* IDENTIFIER; function : scopedidentifier LPAREN argumentlist RPAREN; variable: scopedidentifier; argumentlist: (expression) ? (COMMA expression)*; WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN); COMMENT : '/*' .*? '*/' -> channel(HIDDEN); LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN); STRING: ('"' ~'"'* '"')+; fragment ALPHA: 'a'..'z'|'_'; fragment DIGIT: '0'..'9'; fragment ALNUM: ALPHA|DIGIT; EQ : '=='; ASSIGN : '='; NE : '!=' | '<>'; OR : 'or' | '||'; AND : 'and' | '&&'; NOT : '!'|'not'; LTE : '<='; GTE : '>='; LT : '<'; GT : '>'; TIMES : '*'; DIVIDE : '/'; BOOLEAN : 'true' | 'false'; IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ; REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?; INTEGER: DIGIT+; PLUS : '+'; MINUS : '-'; COMMA : ','; RPAREN : ')'; LPAREN : '('; DOT : '.'; CARET : '^';
"The following sets of rules are mutually left-recursive"
I have tried to write a grammar to recognize expressions like: (A + MAX(B) ) / ( C - AVERAGE(A) ) IF( A > AVERAGE(A), 0, 1 ) X / (MAX(X) Unfortunately antlr3 fails with these errors: error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression] error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option. error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks Code: grammar DerivedKeywords; options { output=AST; //backtrack=true; } WS : ( ' ' | '\t' | '\n' | '\r' ) { skip(); } ; //for numbers DIGIT : '0'..'9' ; //for both integer and real number NUMBER : (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )? ; // Boolean operatos AND : 'AND'; OR : 'OR'; NOT : 'NOT'; EQ : '='; NEQ : '!='; GT : '>'; LT : '<'; GTE : '>='; LTE : '<='; COMMA : ','; // Token for Functions IF : 'IF'; MAX : 'MAX'; MIN : 'MIN'; AVERAGE : 'AVERAGE'; VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')* ; // OPERATORS LPAREN : '(' ; RPAREN : ')' ; DIV : '/' ; PLUS : '+' ; MINUS : '-' ; STAR : '*' ; expression : formula; formula : functionExpression | additiveExpression | LPAREN! a=formula RPAREN! // First Problem ; additiveExpression : a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )* ; multiplicativeExpression : a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )* ; unaryExpression : MINUS^ u=unaryExpression | primaryExpression ; functionExpression : f=functionOperator LPAREN e=formula RPAREN | IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN ; functionOperator : MAX | MIN | AVERAGE; primaryExpression : NUMBER // Used for scientific numbers | DIGIT | VARIABLE | formula ; // Boolean stuff booleanExpression : orExpression; orExpression : a=andExpression (OR^ b=andExpression )* ; andExpression : a=notExpression (AND^ b=notExpression )* ; notExpression : NOT^ t=booleanTerm | booleanTerm ; booleanOperator : GT | LT | EQ | GTE | LTE | NEQ; booleanTerm : a=formula op=booleanOperator b=formula | LPAREN! booleanTerm RPAREN! // Second problem ;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression] - this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same. You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.
ANTLR4 Token is not recognized when substituted
I try to modify the grammar of the sqlite syntax (I'm interested in a variant of the where clause only) and I'm keep having a weird error when substituting AND to it's own token. grammar wtfql; /* SQLite understands the following binary operators, in order from highest to lowest precedence: || * / % + - << >> & | < <= > >= = != <> IS IS NOT IN LIKE GLOB MATCH REGEXP AND OR */ start : expr EOF?; expr : literal_value //BIND_PARAMETER | ( table_name '.' )? column_name | unary_operator expr | expr '||' expr | expr ( '*' | '/' | '%' ) expr | expr ( '+' | '-' ) expr | expr ( '<' | '<=' | '>' | '>=' ) expr | expr ( '=' | '<>' | K_IN ) expr | expr K_AND expr | expr K_OR expr | function_name '(' ( expr ( ',' expr )* )? ')' | '(' expr ')' | expr K_NOT expr | expr ( K_NOT K_NULL ) | expr K_NOT? K_IN ( '(' ( expr ( ',' expr )* ) ')' ) ; unary_operator : '-' | '+' | K_NOT ; literal_value : NUMERIC_LITERAL | STRING_LITERAL | K_NULL ; function_name : IDENTIFIER ; table_name : any_name ; column_name : any_name ; any_name : IDENTIFIER | keyword // | '(' any_name ')' ; keyword : K_AND | K_NOT | K_NULL | K_IN | K_OR ; IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set ; NUMERIC_LITERAL : DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )? | '.' DIGIT+ ( E [-+]? DIGIT+ )? ; STRING_LITERAL : '\"' ( ~'\"' | '\"\"' )* '\"' ; SPACES : [ \u000B\t\r\n] -> channel(HIDDEN) ; DOT : '.'; OPEN_PAR : '('; CLOSE_PAR : ')'; COMMA : ','; STAR : '*'; PLUS : '+'; MINUS : '-'; TILDE : '~'; DIV : '/'; MOD : '%'; AMP : '&'; PIPE : '|'; LT : '<'; LT_EQ : '<='; GT : '>'; GT_EQ : '>='; EQ : '='; NOT_EQ2 : '<>'; K_AND : A N D; K_NOT : N O T; K_NULL : N U L L; K_OR : O R; K_IN : I N; fragment DIGIT : [0-9]; fragment A : [aA]; fragment B : [bB]; fragment C : [cC]; fragment D : [dD]; fragment E : [eE]; fragment F : [fF]; fragment G : [gG]; fragment H : [hH]; fragment I : [iI]; fragment J : [jJ]; fragment K : [kK]; fragment L : [lL]; fragment M : [mM]; fragment N : [nN]; fragment O : [oO]; fragment P : [pP]; fragment Q : [qQ]; fragment R : [rR]; fragment S : [sS]; fragment T : [tT]; fragment U : [uU]; fragment V : [vV]; fragment W : [wW]; fragment X : [xX]; fragment Y : [yY]; fragment Z : [zZ]; writing | expr K_AND expr with the input field1=1 and field2 = 2 results in line 1:8 mismatched input 'and' expecting {<EOF>, '||', '*', '+', '-', '/', '%', '<', '<=', '>', '>=', '=', '<>', K_AND, K_NOT, K_OR, K_IN} while | expr 'and' expr works like a charm: $ antlr4 wtfql.g4 && javac -classpath /usr/local/Cellar/antlr/4.4/antlr-4.4-complete.jar wtfql*.java && cat test.txt | grun wtfql start -tree -gui (start (expr (expr (expr (column_name (any_name feld1))) = (expr (literal_value 1))) and (expr (expr (column_name (any_name feld2))) = (expr (literal_value 2)))) <EOF>) What am I missing?
I presume "and" is an IDENTIFIER since the rule for IDENTIFIER comes before the rule for AND and thus wins. If you write 'and' in the parser rule this implicitly creates a token (not AND!) which comes before IDENTIFIER and thus wins. Rule of thumb: More specific lexer rules first. Don't create new lexer tokens implicitly in parser rules. If you check the token type, you'll get a clue what's going on.
Antlr 4 whitespace in string been eliminated
I'm using Antlr 4 to build a compiler for a made up language. I'm having problems with eliminating whitespace properly. It will get rid of whitespace between tokens but it also delete whitespace within the string token which is obviously not what I want. I've tried using modes to clear this issue up with no avail. Lexer.g4 lexer grammar WaccLexer; SEMICOLON: ';' ; WS: [ \n\t\r\u000C]+ -> skip; EOL: '\n' ; BEGIN: 'begin' ; END: 'end' ; SKIP: 'skip' ; READ: 'read' ; FREE: 'free' ; RETURN: 'return' ; EXIT: 'exit' ; IS: 'is' ; PRINT: 'print' ; PRINTLN: 'println' ; IF: 'if' ; THEN: 'then' ; ELSE: 'else' ; FI: 'fi' ; WHILE: 'while' ; DO: 'do' ; DONE: 'done' ; NEWPAIR: 'newpair' ; CALL: 'call' ; FST: 'fst' ; SND: 'snd' ; INT: 'int' ; BOOL: 'bool' ; CHAR: 'char' ; STRING: 'string' ; PAIR: 'pair' ; EXCLAMATION: '!' ; LEN: 'len' ; ORD: 'ord' ; TOINT: 'toInt' ; DIGIT: '0'..'9' ; LOWCHAR: 'a'..'z' ; R: 'r' ; F: 'f' ; N: 'n' ; T: 't' ; B: 'b' ; ZERO: '0' ; MULTI: '*' ; DIVIDE: '/' ; MOD: '%' ; PLUS: '+' ; MINUS: '-' ; GT: '>' ; GTE: '>=' ; LT: '<' ; LTE: '<=' ; DOUBLEEQUAL: '==' ; EQUAL: '=' ; NOTEQUAL: '!=' ; AND: '&&' ; OR: '||' ; UNDERSCORE: '_' ; UPCHAR: 'A'..'Z' ; OPENSQUARE: '[' ; CLOSESQUARE: ']' ; OPENPARENTHESIS: '(' ; CLOSEPARENTHESIS: ')' ; TRUE: 'true' ; FALSE: 'false' ; SINGLEQUOT: '\'' ; DOUBLEQUOT: '\"' ; BACKSLASH: '\\' ; COMMA: ',' ; NULL: 'null' ; OPENSTRING : DOUBLEQUOT -> pushMode(STRINGMODE) ; COMMENT: '#' ~[\r\n]* '\r'? '\n' -> skip ; mode STRINGMODE ; CLOSESTRING : DOUBLEQUOT -> popMode ; CHARACTER : ~[\"\'\\] | (BACKSLASH ESCAPEDCHAR) ; STRLIT : (CHARACTER)* ; ESCAPEDCHAR : ZERO | B | T | N | F | R | DOUBLEQUOT | SINGLEQUOT | BACKSLASH ; Parser.g4 parser grammar WaccParser; options { tokenVocab=WaccLexer; } program : BEGIN (func)* stat END EOF; func : type ident OPENPARENTHESIS (paramlist)? CLOSEPARENTHESIS IS stat END ; paramlist : param (COMMA param)* ; param : type ident ; stat : SKIP | type ident EQUAL assignrhs | assignlhs EQUAL assignrhs | READ assignlhs | FREE expr | RETURN expr | EXIT expr | PRINT expr | PRINTLN expr | IF expr THEN stat ELSE stat FI | WHILE expr DO stat DONE | BEGIN stat END | stat SEMICOLON stat ; assignlhs : ident | expr OPENSQUARE expr CLOSESQUARE | pairelem ; assignrhs : expr | arrayliter | NEWPAIR OPENPARENTHESIS expr COMMA expr CLOSEPARENTHESIS | pairelem | CALL ident OPENPARENTHESIS (arglist)? CLOSEPARENTHESIS ; arglist : expr (COMMA expr)* ; pairelem : FST expr | SND expr ; type : basetype | type OPENSQUARE CLOSESQUARE | pairtype ; basetype : INT | BOOL | CHAR | STRING ; pairtype : PAIR OPENPARENTHESIS pairelemtype COMMA pairelemtype CLOSEPARENTHESIS ; pairelemtype : basetype | type OPENSQUARE CLOSESQUARE | PAIR ; expr : intliter | boolliter | charliter | strliter | pairliter | ident | expr OPENSQUARE expr CLOSESQUARE | unaryoper expr | expr binaryoper expr | OPENPARENTHESIS expr CLOSEPARENTHESIS ; unaryoper : EXCLAMATION | MINUS | LEN | ORD | TOINT ; binaryoper : MULTI | DIVIDE | MOD | PLUS | MINUS nus | GT | GTE | LT | LTE | DOUBLEEQUAL | NOTEQUAL | AND | OR ; ident : (UNDERSCORE | LOWCHAR | UPCHAR) (UNDERSCORE | LOWCHAR | UPCHAR | DIGIT)* ; intliter : (intsign)? (digit)+ ; digit : DIGIT ; intsign : PLUS | MINUS ; boolliter : TRUE | FALSE ; charliter : CHARACTER; strliter : OPENSTRING STRLIT CLOSESTRING; arrayliter : OPENSQUARE (expr (COMMA expr)*)? CLOSESQUARE ; Please also remember that comment starting with # need to be ignored. Thanks in advance.
The OPENSTRING lexer rule will never be matched in your grammar because the DOUBLEQUOT rule matches exactly the same input sequence and appears before it in the grammar. If you want to define a lexer rule, but you do not actually want that lexer rule to create a token on its own, then you need to define the rule with the fragment modifier. fragment DOUBLEQUOT : '"'; In addition, you need to correct the warnings that appear when you generate code for your grammar. At least one of them (defined as EPSILON_TOKEN) indicates a major mistake that you made that used to be an error in ANTLR 4.0 but was changed to a warning in ANTLR 4.1 since there is an edge case where it can be used without problems.
How to have both function calls and parenthetical grouping without backtrack
Is there any way to specify a grammar which allows the following syntax: f(x)(g, (1-(-2))*3, 1+2*3)[0] which is transformed into (in pseudo-lisp to show order): (index ((f x) g (* (- 1 -2) 3) (+ (* 2 3) 1) ) 0 ) along with things like limited operator precedence etc. The following grammar works with backtrack = true, but I'd like to avoid that: grammar T; options { output=AST; backtrack=true; memoize=true; } tokens { CALL; INDEX; LOOKUP; } prog: (expr '\n')* ; expr : boolExpr; boolExpr : relExpr (boolop^ relExpr)? ; relExpr : addExpr (relop^ addExpr)? | a=addExpr oa=relop b=addExpr ob=relop c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c)) ; addExpr : mulExpr (addop^ mulExpr)? ; mulExpr : atomExpr (mulop^ atomExpr)? ; atomExpr : INT | ID | OPAREN expr CPAREN -> expr | call ; call : callable ( OPAREN (expr (COMMA expr)*)? CPAREN -> ^(CALL callable expr*) | OBRACK expr CBRACK -> ^(INDEX callable expr) | DOT ID -> ^(INDEX callable ID) ) ; fragment callable : ID | OPAREN expr CPAREN ; fragment boolop : LAND | LOR ; fragment relop : (EQ|GT|LT|GTE|LTE) ; fragment addop : (PLUS|MINUS) ; fragment mulop : (TIMES|DIVIDE) ; EQ : '==' ; GT : '>' ; LT : '<' ; GTE : '>=' ; LTE : '<=' ; LAND : '&&' ; LOR : '||' ; PLUS : '+' ; MINUS : '-' ; TIMES : '*' ; DIVIDE : '/' ; ID : ('a'..'z')+ ; INT : '0'..'9' ; OPAREN : '(' ; CPAREN : ')' ; OBRACK : '[' ; CBRACK : ']' ; DOT : '.' ; COMMA : ',' ;
There are a couple of things wrong with your grammar: 1 Only lexer rules can be fragments, not parser rules. Some ANTLR targets simply ignore the fragment keyword in front of parser rules (like the Java target), but better just remove them from your grammar: if you decide to create a parser for a different target-language, you may run into problems because of it. 2 Without the backtrack=true, you cannot mix tree-rewrite operators (^ and !) and rewrite rules (->) because you need to create a single alternative inside relExpr instead of the two alternatives you now have (this is to eliminate an ambiguity). In your case, you can't create the desired AST with just ^ (inside a single alternative), so you'll need to do it like this: relExpr : (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b)) ( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c)) )? )? ; (yes, I know, it's not particularly pretty, but that can't be helped AFAIK) Also, you can only put the LAND token in the rewrite rules if it is defined in the tokens { ... } block: tokens { // literal tokens LAND='&&'; ... // imaginary tokens CALL; ... } Otherwise you can only use tokens (and other parser rules) in rewrite rules if they really occur inside the parser rule itself. 3 You did not account for the unary minus in your grammar, implement it like this: mulExpr : unaryExpr ((TIMES | DIVIDE)^ unaryExpr)* ; unaryExpr : MINUS atomExpr -> ^(UNARY_MINUS atomExpr) | atomExpr ; Now, to create a grammar that does not need backtrack=true, remove the ID and '(' expr ')' from your atomExpr rule: atomExpr : INT | call ; and make everything passed callable optional inside your call rule: call : (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params) | OBRACK expr CBRACK -> ^(INDEX $call expr) | DOT ID -> ^(INDEX $call ID) )* ; That way, ID and '(' expr ')' are already matched by call (and there's no ambiguity). Taken all the remarks above into account, you could get the following grammar: grammar T; options { output=AST; } tokens { // literal tokens EQ = '==' ; GT = '>' ; LT = '<' ; GTE = '>=' ; LTE = '<=' ; LAND = '&&' ; LOR = '||' ; PLUS = '+' ; MINUS = '-' ; TIMES = '*' ; DIVIDE = '/' ; OPAREN = '(' ; CPAREN = ')' ; OBRACK = '[' ; CBRACK = ']' ; DOT = '.' ; COMMA = ',' ; // imaginary tokens CALL; INDEX; LOOKUP; UNARY_MINUS; PARAMS; } prog : expr EOF -> expr ; expr : boolExpr ; boolExpr : relExpr ((LAND | LOR)^ relExpr)? ; relExpr : (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b)) ( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c)) )? )? ; addExpr : mulExpr ((PLUS | MINUS)^ mulExpr)* ; mulExpr : unaryExpr ((TIMES | DIVIDE)^ unaryExpr)* ; unaryExpr : MINUS atomExpr -> ^(UNARY_MINUS atomExpr) | atomExpr ; atomExpr : INT | call ; call : (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params) | OBRACK expr CBRACK -> ^(INDEX $call expr) | DOT ID -> ^(INDEX $call ID) )* ; callable : ID | OPAREN expr CPAREN -> expr ; params : (expr (COMMA expr)*)? -> ^(PARAMS expr*) ; relOp : EQ | GT | LT | GTE | LTE ; ID : 'a'..'z'+ ; INT : '0'..'9'+ ; SPACE : (' ' | '\t') {skip();}; which would parse the input "a >= b < c" into the following AST: and the input "f(x)(g, (1-(-2))*3, 1+2*3)[0]" as follows: