Antlr 4 whitespace in string been eliminated - antlr

I'm using Antlr 4 to build a compiler for a made up language. I'm having problems with eliminating whitespace properly. It will get rid of whitespace between tokens but it also delete whitespace within the string token which is obviously not what I want. I've tried using modes to clear this issue up with no avail.
Lexer.g4
lexer grammar WaccLexer;
SEMICOLON: ';' ;
WS: [ \n\t\r\u000C]+ -> skip;
EOL: '\n' ;
BEGIN: 'begin' ;
END: 'end' ;
SKIP: 'skip' ;
READ: 'read' ;
FREE: 'free' ;
RETURN: 'return' ;
EXIT: 'exit' ;
IS: 'is' ;
PRINT: 'print' ;
PRINTLN: 'println' ;
IF: 'if' ;
THEN: 'then' ;
ELSE: 'else' ;
FI: 'fi' ;
WHILE: 'while' ;
DO: 'do' ;
DONE: 'done' ;
NEWPAIR: 'newpair' ;
CALL: 'call' ;
FST: 'fst' ;
SND: 'snd' ;
INT: 'int' ;
BOOL: 'bool' ;
CHAR: 'char' ;
STRING: 'string' ;
PAIR: 'pair' ;
EXCLAMATION: '!' ;
LEN: 'len' ;
ORD: 'ord' ;
TOINT: 'toInt' ;
DIGIT: '0'..'9' ;
LOWCHAR: 'a'..'z' ;
R: 'r' ;
F: 'f' ;
N: 'n' ;
T: 't' ;
B: 'b' ;
ZERO: '0' ;
MULTI: '*' ;
DIVIDE: '/' ;
MOD: '%' ;
PLUS: '+' ;
MINUS: '-' ;
GT: '>' ;
GTE: '>=' ;
LT: '<' ;
LTE: '<=' ;
DOUBLEEQUAL: '==' ;
EQUAL: '=' ;
NOTEQUAL: '!=' ;
AND: '&&' ;
OR: '||' ;
UNDERSCORE: '_' ;
UPCHAR: 'A'..'Z' ;
OPENSQUARE: '[' ;
CLOSESQUARE: ']' ;
OPENPARENTHESIS: '(' ;
CLOSEPARENTHESIS: ')' ;
TRUE: 'true' ;
FALSE: 'false' ;
SINGLEQUOT: '\'' ;
DOUBLEQUOT: '\"' ;
BACKSLASH: '\\' ;
COMMA: ',' ;
NULL: 'null' ;
OPENSTRING : DOUBLEQUOT -> pushMode(STRINGMODE) ;
COMMENT: '#' ~[\r\n]* '\r'? '\n' -> skip ;
mode STRINGMODE ;
CLOSESTRING : DOUBLEQUOT -> popMode ;
CHARACTER : ~[\"\'\\] | (BACKSLASH ESCAPEDCHAR) ;
STRLIT : (CHARACTER)* ;
ESCAPEDCHAR : ZERO
| B
| T
| N
| F
| R
| DOUBLEQUOT
| SINGLEQUOT
| BACKSLASH
;
Parser.g4
parser grammar WaccParser;
options {
tokenVocab=WaccLexer;
}
program : BEGIN (func)* stat END EOF;
func : type ident OPENPARENTHESIS (paramlist)? CLOSEPARENTHESIS IS stat END ;
paramlist : param (COMMA param)* ;
param : type ident ;
stat : SKIP
| type ident EQUAL assignrhs
| assignlhs EQUAL assignrhs
| READ assignlhs
| FREE expr
| RETURN expr
| EXIT expr
| PRINT expr
| PRINTLN expr
| IF expr THEN stat ELSE stat FI
| WHILE expr DO stat DONE
| BEGIN stat END
| stat SEMICOLON stat
;
assignlhs : ident
| expr OPENSQUARE expr CLOSESQUARE
| pairelem
;
assignrhs : expr
| arrayliter
| NEWPAIR OPENPARENTHESIS expr COMMA expr CLOSEPARENTHESIS
| pairelem
| CALL ident OPENPARENTHESIS (arglist)? CLOSEPARENTHESIS
;
arglist : expr (COMMA expr)* ;
pairelem : FST expr
| SND expr
;
type : basetype
| type OPENSQUARE CLOSESQUARE
| pairtype
;
basetype : INT
| BOOL
| CHAR
| STRING
;
pairtype : PAIR OPENPARENTHESIS pairelemtype COMMA pairelemtype CLOSEPARENTHESIS ;
pairelemtype : basetype
| type OPENSQUARE CLOSESQUARE
| PAIR
;
expr : intliter
| boolliter
| charliter
| strliter
| pairliter
| ident
| expr OPENSQUARE expr CLOSESQUARE
| unaryoper expr
| expr binaryoper expr
| OPENPARENTHESIS expr CLOSEPARENTHESIS
;
unaryoper : EXCLAMATION
| MINUS
| LEN
| ORD
| TOINT
;
binaryoper : MULTI
| DIVIDE
| MOD
| PLUS
| MINUS nus
| GT
| GTE
| LT
| LTE
| DOUBLEEQUAL
| NOTEQUAL
| AND
| OR
;
ident : (UNDERSCORE | LOWCHAR | UPCHAR) (UNDERSCORE | LOWCHAR | UPCHAR | DIGIT)* ;
intliter : (intsign)? (digit)+ ;
digit : DIGIT ;
intsign : PLUS
| MINUS
;
boolliter : TRUE
| FALSE
;
charliter : CHARACTER;
strliter : OPENSTRING STRLIT CLOSESTRING;
arrayliter : OPENSQUARE (expr (COMMA expr)*)? CLOSESQUARE ;
Please also remember that comment starting with # need to be ignored. Thanks in advance.

The OPENSTRING lexer rule will never be matched in your grammar because the DOUBLEQUOT rule matches exactly the same input sequence and appears before it in the grammar. If you want to define a lexer rule, but you do not actually want that lexer rule to create a token on its own, then you need to define the rule with the fragment modifier.
fragment DOUBLEQUOT : '"';
In addition, you need to correct the warnings that appear when you generate code for your grammar. At least one of them (defined as EPSILON_TOKEN) indicates a major mistake that you made that used to be an error in ANTLR 4.0 but was changed to a warning in ANTLR 4.1 since there is an edge case where it can be used without problems.

Related

ANTLR arithmetic and comparison expressions grammer ANTLR

how to add relational operations to my code
Thanks
My code is
grammar denem1;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
Like this:
...
expr
: Id Assign expr -> ^(Assign Id expr)
| rel
;
rel
: add (('<=' | '<' | '>=' | '>')^ add)?
;
add
: mult (('+' | '-')^ mult)*
;
...
If possible, use ANTLR v4 instead of the old v3. In v4, you can simply do this:
stat
: expr ';'
;
expr
: Id Assign expr
| '-' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| expr ('<=' | '<' | '>=' | '>') expr
| Id
| Num
| '(' expr ')'
;

Simple ANTLR grammar not disambiguating identifiers and literals

I'm trying to match this syntax:
P = 100
require P
credit account:subaccount P
The first is an assignment. The second is a "check" that P is truthy. The third is an instruction to move 100 to account:subaccount. The problem is that the grammar I've written thinks the third line is just an assignment with a missing equal sign. I can't see why.
program: (stmt NEWLINE)+;
stmt: require | entry;
require: 'require' filtrex;
entry: (CREDIT | DEBIT) JOURNAL filtrex (IF filtrex)? (LPARENHASH EXTID RPAREN)?;
assign: ID EQ filtrex;
filtrex: math;
math
: math (TIMES | DIV) math
| math (PLUS | MINUS) math
| LPAREN math RPAREN
| (PLUS | MINUS)* atom
;
atom: NUMBER
| ID
;
NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
fragment SIGN
: ('+' | '-')
;
ID: [a-zA-Z]+[0-9a-zA-Z]*;
EQ: '=';
JOURNAL: [a-zA-Z:]+;
EXTID: [a-zA-Z0-9-]+;
COLON: ':';
CREDIT: 'credit';
DEBIT: 'debit';
IF: 'if';
NEWLINE : [\r\n];
NUM : [0-9.]+;
LPAREN: '(';
RPAREN: ')';
LPARENHASH: '(#';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/' ;
POINT: '.';
WS: [ \r\n\t] + -> skip;
UPDATE
Thanks to the suggestions below I have something that seems to work properly. Now to the implementation of the logic...
grammar Txl;
// High level language
program: stmt (NEWLINE stmt)* NEWLINE? EOF;
stmt: require | entry | assignment;
require: 'require' expr;
entry: (CREDIT | DEBIT) journal expr (IF expr)? (LPAREN 'id:' EXTID RPAREN)?;
assignment: IDENT ASSIGN expr;
journal: IDENT COLON IDENT;
expr: expr MULT expr
| expr DIV expr
| expr PLUS expr
| expr MINUS expr
| expr MOD expr
| expr POW expr
| MINUS expr
| expr AND expr
| expr OR expr
| NOT expr
| expr EQ expr
| expr NEQ expr
| expr LTE expr
| expr LT expr
| expr GTE expr
| expr GT expr
| expr QUESTION expr COLON expr
| LPAREN expr RPAREN
| NUMBER
| IDENT LPAREN args RPAREN
| IDENT
;
fnArg: expr | journal;
args: fnArg
| fnArg COMMA fnArg
|
;
// Reserved words
CREDIT: 'credit';
DEBIT: 'debit';
IF: 'if';
REQUIRE: 'require';
// Operators
MULT: '*';
DIV: '/';
MINUS: '-';
PLUS: '+';
POW: '^';
MOD: '%';
LPAREN: '(';
RPAREN: ')';
LBRACE: '[';
RBRACE: ']';
COMMA: ',';
EQ: '==';
NEQ: '!=';
GTE: '>=';
LTE: '<=';
GT: '>';
LT: '<';
ASSIGN: '=';
QUESTION: '?';
COLON: ':';
AND: 'and';
OR: 'or';
NOT: 'not';
HASH: '#';
NEWLINE : [\r\n];
WS: [ \t] + -> skip;
// Entities
NUMBER: ('0' .. '9') + ('.' ('0' .. '9') +)?;
IDENT: [a-zA-Z]+[0-9a-zA-Z]*;
EXTID: [a-zA-Z0-9-]+;
That is because the input credit is not being matched by your CREDIT rule, but by the ID rule. The lexer always tries to match as many characters as possible. So, the input credit can be matched by: ID, JOURNAL, EXTID and CREDIT. Whenever it happens that multiple rules can match the same characters, the one defined first "wins" (ID in this case). The lexer does not "listen" to what the parser is trying to match, it operates independently from the parser.
Note that the EXTID also causes the input - to be matched by it, causing the MINUS rule to never be matched.
The solution: place your keywords before the ID rule inside the grammar:
CREDIT : 'credit';
DEBIT : 'debit';
REQUIRE : 'require';
ID : [a-zA-Z]+ [0-9a-zA-Z]*;
And, if possible, I'd also remove the JOURNAL and EXTID lexer rules and try to "promote" them to parser rules:
journal
: ID COLON ID
;
extid
: ID (MINUS ID)*
;
NUMBER and NUM can also match the same, while NUM also matches input like 1........2.......22222.... I'd remove the NUM rule and only keep NUMBER.
Remove the \r\n part from WS: [ \r\n\t] + -> skip; since these are already matched by your NEWLINE rule.
By doing (stmt NEWLINE)+, every stmt must end with a new line (also the last one). This could be a better solution: stmt (NEWLINE stmt)* NEWLINE?.
The grammar could look like this:
program
: stmt (NEWLINE stmt)* NEWLINE? EOF
;
stmt
: require
| entry
| assign
;
require
: REQUIRE filtrex
;
entry
: (CREDIT | DEBIT) journal filtrex (IF filtrex)? (LPARENHASH extid RPAREN)?
;
assign
: ID EQ filtrex
;
journal
: ID COLON ID
;
extid
: ID (MINUS ID)*
;
filtrex
: math
;
math
: math (TIMES | DIV) math
| math (PLUS | MINUS) math
| LPAREN math RPAREN
| (PLUS | MINUS)* atom
;
atom
: NUMBER
| ID
;
NUMBER : [0-9]+ ('.' [0-9]+)?;
CREDIT : 'credit';
DEBIT : 'debit';
REQUIRE : 'require';
IF : 'if';
ID : [a-zA-Z]+ [0-9a-zA-Z]*;
EQ : '=';
COLON : ':';
NEWLINE : [\r\n];
LPAREN : '(';
RPAREN : ')';
LPARENHASH : '(#';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIV : '/' ;
POINT : '.';
WS : [ \t] + -> skip;
which will parse your example input like this:

grammar does not separate '123 and ] though the rule is set for it

I am new to antlr. I am trying to parse some queries like [network-traffic:src_port = '123] and [network-traffic:src_port =] and [network-traffic:src_port = ] and ... I have a grammar as follows:
grammar STIXPattern;
pattern
: observationExpressions EOF
;
observationExpressions
: <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
| observationExpressionOr #observationExpressionOr_
;
observationExpressionOr
: <assoc=left> observationExpressionOr OR observationExpressionOr #observationExpressionOred
| observationExpressionAnd #observationExpressionAnd_
;
observationExpressionAnd
: <assoc=left> observationExpressionAnd AND observationExpressionAnd #observationExpressionAnded
| observationExpression #observationExpression_
;
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
comparisonExpression
: <assoc=left> comparisonExpression OR comparisonExpression #comparisonExpressionOred
| comparisonExpressionAnd #comparisonExpressionAnd_
;
comparisonExpressionAnd
: <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd #comparisonExpressionAnded
| propTest #comparisonExpressionAndpropTest
;
propTest
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
| objectPath NOT? (GT|LT|GE|LE) orderableLiteral # propTestOrder
| objectPath NOT? IN setLiteral # propTestSet
| objectPath NOT? LIKE StringLiteral # propTestLike
| objectPath NOT? MATCHES StringLiteral # propTestRegex
| objectPath NOT? ISSUBSET StringLiteral # propTestIsSubset
| objectPath NOT? ISSUPERSET StringLiteral # propTestIsSuperset
| LPAREN comparisonExpression RPAREN # propTestParen
| objectPath NOT? (EQ|NEQ) objectPathThl # propTestThlEqual
;
startStopQualifier
: START TimestampLiteral STOP TimestampLiteral
;
withinQualifier
: WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
;
repeatedQualifier
: REPEATS IntPosLiteral TIMES
;
objectPath
: objectType COLON firstPathComponent objectPathComponent?
;
objectPathThl
: varThlType DOT firstPathComponent objectPathComponent?
;
objectType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
varThlType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
firstPathComponent
: IdentifierWithoutHyphen
| StringLiteral
;
objectPathComponent
: <assoc=left> objectPathComponent objectPathComponent # pathStep
| '.' (IdentifierWithoutHyphen | StringLiteral) # keyPathStep
| LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK # indexPathStep
;
setLiteral
: LPAREN RPAREN
| LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
;
primitiveLiteral
: orderableLiteral
| BoolLiteral
| edgeCases
;
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
orderableLiteral
: IntPosLiteral
| IntNegLiteral
| FloatPosLiteral
| FloatNegLiteral
| StringLiteral
| BinaryLiteral
| HexLiteral
| TimestampLiteral
;
IntNegLiteral :
'-' ('0' | [1-9] [0-9]*)
;
IntNoSign :
('0' | [1-9] [0-9]*)
;
IntPosLiteral :
'+'? ('0' | [1-9] [0-9]*)
;
FloatNegLiteral :
'-' [0-9]* '.' [0-9]+
;
FloatPosLiteral :
'+'? [0-9]* '.' [0-9]+
;
HexLiteral :
'h' QUOTE TwoHexDigits* QUOTE
;
BinaryLiteral :
'b' QUOTE
( Base64Char Base64Char Base64Char Base64Char )*
( (Base64Char Base64Char Base64Char Base64Char )
| (Base64Char Base64Char Base64Char ) '='
| (Base64Char Base64Char ) '=='
)
QUOTE
;
StringLiteral :
QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
;
BoolLiteral :
TRUE | FALSE
;
TimestampLiteral :
't' QUOTE
[0-9] [0-9] [0-9] [0-9] HYPHEN
( ('0' [1-9]) | ('1' [012]) ) HYPHEN
( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
'T'
( ([01] [0-9]) | ('2' [0-3]) ) COLON
[0-5] [0-9] COLON
([0-5] [0-9] | '60')
(DOT [0-9]+)?
'Z'
QUOTE
;
//////////////////////////////////////////////
// Keywords
AND: 'AND' ;
OR: 'OR' ;
NOT: 'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE: 'LIKE' ;
MATCHES: 'MATCHES' ;
ISSUPERSET: 'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST: 'LAST' ;
IN: 'IN' ;
START: 'START' ;
STOP: 'STOP' ;
SECONDS: 'SECONDS' ;
TRUE: 'true' ;
FALSE: 'false' ;
WITHIN: 'WITHIN' ;
REPEATS: 'REPEATS' ;
TIMES: 'TIMES' ;
// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
[a-zA-Z_] [a-zA-Z0-9_]*
;
IdentifierWithHyphen :
[a-zA-Z_] [a-zA-Z0-9_-]*
;
EQ : '=' | '==';
NEQ : '!=' | '<>';
LT : '<';
LE : '<=';
GT : '>';
GE : '>=';
QUOTE : '\'';
COLON : ':' ;
DOT : '.' ;
COMMA : ',' ;
RPAREN : ')' ;
LPAREN : '(' ;
RBRACK : ']' ;
LBRACK : '[' ;
PLUS : '+' ;
HYPHEN : MINUS ;
MINUS : '-' ;
POWER_OP : '^' ;
DIVIDE : '/' ;
ASTERISK : '*';
EQRBRAC : ']';
fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];
// Whitespace and comments
//
WS : [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
: .
;
Now when I feed [network-traffic:src_port = '123] I expect antlr to parse the query to '123 and ]
However the grammar return '123] and is not able to separate '123 and ]
Am missing anything?
grammar does not separate '123 and ] though the rule is set for it
That is not true. The quote and 123 are separate tokens. As demonstrated/suggested in your previous ANTLR question: start by printing all the tokens to your console to see what tokens are being created. This should always be the first thing you do when trying to debug an ANTLR grammar. It will save you a lot of time and headache.
The fact [network-traffic:src_port = '123] is not parsed properly, is because the ](RBRACK) is being consumed by the alternative observationExpressionSimple:
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
Because RBRACK was already consumed by a parser rule, the edgeCases rule can't consume this RBRACK token as well.
To fix this, change your rule:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign) RBRACK
| RBRACK
;
into this:
edgeCases
: QUOTE (IdentifierWithHyphen | IdentifierWithoutHyphen | IntNoSign)
;
Now [network-traffic:src_port = '123] will be parsed properly:

ANTLR4 Token is not recognized when substituted

I try to modify the grammar of the sqlite syntax (I'm interested in a variant of the where clause only) and I'm keep having a weird error when substituting AND to it's own token.
grammar wtfql;
/*
SQLite understands the following binary operators, in order from highest to
lowest precedence:
||
* / %
+ -
<< >> & |
< <= > >=
= != <> IS IS NOT IN LIKE GLOB MATCH REGEXP
AND
OR
*/
start : expr EOF?;
expr
: literal_value
//BIND_PARAMETER
| ( table_name '.' )? column_name
| unary_operator expr
| expr '||' expr
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '<>' | K_IN ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* )? ')'
| '(' expr ')'
| expr K_NOT expr
| expr ( K_NOT K_NULL )
| expr K_NOT? K_IN ( '(' ( expr ( ',' expr )* ) ')' )
;
unary_operator
: '-'
| '+'
| K_NOT
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| K_NULL
;
function_name
: IDENTIFIER
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
// | '(' any_name ')'
;
keyword
: K_AND
| K_NOT
| K_NULL
| K_IN
| K_OR
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\"' ( ~'\"' | '\"\"' )* '\"'
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
DOT : '.';
OPEN_PAR : '(';
CLOSE_PAR : ')';
COMMA : ',';
STAR : '*';
PLUS : '+';
MINUS : '-';
TILDE : '~';
DIV : '/';
MOD : '%';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '=';
NOT_EQ2 : '<>';
K_AND : A N D;
K_NOT : N O T;
K_NULL : N U L L;
K_OR : O R;
K_IN : I N;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
writing
| expr K_AND expr
with the input
field1=1 and field2 = 2
results in
line 1:8 mismatched input 'and' expecting {<EOF>, '||', '*', '+', '-', '/', '%', '<', '<=', '>', '>=', '=', '<>', K_AND, K_NOT, K_OR, K_IN}
while
| expr 'and' expr
works like a charm:
$ antlr4 wtfql.g4 && javac -classpath /usr/local/Cellar/antlr/4.4/antlr-4.4-complete.jar wtfql*.java && cat test.txt | grun wtfql start -tree -gui
(start (expr (expr (expr (column_name (any_name feld1))) = (expr (literal_value 1))) and (expr (expr (column_name (any_name feld2))) = (expr (literal_value 2)))) <EOF>)
What am I missing?
I presume "and" is an IDENTIFIER since the rule for IDENTIFIER comes before the rule for AND and thus wins.
If you write 'and' in the parser rule this implicitly creates a token (not AND!) which comes before IDENTIFIER and thus wins.
Rule of thumb: More specific lexer rules first. Don't create new lexer tokens implicitly in parser rules.
If you check the token type, you'll get a clue what's going on.

Trying to resolve left-recursion trying to build Parser with ANTLR

I’m currently trying to build a parser for the language Oberon using Antlr and Ecplise.
This is what I have got so far:
grammar oberon;
options
{
language = Java;
//backtrack = true;
output = AST;
}
#parser::header {package dhbw.Oberon;}
#lexer::header {package dhbw.Oberon; }
T_ARRAY : 'ARRAY' ;
T_BEGIN : 'BEGIN';
T_CASE : 'CASE' ;
T_CONST : 'CONST' ;
T_DO : 'DO' ;
T_ELSE : 'ELSE' ;
T_ELSIF : 'ELSIF' ;
T_END : 'END' ;
T_EXIT : 'EXIT' ;
T_IF : 'IF' ;
T_IMPORT : 'IMPORT' ;
T_LOOP : 'LOOP' ;
T_MODULE : 'MODULE' ;
T_NIL : 'NIL' ;
T_OF : 'OF' ;
T_POINTER : 'POINTER' ;
T_PROCEDURE : 'PROCEDURE' ;
T_RECORD : 'RECORD' ;
T_REPEAT : 'REPEAT' ;
T_RETURN : 'RETURN';
T_THEN : 'THEN' ;
T_TO : 'TO' ;
T_TYPE : 'TYPE' ;
T_UNTIL : 'UNTIL' ;
T_VAR : 'VAR' ;
T_WHILE : 'WHILE' ;
T_WITH : 'WITH' ;
module : T_MODULE ID SEMI importlist? declarationsequence?
(T_BEGIN statementsequence)? T_END ID PERIOD ;
importlist : T_IMPORT importitem (COMMA importitem)* SEMI ;
importitem : ID (ASSIGN ID)? ;
declarationsequence :
( T_CONST (constantdeclaration SEMI)*
| T_TYPE (typedeclaration SEMI)*
| T_VAR (variabledeclaration SEMI)*)
(proceduredeclaration SEMI | forwarddeclaration SEMI)*
;
constantdeclaration: identifierdef EQUAL expression ;
identifierdef: ID MULT? ;
expression: simpleexpression (relation simpleexpression)? ;
simpleexpression : (PLUS|MINUS)? term (addoperator term)* ;
term: factor (muloperator factor)* ;
factor: number
| stringliteral
| T_NIL
| set
| designator '(' explist? ')'
;
number: INT | HEX ; // TODO add real
stringliteral : '"' ( ~('\\'|'"') )* '"' ;
set: '{' elementlist? '}' ;
elementlist: element (COMMA element)* ;
element: expression (RANGESEP expression)? ;
designator: qualidentifier
('.' ID
| '[' explist ']'
| '(' qualidentifier ')'
| UPCHAR )+
;
explist: expression (COMMA expression)* ;
actualparameters: '(' explist? ')' ;
muloperator: MULT | DIV | MOD | ET ;
addoperator: PLUS | MINUS | OR ;
relation: EQUAL ; // TODO
typedeclaration: ID EQUAL type ;
type: qualidentifier
| arraytype
| recordtype
| pointertype
| proceduretype
;
qualidentifier: (ID '.')* ID ;
arraytype: T_ARRAY expression (',' expression) T_OF type;
recordtype: T_RECORD ('(' qualidentifier ')')? fieldlistsequence T_END ;
fieldlistsequence: fieldlist (SEMI fieldlist) ;
fieldlist: (identifierlist COLON type)? ;
identifierlist: identifierdef (COMMA identifierdef)* ;
pointertype: T_POINTER T_TO type ;
proceduretype: T_PROCEDURE formalparameters? ;
variabledeclaration: identifierlist COLON type ;
proceduredeclaration: procedureheading SEMI procedurebody ID ;
procedureheading: T_PROCEDURE MULT? identifierdef formalparameters? ;
formalparameters: '(' params? ')' (COLON qualidentifier)? ;
params: fpsection (SEMI fpsection)* ;
fpsection: T_VAR? idlist COLON formaltype ;
idlist: ID (COMMA ID)* ;
formaltype: (T_ARRAY T_OF)* (qualidentifier | proceduretype);
procedurebody: declarationsequence (T_BEGIN statementsequence)? T_END ;
forwarddeclaration: T_PROCEDURE UPCHAR? ID MULT? formalparameters? ;
statementsequence: statement (SEMI statement)* ;
statement : assignment
| procedurecall
| ifstatement
| casestatement
| whilestatement
| repeatstatement
| loopstatement
| withstatement
| T_EXIT
| T_RETURN expression?
;
assignment: designator ASSIGN expression ;
procedurecall: designator actualparameters? ;
ifstatement: T_IF expression T_THEN statementsequence
(T_ELSIF expression T_THEN statementsequence)*
(T_ELSE statementsequence)? T_END ;
casestatement: T_CASE expression T_OF caseitem ('|' caseitem)*
(T_ELSE statementsequence)? T_END ;
caseitem: caselabellist COLON statementsequence ;
caselabellist: caselabels (COMMA caselabels)* ;
caselabels: expression (RANGESEP expression)? ;
whilestatement: T_WHILE expression T_DO statementsequence T_END ;
repeatstatement: T_REPEAT statementsequence T_UNTIL expression ;
loopstatement: T_LOOP statementsequence T_END ;
withstatement: T_WITH qualidentifier COLON qualidentifier T_DO statementsequence T_END ;
ID : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'_'|'0'..'9')* ;
fragment DIGIT : '0'..'9' ;
INT : ('-')?DIGIT+ ;
fragment HEXDIGIT : '0'..'9'|'A'..'F' ;
HEX : HEXDIGIT+ 'H' ;
ASSIGN : ':=' ;
COLON : ':' ;
COMMA : ',' ;
DIV : '/' ;
EQUAL : '=' ;
ET : '&' ;
MINUS : '-' ;
MOD : '%' ;
MULT : '*' ;
OR : '|' ;
PERIOD : '.' ;
PLUS : '+' ;
RANGESEP : '..' ;
SEMI : ';' ;
UPCHAR : '^' ;
WS : ( ' ' | '\t' | '\r' | '\n'){skip();};
My problem is when I check the grammar I get the following error and just can’t find an appropriate way to fix this:
rule statement has non-LL(*) decision
due to recursive rule invocations reachable from alts 1,2.
Resolve by left-factoring or using syntactic predicates
or using backtrack=true option.
|---> statement : assignment
Also I have the problem with declarationsequence and simpleexpression.
When I use options { … backtrack = true; … } it at least compiles, but obviously doesn’t work right anymore when I run a test-file, but I can’t find a way to resolve the left-recursion on my own (or maybe I’m just too blind at the moment because I’ve looked at this for far too long now). Any ideas how I could change the lines where the errors occurs to make it work?
EDIT
I could fix one of the three mistakes. statement works now. The problem was that assignment and procedurecall both started with designator.
statement : procedureassignmentcall
| ifstatement
| casestatement
| whilestatement
| repeatstatement
| loopstatement
| withstatement
| T_EXIT
| T_RETURN expression?
;
procedureassignmentcall : (designator ASSIGN)=> assignment | procedurecall;
assignment: designator ASSIGN expression ;
procedurecall: designator actualparameters? ;