Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity - antlr

I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?

(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);

Related

ANTLR grammar issues with negative numbers

My ANTLR Grammar for simple expressions is as below:
This grammar works for most of the scenarios except when I try to use negative numbers.
abs(1.324) is valid
abs(-1.324) is being thrown as an error.
Or if the expression is just a negative number such as -1.344 I am having the following error in the console.
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| expr ( MOD ) expr
| NUM
| ID
| STRING
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
MOD: '%';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: ([0-9]*[.])?[0-9]+;
STRING: '"' ~ ["]* '"';
fragment ID_NODE: [a-zA-Z_$][a-zA-Z0-9_$]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '/*' .*? '*/' -> skip;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
WS: [ \r\t\n]+ -> skip;
The grammar seems fine to me. It could be a bug with the runtime you're using, but that seems odd to me, given you're not doing anything special.
With the Java runtime, this is what I get when parsing/lexing the input abs(-1.324):
The following tokens are produced:
ID `abs`
OPEN_PAR `(`
MIN `-`
NUM `1.324`
CLOSE_PAR `)`
EOF `<EOF>`
and the entry point parse gives:

How to enable combination of orExpression and andExpression rule in ANTLR?

I want to parse the following with antlr4
isSet(foo) or isSet(bar) and isSet(test)
Actually i can see in the parse tree that only the first or is recognized, I can add multiple or's and the parse tree grows, but an additional and will not be recognized. How can I define this in the grammar?
This my current grammar file:
grammar Expr;
prog: (stat)+;
stat: (command | orExpression | andExpression | notExpression)+;
orExpression: command ( OR command | XOR command)*;
andExpression:command ( AND command)*;
notExpression:NOT command;
command:IS_SET LPAREN parameter RPAREN
| IS_EMPTY LPAREN parameter RPAREN;
parameter: ID;
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
SEMI : ';';
COMMA : ',';
DOT : '.';
ASSIGN : '=';
GT : '>';
LT : '<';
BANG : '!';
TILDE : '~';
QUESTION : '?';
COLON : ':';
EQUAL : '==';
LE : '<=';
GE : '>=';
NOTEQUAL : '!=';
AND : 'and';
OR : 'or';
XOR :'xor';
NOT :'not' ;
INC : '++';
DEC : '--';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
INT: [0-9]+;
NEWLINE: '\r'? '\n';
IS_SET:'isSet';
IS_EMPTY:'isEmpty';
WS: [\t]+ -> skip;
ID
: JavaLetter JavaLetterOrDigit*
;
fragment
JavaLetter
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
Here you can see the parse tree, with the missing andExpression
Only the first part is parsed because the rule prog: (stat)+; is only told to parse at least 1 stat, which it does. If you want the parser to process all tokens, "anchor" your start rule with the EOF token:
prog : stat+ EOF;
And now your input isSet(foo) or isSet(bar) and isSet(test) will produce an error message. The first part, isSet(foo) or isSet(bar), is still recognised as a orExpression, but the last part and isSet(test) cannot be matched. The general idea is to do something like this:
prog : stat+ EOF;
stat : orExpression+;
orExpression : andExpression ( OR andExpression | XOR andExpression)*;
andExpression : notExpression ( AND notExpression)*;
notExpression : NOT? command;
command : IS_SET LPAREN parameter RPAREN
| IS_EMPTY LPAREN parameter RPAREN;
parameter : ID;
But ANTLR4 supports direct left recursive rules, so you could also write the rules above like this:
prog: expr+ EOF;
expr
: NOT expr #NotExpr
| expr AND expr #AndExpr
| expr (OR | XOR) expr #OrExpr
| IS_SET LPAREN expr RPAREN #CommandExpr
| ID #IdExpr
;
which is, IMO, much nicer.

How to make antlr find invalid input throw exception

I have the following grammar:
grammar Expr;
expr : '-' expr # unaryOpExpr
| expr ('*'|'/'|'%') expr # mulDivModuloExpr
| expr ('+'|'-') expr # addSubExpr
| '(' expr ')' # nestedExpr
| IDENTIFIER '(' fnArgs? ')' # functionExpr
| IDENTIFIER # identifierExpr
| DOUBLE # doubleExpr
| LONG # longExpr
| STRING # string
;
fnArgs : expr (',' expr)* # functionArgs
;
IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"';
LONG : [0-9]+;
DOUBLE : [0-9]+ '.' [0-9]*;
WS : [ \t\r\n]+ -> skip ;
STRING: '"' (~["\\\r\n] | ESC)* '"';
fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MODULO : '%' ;
PLUS : '+' ;
// math function
MAX: 'MAX';
when I enter following text,It should be effective
-1.1
bug when i enter following text:
-1.1ffff
I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff",
but i want to change this behavior, didn't discard invalid token, but throw exception,report
detection invalid token.
So what should i do, Thanks for your advice
Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this:
parse: expr EOF;
This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.

Decision can match input such as "MULOP LETTER" using multiple alternatives: 1, 2

I'm getting this error
[22:52:55] warning(200): ProjLang.g:53:30:
Decision can match input such as "MULOP LETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
It seems like there could be some ambiguity in my grammar from my googling. I don't know how I can remove the ambiguity.
fragment LETTER
: ('a'..'z') | ('A'..'Z');
fragment DIGIT
: ('0'..'9');
ADDOP : ('+'|'-'|'|')
;
MULOP : ('*'|'/'|'&')
;
RELOP : ('<'|'=')
;
LPAR : '(';
RPAR : ')';
BOOL : 'true'|'false';
LCURL : '{';
RCURL : '}';
// parser rules: non terminal states should be lowercase
input : expr EOF
;
expr : 'if' expr 'then' expr 'else' expr
| 'let'( 'val' id RELOP expr 'in' expr 'end'|'fun' id LPAR id RPAR RELOP expr 'in' expr 'end')
| 'while' expr 'do' expr
| LCURL expr (';'expr)* RCURL
| '!'expr
| id ':=' expr
| relexpr
;
relexpr
: arithexpr (RELOP arithexpr)?
;
arithexpr
: term (MULOP term)*
;
term : factor (MULOP factor)*
;
factor : num
| BOOL
| id (LPAR expr RPAR)?
| LPAR expr RPAR
;
id : LETTER (LETTER | DIGIT)*;
num : DIGIT+;
I expect to write a grammar without error message so I can generate a lexer and a parser for it.
These rules do essentially the same:
arithexpr
: term (MULOP term)*
;
term : factor (MULOP factor)*
;
If you combine them you will get:
arithexpr: factor (MULOP factor)* (MULOP factor (MULOP factor)*)*
which contains an ambiquity (which of the two MULOP tokens should be matched after the initial factor?). But from the rewrite it's easy to see what to do:
arithexpr: factor (MULOP factor)*;
which replaces the original term and arithexpr rules.

String interpolation: is it possible without adding members?

I would like to create an Antlr parser for custom language and decided to pick a simple calculator as an example. In my new grammar it should be possible to define a string, like this:
s = "Hello, I am a string"
and handle string interpolation.
Text in double quotes enclosed in persent should be treated as interpolated, e.g.
s = "Hello, did you know that %2 + 2% is 4?"
Double percent sign should not be processed, e.g.
s = "He wants 50%% of this deal."
But at the same time my calculator should support modulus operation:
x = 5 % 2
So far, I was able to craft a Lexer/Grammar, which could switch mode and parse simple strings, here they are:
lexer grammar CalcLexer;
EQ: '=';
PLUS: '+';
MINUS: '-';
MULT: '*';
DIV: '/';
LPAREN : '(' ;
RPAREN : ')' ;
SINGLE_PERCENT_POP: '%' -> popMode;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
OPEN_DOUBLE_QUOTE: '"' -> pushMode(STRING_MODE);
NEWLINE:'\r'? '\n' ;
WS : [ \t]+ -> skip;
mode STRING_MODE;
DOUBLE_PERCENT: '%%';
SINGLE_PERCENT: '%' -> pushMode(DEFAULT_MODE);
TEXT: ~('%'|'\n'|'"')+;
CLOSE_DOUBLE_QUOTE: '"' -> popMode;
and
parser grammar CalcGrammar;
options { tokenVocab=CalcLexer; } // use tokens from CalcLexer.g4
prog: stat+ ;
stat: expr NEWLINE
| ID EQ (expr|text) NEWLINE
| NEWLINE
;
text: OPEN_DOUBLE_QUOTE content* CLOSE_DOUBLE_QUOTE;
content: DOUBLE_PERCENT | TEXT | SINGLE_PERCENT expr SINGLE_PERCENT_POP;
expr: expr (MULT|DIV) expr
| expr (PLUS|MINUS) expr
| INT
| ID
| LPAREN expr RPAREN
;
But only thing doesn't work and I'm not sure if it ever possible to implement without custom code (members) is modulus operation:
x = 5 % 2
There is no way I can ask Anltr to check for previous mode and safely pop mode.
But I hope my understanding is wrong and there is some way to treat % sign as operator in default mode?
I have found several sources for inspiration, probably they would help you as well:
Parsing string interpolation in ANTLR
ANTLR String interpolation
Parsing String Interpolations with ANTLR4
String interpolation and lexer modes
Murphy's law for StackOverflow: you will find an answer to your own question after several minutes you post detailed question to SO.
Instead of switching to DEFAULT_MODE, I should create separate one - STRING_INTERPOLATION. This way I have to define separate tokens for this mode, which will let use % sign in normal mode (and prohibit in interpolated).
Here is Lexer and Grammar which works for me:
lexer grammar CalcLexer;
EQ: '=';
PLUS: '+';
MINUS: '-';
MULT: '*';
DIV: '/';
MOD: '%';
LPAREN : '(' ;
RPAREN : ')' ;
ID : F_ID;
INT : F_INT;
fragment F_ID: [a-zA-Z]+ ;
fragment F_INT: [0-9]+ ;
OPEN_DOUBLE_QUOTE: '"' -> pushMode(STRING_MODE);
NEWLINE:'\r'? '\n' ;
WS : [ \t]+ -> skip;
mode STRING_MODE;
DOUBLE_PERCENT: '%%';
SINGLE_PERCENT: '%' -> pushMode(STRING_INTERPOLATION);
TEXT: ~('%'|'\n'|'"')+;
CLOSE_DOUBLE_QUOTE: '"' -> popMode;
mode STRING_INTERPOLATION;
SINGLE_PERCENT_POP: '%' -> popMode;
I_PLUS: PLUS -> type(PLUS);
I_MINUS: MINUS -> type(MINUS);
I_MULT: MULT -> type(MULT);
I_DIV: DIV -> type(DIV);
I_MOD: MOD -> type(MOD);
I_LPAREN: LPAREN -> type(LPAREN);
I_RPAREN: RPAREN -> type(RPAREN);
I_ID : F_ID -> type(ID);
I_INT : F_INT -> type(INT);
WS1 : [ \t]+ -> skip;
and
parser grammar CalcGrammar;
options { tokenVocab=CalcLexer; } // use tokens from CalcLexer.g4
prog: stat+ ;
stat: expr NEWLINE
| ID EQ (expr|text) NEWLINE
| NEWLINE
;
text: OPEN_DOUBLE_QUOTE content* CLOSE_DOUBLE_QUOTE;
content: DOUBLE_PERCENT | TEXT | SINGLE_PERCENT expr SINGLE_PERCENT_POP;
expr: expr (MULT|DIV|MOD) expr
| expr (PLUS|MINUS) expr
| INT
| ID
| LPAREN expr RPAREN
;
I hope this would help someone. Probably, future me.