I want to parse the following with antlr4
isSet(foo) or isSet(bar) and isSet(test)
Actually i can see in the parse tree that only the first or is recognized, I can add multiple or's and the parse tree grows, but an additional and will not be recognized. How can I define this in the grammar?
This my current grammar file:
grammar Expr;
prog: (stat)+;
stat: (command | orExpression | andExpression | notExpression)+;
orExpression: command ( OR command | XOR command)*;
andExpression:command ( AND command)*;
notExpression:NOT command;
command:IS_SET LPAREN parameter RPAREN
| IS_EMPTY LPAREN parameter RPAREN;
parameter: ID;
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
SEMI : ';';
COMMA : ',';
DOT : '.';
ASSIGN : '=';
GT : '>';
LT : '<';
BANG : '!';
TILDE : '~';
QUESTION : '?';
COLON : ':';
EQUAL : '==';
LE : '<=';
GE : '>=';
NOTEQUAL : '!=';
AND : 'and';
OR : 'or';
XOR :'xor';
NOT :'not' ;
INC : '++';
DEC : '--';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
INT: [0-9]+;
NEWLINE: '\r'? '\n';
IS_SET:'isSet';
IS_EMPTY:'isEmpty';
WS: [\t]+ -> skip;
ID
: JavaLetter JavaLetterOrDigit*
;
fragment
JavaLetter
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
Here you can see the parse tree, with the missing andExpression
Only the first part is parsed because the rule prog: (stat)+; is only told to parse at least 1 stat, which it does. If you want the parser to process all tokens, "anchor" your start rule with the EOF token:
prog : stat+ EOF;
And now your input isSet(foo) or isSet(bar) and isSet(test) will produce an error message. The first part, isSet(foo) or isSet(bar), is still recognised as a orExpression, but the last part and isSet(test) cannot be matched. The general idea is to do something like this:
prog : stat+ EOF;
stat : orExpression+;
orExpression : andExpression ( OR andExpression | XOR andExpression)*;
andExpression : notExpression ( AND notExpression)*;
notExpression : NOT? command;
command : IS_SET LPAREN parameter RPAREN
| IS_EMPTY LPAREN parameter RPAREN;
parameter : ID;
But ANTLR4 supports direct left recursive rules, so you could also write the rules above like this:
prog: expr+ EOF;
expr
: NOT expr #NotExpr
| expr AND expr #AndExpr
| expr (OR | XOR) expr #OrExpr
| IS_SET LPAREN expr RPAREN #CommandExpr
| ID #IdExpr
;
which is, IMO, much nicer.
Related
I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?
(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);
I'm getting this error
[22:52:55] warning(200): ProjLang.g:53:30:
Decision can match input such as "MULOP LETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
It seems like there could be some ambiguity in my grammar from my googling. I don't know how I can remove the ambiguity.
fragment LETTER
: ('a'..'z') | ('A'..'Z');
fragment DIGIT
: ('0'..'9');
ADDOP : ('+'|'-'|'|')
;
MULOP : ('*'|'/'|'&')
;
RELOP : ('<'|'=')
;
LPAR : '(';
RPAR : ')';
BOOL : 'true'|'false';
LCURL : '{';
RCURL : '}';
// parser rules: non terminal states should be lowercase
input : expr EOF
;
expr : 'if' expr 'then' expr 'else' expr
| 'let'( 'val' id RELOP expr 'in' expr 'end'|'fun' id LPAR id RPAR RELOP expr 'in' expr 'end')
| 'while' expr 'do' expr
| LCURL expr (';'expr)* RCURL
| '!'expr
| id ':=' expr
| relexpr
;
relexpr
: arithexpr (RELOP arithexpr)?
;
arithexpr
: term (MULOP term)*
;
term : factor (MULOP factor)*
;
factor : num
| BOOL
| id (LPAR expr RPAR)?
| LPAR expr RPAR
;
id : LETTER (LETTER | DIGIT)*;
num : DIGIT+;
I expect to write a grammar without error message so I can generate a lexer and a parser for it.
These rules do essentially the same:
arithexpr
: term (MULOP term)*
;
term : factor (MULOP factor)*
;
If you combine them you will get:
arithexpr: factor (MULOP factor)* (MULOP factor (MULOP factor)*)*
which contains an ambiquity (which of the two MULOP tokens should be matched after the initial factor?). But from the rewrite it's easy to see what to do:
arithexpr: factor (MULOP factor)*;
which replaces the original term and arithexpr rules.
I am trying to build a new language with ANTLR, and I have run into a problem. I am trying to support numerical expressions and mathematical operations on numbers(pretty important I reckon), but the parser doesn't seem to be acting how I expect. Here is my grammar:
grammar Lumos;
/*
* Parser Rules
*/
program : 'start' stat+ 'stop';
block : stat*
;
stat : assign
| numop
| if_stat
| while_stat
| display
;
assign : LET ID BE expr ;
display : DISPLAY expr ;
numop : add | subtract | multiply | divide ;
add : 'add' expr TO ID ;
subtract : 'subtract' expr 'from' ID ;
divide : 'divide' ID BY expr ;
multiply : 'multiply' ID BY expr ;
append : 'append' expr TO ID ;
if_stat
: IF condition_block (ELSE IF condition_block)* (ELSE stat_block)?
;
condition_block
: expr stat_block
;
stat_block
: OBRACE block CBRACE
| stat
;
while_stat
: WHILE expr stat_block
;
expr : expr POW<assoc=right> expr #powExpr
| MINUS expr #unaryExpr
| NOT expr #notExpr
| expr op=(TIMES|DIV|MOD) expr #multiplicativeExpr
| expr op=(PLUS|MINUS) expr #additiveExpr
| expr op=RELATIONALOPERATOR expr #relationalExpr
| expr op=EQUALITYOPERATOR expr #equalityExpr
| expr AND expr #andExpr
| expr OR expr #orExpr
//| ARRAY #arrayExpr
| atom #atomExpr
;
atom : LPAREN expr RPAREN #parExpr
| (INT|FLOAT) #numberExpr
| (TRUE|FALSE) #booleanAtom
| ID #idAtom
| STRING #stringAtom
| NIX #nixAtom
;
compileUnit : EOF ;
/*
* Lexer Rules
*/
fragment LETTER : [a-zA-Z] ;
MATHOP : PLUS
| MINUS
| TIMES
| DIV
| MOD
| POW
;
RELATIONALOPERATOR : LTEQ
| GTEQ
| LT
| GT
;
EQUALITYOPERATOR : EQ
| NEQ
;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
OR : 'or' ;
AND : 'and' ;
BY : 'by' ;
TO : 'to' ;
FROM : 'from' ;
LET : 'let' ;
BE : 'be' ;
EQ :'==' ;
NEQ :'!=' ;
LTEQ :'<=' ;
GTEQ :'>=' ;
LT :'<' ;
GT :'>' ;
//Different statements will choose between these, but they are pretty much the
same.
PLUS :'plus' ;
ADD :'add' ;
MINUS :'minus' ;
SUBTRACT :'sub' ;
TIMES :'times' ;
MULT :'multiply' ;
DIV :'divide' ;
MOD :'mod' ;
POW :'pow' ;
NOT :'not' ;
TRUE :'true' ;
FALSE :'false' ;
NIX :'nix' ;
IF :'if' ;
THEN :'then' ;
ELSE :'else' ;
WHILE :'while' ;
DISPLAY :'display' ;
ARRAY : '['(INT|FLOAT)(','(INT|FLOAT))+']';
ID : [a-z]+ ;
WORD : LETTER+ ;
//NUMBER : INT | FLOAT ;
INT : [0-9]+ ;
FLOAT : [0-9]+ '.' [0-9]*
| '.'[0-9]+
;
COMMENT : '#' ~[\r\n]* -> channel(HIDDEN) ;
WS : [ \n\t\r]+ -> channel(HIDDEN) ;
STRING : '"' (~["{}])+ '"' ;
When given the input let foo be 5 times 3, the visitor sees let foo be 5 and an extraneous times 3. I thought I set up the expr rule so that it would recognize a multiplication expression before it recognizes atoms, so this wouldn't happen. I don't know where I went wrong, but it does not work how I expected.
If anyone has any idea where I went wrong or how I can fix this problem, I would appreciate your input.
You're using TIMES in your parser rules, but the MATHOP also matches TIMES and since MATHOP is defined before your TIMES rule, it gets precedence. That is why the TIMES rule in expr op=(TIMES|DIV|MOD) expr isn't matched.
I don't see you using this MATHOP rule anywhere in your parser rules, so I recommend just removing the MATHOP rule all together.
I would like to create an Antlr parser for custom language and decided to pick a simple calculator as an example. In my new grammar it should be possible to define a string, like this:
s = "Hello, I am a string"
and handle string interpolation.
Text in double quotes enclosed in persent should be treated as interpolated, e.g.
s = "Hello, did you know that %2 + 2% is 4?"
Double percent sign should not be processed, e.g.
s = "He wants 50%% of this deal."
But at the same time my calculator should support modulus operation:
x = 5 % 2
So far, I was able to craft a Lexer/Grammar, which could switch mode and parse simple strings, here they are:
lexer grammar CalcLexer;
EQ: '=';
PLUS: '+';
MINUS: '-';
MULT: '*';
DIV: '/';
LPAREN : '(' ;
RPAREN : ')' ;
SINGLE_PERCENT_POP: '%' -> popMode;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
OPEN_DOUBLE_QUOTE: '"' -> pushMode(STRING_MODE);
NEWLINE:'\r'? '\n' ;
WS : [ \t]+ -> skip;
mode STRING_MODE;
DOUBLE_PERCENT: '%%';
SINGLE_PERCENT: '%' -> pushMode(DEFAULT_MODE);
TEXT: ~('%'|'\n'|'"')+;
CLOSE_DOUBLE_QUOTE: '"' -> popMode;
and
parser grammar CalcGrammar;
options { tokenVocab=CalcLexer; } // use tokens from CalcLexer.g4
prog: stat+ ;
stat: expr NEWLINE
| ID EQ (expr|text) NEWLINE
| NEWLINE
;
text: OPEN_DOUBLE_QUOTE content* CLOSE_DOUBLE_QUOTE;
content: DOUBLE_PERCENT | TEXT | SINGLE_PERCENT expr SINGLE_PERCENT_POP;
expr: expr (MULT|DIV) expr
| expr (PLUS|MINUS) expr
| INT
| ID
| LPAREN expr RPAREN
;
But only thing doesn't work and I'm not sure if it ever possible to implement without custom code (members) is modulus operation:
x = 5 % 2
There is no way I can ask Anltr to check for previous mode and safely pop mode.
But I hope my understanding is wrong and there is some way to treat % sign as operator in default mode?
I have found several sources for inspiration, probably they would help you as well:
Parsing string interpolation in ANTLR
ANTLR String interpolation
Parsing String Interpolations with ANTLR4
String interpolation and lexer modes
Murphy's law for StackOverflow: you will find an answer to your own question after several minutes you post detailed question to SO.
Instead of switching to DEFAULT_MODE, I should create separate one - STRING_INTERPOLATION. This way I have to define separate tokens for this mode, which will let use % sign in normal mode (and prohibit in interpolated).
Here is Lexer and Grammar which works for me:
lexer grammar CalcLexer;
EQ: '=';
PLUS: '+';
MINUS: '-';
MULT: '*';
DIV: '/';
MOD: '%';
LPAREN : '(' ;
RPAREN : ')' ;
ID : F_ID;
INT : F_INT;
fragment F_ID: [a-zA-Z]+ ;
fragment F_INT: [0-9]+ ;
OPEN_DOUBLE_QUOTE: '"' -> pushMode(STRING_MODE);
NEWLINE:'\r'? '\n' ;
WS : [ \t]+ -> skip;
mode STRING_MODE;
DOUBLE_PERCENT: '%%';
SINGLE_PERCENT: '%' -> pushMode(STRING_INTERPOLATION);
TEXT: ~('%'|'\n'|'"')+;
CLOSE_DOUBLE_QUOTE: '"' -> popMode;
mode STRING_INTERPOLATION;
SINGLE_PERCENT_POP: '%' -> popMode;
I_PLUS: PLUS -> type(PLUS);
I_MINUS: MINUS -> type(MINUS);
I_MULT: MULT -> type(MULT);
I_DIV: DIV -> type(DIV);
I_MOD: MOD -> type(MOD);
I_LPAREN: LPAREN -> type(LPAREN);
I_RPAREN: RPAREN -> type(RPAREN);
I_ID : F_ID -> type(ID);
I_INT : F_INT -> type(INT);
WS1 : [ \t]+ -> skip;
and
parser grammar CalcGrammar;
options { tokenVocab=CalcLexer; } // use tokens from CalcLexer.g4
prog: stat+ ;
stat: expr NEWLINE
| ID EQ (expr|text) NEWLINE
| NEWLINE
;
text: OPEN_DOUBLE_QUOTE content* CLOSE_DOUBLE_QUOTE;
content: DOUBLE_PERCENT | TEXT | SINGLE_PERCENT expr SINGLE_PERCENT_POP;
expr: expr (MULT|DIV|MOD) expr
| expr (PLUS|MINUS) expr
| INT
| ID
| LPAREN expr RPAREN
;
I hope this would help someone. Probably, future me.
I'm trying to write a simple lambda calculus grammar (show below). The issue I am having is that function application seems to be treated as right associative instead of left associative e.g. "f 1 2" is parsed as (f (1 2)) instead of ((f 1) 2). ANTLR has an assoc option for tokens, but I don't see how that helps here since there is no operator for function application. Does anyone see a solution?
LAMBDA : '\\';
DOT : '.';
OPEN_PAREN : '(';
CLOSE_PAREN : ')';
fragment ID_START : [A-Za-z+\-*/_];
fragment ID_BODY : ID_START | DIGIT;
fragment DIGIT : [0-9];
ID : ID_START ID_BODY*;
NUMBER : DIGIT+ (DOT DIGIT+)?;
WS : [ \t\r\n]+ -> skip;
parse : expr EOF;
expr : variable #VariableExpr
| number #ConstantExpr
| function_def #FunctionDefinition
| expr expr #FunctionApplication
| OPEN_PAREN expr CLOSE_PAREN #ParenExpr
;
function_def : LAMBDA ID DOT expr;
number : NUMBER;
variable : ID;
Thanks!
this breaks 4.1's pattern matcher for left-recursion. cleaned up in main branch I believe. try downloading last master and build. CUrrently 4.1 generates:
expr[int _p]
: ( {} variable
| number
| function_def
| OPEN_PAREN expr CLOSE_PAREN
)
(
{2 >= $_p}? expr
)*
;
for that rule. expr ref in loop is expr[0] actually, which isn't right.