Antlr 4.6.1 not generating errorNodes for inputstream - antlr

I have a simple grammar like :
grammar CellMath;
equation : expr EOF;
expr
: '-'expr #UnaryNegation // unary minus
| expr op=('*'|'/') expr #MultiplicativeOp // MultiplicativeOperation
| expr op=('+'|'-') expr #AdditiveOp // AdditiveOperation
| FLOAT #Float // Floating Point Number
| INT #Integer // Integer Number
| '(' expr ')' #ParenExpr // Parenthesized Expression
;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
FLOAT
: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
INT : DIGIT+ ;
fragment
DIGIT : [0-9] ; // match single digit
//fragment
//ATSIGN : [#];
WS : [ \t\r\n]+ -> skip ;
ERRORCHAR : . ;
Not able to throw an exception in case of special char in between expression
[{Number}{SPLChar}{Chars}]
Ex:
"123#abc",
"123&abc".
I expecting an exception to throw
For Example:
Input stream : 123#abc Just like in ANTLR labs Image
But in my case Output : '123' without any errors
I'm using Listener pattern, Error nodes are just ignored not going through VisitTerminal([NotNull] ITerminalNode node) / VisitErrorNode([NotNull] IErrorNode node) in the BaseListener class. Also all the BaseErrorListener class methods has been overridden not even there.
Thanks in advance for your help.

Related

Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity

I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?
(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);

ANTLR: not match if a certain character follows

Following code is completely valid in the V programming language:
fn main() {
a := 1.
b := .1
println("$a $b")
for i in 0..10 {
println(i)
}
}
I want to write a Lexer for syntax coloring such files. 1. and .1 should be matched by FloatNumber fragment while the .. in the for-loop should match by a punctuation rule. The problem I have is that my FloatNumber implementation already matches 0. and .10 from the 0..10 and I have no idea how to tell it not to match if a . follows (or is in front of it). A little bit simplified (leaving possible underscores aside) my grammar looks like this:
fragment FloatNumber
: ( Digit+ ('.' Digit*)? ([eE] [+-]? Digit+)?
| Digit* '.' Digit+ ([eE] [+-]? Digit+)?
)
;
fragment Digit
: [0-9]
;
Then you will have to introduce a predicate that checks if there is no . ahead when matching a float like 1..
The following rules:
Plus
: '+'
;
FloatLiteral
: Digit+ '.' {_input.LA(1) != '.'}?
| Digit* '.' Digit+
;
Int
: Digit+
;
Range
: '..'
;
given the input "1.2 .3 4. 5 6..7 8.+9", will produce the following tokens:
FloatLiteral `1.2`
FloatLiteral `.3`
FloatLiteral `4.`
Int `5`
Int `6`
Range `..`
Int `7`
FloatLiteral `8.`
Plus `+`
Int `9`
Code inside a predicate is target specific. The predicate above ({_input.LA(1) != '.'}?) works with the Java target.

How to make antlr find invalid input throw exception

I have the following grammar:
grammar Expr;
expr : '-' expr # unaryOpExpr
| expr ('*'|'/'|'%') expr # mulDivModuloExpr
| expr ('+'|'-') expr # addSubExpr
| '(' expr ')' # nestedExpr
| IDENTIFIER '(' fnArgs? ')' # functionExpr
| IDENTIFIER # identifierExpr
| DOUBLE # doubleExpr
| LONG # longExpr
| STRING # string
;
fnArgs : expr (',' expr)* # functionArgs
;
IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"';
LONG : [0-9]+;
DOUBLE : [0-9]+ '.' [0-9]*;
WS : [ \t\r\n]+ -> skip ;
STRING: '"' (~["\\\r\n] | ESC)* '"';
fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MODULO : '%' ;
PLUS : '+' ;
// math function
MAX: 'MAX';
when I enter following text,It should be effective
-1.1
bug when i enter following text:
-1.1ffff
I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff",
but i want to change this behavior, didn't discard invalid token, but throw exception,report
detection invalid token.
So what should i do, Thanks for your advice
Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this:
parse: expr EOF;
This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.

ANTLR4 Unexpected Parse Behavior

I am trying to build a new language with ANTLR, and I have run into a problem. I am trying to support numerical expressions and mathematical operations on numbers(pretty important I reckon), but the parser doesn't seem to be acting how I expect. Here is my grammar:
grammar Lumos;
/*
* Parser Rules
*/
program : 'start' stat+ 'stop';
block : stat*
;
stat : assign
| numop
| if_stat
| while_stat
| display
;
assign : LET ID BE expr ;
display : DISPLAY expr ;
numop : add | subtract | multiply | divide ;
add : 'add' expr TO ID ;
subtract : 'subtract' expr 'from' ID ;
divide : 'divide' ID BY expr ;
multiply : 'multiply' ID BY expr ;
append : 'append' expr TO ID ;
if_stat
: IF condition_block (ELSE IF condition_block)* (ELSE stat_block)?
;
condition_block
: expr stat_block
;
stat_block
: OBRACE block CBRACE
| stat
;
while_stat
: WHILE expr stat_block
;
expr : expr POW<assoc=right> expr #powExpr
| MINUS expr #unaryExpr
| NOT expr #notExpr
| expr op=(TIMES|DIV|MOD) expr #multiplicativeExpr
| expr op=(PLUS|MINUS) expr #additiveExpr
| expr op=RELATIONALOPERATOR expr #relationalExpr
| expr op=EQUALITYOPERATOR expr #equalityExpr
| expr AND expr #andExpr
| expr OR expr #orExpr
//| ARRAY #arrayExpr
| atom #atomExpr
;
atom : LPAREN expr RPAREN #parExpr
| (INT|FLOAT) #numberExpr
| (TRUE|FALSE) #booleanAtom
| ID #idAtom
| STRING #stringAtom
| NIX #nixAtom
;
compileUnit : EOF ;
/*
* Lexer Rules
*/
fragment LETTER : [a-zA-Z] ;
MATHOP : PLUS
| MINUS
| TIMES
| DIV
| MOD
| POW
;
RELATIONALOPERATOR : LTEQ
| GTEQ
| LT
| GT
;
EQUALITYOPERATOR : EQ
| NEQ
;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
OR : 'or' ;
AND : 'and' ;
BY : 'by' ;
TO : 'to' ;
FROM : 'from' ;
LET : 'let' ;
BE : 'be' ;
EQ :'==' ;
NEQ :'!=' ;
LTEQ :'<=' ;
GTEQ :'>=' ;
LT :'<' ;
GT :'>' ;
//Different statements will choose between these, but they are pretty much the
same.
PLUS :'plus' ;
ADD :'add' ;
MINUS :'minus' ;
SUBTRACT :'sub' ;
TIMES :'times' ;
MULT :'multiply' ;
DIV :'divide' ;
MOD :'mod' ;
POW :'pow' ;
NOT :'not' ;
TRUE :'true' ;
FALSE :'false' ;
NIX :'nix' ;
IF :'if' ;
THEN :'then' ;
ELSE :'else' ;
WHILE :'while' ;
DISPLAY :'display' ;
ARRAY : '['(INT|FLOAT)(','(INT|FLOAT))+']';
ID : [a-z]+ ;
WORD : LETTER+ ;
//NUMBER : INT | FLOAT ;
INT : [0-9]+ ;
FLOAT : [0-9]+ '.' [0-9]*
| '.'[0-9]+
;
COMMENT : '#' ~[\r\n]* -> channel(HIDDEN) ;
WS : [ \n\t\r]+ -> channel(HIDDEN) ;
STRING : '"' (~["{}])+ '"' ;
When given the input let foo be 5 times 3, the visitor sees let foo be 5 and an extraneous times 3. I thought I set up the expr rule so that it would recognize a multiplication expression before it recognizes atoms, so this wouldn't happen. I don't know where I went wrong, but it does not work how I expected.
If anyone has any idea where I went wrong or how I can fix this problem, I would appreciate your input.
You're using TIMES in your parser rules, but the MATHOP also matches TIMES and since MATHOP is defined before your TIMES rule, it gets precedence. That is why the TIMES rule in expr op=(TIMES|DIV|MOD) expr isn't matched.
I don't see you using this MATHOP rule anywhere in your parser rules, so I recommend just removing the MATHOP rule all together.

Antlr4 Grammar for Function Application

I'm trying to write a simple lambda calculus grammar (show below). The issue I am having is that function application seems to be treated as right associative instead of left associative e.g. "f 1 2" is parsed as (f (1 2)) instead of ((f 1) 2). ANTLR has an assoc option for tokens, but I don't see how that helps here since there is no operator for function application. Does anyone see a solution?
LAMBDA : '\\';
DOT : '.';
OPEN_PAREN : '(';
CLOSE_PAREN : ')';
fragment ID_START : [A-Za-z+\-*/_];
fragment ID_BODY : ID_START | DIGIT;
fragment DIGIT : [0-9];
ID : ID_START ID_BODY*;
NUMBER : DIGIT+ (DOT DIGIT+)?;
WS : [ \t\r\n]+ -> skip;
parse : expr EOF;
expr : variable #VariableExpr
| number #ConstantExpr
| function_def #FunctionDefinition
| expr expr #FunctionApplication
| OPEN_PAREN expr CLOSE_PAREN #ParenExpr
;
function_def : LAMBDA ID DOT expr;
number : NUMBER;
variable : ID;
Thanks!
this breaks 4.1's pattern matcher for left-recursion. cleaned up in main branch I believe. try downloading last master and build. CUrrently 4.1 generates:
expr[int _p]
: ( {} variable
| number
| function_def
| OPEN_PAREN expr CLOSE_PAREN
)
(
{2 >= $_p}? expr
)*
;
for that rule. expr ref in loop is expr[0] actually, which isn't right.