Using floats in range and grammar mistakes? - antlr

I am trying to convert a LALR grammar to LL using ANTLR and I am running into a few problems. So far, I think converting the expressions into a Top-Down approach is straight forward to me. The problem is when I include Range (1..10) and (1.0..10.0) with floats.
I have tried to use the answer found here and somehow it is not even running correctly with my code, let alone solving a range of float, i.e. (float..float).
Float literal and range parameter in ANTLR
Attached is a sample of my grammar that just focuses on this issue.
grammar Test;
options {
language = Java;
output = AST;
}
parse: 'in' rangeExpression ';'
;
rangeExpression : expression ('..' expression)?
;
expression : addingExpression (('=='|'!='|'<='|'<'|'>='|'>') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/'|'div') unaryExpression)*
;
unaryExpression: ('+'|'-')* primitiveElement;
primitiveElement : literalExpression
| id ('.' id)?
| '(' expression ')'
;
literalExpression : NUMBER
| BOOLEAN_LITERAL
| 'infinity'
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
Range
: '..'
;
NUMBER
: (DIGITS Range) => DIGITS {$type=DIGITS;}
| (FloatLiteral) => FloatLiteral {$type=FloatLiteral;}
| DIGITS {$type=DIGITS;}
;
// fragments
fragment FloatLiteral : Float;
fragment Float
: DIGITS ( options {greedy = true; } : '.' DIGIT* EXPONENT?)
| '.' DIGITS EXPONENT?
| DIGITS EXPONENT
;
BOOLEAN_LITERAL : 'false'
| 'true'
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
fragment EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
Any reason why it is not even taking:
in 10;
or
in 10.0;
Thanks in advance!

The following things are not correct:
you're never matching a FloatLiteral in your literalExpression rule
in every alternative of NUMBER you're changing the type of the token, therefor a NUMBER token will never be created
Something like this should work for both 11..22 and 1.1..2.2:
...
literalExpression : INT
| BOOLEAN_LITERAL
| FLOAT
| 'infinity'
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
Range
: '..'
;
INT
: (DIGITS Range)=> DIGITS
| DIGITS (('.' DIGITS EXPONENT? | EXPONENT) {$type=FLOAT;})?
;
BOOLEAN_LITERAL : 'false'
| 'true'
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
fragment EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment FLOAT : ;

To your question about handling (1.0 .. 10.0):
Notice that parser rule primitiveElement defines an alternative as '(' expression ')', but rule expression can never reach rule rangeExpression.
Consider redefining expression and rangeExpression like so:
expression : rangeExpression
;
rangeExpression : compExpression ('..' compExpression)?
;
compExpression : addingExpression (('=='|'!='|'<='|'<'|'>='|'>') addingExpression)*
;
This ensures that the expression rule sits above all forms of expressions and will work as expected in parentheses.

Related

Antlr 4.6.1 not generating errorNodes for inputstream

I have a simple grammar like :
grammar CellMath;
equation : expr EOF;
expr
: '-'expr #UnaryNegation // unary minus
| expr op=('*'|'/') expr #MultiplicativeOp // MultiplicativeOperation
| expr op=('+'|'-') expr #AdditiveOp // AdditiveOperation
| FLOAT #Float // Floating Point Number
| INT #Integer // Integer Number
| '(' expr ')' #ParenExpr // Parenthesized Expression
;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
FLOAT
: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
INT : DIGIT+ ;
fragment
DIGIT : [0-9] ; // match single digit
//fragment
//ATSIGN : [#];
WS : [ \t\r\n]+ -> skip ;
ERRORCHAR : . ;
Not able to throw an exception in case of special char in between expression
[{Number}{SPLChar}{Chars}]
Ex:
"123#abc",
"123&abc".
I expecting an exception to throw
For Example:
Input stream : 123#abc Just like in ANTLR labs Image
But in my case Output : '123' without any errors
I'm using Listener pattern, Error nodes are just ignored not going through VisitTerminal([NotNull] ITerminalNode node) / VisitErrorNode([NotNull] IErrorNode node) in the BaseListener class. Also all the BaseErrorListener class methods has been overridden not even there.
Thanks in advance for your help.

Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity

I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?
(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);

ANTLR4 Unexpected Parse Behavior

I am trying to build a new language with ANTLR, and I have run into a problem. I am trying to support numerical expressions and mathematical operations on numbers(pretty important I reckon), but the parser doesn't seem to be acting how I expect. Here is my grammar:
grammar Lumos;
/*
* Parser Rules
*/
program : 'start' stat+ 'stop';
block : stat*
;
stat : assign
| numop
| if_stat
| while_stat
| display
;
assign : LET ID BE expr ;
display : DISPLAY expr ;
numop : add | subtract | multiply | divide ;
add : 'add' expr TO ID ;
subtract : 'subtract' expr 'from' ID ;
divide : 'divide' ID BY expr ;
multiply : 'multiply' ID BY expr ;
append : 'append' expr TO ID ;
if_stat
: IF condition_block (ELSE IF condition_block)* (ELSE stat_block)?
;
condition_block
: expr stat_block
;
stat_block
: OBRACE block CBRACE
| stat
;
while_stat
: WHILE expr stat_block
;
expr : expr POW<assoc=right> expr #powExpr
| MINUS expr #unaryExpr
| NOT expr #notExpr
| expr op=(TIMES|DIV|MOD) expr #multiplicativeExpr
| expr op=(PLUS|MINUS) expr #additiveExpr
| expr op=RELATIONALOPERATOR expr #relationalExpr
| expr op=EQUALITYOPERATOR expr #equalityExpr
| expr AND expr #andExpr
| expr OR expr #orExpr
//| ARRAY #arrayExpr
| atom #atomExpr
;
atom : LPAREN expr RPAREN #parExpr
| (INT|FLOAT) #numberExpr
| (TRUE|FALSE) #booleanAtom
| ID #idAtom
| STRING #stringAtom
| NIX #nixAtom
;
compileUnit : EOF ;
/*
* Lexer Rules
*/
fragment LETTER : [a-zA-Z] ;
MATHOP : PLUS
| MINUS
| TIMES
| DIV
| MOD
| POW
;
RELATIONALOPERATOR : LTEQ
| GTEQ
| LT
| GT
;
EQUALITYOPERATOR : EQ
| NEQ
;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
OR : 'or' ;
AND : 'and' ;
BY : 'by' ;
TO : 'to' ;
FROM : 'from' ;
LET : 'let' ;
BE : 'be' ;
EQ :'==' ;
NEQ :'!=' ;
LTEQ :'<=' ;
GTEQ :'>=' ;
LT :'<' ;
GT :'>' ;
//Different statements will choose between these, but they are pretty much the
same.
PLUS :'plus' ;
ADD :'add' ;
MINUS :'minus' ;
SUBTRACT :'sub' ;
TIMES :'times' ;
MULT :'multiply' ;
DIV :'divide' ;
MOD :'mod' ;
POW :'pow' ;
NOT :'not' ;
TRUE :'true' ;
FALSE :'false' ;
NIX :'nix' ;
IF :'if' ;
THEN :'then' ;
ELSE :'else' ;
WHILE :'while' ;
DISPLAY :'display' ;
ARRAY : '['(INT|FLOAT)(','(INT|FLOAT))+']';
ID : [a-z]+ ;
WORD : LETTER+ ;
//NUMBER : INT | FLOAT ;
INT : [0-9]+ ;
FLOAT : [0-9]+ '.' [0-9]*
| '.'[0-9]+
;
COMMENT : '#' ~[\r\n]* -> channel(HIDDEN) ;
WS : [ \n\t\r]+ -> channel(HIDDEN) ;
STRING : '"' (~["{}])+ '"' ;
When given the input let foo be 5 times 3, the visitor sees let foo be 5 and an extraneous times 3. I thought I set up the expr rule so that it would recognize a multiplication expression before it recognizes atoms, so this wouldn't happen. I don't know where I went wrong, but it does not work how I expected.
If anyone has any idea where I went wrong or how I can fix this problem, I would appreciate your input.
You're using TIMES in your parser rules, but the MATHOP also matches TIMES and since MATHOP is defined before your TIMES rule, it gets precedence. That is why the TIMES rule in expr op=(TIMES|DIV|MOD) expr isn't matched.
I don't see you using this MATHOP rule anywhere in your parser rules, so I recommend just removing the MATHOP rule all together.

Antlr4 ignoring newlines at all but one point

I am writing a parser for a scripting language, and using antlr 4.5.3 for the purpose.
grammar VSE;
chunk
: block* EOF
;
block
: var '=' exp
| functioncall
;
var
: NAME
| var '[' exp ']'
| var '.' var
;
exp
: number
| string
| var
| functioncall
| <assoc=right> exp exp //concat
;
functioncall
: NAME '(' (exp)? (',' exp)* ')'
| var '.' functioncall
;
string
: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"'
;
NAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
number
: INT | HEX | FLOAT
;
INT
: Digit+
;
HEX
: '0' [xX] [0-9a-fA-F]+
;
FLOAT
: Digit* '.' Digit+
;
Digit
: [0-9]
;
WS
: [ \t\u000C\r\n]+ -> skip
;
However, while testing it, I found a variable assignment like var = something followed by some function call in next line leads to a concat statement. (My concat statement is a variable followed by another like var = var1 var2) I understand that antlr is skipping ALL the new lines in favor of line continuation, but I'd like to add the condition that if there is a new line between two exps, it would treat them as two separate blocks instead of a concat statement. i.e.
var = var2
functioncall(var)
These should be two separate blocks instead of concat statement.
Is there any way to do this?
Does the following rule suitable for you?
block
: var '=' exp NEW_LINE
| functioncall NEW_LINE
;
NEW_LINE: '\r'? '\n'
WS
: [ \t]+ -> skip
;
In another case you should use Semantic Predicates or very unclear grammar.

Precedence in Antlr using parentheses

We are developing a DSL, and we're facing some problems:
Problem 1:
In our DSL, it's allowed to do this:
A + B + C
but not this:
A + B - C
If the user needs to use two or more different operators, he'll need to insert parentheses:
A + (B - C) or (A + B) - C.
Problem 2:
In our DSL, the most precedent operator must be surrounded by parentheses.
For example, instead of using this way:
A + B * C
The user needs to use this:
A + (B * C)
To solve the Problem 1 I've got a snippet of ANTLR that worked, but I'm not sure if it's the best way to solve it:
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr;
To solve the Problem 2, I have no idea where to start.
I appreciate your help to find out a better solution to the first problem and a direction to solve the seconde one. (Sorry for my bad english)
Below is the grammar that we have developed:
grammar TclGrammar;
options {
output=AST;
ASTLabelType=CommonTree;
}
#members {
public boolean isSum(String type) {
System.out.println("Tipo: " + type);
return "+".equals(type);
}
public boolean isSub(String type) {
System.out.println("Tipo: " + type);
return "-".equals(type);
}
}
prog
: exprMain ';' {System.out.println(
$exprMain.tree == null ? "null" : $exprMain.tree.toStringTree());}
;
exprMain
: exprQuando? (exprDeveSatis | exprDeveFalharCaso)
;
exprDeveSatis
: 'DEVE SATISFAZER' '{'! expr '}'!
;
exprDeveFalharCaso
: 'DEVE FALHAR CASO' '{'! expr '}'!
;
exprQuando
: 'QUANDO' '{'! expr '}'!
;
expr
: logicExpr
;
logicExpr
: boolExpr (('E'|'OU')^ boolExpr)*
;
boolExpr
: comparatorExpr
| emExpr
| 'VERDADE'
| 'FALSO'
;
emExpr
: FIELD 'EM' '[' (variable_lista | field_lista) comCruzamentoExpr? ']'
-> ^('EM' FIELD (variable_lista+)? (field_lista+)? comCruzamentoExpr?)
;
comCruzamentoExpr
: 'COM CRUZAMENTO' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COM CRUZAMENTO' FIELD+)
;
comparatorExpr
: sumExpr (('<'^|'<='^|'>'^|'>='^|'='^|'<>'^) sumExpr)?
| naoPreenchidoExpr
| preenchidoExpr
;
naoPreenchidoExpr
: FIELD 'NAO PREENCHIDO' -> ^('NAO PREENCHIDO' FIELD)
;
preenchidoExpr
: FIELD 'PREENCHIDO' -> ^('PREENCHIDO' FIELD)
;
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr
;
multExpr
: funcExpr(('*'^|'/'^) funcExpr)?
;
funcExpr
: 'QUANTIDADE'^ '('! FIELD ')'!
| 'EXTRAI_TEXTO'^ '('! FIELD ';' INTEGER ';' INTEGER ')'!
| cruzaExpr
| 'COMBINACAO_UNICA' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COMBINACAO_UNICA' FIELD+)
| 'EXISTE'^ '('! FIELD ')'!
| 'UNICO'^ '('! FIELD ')'!
| atom
;
cruzaExpr
: operadorCruzaExpr ('CRUZA COM'^|'CRUZA AMBOS'^) operadorCruzaExpr ondeExpr?
;
operadorCruzaExpr
: FIELD('('!field_lista')'!)?
;
ondeExpr
: ('ONDE'^ '('!expr')'!)
;
atom
: FIELD
| VARIABLE
| '('! expr ')'!
;
field_lista
: FIELD(';' field_lista)?
;
variable_lista
: VARIABLE(';' variable_lista)?
;
FIELD
: NONCONTROL_CHAR+
;
NUMBER
: INTEGER | FLOAT
;
VARIABLE
: '\'' NONCONTROL_CHAR+ '\''
;
fragment SIGN: '+' | '-';
fragment NONCONTROL_CHAR: LETTER | DIGIT | SYMBOL;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
fragment SYMBOL: '_' | '.' | ',';
fragment FLOAT: INTEGER '.' '0'..'9'*;
fragment INTEGER: '0' | SIGN? '1'..'9' '0'..'9'*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {skip();}
;
This is similar to not having operator precedence at all.
expr
: funcExpr
( ('+' funcExpr)*
| ('-' funcExpr)*
| ('*' funcExpr)*
| ('/' funcExpr)*
)
;
I think the following should work. I'm assuming some lexer tokens with obvious names.
expr: sumExpr;
sumExpr: onlySum | subExpr;
onlySum: atom ( PLUS onlySum )?;
subExpr: onlySub | multExpr;
onlySub: atom ( MINUS onlySub )? ;
multExpr: atom ( STAR atomic )? ;
parenExpr: OPEN_PAREN expr CLOSE_PAREN;
atom: FIELD | VARIABLE | parenExpr
The only* rules match an expression if it only has one type of operator outside of parentheses. The *Expr rules match either a line with the appropriate type of operators or go to the next operator.
If you have multiple types of operators, then they are forced to be inside parentheses because the match will go through atom.