Antlr compiles text but always throws an error when there's a whitespace

Antlr compiles text but always throws an error when there's a whitespace - antlr

So this is my grammar:
grammar Test;
prog: stmt_list;
stmt_list
: stmt_list stmt ';'
| stmt ';'
;
stmt
: assignment
| bind
;
assignment: 'var' IDENTIFIER ('=' | '+=' | '-=' | '*=' | '/=') expression;
type
: IDENTIFIER
| primitiveType
;
primitiveType
: 'int'
| 'float'
| 'string'
| 'bool'
;
expression
: atom
| expression ('*' | '/') expression
| expression ('+' | '-') expression
;
atom
: '(' expression ')'
| IDENTIFIER
| INT
| STRING
;
IDENTIFIER: [A-z_][A-z_0-9]*;
INT: [1-9][0-9]*;
STRING: '"' [A-z] '"';
WS: [\t\r\n]+ -> channel(HIDDEN);
I can compile it with antlr and everything works fine. When I test it with grun it will compile but it throws a "token recognition error" whenever there's a whitespace. For example with this input:
var a = b + c;
I get:
line 1:3 token recognition error at: ' '
line 1:5 token recognition error at: ' '
line 1:7 token recognition error at: ' '
line 1:9 token recognition error at: ' '
line 1:11 token recognition error at: ' '
Besides this everything works but it would still be nice if I could get rid of these messages.

You're only putting tabs and line break chars to the hidden channel, not spaces.
Instead of:
WS: [\t\r\n]+ -> channel(HIDDEN);
do:
WS: [ \t\r\n]+ -> channel(HIDDEN);

Related

Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity

I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?

(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);

ANTLR grammar not picking the right option

So, I'm trying to assign a method value to a var in a test program, I'm using a Decaf grammar.
The grammar:
// Define decaf grammar
grammar Decaf;
// Reglas LEXER
// Definiciones base para letras y digitos
fragment LETTER: ('a'..'z'|'A'..'Z'|'_');
fragment DIGIT: '0'..'9';
// Las otras reglas de lexer de Decaf
ID: LETTER (LETTER|DIGIT)*;
NUM: DIGIT(DIGIT)*;
CHAR: '\'' ( ~['\r\n\\] | '\\' ['\\] ) '\'';
WS : [ \t\r\n\f]+ -> channel(HIDDEN);
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
// -----------------------------------------------------------------------------------------------------------------------------------------
// Reglas PARSER
program:'class' 'Program' '{' (declaration)* '}';
declaration
: structDeclaration
| varDeclaration
| methodDeclaration
;
varDeclaration
: varType ID ';'
| varType ID '[' NUM ']' ';'
;
structDeclaration:'struct' ID '{' (varDeclaration)* '}' (';')?;
varType
: 'int'
| 'char'
| 'boolean'
| 'struct' ID
| structDeclaration
| 'void'
;
methodDeclaration: methodType ID '(' (parameter (',' parameter)*)* ')' block;
methodType
: 'int'
| 'char'
| 'boolean'
| 'void'
;
parameter
: parameterType ID
| parameterType ID '[' ']'
| 'void'
;
parameterType
: 'int'
| 'char'
| 'boolean'
;
block: '{' (varDeclaration)* (statement)* '}';
statement
: 'if' '(' expression ')' block ( 'else' block )? #stat_if
| 'while' '('expression')' block #stat_else
| 'return' expressionOom ';' #stat_return
| methodCall ';' #stat_mcall
| block #stat_block
| location '=' expression #stat_assignment
| (expression)? ';' #stat_line
;
expressionOom: expression |;
location: (ID|ID '[' expression ']') ('.' location)?;
expression
: location #expr_loc
| methodCall #expr_mcall
| literal #expr_literal
| '-' expression #expr_minus // Unary Minus Operation
| '!' expression #expr_not // Unary NOT Operation
| '('expression')' #expr_parenthesis
| expression arith_op_fifth expression #expr_arith5 // * / % << >>
| expression arith_op_fourth expression #expr_arith4 // + -
| expression arith_op_third expression #expr_arith3 // == != < <= > >=
| expression arith_op_second expression #expr_arith2 // &&
| expression arith_op_first expression #expr_arith1 // ||
;
methodCall: ID '(' arg1 ')';
// Puede ir algo que coincida con arg2 o nada, en caso de una llamada a metodo sin parametro
arg1: arg2 |;
// Expression y luego se utiliza * para permitir 0 o más parametros adicionales
arg2: (arg)(',' arg)*;
arg: expression;
// Operaciones
// Divididas por nivel de precedencia
// Especificación de precedencia: https://anoopsarkar.github.io/compilers-class/decafspec.html
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
arith_op_fifth: '*' | '/' | '%' | '<<' | '>>';
arith_op_fourth: '+' | '-';
arith_op_third: rel_op | eq_op;
arith_op_second: '&&';
arith_op_first: '||';
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : '\'' CHAR '\'' ;
bool_literal : 'true' | 'false' ;
And the test program is as follows:
class Program
{
int factorial(int b)
{
int n;
n = 1;
return n+2;
}
void main(void)
{
int a;
int b;
b=0;
a=factorial(b);
factorial(b);
return;
}
}
The parse tree for this program looks as following, at least for the part I'm interested which is a=factorial(b):
This tree is wrong, since it should look like location = expression -> methodCall
The following tree is how it looks on a friend's implementation, and it should sort of look like this if the grammar was correctly implemented:
This is correctly implemented, or the result I need, since I want the tree to look like location = expression -> methodCall and not location = expression -> location. If I remove the parameter from a=factorial(b) and leave it as a=factorial(), it will be read correctly as a methodCall, so I'm not sure what I'm missing.
So my question is, I'm not sure where I'm messing up in the grammar, I guess it's either on location or expression, but I'm not sure how to adjust it to behave the way I want it to. I sort of just got the rules literally from the specification we were provided.

In an ANTLR rule, alternatives are matches from top to bottom. So in your expression rule:
expression
: location #expr_loc
| methodCall #expr_mcall
...
;
the generated parser will try to match a location before it tries to match a methodCall. Try swapping those two around:
expression
: methodCall #expr_mcall
| location #expr_loc
...
;

How to make antlr find invalid input throw exception

I have the following grammar:
grammar Expr;
expr : '-' expr # unaryOpExpr
| expr ('*'|'/'|'%') expr # mulDivModuloExpr
| expr ('+'|'-') expr # addSubExpr
| '(' expr ')' # nestedExpr
| IDENTIFIER '(' fnArgs? ')' # functionExpr
| IDENTIFIER # identifierExpr
| DOUBLE # doubleExpr
| LONG # longExpr
| STRING # string
;
fnArgs : expr (',' expr)* # functionArgs
;
IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"';
LONG : [0-9]+;
DOUBLE : [0-9]+ '.' [0-9]*;
WS : [ \t\r\n]+ -> skip ;
STRING: '"' (~["\\\r\n] | ESC)* '"';
fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MODULO : '%' ;
PLUS : '+' ;
// math function
MAX: 'MAX';
when I enter following text,It should be effective
-1.1
bug when i enter following text:
-1.1ffff
I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff",
but i want to change this behavior, didn't discard invalid token, but throw exception,report
detection invalid token.
So what should i do, Thanks for your advice

Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this:
parse: expr EOF;
This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.

ANTLR4 using HIDDEN channel causes errors while using skip does not

In my grammar I use:
WS: [ \t\r\n]+ -> skip;
when I change this to use HIDDEN channel:
WS: [ \t\r\n]+ -> channel(HIDDEN);
I receive errors (extraneous input ' '...) I did not receive while using 'skip'.
I thought, that skipping and sending to a channel does not differ if it comes to a content passed to a parser.
Below you can find a code excerpt in which the parser is executed:
CharStream charStream = new ANTLRInputStream(formulaString);
FormulaLexer lexer = new FormulaLexer(charStream);
BufferedTokenStream tokens = new BufferedTokenStream(lexer);
FormulaParser parser = new FormulaParser(tokens);
ParseTree tree = parser.startRule();
StartRuleVisitor startRuleVisitor = new StartRuleVisitor();
startRuleVisitor.visit(tree);
VariableVisitor variableVisitor = new VariableVisitor(tokens);
variableVisitor.visit(tree);
And a grammar itself:
grammar Formula;
startRule
: variable RELATION_OPERATOR integer
;
integer
: DIGIT+
;
identifier
: (LETTER | DIGIT) ( DIGIT | LETTER | '_' | '.')+
;
tableId
: 'T_' (identifier | WILDCARD)
;
rowId
: 'R_' (identifier | WILDCARD)
;
columnId
: 'C_' (identifier | WILDCARD)
;
sheetId
: 'S_' (identifier | WILDCARD)
;
variable
: L_CURLY_BRACKET cellIdComponent (COMMA cellIdComponent)+ R_CURLY_BRACKET
;
cellIdComponent
: tableId | rowId | columnId | sheetId
;
COMMA
: ','
;
RELATION_OPERATOR
: EQ
;
WILDCARD
: 'NNN'
;
L_CURLY_BRACKET
: '{'
;
R_CURLY_BRACKET
: '}'
;
LETTER
: ('a' .. 'z') | ('A' .. 'Z')
;
DIGIT
: ('0' .. '9')
;
EQ
: '='
| 'EQ' | 'eq'
;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
String I try to parse:
{T_C 00.01, R_010, C_010} = 1
Output I get with channel(HIDDEN) used:
line 1:4 extraneous input ' ' expecting {'_', '.', LETTER, DIGIT}
line 1:11 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:18 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:27 extraneous input ' ' expecting RELATION_OPERATOR
line 1:29 extraneous input ' ' expecting DIGIT
But if I change channel(HIDDEN) to 'skip' there are no errors.
What is more, I have observed that for more complex grammar than this i get 'no viable alternative at input...' if I use channel(HIDDEN) and once again the error disappear for the 'skip'.
Do you know what may be the cause of it?

You should use CommonTokenStream instead of BufferedTokenStream. See BufferedTokenStream description on github:
This token stream ignores the value of {#link Token#getChannel}. If your
parser requires the token stream filter tokens to only those on a particular
channel, such as {#link Token#DEFAULT_CHANNEL} or
{#link Token#HIDDEN_CHANNEL}, use a filtering token stream such a
{#link CommonTokenStream}.

Antlr4 ignoring newlines at all but one point

I am writing a parser for a scripting language, and using antlr 4.5.3 for the purpose.
grammar VSE;
chunk
: block* EOF
;
block
: var '=' exp
| functioncall
;
var
: NAME
| var '[' exp ']'
| var '.' var
;
exp
: number
| string
| var
| functioncall
| <assoc=right> exp exp //concat
;
functioncall
: NAME '(' (exp)? (',' exp)* ')'
| var '.' functioncall
;
string
: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"'
;
NAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
number
: INT | HEX | FLOAT
;
INT
: Digit+
;
HEX
: '0' [xX] [0-9a-fA-F]+
;
FLOAT
: Digit* '.' Digit+
;
Digit
: [0-9]
;
WS
: [ \t\u000C\r\n]+ -> skip
;
However, while testing it, I found a variable assignment like var = something followed by some function call in next line leads to a concat statement. (My concat statement is a variable followed by another like var = var1 var2) I understand that antlr is skipping ALL the new lines in favor of line continuation, but I'd like to add the condition that if there is a new line between two exps, it would treat them as two separate blocks instead of a concat statement. i.e.
var = var2
functioncall(var)
These should be two separate blocks instead of concat statement.
Is there any way to do this?

Does the following rule suitable for you?
block
: var '=' exp NEW_LINE
| functioncall NEW_LINE
;
NEW_LINE: '\r'? '\n'
WS
: [ \t]+ -> skip
;
In another case you should use Semantic Predicates or very unclear grammar.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Antlr compiles text but always throws an error when there's a whitespace - antlr

You're only putting tabs and line break chars to the hidden channel, not spaces. Instead of: WS: [\t\r\n]+ -> channel(HIDDEN); do: WS: [ \t\r\n]+ -> channel(HIDDEN);

Related

Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity

ANTLR grammar not picking the right option

How to make antlr find invalid input throw exception

ANTLR4 using HIDDEN channel causes errors while using skip does not

Antlr4 ignoring newlines at all but one point

Categories

Resources