ANTLR4 using HIDDEN channel causes errors while using skip does not - antlr

In my grammar I use:
WS: [ \t\r\n]+ -> skip;
when I change this to use HIDDEN channel:
WS: [ \t\r\n]+ -> channel(HIDDEN);
I receive errors (extraneous input ' '...) I did not receive while using 'skip'.
I thought, that skipping and sending to a channel does not differ if it comes to a content passed to a parser.
Below you can find a code excerpt in which the parser is executed:
CharStream charStream = new ANTLRInputStream(formulaString);
FormulaLexer lexer = new FormulaLexer(charStream);
BufferedTokenStream tokens = new BufferedTokenStream(lexer);
FormulaParser parser = new FormulaParser(tokens);
ParseTree tree = parser.startRule();
StartRuleVisitor startRuleVisitor = new StartRuleVisitor();
startRuleVisitor.visit(tree);
VariableVisitor variableVisitor = new VariableVisitor(tokens);
variableVisitor.visit(tree);
And a grammar itself:
grammar Formula;
startRule
: variable RELATION_OPERATOR integer
;
integer
: DIGIT+
;
identifier
: (LETTER | DIGIT) ( DIGIT | LETTER | '_' | '.')+
;
tableId
: 'T_' (identifier | WILDCARD)
;
rowId
: 'R_' (identifier | WILDCARD)
;
columnId
: 'C_' (identifier | WILDCARD)
;
sheetId
: 'S_' (identifier | WILDCARD)
;
variable
: L_CURLY_BRACKET cellIdComponent (COMMA cellIdComponent)+ R_CURLY_BRACKET
;
cellIdComponent
: tableId | rowId | columnId | sheetId
;
COMMA
: ','
;
RELATION_OPERATOR
: EQ
;
WILDCARD
: 'NNN'
;
L_CURLY_BRACKET
: '{'
;
R_CURLY_BRACKET
: '}'
;
LETTER
: ('a' .. 'z') | ('A' .. 'Z')
;
DIGIT
: ('0' .. '9')
;
EQ
: '='
| 'EQ' | 'eq'
;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
String I try to parse:
{T_C 00.01, R_010, C_010} = 1
Output I get with channel(HIDDEN) used:
line 1:4 extraneous input ' ' expecting {'_', '.', LETTER, DIGIT}
line 1:11 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:18 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:27 extraneous input ' ' expecting RELATION_OPERATOR
line 1:29 extraneous input ' ' expecting DIGIT
But if I change channel(HIDDEN) to 'skip' there are no errors.
What is more, I have observed that for more complex grammar than this i get 'no viable alternative at input...' if I use channel(HIDDEN) and once again the error disappear for the 'skip'.
Do you know what may be the cause of it?

You should use CommonTokenStream instead of BufferedTokenStream. See BufferedTokenStream description on github:
This token stream ignores the value of {#link Token#getChannel}. If your
parser requires the token stream filter tokens to only those on a particular
channel, such as {#link Token#DEFAULT_CHANNEL} or
{#link Token#HIDDEN_CHANNEL}, use a filtering token stream such a
{#link CommonTokenStream}.

Related

How to make antlr find invalid input throw exception

I have the following grammar:
grammar Expr;
expr : '-' expr # unaryOpExpr
| expr ('*'|'/'|'%') expr # mulDivModuloExpr
| expr ('+'|'-') expr # addSubExpr
| '(' expr ')' # nestedExpr
| IDENTIFIER '(' fnArgs? ')' # functionExpr
| IDENTIFIER # identifierExpr
| DOUBLE # doubleExpr
| LONG # longExpr
| STRING # string
;
fnArgs : expr (',' expr)* # functionArgs
;
IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"';
LONG : [0-9]+;
DOUBLE : [0-9]+ '.' [0-9]*;
WS : [ \t\r\n]+ -> skip ;
STRING: '"' (~["\\\r\n] | ESC)* '"';
fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MODULO : '%' ;
PLUS : '+' ;
// math function
MAX: 'MAX';
when I enter following text,It should be effective
-1.1
bug when i enter following text:
-1.1ffff
I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff",
but i want to change this behavior, didn't discard invalid token, but throw exception,report
detection invalid token.
So what should i do, Thanks for your advice
Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this:
parse: expr EOF;
This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.

ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

Antlr4 ignoring newlines at all but one point

I am writing a parser for a scripting language, and using antlr 4.5.3 for the purpose.
grammar VSE;
chunk
: block* EOF
;
block
: var '=' exp
| functioncall
;
var
: NAME
| var '[' exp ']'
| var '.' var
;
exp
: number
| string
| var
| functioncall
| <assoc=right> exp exp //concat
;
functioncall
: NAME '(' (exp)? (',' exp)* ')'
| var '.' functioncall
;
string
: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"'
;
NAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
number
: INT | HEX | FLOAT
;
INT
: Digit+
;
HEX
: '0' [xX] [0-9a-fA-F]+
;
FLOAT
: Digit* '.' Digit+
;
Digit
: [0-9]
;
WS
: [ \t\u000C\r\n]+ -> skip
;
However, while testing it, I found a variable assignment like var = something followed by some function call in next line leads to a concat statement. (My concat statement is a variable followed by another like var = var1 var2) I understand that antlr is skipping ALL the new lines in favor of line continuation, but I'd like to add the condition that if there is a new line between two exps, it would treat them as two separate blocks instead of a concat statement. i.e.
var = var2
functioncall(var)
These should be two separate blocks instead of concat statement.
Is there any way to do this?
Does the following rule suitable for you?
block
: var '=' exp NEW_LINE
| functioncall NEW_LINE
;
NEW_LINE: '\r'? '\n'
WS
: [ \t]+ -> skip
;
In another case you should use Semantic Predicates or very unclear grammar.

Lexer to handle lines with line number prefix

I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.
Is there a way to properly tokenize this input using an ANTLR grammar?
You need to add a semantic predicate to IDENTIFIER:
IDENTIFIER
: {_input.getCharPositionInLine() != 0
|| _input.LA(1) != 'L'
|| !Character.isDigit(_input.LA(2))}?
[a-zA-Z_] [a-zA-Z0-9_]*
;
You could also avoid semantic predicates by using lexer modes.
//
// Default mode is active at the beginning of a line
//
LINE_NUM
: 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
;
KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
: ( 'L'
| 'L' [a-zA-Z_] [a-zA-Z0-9_]*
| [a-zA-KM-Z_] [a-zA-Z0-9_]*
)
-> pushMode(NotBeginningOfLine)
;
NL : ('\r' '\n'? | '\n');
mode NotBeginningOfLine;
NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
NotBeginningOfLine_IDENTIFIER
: [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
;

Why does my antlr grammar seem to properly parse this input?

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.