I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.
Related
ANLTR 4:
I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.
1. 'this is a string literal with an escaped\' character'
2. 'this is a string {{functionName(x)}} literal with double curlies'
StringLiteral
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;
fragment
ESC : '\\' [btnr\'\\];
fragment
AnyExceptDblCurlies
: '{' ~'{'
| ~'{' .;
I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...
Negating inside lexer- and parser rules
But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.
if I alter the string literal token rule to the following it works...
StringLiteral
: '\'' (ESC | .)*? '\'' ;
Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.
To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.
Parser:
options {
tokenVocab = TesterLexer ;
}
test : string EOF ;
string : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
template : TMPLBEG TMPLNAME TMPLEND ;
Lexer:
STRBEG : Squote -> pushMode(strMode) ;
mode strMode ;
STRESQ : Esqote -> type(SCHAR) ; // predeclare SCHAR in tokens block
STREND : Squote -> popMode ;
TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
STRCHAR : . -> type(SCHAR) ;
mode tmplMode ;
TMPLEND : DBrClose -> popMode ;
TMPLNAME : ~'}'* ;
fragment Squote : '\'' ;
fragment Esqote : '\\\'' ;
fragment DBrOpen : '{{' ;
fragment DBrClose : '}}' ;
Updated to correct the TMPLNAME rule, add main rule and options block.
In my grammar I use:
WS: [ \t\r\n]+ -> skip;
when I change this to use HIDDEN channel:
WS: [ \t\r\n]+ -> channel(HIDDEN);
I receive errors (extraneous input ' '...) I did not receive while using 'skip'.
I thought, that skipping and sending to a channel does not differ if it comes to a content passed to a parser.
Below you can find a code excerpt in which the parser is executed:
CharStream charStream = new ANTLRInputStream(formulaString);
FormulaLexer lexer = new FormulaLexer(charStream);
BufferedTokenStream tokens = new BufferedTokenStream(lexer);
FormulaParser parser = new FormulaParser(tokens);
ParseTree tree = parser.startRule();
StartRuleVisitor startRuleVisitor = new StartRuleVisitor();
startRuleVisitor.visit(tree);
VariableVisitor variableVisitor = new VariableVisitor(tokens);
variableVisitor.visit(tree);
And a grammar itself:
grammar Formula;
startRule
: variable RELATION_OPERATOR integer
;
integer
: DIGIT+
;
identifier
: (LETTER | DIGIT) ( DIGIT | LETTER | '_' | '.')+
;
tableId
: 'T_' (identifier | WILDCARD)
;
rowId
: 'R_' (identifier | WILDCARD)
;
columnId
: 'C_' (identifier | WILDCARD)
;
sheetId
: 'S_' (identifier | WILDCARD)
;
variable
: L_CURLY_BRACKET cellIdComponent (COMMA cellIdComponent)+ R_CURLY_BRACKET
;
cellIdComponent
: tableId | rowId | columnId | sheetId
;
COMMA
: ','
;
RELATION_OPERATOR
: EQ
;
WILDCARD
: 'NNN'
;
L_CURLY_BRACKET
: '{'
;
R_CURLY_BRACKET
: '}'
;
LETTER
: ('a' .. 'z') | ('A' .. 'Z')
;
DIGIT
: ('0' .. '9')
;
EQ
: '='
| 'EQ' | 'eq'
;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
String I try to parse:
{T_C 00.01, R_010, C_010} = 1
Output I get with channel(HIDDEN) used:
line 1:4 extraneous input ' ' expecting {'_', '.', LETTER, DIGIT}
line 1:11 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:18 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:27 extraneous input ' ' expecting RELATION_OPERATOR
line 1:29 extraneous input ' ' expecting DIGIT
But if I change channel(HIDDEN) to 'skip' there are no errors.
What is more, I have observed that for more complex grammar than this i get 'no viable alternative at input...' if I use channel(HIDDEN) and once again the error disappear for the 'skip'.
Do you know what may be the cause of it?
You should use CommonTokenStream instead of BufferedTokenStream. See BufferedTokenStream description on github:
This token stream ignores the value of {#link Token#getChannel}. If your
parser requires the token stream filter tokens to only those on a particular
channel, such as {#link Token#DEFAULT_CHANNEL} or
{#link Token#HIDDEN_CHANNEL}, use a filtering token stream such a
{#link CommonTokenStream}.
I am an ANTLR newbie and struggling with some of the errors that I am getting. Below I have included the grammar that I am using, The input file and the error that I am getting.
My Antlr Grammar file is as follows:
grammar Simple;
#header
{
package simple;
}
PARSER
program :
anylinebefore+
processline
anylineafter+
'MSEND' NEWLINE
'.' EOF
;
anylinebefore: CH* NEWLINE | commentline;
anylineafter: statement | commentline;
statement: movestatement ;
movestatement : 'MOVE' arg ('to' | 'TO') ID '.' NEWLINE ;
arg : ID|STRING;
processline: PROCESSLITERAL NEWLINE;
commentline: '!' CH* NEWLINE;
LEXER
WS : [ \t]+ -> skip ;
STRING : '\'' (~['])* '\'';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
TO : ('to' | 'TO');
CH : [\u0000-\uFFFE];
PROCESSLITERAL : 'PROCESS SOURCE FOLLOWS';
NEWLINE : '\r'? '\n' ;
My input file is as follows:
MODIFY
PROCESS SOURCE FOLLOWS
MOVE 'WSFRED' TO AGRPASSEDTWO.
MSEND
.
The error that I get is:
showtree:
[java] line 1:0 extraneous input 'MODIFY' expecting {'!', CH, NEWLINE}
I don't understand why this isn't matching anylinebefore in the grammar
Any help would be appreciated.
"MODIFY" is an ID, which doesn't match anylinebefore+
I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.
I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".