Why does my antlr grammar seem to properly parse this input? - antlr

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?

I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.

Related

ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

Switching lexer state in antlr3 grammar

I'm trying to construct an antlr grammar to parse a templating language. that language can be embedded in any text and the boundaries are marked with opening/closing tags: {{ / }}. So a valid template looks like this:
foo {{ someVariable }} bar
Where foo and bar should be ignored, and the part inside the {{ and }} tags should be parsed. I've found this question which basically has an answer for the problem, except that the tags are only one { and }. I've tried to modify the grammar to match 2 opening/closing characters, but as soon as i do this, the BUFFER rule consumes ALL characters, also the opening and closing brackets. The LD rule is never being invoked.
Has anyone an idea why the antlr lexer is consuming all tokens in the Buffer rule when the delimiters have 2 characters, but does not consume the delimiters when they have only one character?
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: (tag | BUFFER )*
;
tag
: LD IDENT^ RD
;
LD #after {
// flip lexer the state
insideTag=true;
System.err.println("FLIPPING TAG");
} : '{{';
RD #after {
// flip the state back
insideTag=false;
} : '}}';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER)*;
BUFFER : { !insideTag }?=> ~(LD | RD)+;
fragment LETTER : ('a'..'z' | 'A'..'Z');
You can match any character once or more until you see {{ ahead by including a predicate inside the parenthesis ( ... )+ (see the BUFFER rule in the demo).
A demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: tag EOF
;
tag
: LD IDENT^ RD
;
LD
#after {insideTag=true;}
: '{{'
;
RD
#after {insideTag=false;}
: '}}'
;
BUFFER
: ({!insideTag && !(input.LA(1)=='{' && input.LA(2)=='{')}?=> .)+ {$channel=HIDDEN;}
;
SPACE
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
IDENT
: ('a'..'z' | 'A'..'Z')+
;
Note that it's best to keep the BUFFER rule as the first lexer rule in your grammar: that way, it will be the first token that is tried.
If you now parse "foo {{ someVariable }} bar", the following AST is created:
Wouldn't a grammar like this fit your needs? I don't see why the BUFFER needs to be that complicated.
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean inTag=false;
}
start
: tag* EOF
;
tag
: LD IDENT RD -> IDENT
;
LD
#after { inTag=true; }
: '{{'
;
RD
#after { inTag=false; }
: '}}'
;
IDENT : {inTag}?=> ('a'..'z'|'A'..'Z'|'_') 'a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
BUFFER
: . {$channel=HIDDEN;}
;

ANTLR parsing Java Properties

I'm trying to pick up ANTLR and writing a grammar for Java Properties. I'm hitting an issue here and will appreciate some help.
In Java Properties, it has a little strange escape handling. For example,
key1=1=Key1
key\=2==
results in key-value pairs in Java runtime as
KEY VALUE
=== =====
key1 1=Key1
key=2 =
So far, this is the best I can mimic.. by folding the '=' and value into one single token.
grammar Prop;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : (~('='|'\r'|'\n') | '\\=')* ;
VALUE : '=' (~('\r'|'\n'))*;
CARRIAGE_RETURN
: ('\r'|'\n') + {$channel=HIDDEN;}
;
LINE_COMMENT
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
;
Is there any good suggestion if I can implement a better one?
Thanks a lot
It's not as easy as that. You can't handle much at the lexing level because many things depend on a certain context. So at the lexing level, you can only match single characters and construct key and values in parser rules. Also, the = and : as possible key-value separators and the fact that these characters can be the start of a value, makes them a pain in the butt to translate into a grammar. The easiest would be to include these (possible) separator chars in your value-rule and after matching the separator and value together, strip the separator chars from it.
A small demo:
JavaProperties.g
grammar JavaProperties;
parse
: line* EOF
;
line
: Space* keyValue
| Space* Comment eol
| Space* LineBreak
;
keyValue
: key separatorAndValue eol
{
// Replace all escaped `=` and `:`
String k = $key.text.replace("\\:", ":").replace("\\=", "=");
// Remove the separator, if it exists
String v = $separatorAndValue.text.replaceAll("^\\s*[:=]\\s*", "");
// Remove all escaped line breaks with trailing spaces
v = v.replaceAll("\\\\(\r?\n|\r)[ \t\f]*", "").trim();
System.out.println("\nkey : `" + k + "`");
System.out.println("value : `" + v + "`");
}
;
key
: keyChar+
;
keyChar
: AlphaNum
| Backslash (Colon | Equals)
;
separatorAndValue
: (Space | Colon | Equals) valueChar+
;
valueChar
: AlphaNum
| Space
| Backslash LineBreak
| Equals
| Colon
;
eol
: LineBreak
| EOF
;
Backslash : '\\';
Colon : ':';
Equals : '=';
Comment
: ('!' | '#') ~('\r' | '\n')*
;
LineBreak
: '\r'? '\n'
| '\r'
;
Space
: ' '
| '\t'
| '\f'
;
AlphaNum
: 'a'..'z'
| 'A'..'Z'
| '0'..'9'
;
The grammar above can be tested with the class:
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRFileStream("test.properties");
JavaPropertiesLexer lexer = new JavaPropertiesLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
JavaPropertiesParser parser = new JavaPropertiesParser(tokens);
parser.parse();
}
}
and the input file:
test.properties
key1 = value 1
key2:value 2
key3 :value3
ke\:\=y4=v\
a\
l\
u\
e 4
key\=5==
key6 value6
to produce the following output:
key : `key1`
value : `value 1`
key : `key2`
value : `value 2`
key : `key3`
value : `value3`
key : `ke:=y4`
value : `value 4`
key : `key=5`
value : `=`
key : `key6`
value : `value6`
Realize that my grammar is just an example: it does not account for all valid properties files (sometimes backslashes should be ignored, there's no Unicode escapes, many characters are missing in the key and value). For a complete specification of properties files, see:
http://download.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace