Using antlr to parse a | separated file - antlr

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...

A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.

It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;

This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace

Related

Why does my antlr grammar seem to properly parse this input?

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.

Selectively Skip Newline Depending on Context

I must parse files made of two parts. In the first one, new lines must be skipped. In the second one, they are important and used as a delimiter.
I want to avoid solutions like http://www.antlr.org/wiki/pages/viewpage.action?pageId=1734 and use predicate instead.
For the moment, I have something like:
WS: ( ' ' | '\t' | NEWLINE) {SKIP();};
fragment NEWLINE : '\r'|'\n'|'\r\n';
I tried to add a dynamically scoped variable keepNewline that is set to true when "entering" second part of the file.
However, I am not able to create the correct predicate to switch off the "skipping" of newlines.
Any help would be greatly appreciated.
Best regards.
It's easier than you might think: you don't even need a predicate.
Let's say you want to preserve line breaks only inside <pre>...</pre> tags. The following dummy grammar does just that:
grammar Pre;
#lexer::members {
private boolean keepNewLine = false;
}
parse
: (t=.
{
System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));
}
)*
EOF
;
Word
: ('a'..'z' | 'A'..'Z')+
;
OPr
: '<pre>' {keepNewLine = true;}
;
CPr
: '</pre>' {keepNewLine = false;}
;
NewLine
: ('\r'? '\n' | '\r') {if(!keepNewLine) skip();}
;
Space
: (' ' | '\t') {skip();}
;
which you can test with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
PreLexer lexer = new PreLexer(new ANTLRFileStream("in.txt"));
PreParser parser = new PreParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if in.txt would contain:
foo bar
<pre>
a
b
</pre>
baz
the output of running the Main class would be:
Word 'foo'
Word 'bar'
OPr '<pre>'
NewLine '\n'
Word 'a'
NewLine '\n'
NewLine '\n'
Word 'b'
NewLine '\n'
CPr '</pre>'
Word 'baz'

Switching lexer state in antlr3 grammar

I'm trying to construct an antlr grammar to parse a templating language. that language can be embedded in any text and the boundaries are marked with opening/closing tags: {{ / }}. So a valid template looks like this:
foo {{ someVariable }} bar
Where foo and bar should be ignored, and the part inside the {{ and }} tags should be parsed. I've found this question which basically has an answer for the problem, except that the tags are only one { and }. I've tried to modify the grammar to match 2 opening/closing characters, but as soon as i do this, the BUFFER rule consumes ALL characters, also the opening and closing brackets. The LD rule is never being invoked.
Has anyone an idea why the antlr lexer is consuming all tokens in the Buffer rule when the delimiters have 2 characters, but does not consume the delimiters when they have only one character?
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: (tag | BUFFER )*
;
tag
: LD IDENT^ RD
;
LD #after {
// flip lexer the state
insideTag=true;
System.err.println("FLIPPING TAG");
} : '{{';
RD #after {
// flip the state back
insideTag=false;
} : '}}';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER)*;
BUFFER : { !insideTag }?=> ~(LD | RD)+;
fragment LETTER : ('a'..'z' | 'A'..'Z');
You can match any character once or more until you see {{ ahead by including a predicate inside the parenthesis ( ... )+ (see the BUFFER rule in the demo).
A demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: tag EOF
;
tag
: LD IDENT^ RD
;
LD
#after {insideTag=true;}
: '{{'
;
RD
#after {insideTag=false;}
: '}}'
;
BUFFER
: ({!insideTag && !(input.LA(1)=='{' && input.LA(2)=='{')}?=> .)+ {$channel=HIDDEN;}
;
SPACE
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
IDENT
: ('a'..'z' | 'A'..'Z')+
;
Note that it's best to keep the BUFFER rule as the first lexer rule in your grammar: that way, it will be the first token that is tried.
If you now parse "foo {{ someVariable }} bar", the following AST is created:
Wouldn't a grammar like this fit your needs? I don't see why the BUFFER needs to be that complicated.
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean inTag=false;
}
start
: tag* EOF
;
tag
: LD IDENT RD -> IDENT
;
LD
#after { inTag=true; }
: '{{'
;
RD
#after { inTag=false; }
: '}}'
;
IDENT : {inTag}?=> ('a'..'z'|'A'..'Z'|'_') 'a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
BUFFER
: . {$channel=HIDDEN;}
;

Parsing a templating language

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)

Antlr Array Help

Hey ive started to use Antlr with java and i wanted to know how i can store some values directly into a 2d array and return this array? i cant find any tutorials on this at all, all help is apperciated.
Let's say you want to parse a flat text file containing numbers separated by spaces. You'd like to parse this into a 2d array of int's where each line is a "row" in your array.
The ANTLR grammar for such a "language" could look like:
grammar Number;
parse
: line* EOF
;
line
: Number+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
Now, you'd like to have the parse rule return an List of List<Integer> objects. Do that by adding a returns [List<List<Integer>> numbers] after your parse rule which can be initialized in an #init{ ... } block:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: line* EOF
;
Your line rule looks a bit the same, only it returns a 1 dimensional list of numbers:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: Number+ (LineBreak | EOF)
;
The next step is to fill the Lists with the actual values that are being parsed. This can be done embedding the code {$row.add(Integer.parseInt($Number.text));} inside the Number+ loop in your line rule:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
And lastly, you'll want to add the Lists being returned by your line rule to be actually added to your 2D numbers list from your parse rule:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
Below is the final grammar:
grammar Number;
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
which can be tested with the following class:
import org.antlr.runtime.*;
import java.util.List;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"1 2 \n" +
"3 4 5 6 7 \n" +
" 8 \n" +
"9 10 11 ";
ANTLRStringStream in = new ANTLRStringStream(source);
NumberLexer lexer = new NumberLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
NumberParser parser = new NumberParser(tokens);
List<List<Integer>> numbers = parser.parse();
System.out.println(numbers);
}
}
Now generate a lexer and parser from the grammar:
java -cp antlr-3.2.jar org.antlr.Tool Number.g
compile all .java source files:
javac -cp antlr-3.2.jar *.java
and run the main class:
// On *nix
java -cp .:antlr-3.2.jar Main
// or Windows
java -cp .;antlr-3.2.jar Main
which produces the following output:
[[1, 2], [3, 4, 5, 6, 7], [8], [9, 10, 11]]
HTH
Here's some excerpts from a grammar I made that parses people's names and returns a Name object. Should be enough to show you how it works. Other objects such as arrays are done the same way.
In the grammar:
grammar PersonNames;
fullname returns [Name name]
#init {
name = new Name();
}
: (directory_style[name] | standard[name] | title_without_fname[name] | family_style[name] | proper_initials[name]) EOF;
standard[Name name]
: (title[name] ' ')* fname[name] ' ' (mname[name] ' ')* (nickname[name] ' ')? lname[name] (sep honorifics[name])*;
fname[Name name] : (f=NAME | f=INITIAL) { name.set(Name.Part.FIRST, toNameCase($f.text)); };
in your regular Java code
public static Name parseName(String str) throws RecognitionException {
System.err.println("parsing `" + str + "`");
CharStream stream = new ANTLRStringStream(str);
PersonNamesLexer lexer = new PersonNamesLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PersonNamesParser parser = new PersonNamesParser(tokens);
return parser.fullname();
}