Antlr Array Help - antlr

Hey ive started to use Antlr with java and i wanted to know how i can store some values directly into a 2d array and return this array? i cant find any tutorials on this at all, all help is apperciated.

Let's say you want to parse a flat text file containing numbers separated by spaces. You'd like to parse this into a 2d array of int's where each line is a "row" in your array.
The ANTLR grammar for such a "language" could look like:
grammar Number;
parse
: line* EOF
;
line
: Number+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
Now, you'd like to have the parse rule return an List of List<Integer> objects. Do that by adding a returns [List<List<Integer>> numbers] after your parse rule which can be initialized in an #init{ ... } block:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: line* EOF
;
Your line rule looks a bit the same, only it returns a 1 dimensional list of numbers:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: Number+ (LineBreak | EOF)
;
The next step is to fill the Lists with the actual values that are being parsed. This can be done embedding the code {$row.add(Integer.parseInt($Number.text));} inside the Number+ loop in your line rule:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
And lastly, you'll want to add the Lists being returned by your line rule to be actually added to your 2D numbers list from your parse rule:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
Below is the final grammar:
grammar Number;
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
which can be tested with the following class:
import org.antlr.runtime.*;
import java.util.List;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"1 2 \n" +
"3 4 5 6 7 \n" +
" 8 \n" +
"9 10 11 ";
ANTLRStringStream in = new ANTLRStringStream(source);
NumberLexer lexer = new NumberLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
NumberParser parser = new NumberParser(tokens);
List<List<Integer>> numbers = parser.parse();
System.out.println(numbers);
}
}
Now generate a lexer and parser from the grammar:
java -cp antlr-3.2.jar org.antlr.Tool Number.g
compile all .java source files:
javac -cp antlr-3.2.jar *.java
and run the main class:
// On *nix
java -cp .:antlr-3.2.jar Main
// or Windows
java -cp .;antlr-3.2.jar Main
which produces the following output:
[[1, 2], [3, 4, 5, 6, 7], [8], [9, 10, 11]]
HTH

Here's some excerpts from a grammar I made that parses people's names and returns a Name object. Should be enough to show you how it works. Other objects such as arrays are done the same way.
In the grammar:
grammar PersonNames;
fullname returns [Name name]
#init {
name = new Name();
}
: (directory_style[name] | standard[name] | title_without_fname[name] | family_style[name] | proper_initials[name]) EOF;
standard[Name name]
: (title[name] ' ')* fname[name] ' ' (mname[name] ' ')* (nickname[name] ' ')? lname[name] (sep honorifics[name])*;
fname[Name name] : (f=NAME | f=INITIAL) { name.set(Name.Part.FIRST, toNameCase($f.text)); };
in your regular Java code
public static Name parseName(String str) throws RecognitionException {
System.err.println("parsing `" + str + "`");
CharStream stream = new ANTLRStringStream(str);
PersonNamesLexer lexer = new PersonNamesLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PersonNamesParser parser = new PersonNamesParser(tokens);
return parser.fullname();
}

Related

Selectively Skip Newline Depending on Context

I must parse files made of two parts. In the first one, new lines must be skipped. In the second one, they are important and used as a delimiter.
I want to avoid solutions like http://www.antlr.org/wiki/pages/viewpage.action?pageId=1734 and use predicate instead.
For the moment, I have something like:
WS: ( ' ' | '\t' | NEWLINE) {SKIP();};
fragment NEWLINE : '\r'|'\n'|'\r\n';
I tried to add a dynamically scoped variable keepNewline that is set to true when "entering" second part of the file.
However, I am not able to create the correct predicate to switch off the "skipping" of newlines.
Any help would be greatly appreciated.
Best regards.
It's easier than you might think: you don't even need a predicate.
Let's say you want to preserve line breaks only inside <pre>...</pre> tags. The following dummy grammar does just that:
grammar Pre;
#lexer::members {
private boolean keepNewLine = false;
}
parse
: (t=.
{
System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));
}
)*
EOF
;
Word
: ('a'..'z' | 'A'..'Z')+
;
OPr
: '<pre>' {keepNewLine = true;}
;
CPr
: '</pre>' {keepNewLine = false;}
;
NewLine
: ('\r'? '\n' | '\r') {if(!keepNewLine) skip();}
;
Space
: (' ' | '\t') {skip();}
;
which you can test with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
PreLexer lexer = new PreLexer(new ANTLRFileStream("in.txt"));
PreParser parser = new PreParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if in.txt would contain:
foo bar
<pre>
a
b
</pre>
baz
the output of running the Main class would be:
Word 'foo'
Word 'bar'
OPr '<pre>'
NewLine '\n'
Word 'a'
NewLine '\n'
NewLine '\n'
Word 'b'
NewLine '\n'
CPr '</pre>'
Word 'baz'

ANTRL simple grammar and Identifier

I wrote this simple grammar for ANTLR
grammar ALang;
#members {
public static void main(String[] args) throws Exception {
ALangLexer lex = new ALangLexer(new ANTLRFileStream("antlr/ALang.al"));
CommonTokenStream tokens = new CommonTokenStream(lex);
ALangParser parser = new ALangParser(tokens);
parser.prog();
}
}
prog :
ID | PRINT
;
PRINT : 'print';
ID : ( 'a'..'z' | 'A'..'Z' )+;
WS : (' ' | '\t' | '\n' | '\r')+ { skip(); };
Using as input:
print
the only token found is a token of type ID. Isn't enough to put the PRINT token definition right before the ID definition?
ALang.g:21:1: The following token definitions can never be matched because prior tokens match the same input: PRINT
Yes, that is enough. If you define PRINT after ID, ANTLR will produce an error:
ALang.g:21:1: The following token definitions can never be matched because prior tokens match the same input: PRINT
I'm so sorry, i didn't want to use this production: PRINT : 'print '; but the production without the trailing space: PRINT : 'print'; The problem is that 'print' is derived from ID and not from PRINT
No, that can't be the case.
The following:
grammar ALang;
#members {
public static void main(String[] args) throws Exception {
ALangLexer lex = new ALangLexer(new ANTLRStringStream("sprint print prints foo"));
CommonTokenStream tokens = new CommonTokenStream(lex);
ALangParser parser = new ALangParser(tokens);
parser.prog();
}
}
prog
: ( ID {System.out.printf("ID :: '\%s'\n", $ID.text);}
| PRINT {System.out.printf("PRINT :: '\%s'\n", $PRINT.text);}
)*
EOF
;
PRINT : 'print';
ID : ('a'..'z' | 'A'..'Z')+;
WS : (' ' | '\t' | '\n' | '\r')+ {skip();};
will print:
ID :: 'sprint'
PRINT :: 'print'
ID :: 'prints'
ID :: 'foo'
As you see, the PRINT rule does match "print".

Parsing a templating language

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)

An antlr problem with embedded comments

I am trying to implement a nested comment in D.
nestingBlockComment
: '/+' (options {greedy=false;} :nestingBlockCommentCharacters)* '+/' {$channel=HIDDEN;}; // line 58
nestingBlockCommentCharacters
: (nestingBlockComment| '/'~'+' | ~'/' ) ; //line 61
For me, it would be logical that this should work...
This is the error message I get:
[21:06:34] warning(200): d.g:58:64: Decision can match input such as "'+/'" using multiple alternatives: 1, 2
As a result, alternative(s) 1 were disabled for that input
[21:06:34] warning(200): d.g:61:7: Decision can match input such as "'/+'" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
Could someone explan those error messages to me and the fix?
Thanks.
AFAIK, the error is because nestingBlockCommentCharacters can match +/ (the ~'/' twice).
Personally, I'd keep the nestingBlockComment as a lexer rule instead of a parser rule. You can do that by adding a little helper method in the lexer class:
public boolean openOrCloseCommentAhead() {
// return true iff '/+' or '+/' is ahead in the character stream
}
and then in a lexer comment-rule, use a gated semantic predicates with that helper method as the boolean expression inside the predicate:
// match nested comments
Comment
: '/+' (Comment | {!openOrCloseCommentAhead()}?=> Any)* '+/'
;
// match any character
Any
: .
;
A little demo-grammar:
grammar DComments;
#lexer::members {
public boolean openOrCloseCommentAhead() {
return (input.LA(1) == '+' && input.LA(2) == '/') ||
(input.LA(1) == '/' && input.LA(2) == '+');
}
}
parse
: token+ EOF
;
token
: Comment {System.out.println("comment :: "+$Comment.text);}
| Any
;
Comment
: '/+' (Comment | {!openOrCloseCommentAhead()}?=> Any)* '+/'
;
Any
: .
;
and a main class to test it:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(
"foo /+ comment /+ and +/ comment +/ bar /+ comment +/ baz");
DCommentsLexer lexer = new DCommentsLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
DCommentsParser parser = new DCommentsParser(tokens);
parser.parse();
}
}
Then the following commands:
java -cp antlr-3.2.jar org.antlr.Tool DComments.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
(for Windows, the last command is: java -cp .;antlr-3.2.jar Main)
produce the following output:
comment :: /+ comment /+ and +/ comment +/
comment :: /+ comment +/

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace