An antlr problem with embedded comments - antlr

I am trying to implement a nested comment in D.
nestingBlockComment
: '/+' (options {greedy=false;} :nestingBlockCommentCharacters)* '+/' {$channel=HIDDEN;}; // line 58
nestingBlockCommentCharacters
: (nestingBlockComment| '/'~'+' | ~'/' ) ; //line 61
For me, it would be logical that this should work...
This is the error message I get:
[21:06:34] warning(200): d.g:58:64: Decision can match input such as "'+/'" using multiple alternatives: 1, 2
As a result, alternative(s) 1 were disabled for that input
[21:06:34] warning(200): d.g:61:7: Decision can match input such as "'/+'" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
Could someone explan those error messages to me and the fix?
Thanks.

AFAIK, the error is because nestingBlockCommentCharacters can match +/ (the ~'/' twice).
Personally, I'd keep the nestingBlockComment as a lexer rule instead of a parser rule. You can do that by adding a little helper method in the lexer class:
public boolean openOrCloseCommentAhead() {
// return true iff '/+' or '+/' is ahead in the character stream
}
and then in a lexer comment-rule, use a gated semantic predicates with that helper method as the boolean expression inside the predicate:
// match nested comments
Comment
: '/+' (Comment | {!openOrCloseCommentAhead()}?=> Any)* '+/'
;
// match any character
Any
: .
;
A little demo-grammar:
grammar DComments;
#lexer::members {
public boolean openOrCloseCommentAhead() {
return (input.LA(1) == '+' && input.LA(2) == '/') ||
(input.LA(1) == '/' && input.LA(2) == '+');
}
}
parse
: token+ EOF
;
token
: Comment {System.out.println("comment :: "+$Comment.text);}
| Any
;
Comment
: '/+' (Comment | {!openOrCloseCommentAhead()}?=> Any)* '+/'
;
Any
: .
;
and a main class to test it:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(
"foo /+ comment /+ and +/ comment +/ bar /+ comment +/ baz");
DCommentsLexer lexer = new DCommentsLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
DCommentsParser parser = new DCommentsParser(tokens);
parser.parse();
}
}
Then the following commands:
java -cp antlr-3.2.jar org.antlr.Tool DComments.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
(for Windows, the last command is: java -cp .;antlr-3.2.jar Main)
produce the following output:
comment :: /+ comment /+ and +/ comment +/
comment :: /+ comment +/

Related

Antlr superfluous Predicate required?

I have a file where I want to ignore parts of it. In the Lexer I use gated semantic predicates to avoid creating tokens for the uninteresting part of the file. My rules are similar to the following.
A
: {!ignore}?=> 'A'
;
START_IGNORE
: 'foo' {ignore = true; skip();}
;
END_IGNORE
: 'oof' {ignore = false; skip();}
;
IGNORE
: {ignore}?=> . {skip();}
;
However unless I change START and END to also use semantic predicates (as below) it does not work..
A
: {!ignore}?=> 'A'
;
START_IGNORE
: {true}?=> 'foo' {ignore = true; skip();}
;
END_IGNORE
: {true}?=> 'oof' {ignore = false; skip();}
;
IGNORE
: {ignore}?=> . {skip();}
;
Why do I have to add the predicates?
EDIT: I am using antlr-3.4
Why do I have to add the predicates?
You don't. At least, not using ANTLR v3.3. I don't know how exactly you're testing, but don't use ANTLRWorks' interpreter or the Eclipse ANTLR IDE plugin. Always do a little test from the command line.
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRStringStream("A foo A B C oof A"));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
#lexer::members {
private boolean ignore = false;
}
parse
: (t=.
{System.out.printf("[\%02d] type=\%s text='\%s'\n", $t.getCharPositionInLine(), tokenNames[$t.type], $t.text);}
)* EOF
;
A
: {!ignore}?=> 'A'
;
START_IGNORE
: 'foo' {ignore = true; skip();}
;
END_IGNORE
: 'oof' {ignore = false; skip();}
;
IGNORE
: {ignore}?=> . {skip();}
;
SPACE
: ' ' {skip();}
;
Run it like this:
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
which will print the following:
[00] type=A text='A'
[16] type=A text='A'
I.e.: from the input "A foo A B C oof A" the following: "foo A B C oof" is skipped.

Parsing a templating language

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)

Antlr Array Help

Hey ive started to use Antlr with java and i wanted to know how i can store some values directly into a 2d array and return this array? i cant find any tutorials on this at all, all help is apperciated.
Let's say you want to parse a flat text file containing numbers separated by spaces. You'd like to parse this into a 2d array of int's where each line is a "row" in your array.
The ANTLR grammar for such a "language" could look like:
grammar Number;
parse
: line* EOF
;
line
: Number+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
Now, you'd like to have the parse rule return an List of List<Integer> objects. Do that by adding a returns [List<List<Integer>> numbers] after your parse rule which can be initialized in an #init{ ... } block:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: line* EOF
;
Your line rule looks a bit the same, only it returns a 1 dimensional list of numbers:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: Number+ (LineBreak | EOF)
;
The next step is to fill the Lists with the actual values that are being parsed. This can be done embedding the code {$row.add(Integer.parseInt($Number.text));} inside the Number+ loop in your line rule:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
And lastly, you'll want to add the Lists being returned by your line rule to be actually added to your 2D numbers list from your parse rule:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
Below is the final grammar:
grammar Number;
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
which can be tested with the following class:
import org.antlr.runtime.*;
import java.util.List;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"1 2 \n" +
"3 4 5 6 7 \n" +
" 8 \n" +
"9 10 11 ";
ANTLRStringStream in = new ANTLRStringStream(source);
NumberLexer lexer = new NumberLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
NumberParser parser = new NumberParser(tokens);
List<List<Integer>> numbers = parser.parse();
System.out.println(numbers);
}
}
Now generate a lexer and parser from the grammar:
java -cp antlr-3.2.jar org.antlr.Tool Number.g
compile all .java source files:
javac -cp antlr-3.2.jar *.java
and run the main class:
// On *nix
java -cp .:antlr-3.2.jar Main
// or Windows
java -cp .;antlr-3.2.jar Main
which produces the following output:
[[1, 2], [3, 4, 5, 6, 7], [8], [9, 10, 11]]
HTH
Here's some excerpts from a grammar I made that parses people's names and returns a Name object. Should be enough to show you how it works. Other objects such as arrays are done the same way.
In the grammar:
grammar PersonNames;
fullname returns [Name name]
#init {
name = new Name();
}
: (directory_style[name] | standard[name] | title_without_fname[name] | family_style[name] | proper_initials[name]) EOF;
standard[Name name]
: (title[name] ' ')* fname[name] ' ' (mname[name] ' ')* (nickname[name] ' ')? lname[name] (sep honorifics[name])*;
fname[Name name] : (f=NAME | f=INITIAL) { name.set(Name.Part.FIRST, toNameCase($f.text)); };
in your regular Java code
public static Name parseName(String str) throws RecognitionException {
System.err.println("parsing `" + str + "`");
CharStream stream = new ANTLRStringStream(str);
PersonNamesLexer lexer = new PersonNamesLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PersonNamesParser parser = new PersonNamesParser(tokens);
return parser.fullname();
}

In ANTLR, how do you specify a specific number of repetitions?

I'm using ANTLR to specify a file format that contains lines that cannot exceed 254 characters (excluding line endings). How do I encode this in the grammer, short of doing:
line : CHAR? CHAR? CHAR? CHAR? ... (254 times)
This can be handled by using a semantic predicate.
First write your grammar in such a way that it does not matter how long your lines are. An example would look like this:
grammar Test;
parse
: line* EOF
;
line
: Char+ (LineBreak | EOF)
| LineBreak // empty line!
;
LineBreak : '\r'? '\n' | '\r' ;
Char : ~('\r' | '\n') ;
and then add the "predicate" to the line rule:
grammar Test;
#parser::members {
public static void main(String[] args) throws Exception {
String source = "abcde\nfghij\nklm\nnopqrst";
ANTLRStringStream in = new ANTLRStringStream(source);
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.parse();
}
}
parse
: line* EOF
;
line
: (c+=Char)+ {$c.size()<=5}? (LineBreak | EOF)
| LineBreak // empty line!
;
LineBreak : '\r'? '\n' | '\r' ;
Char : ~('\r' | '\n') ;
The c+=Char will construct an ArrayList containing all characters in the line. The {$c.size()<=5}? causes to throw an exception when the ArrayList's size exceeds 5.
I also added a main method in the parser so you can test it yourself:
// *nix/MacOSX
java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar TestParser
// Windows
java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .;antlr-3.2.jar TestParser
which will output:
line 0:-1 rule line failed predicate: {$c.size()<=5}?
HTH

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace