In ANTLR, how do you specify a specific number of repetitions? - antlr

I'm using ANTLR to specify a file format that contains lines that cannot exceed 254 characters (excluding line endings). How do I encode this in the grammer, short of doing:
line : CHAR? CHAR? CHAR? CHAR? ... (254 times)

This can be handled by using a semantic predicate.
First write your grammar in such a way that it does not matter how long your lines are. An example would look like this:
grammar Test;
parse
: line* EOF
;
line
: Char+ (LineBreak | EOF)
| LineBreak // empty line!
;
LineBreak : '\r'? '\n' | '\r' ;
Char : ~('\r' | '\n') ;
and then add the "predicate" to the line rule:
grammar Test;
#parser::members {
public static void main(String[] args) throws Exception {
String source = "abcde\nfghij\nklm\nnopqrst";
ANTLRStringStream in = new ANTLRStringStream(source);
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.parse();
}
}
parse
: line* EOF
;
line
: (c+=Char)+ {$c.size()<=5}? (LineBreak | EOF)
| LineBreak // empty line!
;
LineBreak : '\r'? '\n' | '\r' ;
Char : ~('\r' | '\n') ;
The c+=Char will construct an ArrayList containing all characters in the line. The {$c.size()<=5}? causes to throw an exception when the ArrayList's size exceeds 5.
I also added a main method in the parser so you can test it yourself:
// *nix/MacOSX
java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar TestParser
// Windows
java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .;antlr-3.2.jar TestParser
which will output:
line 0:-1 rule line failed predicate: {$c.size()<=5}?
HTH

Related

ANTLR - test class not found error when debugging

I am getting error of Test class not found error, even when I have made it via command
java org.antlr.Tool something.g
when debugging via ANTLRworks. I have been searching this on the web, but no success with it. Do you know, how to solve my problem?
Thanks
Edit - grammar:
grammar Expr;
prog : stat+ ;
stat : expr NEWLINE
| ID '=' expr NEWLINE
| NEWLINE
;
expr : multExpr (('+' |'-' ) multExpr)*;
multExpr: atom ('*' atom)*
;
atom : INT
| ID
| '(' expr ')'
;
ID : ('a'..'z' |'A'..'Z' )+ ;
INT : '0'..'9' + ;
NEWLINE:'\r' ? '\n' ;
WS : (' ' |'\t' |'\n' |'\r' )+ {skip();} ;
Error message:
When debugging with imput text 3*4 makes output - "Error: Could not find or load main class Test" While, Test is generated in subfolder /output...
Since you commented in the comments that running a test class from the console also failed, here's one that works:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ExprLexer lex = new ExprLexer(new ANTLRStringStream("1 + 2 - 3 * 4 - E\n"));
CommonTokenStream tokens = new CommonTokenStream(lex);
ExprParser parser = new ExprParser(tokens);
parser.prog();
}
}
Now put your ANTLR JAR (v3.3 in my example) in the same directory as Main.java and Expr.g and run the main class from your console like this:
java -cp antlr-3.3.jar org.antlr.Tool Expr.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
The fact that nothing is printed to your console means that the parsing went without any problems.

Parsing a templating language

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)

An antlr problem with embedded comments

I am trying to implement a nested comment in D.
nestingBlockComment
: '/+' (options {greedy=false;} :nestingBlockCommentCharacters)* '+/' {$channel=HIDDEN;}; // line 58
nestingBlockCommentCharacters
: (nestingBlockComment| '/'~'+' | ~'/' ) ; //line 61
For me, it would be logical that this should work...
This is the error message I get:
[21:06:34] warning(200): d.g:58:64: Decision can match input such as "'+/'" using multiple alternatives: 1, 2
As a result, alternative(s) 1 were disabled for that input
[21:06:34] warning(200): d.g:61:7: Decision can match input such as "'/+'" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
Could someone explan those error messages to me and the fix?
Thanks.
AFAIK, the error is because nestingBlockCommentCharacters can match +/ (the ~'/' twice).
Personally, I'd keep the nestingBlockComment as a lexer rule instead of a parser rule. You can do that by adding a little helper method in the lexer class:
public boolean openOrCloseCommentAhead() {
// return true iff '/+' or '+/' is ahead in the character stream
}
and then in a lexer comment-rule, use a gated semantic predicates with that helper method as the boolean expression inside the predicate:
// match nested comments
Comment
: '/+' (Comment | {!openOrCloseCommentAhead()}?=> Any)* '+/'
;
// match any character
Any
: .
;
A little demo-grammar:
grammar DComments;
#lexer::members {
public boolean openOrCloseCommentAhead() {
return (input.LA(1) == '+' && input.LA(2) == '/') ||
(input.LA(1) == '/' && input.LA(2) == '+');
}
}
parse
: token+ EOF
;
token
: Comment {System.out.println("comment :: "+$Comment.text);}
| Any
;
Comment
: '/+' (Comment | {!openOrCloseCommentAhead()}?=> Any)* '+/'
;
Any
: .
;
and a main class to test it:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(
"foo /+ comment /+ and +/ comment +/ bar /+ comment +/ baz");
DCommentsLexer lexer = new DCommentsLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
DCommentsParser parser = new DCommentsParser(tokens);
parser.parse();
}
}
Then the following commands:
java -cp antlr-3.2.jar org.antlr.Tool DComments.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
(for Windows, the last command is: java -cp .;antlr-3.2.jar Main)
produce the following output:
comment :: /+ comment /+ and +/ comment +/
comment :: /+ comment +/

ANTLR grammar from bison

I'm trying to translate a grammar from bison to ANTLR. The grammar itself is pretty simple in bison but I cannot find a simple way for doing this.
Grammar in bison:
expr = expr or expr | expr and expr | (expr)
Any hints/links/pointers are welcome.
Thanks,
Iulian
In ANTLR3, you cannot create left recursive rules (ANTLR4 can handle left recursion in certain cases):
a : a b
;
tail recursion is fine:
a : b a
;
For more information on left recursive rules, see ANTLR's Wiki.
So, your example could look like:
parse
: expr+ EOF
;
expr
: orExpr
;
orExpr
: andExpr ('or' andExpr)*
;
andExpr
: atom ('and' atom)*
;
atom
: Boolean
| '(' expr ')'
;
Boolean
: 'true'
| 'false'
;
Here's a small demo in Java:
grammar BoolExp;
#members {
public static void main(String[] args) throws Exception {
if(args.length != 1) {
System.out.println("Usage:");
System.out.println(" - Windows : java -cp .;antlr-3.2.jar BoolExpParser \"EXPRESSION\"");
System.out.println(" - *nix/MacOS : java -cp .:antlr-3.2.jar BoolExpParser \"EXPRESSION\"");
System.exit(0);
}
ANTLRStringStream in = new ANTLRStringStream(args[0]);
BoolExpLexer lexer = new BoolExpLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
BoolExpParser parser = new BoolExpParser(tokens);
parser.parse();
}
}
parse
: e=expr EOF {System.out.println($e.bool);}
;
expr returns [boolean bool]
: e=orExpr {$bool = $e.bool;}
;
orExpr returns [boolean bool]
: e1=andExpr {$bool = $e1.bool;}
('or' e2=andExpr {$bool = $bool || $e2.bool;}
)*
;
andExpr returns [boolean bool]
: e1=atom {$bool = $e1.bool;}
('and' e2=atom {$bool = $bool && $e2.bool;}
)*
;
atom returns [boolean bool]
: b=Boolean {$bool = new Boolean($b.text).booleanValue();}
| '(' e=expr ')' {$bool = $e.bool;}
;
Boolean
: 'true'
| 'false'
;
Space
: (' ' | '\t' | '\n' | '\r') {skip();}
;
First create a lexer & parser (1) and then compile all source files (2). Finally, execute the BoolExpParser class (3).
1
// Windows & *nix/MacOS
java -cp antlr-3.2.jar org.antlr.Tool BoolExp.g
2
// Windows
javac -cp .;antlr-3.2.jar *.java
// *nix/MacOS
javac -cp .:antlr-3.2.jar *.java
3
// Windows
java -cp .;antlr-3.2.jar BoolExpParser "false and true or true"
// *nix/MacOS
java -cp .:antlr-3.2.jar BoolExpParser "false and true or true"
Terence Parr's ANTLR reference is the book on ANTLR. And Scott created some excellent video tutorials on ANTLR 3 (with Eclipse).

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace