Why similar antrl2 grammar does not produce non-determenism warning? - antlr2

I'm trying to create antlr grammar to parse javadoc comment. I use ANTLR v2.
Why first grammar produces non-determinism warnings and second one does not?
I thought that they are similar.
First grammar:
class TestParser extends Parser;
options {
buildAST = true;
k = 2;
}
startRule: START TEXT END;
class TestLexer extends Lexer;
options {
k = 2;
}
START: "/**";
END: "*/";
TEXT: (
{LA(2)!='/'}? '*'
| ~('*')
)+;
Warnings:
warning:lexical nondeterminism between rules START and TEXT upon
k==1:'/'
k==2:'*'
warning:lexical nondeterminism between rules END and TEXT upon
k==1:'*'
k==2:'/'
Second grammar:
class TestParser extends Parser;
options {
buildAST = true;
k = 2;
}
startRule: "/**" TEXT "*/";
class TestLexer extends Lexer;
options {
k = 2;
}
TEXT: (
{LA(2)!='/'}? '*'
| ~('*')
)+;

Related

In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?

I have an antlr4 lexer grammar. It has many rules for words, but I also want it to create an Unknown token for any word that it can not match by other rules. I have something like this:
Whitespace : [ \t\n\r]+ -> skip;
Punctuation : [.,:;?!];
// Other rules here
Unknown : .+? ;
Now generated matcher catches '~' as unknown but creates 3 '~' Unknown tokens for input '~~~' instead of a single '~~~' token. What should I do to tell lexer to generate word tokens for unknown consecutive characters. I also tried "Unknown: . ;" and "Unknown : .+ ;" with no results.
EDIT: In current antlr versions .+? now catches remaining words, so this problem seems to be resolved.
.+? at the end of a lexer rule will always match a single character. But .+ will consume as much as possible, which was illegal at the end of a rule in ANTLR v3 (v4 probably as well).
What you can do is just match a single char, and "glue" these together in the parser:
unknowns : Unknown+ ;
...
Unknown : . ;
EDIT
... but I only have a lexer, no parsers ...
Ah, I see. Then you could override the nextToken() method:
lexer grammar Lex;
#members {
public static void main(String[] args) {
Lex lex = new Lex(new ANTLRInputStream("foo, bar...\n"));
for(Token t : lex.getAllTokens()) {
System.out.printf("%-15s '%s'\n", tokenNames[t.getType()], t.getText());
}
}
private java.util.Queue<Token> queue = new java.util.LinkedList<Token>();
#Override
public Token nextToken() {
if(!queue.isEmpty()) {
return queue.poll();
}
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
StringBuilder builder = new StringBuilder();
while(next.getType() == Unknown) {
builder.append(next.getText());
next = super.nextToken();
}
// The `next` will _not_ be an Unknown-token, store it in
// the queue to return the next time!
queue.offer(next);
return new CommonToken(Unknown, builder.toString());
}
}
Whitespace : [ \t\n\r]+ -> skip ;
Punctuation : [.,:;?!] ;
Unknown : . ;
Running it:
java -cp antlr-4.0-complete.jar org.antlr.v4.Tool Lex.g4
javac -cp antlr-4.0-complete.jar *.java
java -cp .:antlr-4.0-complete.jar Lex
will print:
Unknown 'foo'
Punctuation ','
Unknown 'bar'
Punctuation '.'
Punctuation '.'
Punctuation '.'
The accepted answer works, but it only works for Java.
I converted the provided Java code for use with the C# ANTLR runtime. If anyone else needs it... here ya go!
#members {
private IToken _NextToken = null;
public override IToken NextToken()
{
if(_NextToken != null)
{
var token = _NextToken;
_NextToken = null;
return token;
}
var next = base.NextToken();
if(next.Type != UNKNOWN)
{
return next;
}
var originalToken = next;
var lastToken = next;
var builder = new StringBuilder();
while(next.Type == UNKNOWN)
{
lastToken = next;
builder.Append(next.Text);
next = base.NextToken();
}
_NextToken = next;
return new CommonToken(
originalToken
)
{
Text = builder.ToString(),
StopIndex = lastToken.Column
};
}
}

ANTLR What is simpliest way to realize python like indent-depending grammar?

I am trying realize python like indent-depending grammar.
Source example:
ABC QWE
CDE EFG
EFG CDE
ABC
QWE ZXC
As i see, what i need is to realize two tokens INDENT and DEDENT, so i could write something like:
grammar mygrammar;
text: (ID | block)+;
block: INDENT (ID|block)+ DEDENT;
INDENT: ????;
DEDENT: ????;
Is there any simple way to realize this using ANTLR?
(I'd prefer, if it's possible, to use standard ANTLR lexer.)
I don't know what the easiest way to handle it is, but the following is a relatively easy way. Whenever you match a line break in your lexer, optionally match one or more spaces. If there are spaces after the line break, compare the length of these spaces with the current indent-size. If it's more than the current indent size, emit an Indent token, if it's less than the current indent-size, emit a Dedent token and if it's the same, don't do anything.
You'll also want to emit a number of Dedent tokens at the end of the file to let every Indent have a matching Dedent token.
For this to work properly, you must add a leading and trailing line break to your input source file!
ANTRL3
A quick demo:
grammar PyEsque;
options {
output=AST;
}
tokens {
BLOCK;
}
#lexer::members {
private int previousIndents = -1;
private int indentLevel = 0;
java.util.Queue<Token> tokens = new java.util.LinkedList<Token>();
#Override
public void emit(Token t) {
state.token = t;
tokens.offer(t);
}
#Override
public Token nextToken() {
super.nextToken();
return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
}
private void jump(int ttype) {
indentLevel += (ttype == Dedent ? -1 : 1);
emit(new CommonToken(ttype, "level=" + indentLevel));
}
}
parse
: block EOF -> block
;
block
: Indent block_atoms Dedent -> ^(BLOCK block_atoms)
;
block_atoms
: (Id | block)+
;
NewLine
: NL SP?
{
int n = $SP.text == null ? 0 : $SP.text.length();
if(n > previousIndents) {
jump(Indent);
previousIndents = n;
}
else if(n < previousIndents) {
jump(Dedent);
previousIndents = n;
}
else if(input.LA(1) == EOF) {
while(indentLevel > 0) {
jump(Dedent);
}
}
else {
skip();
}
}
;
Id
: ('a'..'z' | 'A'..'Z')+
;
SpaceChars
: SP {skip();}
;
fragment NL : '\r'? '\n' | '\r';
fragment SP : (' ' | '\t')+;
fragment Indent : ;
fragment Dedent : ;
You can test the parser with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
PyEsqueLexer lexer = new PyEsqueLexer(new ANTLRFileStream("in.txt"));
PyEsqueParser parser = new PyEsqueParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
If you now put the following in a file called in.txt:
AAA AAAAA
BBB BB B
BB BBBBB BB
CCCCCC C CC
BB BBBBBB
C CCC
DDD DD D
DDD D DDD
(Note the leading and trailing line breaks!)
then you'll see output that corresponds to the following AST:
Note that my demo wouldn't produce enough dedents in succession, like dedenting from ccc to aaa (2 dedent tokens are needed):
aaa
bbb
ccc
aaa
You would need to adjust the code inside else if(n < previousIndents) { ... } to possibly emit more than 1 dedent token based on the difference between n and previousIndents. Off the top of my head, that could look like this:
else if(n < previousIndents) {
// Note: assuming indent-size is 2. Jumping from previousIndents=6
// to n=2 will result in emitting 2 `Dedent` tokens
int numDedents = (previousIndents - n) / 2;
while(numDedents-- > 0) {
jump(Dedent);
}
previousIndents = n;
}
ANTLR4
For ANTLR4, do something like this:
grammar Python3;
tokens { INDENT, DEDENT }
#lexer::members {
// A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
// The stack that keeps track of the indentation level.
private java.util.Stack<Integer> indents = new java.util.Stack<>();
// The amount of opened braces, brackets and parenthesis.
private int opened = 0;
// The most recently produced token.
private Token lastToken = null;
#Override
public void emit(Token t) {
super.setToken(t);
tokens.offer(t);
}
#Override
public Token nextToken() {
// Check if the end-of-file is ahead and there are still some DEDENTS expected.
if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
// Remove any trailing EOF tokens from our buffer.
for (int i = tokens.size() - 1; i >= 0; i--) {
if (tokens.get(i).getType() == EOF) {
tokens.remove(i);
}
}
// First emit an extra line break that serves as the end of the statement.
this.emit(commonToken(Python3Parser.NEWLINE, "\n"));
// Now emit as much DEDENT tokens as needed.
while (!indents.isEmpty()) {
this.emit(createDedent());
indents.pop();
}
// Put the EOF back on the token stream.
this.emit(commonToken(Python3Parser.EOF, "<EOF>"));
}
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
// Keep track of the last token on the default channel.
this.lastToken = next;
}
return tokens.isEmpty() ? next : tokens.poll();
}
private Token createDedent() {
CommonToken dedent = commonToken(Python3Parser.DEDENT, "");
dedent.setLine(this.lastToken.getLine());
return dedent;
}
private CommonToken commonToken(int type, String text) {
int stop = this.getCharIndex() - 1;
int start = text.isEmpty() ? stop : stop - text.length() + 1;
return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
}
// Calculates the indentation of the provided spaces, taking the
// following rules into account:
//
// "Tabs are replaced (from left to right) by one to eight spaces
// such that the total number of characters up to and including
// the replacement is a multiple of eight [...]"
//
// -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
static int getIndentationCount(String spaces) {
int count = 0;
for (char ch : spaces.toCharArray()) {
switch (ch) {
case '\t':
count += 8 - (count % 8);
break;
default:
// A normal space char.
count++;
}
}
return count;
}
boolean atStartOfInput() {
return super.getCharPositionInLine() == 0 && super.getLine() == 1;
}
}
single_input
: NEWLINE
| simple_stmt
| compound_stmt NEWLINE
;
// more parser rules
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n]+", "");
String spaces = getText().replaceAll("[\r\n]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(Python3Parser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
// more lexer rules
Taken from: https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4
There is an open-source library antlr-denter for ANTLR v4 that helps parse indents and dedents for you. Check out its README for how to use it.
Since it is a library, rather than code snippets to copy-and-paste into your grammar, its indentation-handling can be updated separately from the rest of your grammar.
There is a relatively simple way to do this ANTLR, which I wrote as an experiment: DentLexer.g4. This solution is different from the others mentioned on this page that were written by Kiers and Shavit. It integrates with the runtime solely via an override of the Lexer's nextToken() method. It does its work by examining tokens: (1) a NEWLINE token triggers the start of a "keep track of indentation" phase; (2) whitespace and comments, both set to channel HIDDEN, are counted and ignored, respectively, during that phase; and, (3) any non-HIDDEN token ends the phase. Thus controlling the indentation logic is a simple matter of setting a token's channel.
Both of the solutions mentioned on this page require a NEWLINE token to also grab all the subsequent whitespace, but in doing so can't handle multi-line comments interrupting that whitespace. Dent, instead, keeps NEWLINE and whitespace tokens separate and can handle multi-line comments.
Your grammar would be set up something like below. Note that the NEWLINE and WS lexer rules have actions that control the pendingDent state and keep track of indentation level with the indentCount variable.
grammar MyGrammar;
tokens { INDENT, DEDENT }
#lexer::members {
// override of nextToken(), see Dent.g4 grammar on github
// https://github.com/wevrem/wry/blob/master/grammars/Dent.g4
}
script : ( NEWLINE | statement )* EOF ;
statement
: simpleStatement
| blockStatements
;
simpleStatement : LEGIT+ NEWLINE ;
blockStatements : LEGIT+ NEWLINE INDENT statement+ DEDENT ;
NEWLINE : ( '\r'? '\n' | '\r' ) {
if (pendingDent) { setChannel(HIDDEN); }
pendingDent = true;
indentCount = 0;
initialIndentToken = null;
} ;
WS : [ \t]+ {
setChannel(HIDDEN);
if (pendingDent) { indentCount += getText().length(); }
} ;
BlockComment : '/*' ( BlockComment | . )*? '*/' -> channel(HIDDEN) ; // allow nesting comments
LineComment : '//' ~[\r\n]* -> channel(HIDDEN) ;
LEGIT : ~[ \t\r\n]+ ~[\r\n]*; // Replace with your language-specific rules...
Have you looked at the Python ANTLR grammar?
Edit: Added psuedo Python code for creating INDENT/DEDENT tokens
UNKNOWN_TOKEN = 0
INDENT_TOKEN = 1
DEDENT_TOKEN = 2
# filestream has already been processed so that each character is a newline and
# every tab outside of quotations is converted to 8 spaces.
def GetIndentationTokens(filestream):
# Stores (indentation_token, line, character_index)
indentation_record = list()
line = 0
character_index = 0
column = 0
counting_whitespace = true
indentations = list()
for c in filestream:
if IsNewLine(c):
character_index = 0
column = 0
line += 1
counting_whitespace = true
elif c != ' ' and counting_whitespace:
counting_whitespace = false
if(len(indentations) == 0):
indentation_record.append((token, line, character_index))
else:
while(len(indentations) > 0 and indentations[-1] != column:
if(column < indentations[-1]):
indentations.pop()
indentation_record.append((
DEDENT, line, character_index))
elif(column > indentations[-1]):
indentations.append(column)
indentation_record.append((
INDENT, line, character_index))
if not IsNewLine(c):
column += 1
character_index += 1
while(len(indentations) > 0):
indentations.pop()
indentation_record.append((DEDENT_TOKEN, line, character_index))
return indentation_record

ANTLR: Heterogeneous AST and imaginary tokens

it's my first question here :)
I'd like to build an heterogeneous AST with ANTLR for a simple grammar. There are different Interfaces to represent the AST nodes, e. g. IInfiExp, IVariableDecl. ANTLR comes up with CommonTree to hold all the information of the source code (line number, character position etc.) and I want to use this as a base for the implementations of the AST interfacese IInfixExp ...
In order to get an AST as output with CommonTree as node types, I set:
options {
language = Java;
k = 1;
output = AST;
ASTLabelType = CommonTree;
}
The IInifxExp is:
package toylanguage;
public interface IInfixExp extends IExpression {
public enum Operator {
PLUS, MINUS, TIMES, DIVIDE;
}
public Operator getOperator();
public IExpression getLeftHandSide();
public IExpression getRightHandSide();
}
and the implementation InfixExp is:
package toylanguage;
import org.antlr.runtime.Token;
import org.antlr.runtime.tree.CommonTree;
// IInitializable has only void initialize()
public class InfixExp extends CommonTree implements IInfixExp, IInitializable {
private Operator operator;
private IExpression leftHandSide;
private IExpression rightHandSide;
InfixExp(Token token) {
super(token);
}
#Override
public Operator getOperator() {
return operator;
}
#Override
public IExpression getLeftHandSide() {
return leftHandSide;
}
#Override
public IExpression getRightHandSide() {
return rightHandSide;
}
// from IInitializable. get called from ToyTreeAdaptor.rulePostProcessing
#Override
public void initialize() {
// term ((PLUS|MINUS) term)+
// atom ((TIMES|DIIDE) atom)+
// exact 2 children
assert getChildCount() == 2;
// left and right child are IExpressions
assert getChild(0) instanceof IExpression
&& getChild(1) instanceof IExpression;
// operator
switch (token.getType()) {
case ToyLanguageParser.PLUS:
operator = Operator.PLUS;
break;
case ToyLanguageParser.MINUS:
operator = Operator.MINUS;
break;
case ToyLanguageParser.TIMES:
operator = Operator.TIMES;
break;
case ToyLanguageParser.DIVIDE:
operator = Operator.DIVIDE;
break;
default:
assert false;
}
// left and right operands
leftHandSide = (IExpression) getChild(0);
rightHandSide = (IExpression) getChild(1);
}
}
The corresponding rules are:
exp // e.g. a+b
: term ((PLUS<InfixExp>^|MINUS<InfixExp>^) term)*
;
term // e.g. a*b
: atom ((TIMES<InfixExp>^|DIVIDE<InfixExp>^) atom)*
;
This works fine, becouse PLUS, MINUS etc. are "real" tokens.
But now comes to the imaginary token:
tokens {
PROGRAM;
}
The corresponding rule is:
program // e.g. var a, b; a + b
: varDecl* exp
-> ^(PROGRAM<Program> varDecl* exp)
;
With this, ANTLR doesn't create a tree with PROGRAM as root node.
In the parser, the following code creates the Program instance:
root_1 = (CommonTree)adaptor.becomeRoot(new Program(PROGRAM), root_1);
Unlike InfixExp not the Program(Token) constructor but Program(int) is invoked.
Program is:
package toylanguage;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import org.antlr.runtime.Token;
import org.antlr.runtime.tree.CommonTree;
class Program extends CommonTree implements IProgram, IInitializable {
private final LinkedList<IVariableDecl> variableDeclarations = new LinkedList<IVariableDecl>();
private IExpression expression = null;
Program(Token token) {
super(token);
}
public Program(int tokeType) {
// What to do?
super();
}
#Override
public List<IVariableDecl> getVariableDeclarations() {
// don't allow to change the list
return Collections.unmodifiableList(variableDeclarations);
}
#Override
public IExpression getExpression() {
return expression;
}
#Override
public void initialize() {
// program: varDecl* exp;
// at least one child
assert getChildCount() > 0;
// the last one is a IExpression
assert getChild(getChildCount() - 1) instanceof IExpression;
// iterate over varDecl*
int i = 0;
while (getChild(i) instanceof IVariableDecl) {
variableDeclarations.add((IVariableDecl) getChild(i));
i++;
}
// exp
expression = (IExpression) getChild(i);
}
}
you can see the constructor:
public Program(int tokeType) {
// What to do?
super();
}
as a result of it, with super() a CommonTree ist build without a token. So CommonTreeAdaptor.rulePostProcessing see a flat list, not a tree with a Token as root.
My TreeAdaptor looks like:
package toylanguage;
import org.antlr.runtime.tree.CommonTreeAdaptor;
public class ToyTreeAdaptor extends CommonTreeAdaptor {
public Object rulePostProcessing(Object root) {
Object result = super.rulePostProcessing(root);
// check if needs initialising
if (result instanceof IInitializable) {
IInitializable initializable = (IInitializable) result;
initializable.initialize();
}
return result;
};
}
And to test anything I use:
package toylanguage;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.TokenStream;
import org.antlr.runtime.tree.CommonTree;
import toylanguage.ToyLanguageParser.program_return;
public class Processor {
public static void main(String[] args) {
String input = "var a, b; a + b + 123"; // sample input
ANTLRStringStream stream = new ANTLRStringStream(input);
ToyLanguageLexer lexer = new ToyLanguageLexer(stream);
TokenStream tokens = new CommonTokenStream(lexer);
ToyLanguageParser parser = new ToyLanguageParser(tokens);
ToyTreeAdaptor treeAdaptor = new ToyTreeAdaptor();
parser.setTreeAdaptor(treeAdaptor);
try {
// test with: var a, b; a + b
program_return program = parser.program();
CommonTree root = program.tree;
// prints 'a b (+ a b)'
System.out.println(root.toStringTree());
// get (+ a b), the third child of root
CommonTree third = (CommonTree) root.getChild(2);
// prints '(+ a b)'
System.out.println(third.toStringTree());
// prints 'true'
System.out.println(third instanceof IInfixExp);
// prints 'false'
System.out.println(root instanceof IProgram);
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
For completeness, here is the full grammar:
grammar ToyLanguage;
options {
language = Java;
k = 1;
output = AST;
ASTLabelType = CommonTree;
}
tokens {
PROGRAM;
}
#header {
package toylanguage;
}
#lexer::header {
package toylanguage;
}
program // e.g. var a, b; a + b
: varDecl* exp
-> ^(PROGRAM<Program> varDecl* exp)
;
varDecl // e.g. var a, b;
: 'var'! ID<VariableDecl> (','! ID<VariableDecl>)* ';'!
;
exp // e.g. a+b
: term ((PLUS<InfixExp>^|MINUS<InfixExp>^) term)*
;
term // e.g. a*b
: atom ((TIMES<InfixExp>^|DIVIDE<InfixExp>^) atom)*
;
atom
: INT<IntegerLiteralExp> // e.g. 123
| ID<VariableExp> // e.g. a
| '(' exp ')' -> exp // e.g. (a+b)
;
INT : ('0'..'9')+ ;
ID : ('a'..'z')+ ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIVIDE : '/' ;
WS : ('\t' | '\n' | '\r' | ' ')+ { $channel = HIDDEN; } ;
OK, the final question is how to get from
program // e.g. var a, b; a + b
: varDecl* exp
-> ^(PROGRAM<Program> varDecl* exp)
;
a tree with PROGRAM as root
^(PROGRAM varDecl* exp)
and not a flat list with
(varDecl* exp) ?
(Sorry for this numerous code fragments)
Ciao Vertex
Try creating the following constructor:
public Program(int tokenType) {
super(new CommonToken(tokenType, "PROGRAM"));
}

ANTLRWorks :Can't get operators to work

I've been trying to learn ANTLR for some time and finally got my hands on The Definitive ANTLR reference.
Well I tried the following in ANTLRWorks 1.4
grammar Test;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
expression
: INT ('+'^ INT)*;
When I pass 2+4 and process expression, I don't get a tree with + as the root and 2 and 4 as the child nodes. Rather, I get expression as the root and 2, + and 4 as child nodes at the same level.
Can't figure out what I am doing wrong. Need help desparately.
BTW how can I get those graphic descriptions ?
Yes, you get the expression because it's an expression that your only rule expression is returning.
I have just added a virtual token PLUS to your example along with a rewrite expression that show the result your are expecting.
But it seems that you have already found the solution :o)
grammar Test;
options {
output=AST;
ASTLabelType = CommonTree;
}
tokens {PLUS;}
#members {
public static void main(String [] args) {
try {
TestLexer lexer =
new TestLexer(new ANTLRStringStream("2+2"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
TestParser.expression_return p_result = parser.expression();
CommonTree ast = p_result.tree;
if( ast == null ) {
System.out.println("resultant tree: is NULL");
} else {
System.out.println("resultant tree: " + ast.toStringTree());
}
} catch(Exception e) {
e.printStackTrace();
}
}
}
expression
: INT ('+' INT)* -> ^(PLUS INT+);
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;

Generating simple AST in ANTLR

I'm playing a bit around with ANTLR, and wish to create a function like this:
MOVE x y z pitch roll
That produces the following AST:
MOVE
|---x
|---y
|---z
|---pitch
|---roll
So far I've tried without luck, and I keep getting the AST to have the parameters as siblings, rather than children.
Code so far:
C#:
class Program
{
const string CRLF = "\r\n";
static void Main(string[] args)
{
string filename = "Script.txt";
var reader = new StreamReader(filename);
var input = new ANTLRReaderStream(reader);
var lexer = new ScorBotScriptLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new ScorBotScriptParser(tokens);
var result = parser.program();
var tree = result.Tree as CommonTree;
Print(tree, "");
Console.Read();
}
static void Print(CommonTree tree, string indent)
{
Console.WriteLine(indent + tree.ToString());
if (tree.Children != null)
{
indent += "\t";
foreach (var child in tree.Children)
{
var childTree = child as CommonTree;
if (childTree.Text != CRLF)
{
Print(childTree, indent);
}
}
}
}
ANTLR:
grammar ScorBotScript;
options
{
language = 'CSharp2';
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
memoize = true;
}
#parser::namespace { RSD.Scripting }
#lexer::namespace { RSD.Scripting }
program
: (robotInstruction CRLF)*
;
robotInstruction
: moveCoordinatesInstruction
;
/**
* MOVE X Y Z PITCH ROLL
*/
moveCoordinatesInstruction
: 'MOVE' x=INT y=INT z=INT pitch=INT roll=INT
;
INT : '-'? ( '0'..'9' )*
;
COMMENT
: '//' ~( CR | LF )* CR? LF { $channel = HIDDEN; }
;
WS
: ( ' ' | TAB | CR | LF ) { $channel = HIDDEN; }
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
;
fragment TAB
: '\t'
;
fragment CR
: '\r'
;
fragment LF
: '\n'
;
CRLF
: (CR ? LF) => CR ? LF
| CR
;
parse
: ID
| INT
| COMMENT
| STRING
| WS
;
I'm a beginner with ANTLR myself, this confused me too.
I think if you want to create a tree from your grammar that has structure, you augment your grammar with hints using the ^ and ! characters. This examples page shows how.
From the linked page:
By default ANTLR creates trees as
"sibling lists".
The grammar must be annotated to with
tree commands to produce a parser that
creates trees in the correct shape
(that is, operators at the root, which
operands as children). A somewhat more
complicated expression parser can be
seen here and downloaded in tar form
here. Note that grammar terminals
which should be at the root of a
sub-tree are annotated with ^.