How do I convert this Antlr3 AST to Antlr4? - antlr

I'm trying to convert my existing Antlr3 project to Antlr4 to get more functionality. I have this grammar that wouldn't compile with Antlr4.9
expr
: term ( OR^ term )* ;
and
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
Mostly because Antlr4 doesn't support ^ and ! anymore. From the documentation it seems like those are
AST root operator. When generating abstract syntax trees (ASTs), token
references suffixed with the "^" root operator force AST nodes to be
created and added as the root of the current tree. This symbol is only
effective when the buildAST option is set. More information about ASTs
is also available.
AST exclude operator. When generating abstract syntax trees, token
references suffixed with the "!" exclude operator are not included in
the AST constructed for that rule. Rule references can also be
suffixed with the exclude operator, which implies that, while the tree
for the referenced rule is constructed, it is not linked into the tree
for the referencing rule. This symbol is only effective when the
buildAST option is set. More information about ASTs is also available.
If I took those out it would compile but I'm not sure what do those mean and how would Antlr4 supports it.
LPAREN and RPAREN is tokens
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
which Antlr4 kindly provides the way to convert that in the error messages but not ^ and !. The grammar is for parsing boolean expression for example (a=b AND b=c)
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;

The v3 grammar:
...
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
...
expr
: term ( OR^ term )* ;
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
in v4 would look like this:
...
expr
: term ( OR term )* ;
factor
: ava | NOT factor | (LPAREN expr RPAREN) ;
EQUALS : '=';
LPAREN : '(';
RPAREN : ')';
So, just remove the inline ^ and ! operators (tree rewriting is no longer available in ANTLR4), and move the literal tokens in the tokens { ... } sections into own lexer rules.
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
What you posted there is part of a tree grammar for which there is no equivalent. In ANTLR4 you'd use a visitor to evaluate your expressions instead of inside a tree grammar.

Related

Using Antlr to parse formulas with multiple locales

I'm very new to Antlr, so forgive what may be a very easy question.
I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.
Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.
I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.
What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.
I don't have ANTLR and C# running here, but the Java demo should be pretty similar:
grammar LocaleDemo;
#lexer::header {
import java.text.DecimalFormatSymbols;
import java.util.Locale;
}
#lexer::members {
private char decimalSeparator = '.';
private char groupingSeparator = ',';
public LocaleDemoLexer(CharStream input, Locale locale) {
this(input);
DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
this.decimalSeparator = dfs.getDecimalSeparator();
this.groupingSeparator = dfs.getGroupingSeparator();
}
}
parse
: .*? EOF
;
NUMBER
: D D? ( DG D D D )* ( DS D+ )?
;
OTHER
: .
;
fragment D : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}? . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;
To test the grammar above, run this class:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;
public class Main {
private static void tokenize(String input, Locale locale) {
LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
for (Token t : lexer.getAllTokens()) {
System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
public static void main(String[] args) throws Exception {
tokenize("1.23", Locale.ENGLISH);
tokenize("1.23", Locale.GERMAN);
tokenize("12.345.678,90", Locale.ENGLISH);
tokenize("12.345.678,90", Locale.GERMAN);
}
}
which would print:
input='1.23', locale=en, tokens:
NUMBER '1.23'
input='1.23', locale=de, tokens:
NUMBER '1'
OTHER '.'
NUMBER '23'
input='12.345.678,90', locale=en, tokens:
NUMBER '12.345'
OTHER '.'
NUMBER '67'
NUMBER '8'
OTHER ','
NUMBER '90'
input='12.345.678,90', locale=de, tokens:
NUMBER '12.345.678,90'
Related Q&A's:
What is a 'semantic predicate' in ANTLR?
What does "fragment" mean in ANTLR?
As a follow-up to Bart's answer, this is the grammar I created with his suggestions:
grammar ExcelScript;
#lexer::header
{
using System;
using System.Globalization;
}
#lexer::members
{
private Int32 listseparator = 44; // UTF16 value for comma
private Int32 decimalseparator = 46; // UTF16 value for period
/// <summary>
/// Creates a new lexer object
/// </summary>
/// <param name="input">The input stream</param>
/// <param name="locale">The locale to use in parsing numbers</param>
/// <returns>A new lexer object</returns>
public ExcelScriptLexer (ICharStream input, CultureInfo locale)
: this(input)
{
this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);
// special case for 8 locales where the list separator is a , and the number separator is a , too
// Excel uses semicolon for list separator, so we will too
if (this.listseparator == 44 && this.decimalseparator == 44)
this.listseparator = 59; // UTF16 value for semicolon
}
}
/*
* Parser Rules
*/
formula
: numberLiteral
| Identifier
| '=' expression
;
expression
: primary # PrimaryExpression
| Identifier arguments # FunctionCallExpression
| ('+' | '-') expression # UnarySignExpression
| expression ('*' | '/' | '%') expression # MulDivModExpression
| expression ('+' | '-') expression # AddSubExpression
| expression ('<=' | '>=' | '>' | '<') expression # CompareExpression
| expression ('=' | '<>') expression # EqualCompareExpression
;
primary
: '(' expression ')' # ParenExpression
| literal # LiteralExpression
| Identifier # IdentifierExpression
;
literal
: numberLiteral # NumberLiteralRule
| booleanLiteral # BooleanLiteralRule
;
numberLiteral
: IntegerLiteral
| FloatingPointLiteral
;
booleanLiteral
: TrueKeyword
| FalseKeyword
;
arguments
: '(' expressionList? ')'
;
expressionList
: expression (ListSeparator expression)*
;
/*
* Lexer Rules
*/
AddOperator : '+' ;
SubOperator : '-' ;
MulOperator : '*' ;
DivOperator : '/' ;
PowOperator : '^' ;
EqOperator : '=' ;
NeqOperator : '<>' ;
LeOperator : '<=' ;
GeOperator : '>=' ;
LtOperator : '<' ;
GtOperator : '>' ;
ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;
TrueKeyword : [Tt][Rr][Uu][Ee] ;
FalseKeyword : [Ff][Aa][Ll][Ss][Ee] ;
Identifier
: Letter (Letter | Digit)*
;
fragment Letter
: [A-Z_a-z]
;
fragment Digit
: [0-9]
;
IntegerLiteral
: '0'
| [1-9] [0-9]*
;
FloatingPointLiteral
: [0-9]+ DecimalSeparator [0-9]* Exponent?
| DecimalSeparator [0-9]+ Exponent?
| [0-9]+ Exponent
;
fragment Exponent
: ('e' | 'E') ('+' | '-')? ('0'..'9')+
;
WhiteSpace
: [ \t]+ -> channel(HIDDEN)
;

How to handle the precedence of assignment operator in a PHP parser?

I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?
Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed:

ANTLR Parse Grammar -> Tree Grammar

Our last assignment for our compiler theory class has us creating a compiler for a small subset of Java (not MiniJava). Our prof gave us the option of using whatever tools we wish, and after a lot of poking around, I settled on ANTLR. I've managed to get the scanner and the parser up and running, and the parser outputting an AST. I'm stuck now trying to get a tree grammar file to compile. I understand the basic idea is to copy the grammar rules from the parser and eliminate most of the code, leaving the rewrite rules in place, but it doesn't seem to want to compile (offendingToken error). Am I on the right track? Am I missing something trivial?
Tree Grammar:
tree grammar J0_SemanticAnalysis;
options {
language = Java;
tokenVocab = J0_Parser;
ASTLabelType = CommonTree;
}
#header
{
package ritterre.a4;
import java.util.Map;
import java.util.HashMap;
}
#members
{
}
walk
: compilationunit
;
compilationunit
: ^(UNIT importdeclaration* classdeclaration*)
;
importdeclaration
: ^(IMP_DEC IDENTIFIER+)
;
classdeclaration
: ^(CLASS IDENTIFIER ^(EXTENDS IDENTIFIER)? fielddeclaration* methoddeclaration*)
;
fielddeclaration
: ^(FIELD_DEC IDENTIFIER type visibility? STATIC?)
;
methoddeclaration
: ^(METHOD_DEC IDENTIFIER type visibility? STATIC? ^(PARAMS parameter+)? body)
;
visibility
: PRIVATE
| PUBLIC
;
parameter
: ^(PARAM IDENTIFIER type)
;
body
: ^(BODY ^(DECLARATIONS localdeclaration*) ^(STATEMENTS statement*))
;
localdeclaration
: ^(DECLARATION type IDENTIFIER)
;
statement
: assignment
| ifstatement
| whilestatement
| returnstatement
| callstatement
| printstatement
| block
;
assignment
: ^(ASSIGN IDENTIFIER+ expression? expression)
;
ifstatement
: ^(IF relation statement ^(ELSE statement)?)
;
whilestatement
: ^(WHILE relation statement)
;
returnstatement
: ^(RETURN expression?)
;
callstatement
: ^(CALL IDENTIFIER+ expression+)
;
printstatement
: ^(PRINT expression)
;
block
: ^(STATEMENTS statement*)
;
relation
// : expression (LTHAN | GTHAN | EQEQ | NEQ)^ expression
: ^(LTHAN expression expression)
| ^(GTHAN expression expression)
| ^(EQEQ expression expression)
| ^(NEQ expression expression)
;
expression
// : (PLUS | MINUS)? term ((PLUS | MINUS)^ term)*
: ^(PLUS term term)
| ^(MINUS term term)
;
term
// : factor ((MULT | DIV)^ factor)*
: ^(MULT factor factor)
| ^(DIV factor factor)
;
factor
: NUMBER
| IDENTIFIER (DOT IDENTIFIER | LBRAC expression RBRAC)?
| NULL
| NEW IDENTIFIER LPAREN RPAREN
| NEW (INT | IDENTIFIER) (LBRAC RBRAC)?
;
type
: (INT | IDENTIFIER) (LBRAC RBRAC)?
| VOID
;
Parser Grammar:
parser grammar J0_Parser;
options
{
output = AST; // Output an AST
tokenVocab = J0_Scanner; // Pull Tokens from Scanner
//greedy = true; // forcing this throughout?! success!!
//cannot force greedy true throughout. bad things happen and the parser doesnt build
}
tokens
{
UNIT;
IMP_DEC;
FIELD_DEC;
METHOD_DEC;
PARAMS;
PARAM;
BODY;
DECLARATIONS;
STATEMENTS;
DECLARATION;
ASSIGN;
CALL;
}
#header { package ritterre.a4; }
// J0 - Extended Specification - EBNF
parse
: compilationunit EOF -> compilationunit
;
compilationunit
: importdeclaration* classdeclaration*
-> ^(UNIT importdeclaration* classdeclaration*)
;
importdeclaration
: IMPORT IDENTIFIER (DOT IDENTIFIER)* SCOLON
-> ^(IMP_DEC IDENTIFIER+)
;
classdeclaration
: (PUBLIC)? CLASS n=IDENTIFIER (EXTENDS e=IDENTIFIER)? LBRAK (fielddeclaration|methoddeclaration)* RBRAK
-> ^(CLASS $n ^(EXTENDS $e)? fielddeclaration* methoddeclaration*)
;
fielddeclaration
: visibility? STATIC? type IDENTIFIER SCOLON
-> ^(FIELD_DEC IDENTIFIER type visibility? STATIC?)
;
methoddeclaration
: visibility? STATIC? type IDENTIFIER LPAREN (parameter (COMMA parameter)*)? RPAREN body
-> ^(METHOD_DEC IDENTIFIER type visibility? STATIC? ^(PARAMS parameter+)? body)
;
visibility
: PRIVATE
| PUBLIC
;
parameter
: type IDENTIFIER
-> ^(PARAM IDENTIFIER type)
;
body
: LBRAK localdeclaration* statement* RBRAK
-> ^(BODY ^(DECLARATIONS localdeclaration*) ^(STATEMENTS statement*))
;
localdeclaration
: type IDENTIFIER SCOLON
-> ^(DECLARATION type IDENTIFIER)
;
statement
: assignment
| ifstatement
| whilestatement
| returnstatement
| callstatement
| printstatement
| block
;
assignment
: IDENTIFIER (DOT IDENTIFIER | LBRAC a=expression RBRAC)? EQ b=expression SCOLON
-> ^(ASSIGN IDENTIFIER+ $a? $b)
;
ifstatement
: IF LPAREN relation RPAREN statement (options {greedy=true;} : ELSE statement)?
-> ^(IF relation statement ^(ELSE statement)?)
;
whilestatement
: WHILE LPAREN relation RPAREN statement
-> ^(WHILE relation statement)
;
returnstatement
: RETURN expression? SCOLON
-> ^(RETURN expression?)
;
callstatement
: IDENTIFIER (DOT IDENTIFIER)? LPAREN (expression (COMMA expression)*)? RPAREN SCOLON
-> ^(CALL IDENTIFIER+ expression+)
;
printstatement
: PRINT LPAREN expression RPAREN SCOLON
-> ^(PRINT expression)
;
block
: LBRAK statement* RBRAK
-> ^(STATEMENTS statement*)
;
relation
: expression (LTHAN | GTHAN | EQEQ | NEQ)^ expression
;
expression
: (PLUS | MINUS)? term ((PLUS | MINUS)^ term)*
;
term
: factor ((MULT | DIV)^ factor)*
;
factor
: NUMBER
| IDENTIFIER (DOT IDENTIFIER | LBRAC expression RBRAC)?
| NULL
| NEW IDENTIFIER LPAREN RPAREN
| NEW (INT | IDENTIFIER) (LBRAC RBRAC)?
;
type
: (INT | IDENTIFIER) (LBRAC RBRAC)?
| VOID
;
The problem is that in your tree grammar, you do the following (3 times I believe):
classdeclaration
: ^(CLASS ... ^(EXTENDS IDENTIFIER)? ... )
;
the ^(EXTENDS IDENTIFIER)? part is wrong: you need to wrap the tree around parenthesis, and only then make it optional:
classdeclaration
: ^(CLASS ... (^(EXTENDS IDENTIFIER))? ... )
;
However, it would be a bit too easy if that was all, wouldn't it? :)
When you fix the problem mentioned above, ANTLR will complain that the tree grammar is ambiguous when trying to generate a tree-walker from your tree grammar. ANTLR will throw the following towards you:
error(211): J0_SemanticAnalysis.g:61:26: [fatal] rule assignment has non-LL(*)
decision due to recursive rule invocations reachable from alts 1,2. Resolve by
left-factoring or using syntactic predicates or using backtrack=true option.
It complains about the assignment rule in your grammar:
assignment
: ^(ASSIGN IDENTIFIER+ expression? expression)
;
since ANTLR is an LL parser generator1, it parses tokens from left to right. Therefor the optional expression in IDENTIFIER+ expression? expression makes the grammar ambiguous. Fix this by moving the ? to the last expression:
assignment
: ^(ASSIGN IDENTIFIER+ expression expression?)
;
1 don't let the last two letters in the name ANTLR mislead you, they stand for Language Recognition, not the class of parsers it generates!

Solving antlr left recursion

I'm trying to parse a language using ANTLR which can contain the following syntax:
someVariable, somVariable.someMember, functionCall(param).someMember, foo.bar.baz(bjork).buffalo().xyzzy
This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error
The following sets of rules are mutually left-recursive [access_operation, expression]:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
LHS;
RHS;
CALL;
PARAMS;
}
start
: body? EOF
;
body
: expression (',' expression)*
;
expression
: function -> ^(CALL)
| access_operation
| atom
;
access_operation
: (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
;
function
: (IDENT '(' body? ')') -> ^(IDENT PARAMS?)
;
atom
: IDENT
| NUMBER
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
IDENT : (LETTER)+ ;
NUMBER : (DIGIT)+ ;
SPACE : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };
What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.
What i'm looking for instead is something like this:
How can the left-recursion problem solved in this case?
By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:
foo
/ | \
/ | \
bar baz ...
That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.
A quick demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS;
CALL;
PARAMS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (IDENT -> IDENT) ( tail -> ^(IDENT tail)
| call tail? -> ^(CALL IDENT call tail?)
)?
;
tail
: (access)+
;
access
: ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
;
call
: '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
;
IDENT : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " +
"foo.bar.baz(bjork).buffalo().xyzzy";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
The output of Main corresponds to the following AST:
EDIT
And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS_OP;
CALL;
PARAMS;
LHS;
RHS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (ID -> ID) ( ('(' params ')' -> ^(CALL ID params))
('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
| '.' expression -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
)?
;
params
: (expression (',' expression)*)? -> ^(PARAMS expression*)
;
ID : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which creates the following AST if you run the Main class:
The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:
which means atom can be any of the 5 following alternatives (with their corresponding AST's):
+----------------------+--------------------------------------------------------+
| alternative | generated AST |
+----------------------+--------------------------------------------------------+
| NUMBER | NUMBER |
| ID | ID |
| ID params | ^(CALL ID params) |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression | ^(ACCESS_OP ^(LHS ID) ^(RHS expression) |
+----------------------+--------------------------------------------------------+

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;