ANTLR 4 taking decisions from a parse tree - antlr

I need to take some decisions depending on the structure and information in a parse tree, this is an example of the trees I am generating now:
The decisions for generating code will depend on the operator(";","AND","OR","XOR") between two workflows, for instance the code I need to generate from this tree is
mustPrecede(T6,T4) AND mustPrecede(T6,T1)
AND mustPrecede(T4,T5) AND mustPrecede(T1,T5)
For this I need to find out that the operator between T6 and (T4 AND T1) is ";" (sequential composition operator) for taking a decision and then I need to find out that between T4 and T1 the operator is "AND" and then I need to get the T4 and T1 to make a relation with T5. My question is how can I encode this in a parser?.
This is my grammar definition
grammar Hello;
execution: workflow EOF;
workflow : Task
| workflow OPERATOR workflow
|'(' workflow (OPERATOR workflow)+ ')'
;
Task : 'T' ('0'..'9')+
| 'WF' ('0'..'9')+
;
OPERATOR: 'AND'
| 'OR'
| 'XOR'
| ';'
;
WS : [ \t\n\r]+ -> channel(HIDDEN) ;

You created one token, OPERATOR, which represents all of your operators. This makes it very difficult to distinguish between the different operators. An easier set of rules would be the following:
operator
: AND
| OR
| XOR
| SEMI
;
AND : 'AND';
OR : 'OR';
XOR : 'XOR;
SEMI : ';';
You would also replace references to OPERATOR with references to operator. Then, in your implementation of a listener or visitor, you could create methods like the following (using an example from a listener).
#Override
public void enterWorkflow(WorkflowContext ctx) {
List<? extends OperatorContext> operatorContexts = ctx.operator();
if (operatorContexts.isEmpty()) {
// handle just a Task
} else {
for (OperatorContext operatorContext : ctx.operator()) {
switch (operatorContext.getStart().getType()) {
case HelloLexer.AND:
// handle 'AND'
break;
case HelloLexer.OR:
// handle 'OR'
break;
case HelloLexer.XOR:
// handle 'XOR'
break;
case HelloLexer.SEMI:
// handle ';'
break;
default:
throw new IllegalStateException("Unrecognized operator.");
}
}
}
}

Related

Antlr: how to switch on token type in Visitor implementation

I'm playing around with Antlr, designing a toy language, which I think is where most people start! - I had a question on how best to think about switching on token type.
consider a 'function call' in the language, where a function can consume a string, number or variable - for example like the below (project() is the function call)
project("ABC") vs project(123) vs project($SOME_VARIABLE)
I have the alteration operator in my grammar, so the grammar parses the right thing, but in the visitor code, it would be nice to tell the difference between the three versions of the above.
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
try {
s1 = ctx.STRING_LITERAL().getText();
}catch(Exception e){}
try{
s2 = ctx.NUM().getText();
}catch(Exception e){}
System.out.println("Created Project via => " + ctx.getChild(1).toString());
}
The code above worked, depending on whether s1 or s2 are null, I can infer how I was called (with a literal or a number, I haven't shown the variable case above), but I'm interested if there is a better or more elegant way - for example switching on token type inside the visitor code to actually process the language.
The grammar I had for the above was
createproj: 'project('WS?(STRING_LITERAL|NUM)')';
and when I use the intellij antlr plugin, it seems to know the token type of the argument to the project() function - but I don't seem to be able to get to it from my code.
You could do something like this:
createproj
: 'project' '(' WS? param ')'
;
param
: STRING_LITERAL
| NUM
;
and in your visitor code:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param().start.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
so by inlining the token in the grammar I had originally, I've lost the opportunity to inspect it in the visitor code?
No really, you could also do it like this:
createproj
: 'project' '(' WS? param_token=(STRING_LITERAL | NUM) ')'
;
and could then do this:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param_token.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
Just make sure you don't mix lexer rules (tokens) and parser rules in your set param_token=( ... ). When it's a parser rule, ctx.param_token.getType() will fail (it must then be ctx.param_token.start.getType()). That is why I recommended adding an extra parser rule, because this would then still work:
param
: STRING_LITERAL
| NUM
| some_parser_rule
;

How can I build an ANTLR Works style parse tree?

I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar:
grammar test;
program : idList;
idList : id* ;
id : ID ;
ID : LETTER (LETTER | NUMBER)* ;
LETTER : 'a' .. 'z' | 'A' .. 'Z' ;
NUMBER : '0' .. '9' ;
ANTLR Works will output:
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens.
A quick demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
IdList;
Id;
}
#parser::members {
private static void walk(CommonTree tree, int indent) {
if(tree == null) return;
for(int i = 0; i < indent; i++, System.out.print(" "));
System.out.println(tree.getText());
for(int i = 0; i < tree.getChildCount(); i++) {
walk((CommonTree)tree.getChild(i), indent + 1);
}
}
public static void main(String[] args) throws Exception {
testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123"));
testParser parser = new testParser(new CommonTokenStream(lexer));
walk((CommonTree)parser.program().getTree(), 0);
}
}
program : idList EOF -> idList;
idList : id* -> ^(IdList id*);
id : ID -> ^(Id ID);
ID : LETTER (LETTER | DIGIT)*;
SPACE : ' ' {skip();};
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
fragment DIGIT : '0' .. '9';
If you run the demo above, you will see the following being printed to the console:
IdList
Id
abc
Id
abc123
As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead:
idList : id* -> ^(IdList["idList"] id*);
id : ID -> ^(Id["id"] ID);
which will print:
idList
id
abc
id
abc123

ANTLR Is it possible to make grammar with embed grammar inside?

ANTLR: Is it possible to make grammar with embed grammar (with it's own lexer) inside?
For example in my language I have ability to use embed SQL language:
var Query = [select * from table];
with Query do something ....;
Is it possible with ANTLR?
Is it possible to make grammar with embed grammar (with it's own lexer) inside?
If you mean whether it is possible to define two languages in a single grammar (using separate lexers), then the answer is: no, that's not possible.
However, if the question is whether it is possible to parse two languages into a single AST, then the answer is: yes, it is possible.
You simply need to:
define both languages in their own grammar;
create a lexer rule in you main grammar that captures the entire input of the embedded language;
use a rewrite rule that calls a custom method that parses the external AST and inserts it in the main AST using { ... } (see the expr rule in the main grammar (MyLanguage.g)).
MyLanguage.g
grammar MyLanguage;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ROOT;
}
#members {
private CommonTree parseSQL(String sqlSrc) {
try {
MiniSQLLexer lexer = new MiniSQLLexer(new ANTLRStringStream(sqlSrc));
MiniSQLParser parser = new MiniSQLParser(new CommonTokenStream(lexer));
return (CommonTree)parser.parse().getTree();
} catch(Exception e) {
return new CommonTree(new CommonToken(-1, e.getMessage()));
}
}
}
parse
: assignment+ EOF -> ^(ROOT assignment+)
;
assignment
: Var Id '=' expr ';' -> ^('=' Id expr)
;
expr
: Num
| SQL -> {parseSQL($SQL.text)}
;
Var : 'var';
Id : ('a'..'z' | 'A'..'Z')+;
Num : '0'..'9'+;
SQL : '[' ~']'* ']';
Space : ' ' {skip();};
MiniSQL.g
grammar MiniSQL;
options {
output=AST;
ASTLabelType=CommonTree;
}
parse
: '[' statement ']' EOF -> statement
;
statement
: select
;
select
: Select '*' From ID -> ^(Select '*' From ID)
;
Select : 'select';
From : 'from';
ID : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "var Query = [select * from table]; var x = 42;";
MyLanguageLexer lexer = new MyLanguageLexer(new ANTLRStringStream(src));
MyLanguageParser parser = new MyLanguageParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
Run the demo
java -cp antlr-3.3.jar org.antlr.Tool MiniSQL.g
java -cp antlr-3.3.jar org.antlr.Tool MyLanguage.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Given the input:
var Query = [select * from table]; var x = 42;
the output of the Main class corresponds to the following AST:
And if you want to allow string literals inside your SQL (which could contain ]), and comments (which could contain ' and ]), the you could use the following SQL rule inside your main grammar:
SQL
: '[' ( ~(']' | '\'' | '-')
| '-' ~'-'
| COMMENT
| STR
)*
']'
;
fragment STR
: '\'' (~('\'' | '\r' | '\n') | '\'\'')+ '\''
| '\'\''
;
fragment COMMENT
: '--' ~('\r' | '\n')*
;
which would properly parse the following input in a single token:
[
select a,b,c
from table
where a='A''B]C'
and b='' -- some ] comment ] here'
]
Just beware that trying to create a grammar for an entire SQL dialect (or even a large subset) is no trivial task! You may want to search for existing SQL parsers, or look at the ANTLR wiki for example-grammars.
Yes, with AntLR it is called Island grammar.
You can get a working example in the v3 examples, inside the island-grammar folder : it shows the usage of a grammar to parse javadoc comments inside of java code.
You can also find some clues in the doc Island Grammars Under Parser Control and that Another one.

Whats the correct way to add new tokens (rewrite) to create AST nodes that are not on the input steam

I've a pretty basic math expression grammar for ANTLR here and what's of interest is handling the implied * operator between parentheses e.g. (2-3)(4+5)(6*7) should actually be (2-3)*(4+5)*(6*7).
Given the input (2-3)(4+5)(6*7) I'm trying to add the missing * operator to the AST tree while parsing, in the following grammar I think I've managed to achieve that but I'm wondering if this is the correct, most elegant way?
grammar G;
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ADD = '+' ;
SUB = '-' ;
MUL = '*' ;
DIV = '/' ;
OPARN = '(' ;
CPARN = ')' ;
}
start
: expression EOF!
;
expression
: mult (( ADD^ | SUB^ ) mult)*
;
mult
: atom (( MUL^ | DIV^) atom)*
;
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN expression CPARN -> ^(MUL expression)+
)*
;
INTEGER : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
This grammar appears to output the correct AST Tree in ANTLRworks:
I'm only just starting to get to grips with parsing and ANTLR, don't have much experience so feedback with really appreciated!
Thanks in advance! Carl
First of all, you did a great job given the fact that you've never used ANTLR before.
You can omit the language=Java and ASTLabelType=CommonTree, which are the default values. So you can just do:
options {
output=AST;
}
Also, you don't have to specify the root node for each operator separately. So you don't have to do:
(ADD^ | SUB^)
but the following:
(ADD | SUB)^
will suffice. With only two operators, there's not much difference, but when implementing relational operators (>=, <=, > and <), the latter is a bit easier.
Now, for you AST: you'll probably want to create a binary tree: that way, all internal nodes are operators, and the leafs will be operands which makes the actual evaluating of your expressions much easier. To get a binary tree, you'll have to change your atom rule slightly:
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN e=expression CPARN -> ^(MUL $atom $e)
)*
;
which produces the following AST given the input "(2-3)(4+5)(6*7)":
(image produced by: graphviz-dev.appspot.com)
The DOT file was generated with the following test-class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
GLexer lexer = new GLexer(new ANTLRStringStream("(2-3)(4+5)(6*7)"));
GParser parser = new GParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

ANTLR Grammar gives me upsidedown-tree

I have a grammar which parses dot notion expressions like this:
a.b.c
memberExpression returns [Expression value]
: i=ID { $value = ParameterExpression($i.value); }
('.' m=memberExpression { $value = MemberExpression($m.value, $i.value); }
)*
;
This parses expressions fine and gives me a tree structure like this:
MemberExpression(
MemberExpression(
ParameterExpression("c"),
"b"
)
, "a"
)
But my problem is that I want a tree that looks like this:
MemberExpression(
MemberExpression(
ParameterExpression("a"),
"b"
)
, "c"
)
for the same expression "a.b.c"
How can I achieve this?
You could do this by collecting all tokens in a java.util.List using ANTLR's convenience += operator and create the desired tree using a custom method in your #parser::members section:
// grammar def ...
// options ...
#parser::members {
private Expression customTree(List tks) {
// `tks` is a java.util.List containing `CommonToken` objects
}
}
// parser ...
memberExpression returns [Expression value]
: ids+=ID ('.' ids+=ID)* { $value = customTree($ids); }
;
I think what you are asking for is mutually left recursive, and therefore ANTLR is not a good choice to parse it.
To elaborate, you need C at the root of the tree and therefore your rule would be:
rule: rule ID;
This rule will be uncertain whether it should match
a.b
or
a.b.c