Antlr3 Non-recursive / leftfactorized grammar for expressions with a nice AST

Antlr3 Non-recursive / leftfactorized grammar for expressions with a nice AST - antlr

I have defined the following production rules for expressions. The grammar is not allowed to have backtracking and k better or equal to 3. The current version seams to have some ambiguity, but I can't figure out where. I've removed the AST rules here, but the grammar is supposed to create a nice AST where operations are presented according to their priority as well as showing the left associativity of operations.
Antlr 3.2.1 with Antlerworks 1.5.1
disjunctionExpresion
: (conjunctionExpresion Disjunction conjunctionExpresion disjunctionExpresionDash) | conjunctionExpresion;
disjunctionExpresionDash
: (Disjunction conjunctionExpresion disjunctionExpresionDash) |;
conjunctionExpresion
: (relationalExpresion Conjunction relationalExpresion conjunctionExpresionDash) | relationalExpresion;
conjunctionExpresionDash
: (Conjunction relationalExpresion conjunctionExpresionDash)|;
relationalExpresion
: (addExpresion RelationalOperator addExpresion relationalExpresionDash) | addExpresion;
relationalExpresionDash
: (RelationalOperator addExpresion relationalExpresionDash)|;
addExpresion
: (multiExpresion addOperator multiExpresion addExpresionDash)| multiExpresion;
addExpresionDash
: (addOperator multiExpresion addExpresionDash)|;
multiExpresion
: (unaryExpresion MultiOperator unaryExpresion multiExpresionDash) | unaryExpresion;
multiExpresionDash
: (MultiOperator unaryExpresion multiExpresionDash) | ;
unaryExpresion
: (unaryOperator basicExpr)->^(unaryOperator basicExpr) | basicExpr -> basicExpr;
basicExpr
: Number | var basicExprDash | ('(' expr ')')->expr;
basicExprDash
: 'touches' var | ;

With #kaby76's hint of using EBNF and looking up example grammars, I have ended up with something similar to this C++ Antlr3 example. The Antrl3 Documentations for Tree construction was also very helpful.
The "^" quick operator allowed me to create the desired AST.
expr : disjunctionExpression;
disjunctionExpression
: conjunctionExpression (Disjunction^ conjunctionExpression)*;
/*equal to:
disjunctionExpression
: (a=conjunctionExpression->$a) (o=Disjunction b=conjunctionExpression -> ^($o $disjunctionExpression $b) )*;
*/
conjunctionExpression
: relationalExpression (Conjunction^ relationalExpression)*;
relationalExpression
: additiveExpression (relationalOperator^ additiveExpression)*;
additiveExpression
: multiExpression (addOperator^ multiExpression)*;
multiExpression
: unaryExpression (multiOperator^ unaryExpression)*;
unaryExpression
: (unaryOperator^)? basicExpr;
basicExpr
: Number | var ('touches'^ var)? | '(' expr ')' -> expr;
Which is actually the compact version of what I had before:
expr : disjunctionExpresion;
disjunctionExpresion
: conjunctionExpresion disjunctionExpresionDash;
disjunctionExpresionDash
: (Disjunction conjunctionExpresion disjunctionExpresionDash) |;

Related

ANTLR3 - Decision can match input using multiple alternatives

When running ANTLR3 on the following code, I get the message - warning(200): MYGRAMMAR.g:40:36: Decision can match input such as "QMARK" using multiple alternatives: 3, 4
As a result, alternative(s) 4 were disabled for that input.
The warning message is pointing me to postfixExpr. Is there a way to fix this?
grammar MYGRAMMAR;
options {language = C;}
tokens {
BANG = '!';
COLON = ':';
FALSE_LITERAL = 'false';
GREATER = '>';
LSHIFT = '<<';
MINUS = '-';
MINUS_MINUS = '--';
PLUS = '+';
PLUS_PLUS = '++';
QMARK = '?';
QMARK_COLON = '?:';
TILDE = '~';
TRUE_LITERAL = 'true';
}
condExpr
: shiftExpr (QMARK condExpr COLON condExpr)? ;
shiftExpr
: addExpr ( shiftOp addExpr)* ;
addExpr
: qmarkColonExpr ( addOp qmarkColonExpr)* ;
qmarkColonExpr
: prefixExpr ( QMARK_COLON prefixExpr )? ;
prefixExpr
: ( prefixOrUnaryMinus | postfixExpr) ;
prefixOrUnaryMinus
: prefixOp prefixExpr ;
postfixExpr
: primaryExpr ( postfixOp | BANG | QMARK )*;
primaryExpr
: literal ;
shiftOp
: ( LSHIFT | rShift);
addOp
: (PLUS | MINUS);
prefixOp
: ( BANG | MINUS | TILDE | PLUS_PLUS | MINUS_MINUS );
postfixOp
: (PLUS_PLUS | MINUS_MINUS);
rShift
: (GREATER GREATER)=> a=GREATER b=GREATER {assertNoSpace($a,$b)}? ;
literal
: ( TRUE_LITERAL | FALSE_LITERAL );
assertNoSpace [pANTLR3_COMMON_TOKEN t1, pANTLR3_COMMON_TOKEN t2]
: { $t1->line == $t2->line && $t1->getCharPositionInLine($t1) + 1 == $t2->getCharPositionInLine($t2) }? ;

I think one problem is that PLUS_PLUS as well as MINUS_MINUS will never be matched as they are defined after the respective PLUS or MINUS token. therefore the lexer will always output two PLUS tokens instead of one PLUS_PLUS token.
In order to avaoid something like this you have to define your PLUS_PLUS or MINUS_MINUS token before the PLUS or MINUS token as the lexer processes them in the order they are defined and won't look any further once it found a way to match the current input.
The same problem applies to QMARK_COLON as it is defined after QMARK (this only is a problem because there is another token type COLON to match the following colon).
See if fixing the ambiguities resolves the error message.

ANTLR Parser, need to which parser rule is matched

In ANTLR, for a given token, is there a way to tell which parser rule is matched?
For example, from the ANTLR grammar:
tokens
{
ADD='Add';
SUB='Sub';
}
fragment
ANYDIGIT : '0'..'9';
fragment
UCASECHAR : 'A'..'Z';
fragment
LCASECHAR : 'a'..'z';
fragment
DATEPART : ('0'..'1') (ANYDIGIT) '/' ('0'..'3') (ANYDIGIT) '/' (ANYDIGIT) (ANYDIGIT) (ANYDIGIT) (ANYDIGIT);
fragment
TIMEPART : ('0'..'2') (ANYDIGIT) ':' ('0'..'5') (ANYDIGIT) ':' ('0'..'5') (ANYDIGIT);
SPACE : ' ';
NEWLINE : '\r'? '\n';
TAB : '\t';
FORMFEED : '\f';
WS : (SPACE|NEWLINE|TAB|FORMFEED)+ {$channel=HIDDEN;};
IDENTIFIER : (LCASECHAR|UCASECHAR|'_') (LCASECHAR|UCASECHAR|ANYDIGIT|'_')*;
TIME : '\'' (TIMEPART) '\'';
DATE : '\'' (DATEPART) (' ' (TIMEPART))? '\'';
STRING : '\''! (.)* '\''!;
DOUBLE : (ANYDIGIT)+ '.' (ANYDIGIT)+;
INT : (ANYDIGIT)+;
literal : INT|DOUBLE|STRING|DATE|TIME;
var : IDENTIFIER;
param : literal|fcn_call|var;
fcn_name : ADD |
SUB |
DIVIDE |
MOD |
DTSECONDSBETWEEN |
DTGETCURRENTDATETIME |
APPEND |
STRINGTOFLOAT;
fcn_call : fcn_name WS? '('! WS? ( param WS? ( ','! WS? param)*)* ')'!;
expr : fcn_call WS? EOF;
And in Java:
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
nodes.reset();
Object obj;
while((obj = nodes.nextElement()) != null)
{
if(nodes.isEOF(obj))
{
break;
}
System.out.println(obj);
}
So, what I want to know, at System.out.println(obj), did the node match the fcn_name rule, or did it match the var rule.
The reason being, I am trying to handle vars differently than fcn_names.

Add this to your listener/visitor:
String[] ruleNames;
public void loadParser(gramParser parser) { //get parser
ruleNames = parser.getRuleNames(); //load parser rules from parser
}
Call loadParser() from wherever you create your listener/visitor, eg.:
MyParser parser = new MyParser(tokens);
MyListener listener = new MyListener();
listener.loadParser(parser); //so we can access rule names
Then inside each rule you can get the name of the rule like this:
ruleName = ruleNames[ctx.getRuleIndex()];

No, you cannot get the name of a parser rule (at least, not without an ugly hack ➊).
But if tree is an instance of CommonTree, it means you've already invoked the expr rule of your parser, which means you already know expr matches first (which in its turn matches fcn_name).
➊ On a related note, see: Get active Antlr rule

ANTLR Is it possible to make grammar with embed grammar inside?

ANTLR: Is it possible to make grammar with embed grammar (with it's own lexer) inside?
For example in my language I have ability to use embed SQL language:
var Query = [select * from table];
with Query do something ....;
Is it possible with ANTLR?

Is it possible to make grammar with embed grammar (with it's own lexer) inside?
If you mean whether it is possible to define two languages in a single grammar (using separate lexers), then the answer is: no, that's not possible.
However, if the question is whether it is possible to parse two languages into a single AST, then the answer is: yes, it is possible.
You simply need to:
define both languages in their own grammar;
create a lexer rule in you main grammar that captures the entire input of the embedded language;
use a rewrite rule that calls a custom method that parses the external AST and inserts it in the main AST using { ... } (see the expr rule in the main grammar (MyLanguage.g)).
MyLanguage.g
grammar MyLanguage;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ROOT;
}
#members {
private CommonTree parseSQL(String sqlSrc) {
try {
MiniSQLLexer lexer = new MiniSQLLexer(new ANTLRStringStream(sqlSrc));
MiniSQLParser parser = new MiniSQLParser(new CommonTokenStream(lexer));
return (CommonTree)parser.parse().getTree();
} catch(Exception e) {
return new CommonTree(new CommonToken(-1, e.getMessage()));
}
}
}
parse
: assignment+ EOF -> ^(ROOT assignment+)
;
assignment
: Var Id '=' expr ';' -> ^('=' Id expr)
;
expr
: Num
| SQL -> {parseSQL($SQL.text)}
;
Var : 'var';
Id : ('a'..'z' | 'A'..'Z')+;
Num : '0'..'9'+;
SQL : '[' ~']'* ']';
Space : ' ' {skip();};
MiniSQL.g
grammar MiniSQL;
options {
output=AST;
ASTLabelType=CommonTree;
}
parse
: '[' statement ']' EOF -> statement
;
statement
: select
;
select
: Select '*' From ID -> ^(Select '*' From ID)
;
Select : 'select';
From : 'from';
ID : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "var Query = [select * from table]; var x = 42;";
MyLanguageLexer lexer = new MyLanguageLexer(new ANTLRStringStream(src));
MyLanguageParser parser = new MyLanguageParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
Run the demo
java -cp antlr-3.3.jar org.antlr.Tool MiniSQL.g
java -cp antlr-3.3.jar org.antlr.Tool MyLanguage.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Given the input:
var Query = [select * from table]; var x = 42;
the output of the Main class corresponds to the following AST:
And if you want to allow string literals inside your SQL (which could contain ]), and comments (which could contain ' and ]), the you could use the following SQL rule inside your main grammar:
SQL
: '[' ( ~(']' | '\'' | '-')
| '-' ~'-'
| COMMENT
| STR
)*
']'
;
fragment STR
: '\'' (~('\'' | '\r' | '\n') | '\'\'')+ '\''
| '\'\''
;
fragment COMMENT
: '--' ~('\r' | '\n')*
;
which would properly parse the following input in a single token:
[
select a,b,c
from table
where a='A''B]C'
and b='' -- some ] comment ] here'
]
Just beware that trying to create a grammar for an entire SQL dialect (or even a large subset) is no trivial task! You may want to search for existing SQL parsers, or look at the ANTLR wiki for example-grammars.

Yes, with AntLR it is called Island grammar.
You can get a working example in the v3 examples, inside the island-grammar folder : it shows the usage of a grammar to parse javadoc comments inside of java code.
You can also find some clues in the doc Island Grammars Under Parser Control and that Another one.

How to pass CommonTree parameter to an Antlr rule

I am trying to do what I think is a simple parameter passing to a rule in Antlr 3.3:
grammar rule_params;
options
{
output = AST;
}
rule_params
: outer;
outer: outer_id '[' inner[$outer_id.tree] ']';
inner[CommonTree parent] : inner_id '[' ']';
outer_id : '#'! ID;
inner_id : '$'! ID ;
ID : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' )* ;
So the inner[CommonTree parent] generates the following:
inner4=inner((outer_id2!=null?((Object)outer_id2.tree):null));
resulting in this error:
The method inner(CommonTree) in the type rule_paramsParser is not applicable for the arguments (Object)
As best I can tell, this is the exact same as the example in the Antrl book:
classDefinition[CommonTree mod]
(Kindle Location 3993) - sorry I don't know the page number but it is in the middle of the book in chapter 9, section labeled "Creating Nodes with Arbitrary Actions".
Thanks for any help.
M

If you don't explicitly specify the tree to be used in your grammar, .tree (which is short for getTree()) will return a java.lang.Object and a CommonTree will be used as default Tree implementation. To avoid casting, set the type of tree in your options { ... } section:
options
{
output=AST;
ASTLabelType=CommonTree;
}

Whats the correct way to add new tokens (rewrite) to create AST nodes that are not on the input steam

I've a pretty basic math expression grammar for ANTLR here and what's of interest is handling the implied * operator between parentheses e.g. (2-3)(4+5)(6*7) should actually be (2-3)*(4+5)*(6*7).
Given the input (2-3)(4+5)(6*7) I'm trying to add the missing * operator to the AST tree while parsing, in the following grammar I think I've managed to achieve that but I'm wondering if this is the correct, most elegant way?
grammar G;
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ADD = '+' ;
SUB = '-' ;
MUL = '*' ;
DIV = '/' ;
OPARN = '(' ;
CPARN = ')' ;
}
start
: expression EOF!
;
expression
: mult (( ADD^ | SUB^ ) mult)*
;
mult
: atom (( MUL^ | DIV^) atom)*
;
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN expression CPARN -> ^(MUL expression)+
)*
;
INTEGER : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
This grammar appears to output the correct AST Tree in ANTLRworks:
I'm only just starting to get to grips with parsing and ANTLR, don't have much experience so feedback with really appreciated!
Thanks in advance! Carl

First of all, you did a great job given the fact that you've never used ANTLR before.
You can omit the language=Java and ASTLabelType=CommonTree, which are the default values. So you can just do:
options {
output=AST;
}
Also, you don't have to specify the root node for each operator separately. So you don't have to do:
(ADD^ | SUB^)
but the following:
(ADD | SUB)^
will suffice. With only two operators, there's not much difference, but when implementing relational operators (>=, <=, > and <), the latter is a bit easier.
Now, for you AST: you'll probably want to create a binary tree: that way, all internal nodes are operators, and the leafs will be operands which makes the actual evaluating of your expressions much easier. To get a binary tree, you'll have to change your atom rule slightly:
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN e=expression CPARN -> ^(MUL $atom $e)
)*
;
which produces the following AST given the input "(2-3)(4+5)(6*7)":
(image produced by: graphviz-dev.appspot.com)
The DOT file was generated with the following test-class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
GLexer lexer = new GLexer(new ANTLRStringStream("(2-3)(4+5)(6*7)"));
GParser parser = new GParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Antlr3 Non-recursive / leftfactorized grammar for expressions with a nice AST - antlr

Related

ANTLR3 - Decision can match input using multiple alternatives

ANTLR Parser, need to which parser rule is matched

ANTLR Is it possible to make grammar with embed grammar inside?

How to pass CommonTree parameter to an Antlr rule

Whats the correct way to add new tokens (rewrite) to create AST nodes that are not on the input steam

Categories

Resources