I created an ANTLR grammar for parsing mathematical expressions and a second one for evaluating them. As I thought building an AST and then re-parsing in order to actually evaluate it is sort of one operation too much, I wanted to refactor my grammar to produce a hierarchy of "Term" objects representing the expression including the logic to perform that particular operation. The root Term object can then be simply evaluated to a concrete result.
I had to rewrite quite a lot of my grammar and finally got rid of the last error message. Unfortunately now ANTLR seems to sort of go into an infinite loop.
Could someone here please help me sort out the problem? I think the grammar should be pretty interesting for some, therefore I am posting it. (It is based upon a garmmar I found with google, I should admit, but I have altered it quite a lot to suite my needs).
grammar SecurityRulesNew;
options {
language = Java;
output=AST;
backtrack = true;
ASTLabelType=CommonTree;
k=2;
}
tokens {
POS;
NEG;
CALL;
}
#header {package de.cware.cweb.services.evaluator.parser;}
#lexer::header{package de.cware.cweb.services.evaluator.parser;}
formula returns [Term term]
: a=expression EOF { $term = a; }
;
expression returns [Term term]
: a=boolExpr { $term = a; }
;
boolExpr returns [Term term]
: a=sumExpr { $term = a; }
| a=sumExpr AND b=boolExpr { $term = new AndTerm(a, b); }
| a=sumExpr OR b=boolExpr { $term = new OrTerm(a, b); }
| a=sumExpr LT b=boolExpr { $term = new LessThanTerm(a, b); }
| a=sumExpr LTEQ b=boolExpr { $term = new LessThanOrEqualTerm(a, b); }
| a=sumExpr GT b=boolExpr { $term = new GreaterThanTerm(a, b); }
| a=sumExpr GTEQ b=boolExpr { $term = new GreaterThanTermOrEqual(a, b); }
| a=sumExpr EQ b=boolExpr { $term = new EqualsTerm(a, b); }
| a=sumExpr NOTEQ b=boolExpr { $term = new NotEqualsTerm(a, b); }
;
sumExpr returns [Term term]
: a=productExpr { $term = a; }
| a=productExpr SUB b=sumExpr { $term = new SubTerm(a, b); }
| a=productExpr ADD b=sumExpr { $term = new AddTerm(a, b); }
;
productExpr returns [Term term]
: a=expExpr { $term = a; }
| a=expExpr DIV productExpr { $term = new DivTerm(a, b); }
| a=expExpr MULT productExpr { $term = new MultTerm(a, b); }
;
expExpr returns [Term term]
: a=unaryOperation { $term = a; }
| a=unaryOperation EXP expExpr { $term = new ExpTerm(a, b); }
;
unaryOperation returns [Term term]
: a=operand { $term = a; }
| NOT a=operand { $term = new NotTerm(a); }
| SUB a=operand { $term = new NegateTerm(a); }
;
operand returns [Term term]
: l=literal { $term = l; }
| f=functionExpr { $term = f; }
| v=VARIABLE { $term = new VariableTerm(v); }
| LPAREN e=expression RPAREN { $term = e; }
;
functionExpr returns [Term term]
: f=FUNCNAME LPAREN! RPAREN! { $term = new CallFunctionTerm(f, null); }
| f=FUNCNAME LPAREN! a=arguments RPAREN! { $term = new CallFunctionTerm(f, a); }
;
arguments returns [List<Term> terms]
: a=expression
{
$terms = new ArrayList<Term>();
$terms.add(a);
}
| a=expression COMMA b=arguments
{
$terms = new ArrayList<Term>();
$terms.add(a);
$terms.addAll(b);
}
;
literal returns [Term term]
: n=NUMBER { $term = new NumberLiteral(n); }
| s=STRING { $term = new StringLiteral(s); }
| t=TRUE { $term = new TrueLiteral(t); }
| f=FALSE { $term = new FalseLiteral(f); }
;
STRING
:
'\"'
( options {greedy=false;}
: ESCAPE_SEQUENCE
| ~'\\'
)*
'\"'
|
'\''
( options {greedy=false;}
: ESCAPE_SEQUENCE
| ~'\\'
)*
'\''
;
WHITESPACE
: (' ' | '\n' | '\t' | '\r')+ {skip();};
TRUE
: ('t'|'T')('r'|'R')('u'|'U')('e'|'E')
;
FALSE
: ('f'|'F')('a'|'A')('l'|'L')('s'|'S')('e'|'E')
;
NOTEQ : '!=';
LTEQ : '<=';
GTEQ : '>=';
AND : '&&';
OR : '||';
NOT : '!';
EQ : '=';
LT : '<';
GT : '>';
EXP : '^';
MULT : '*';
DIV : '/';
ADD : '+';
SUB : '-';
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
PERCENT : '%';
VARIABLE
: '[' ~('[' | ']')+ ']'
;
FUNCNAME
: (LETTER)+
;
NUMBER
: (DIGIT)+ ('.' (DIGIT)+)?
;
fragment
LETTER
: ('a'..'z') | ('A'..'Z')
;
fragment
DIGIT
: ('0'..'9')
;
fragment
ESCAPE_SEQUENCE
: '\\' 't'
| '\\' 'n'
| '\\' '\"'
| '\\' '\''
| '\\' '\\'
;
Help is greatly appreciated.
Chris
Because your grammar is so incredibly ambiguous, ANTLR has a problem creating a parser. Apparently ANTLR 3.3+ chokes on it, but ANTLR 3.2 (with less time than 3.3+) produces the following error:
error(10): internal error: org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1279):
could not even do k=1 for decision 1; reason: timed out (>1000ms)
For a simple expression parser, you really shouldn't use backtrack=true.
Besides the fact your grammar is ambiguous, much of your embedded code contains errors.
Let's have a look at your formula rule:
formula returns [Term term]
: a=expression EOF { $term = $a; }
;
Also, the return type of a rule should be explicitly defined. The a in { $term = a; } should have a $ in front of it:
formula returns [Term term]
: a=expression EOF { $term = $a; }
;
but then $a refers to the entire "thing" expression returns. You then have to "tell" ANTLR you want the Term this expression creates. This can be done like this:
formula returns [Term term]
: a=expression EOF { $term = $a.term; }
;
expression returns [Term term]
: a=boolExpr { $term = $a.term; }
;
It looks like you've converted some LR grammar into an ANTLR grammar (note that although ANTLR ends with LR, ANTLR 3.x is an LL parser generator) and without testing in between, you had hoped it should all work: unfortunately, it doesn't. There's too much wrong with it to produce a small working example based on your grammar: I'd have a look at an existing expression parser based on an ANTLR grammar and try again. Have a look at these Q&A's:
ANTLR: Is there a simple example?
ANTLR: From CommonTree to useful object graph
First of all, thank you for that detailed explanation. That really helps :-) ... All of the "$a.term" and similar stuff is sorted out now and code is generated that actually compiles (I simply hacked in that code wanting to fix the issues with that as soon as something is generated at all). I simply commented out a lot of options and keept on generating until I came to the one fragment that seems to break the build. I turned on that backtrack feature, because some errors I got, suggested that I turn it on.
EDIT:
Well I actually refactored the grammar to get rid of the errors without activating backtrack and now my parser is generated really fast and it seems to do it's job nicely. Here comes the current version:
grammar SecurityRulesNew;
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
/* backtrack = true;*/
}
tokens {
POS;
NEG;
CALL;
}
#header {package de.cware.cweb.services.evaluator.parser;
import de.cware.cweb.services.evaluator.terms.*;}
#lexer::header{package de.cware.cweb.services.evaluator.parser;}
formula returns [Term term]
: a=expression EOF { $term = $a.term; }
;
expression returns [Term term]
: a=boolExpr { $term = $a.term; }
;
boolExpr returns [Term term]
: a=sumExpr (AND! b=boolExpr | OR! c=boolExpr | LT! d=boolExpr | LTEQ! e=boolExpr | GT! f=boolExpr | GTEQ! g=boolExpr | EQ! h=boolExpr | NOTEQ! i=boolExpr)? {
if(b != null) {
$term = new AndTerm($a.term, $b.term);
} else if(c != null) {
$term = new OrTerm($a.term, $c.term);
} else if(d != null) {
$term = new LessThanTerm($a.term, $d.term);
} else if(e != null) {
$term = new LessThanOrEqualTerm($a.term, $e.term);
} else if(f != null) {
$term = new GreaterThanTerm($a.term, $f.term);
} else if(g != null) {
$term = new GreaterThanOrEqualTerm($a.term, $g.term);
} else if(h != null) {
$term = new EqualsTerm($a.term, $h.term);
} else if(i != null) {
$term = new NotEqualsTerm($a.term, $i.term);
} else {
$term = $a.term;
}
}
;
sumExpr returns [Term term]
: a=productExpr (SUB! b=sumExpr | ADD! c=sumExpr)?
{
if(b != null) {
$term = new SubTerm($a.term, $b.term);
} else if(c != null) {
$term = new AddTerm($a.term, $c.term);
} else {
$term = $a.term;
}
}
;
productExpr returns [Term term]
: a=expExpr (DIV! b=productExpr | MULT! c=productExpr)?
{
if(b != null) {
$term = new DivTerm($a.term, $b.term);
} else if(c != null) {
$term = new MultTerm($a.term, $c.term);
} else {
$term = $a.term;
}
}
;
expExpr returns [Term term]
: a=unaryOperation (EXP! b=expExpr)?
{
if(b != null) {
$term = new ExpTerm($a.term, $b.term);
} else {
$term = $a.term;
}
}
;
unaryOperation returns [Term term]
: a=operand { $term = $a.term; }
| NOT! a=operand { $term = new NotTerm($a.term); }
| SUB! a=operand { $term = new NegateTerm($a.term); }
| LPAREN! e=expression RPAREN! { $term = $e.term; }
;
operand returns [Term term]
: l=literal { $term = $l.term; }
| v=VARIABLE { $term = new VariableTerm($v.text); }
| f=functionExpr { $term = $f.term; }
;
functionExpr returns [Term term]
: f=FUNCNAME LPAREN! (a=arguments)? RPAREN! { $term = new CallFunctionTerm($f.text, $a.terms); }
;
arguments returns [List<Term> terms]
: a=expression (COMMA b=arguments)?
{
$terms = new ArrayList<Term>();
$terms.add($a.term);
if(b != null) {
$terms.addAll($b.terms);
}
}
;
literal returns [Term term]
: n=NUMBER { $term = new NumberLiteral(Double.valueOf($n.text)); }
| s=STRING { $term = new StringLiteral($s.text.substring(1, s.getText().length() - 1)); }
| TRUE! { $term = new TrueLiteral(); }
| FALSE! { $term = new FalseLiteral(); }
;
STRING
:
'\"'
( options {greedy=false;}
: ESCAPE_SEQUENCE
| ~'\\'
)*
'\"'
|
'\''
( options {greedy=false;}
: ESCAPE_SEQUENCE
| ~'\\'
)*
'\''
;
WHITESPACE
: (' ' | '\n' | '\t' | '\r')+ {skip();};
TRUE
: ('t'|'T')('r'|'R')('u'|'U')('e'|'E')
;
FALSE
: ('f'|'F')('a'|'A')('l'|'L')('s'|'S')('e'|'E')
;
NOTEQ : '!=';
LTEQ : '<=';
GTEQ : '>=';
AND : '&&';
OR : '||';
NOT : '!';
EQ : '=';
LT : '<';
GT : '>';
EXP : '^';
MULT : '*';
DIV : '/';
ADD : '+';
SUB : '-';
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
PERCENT : '%';
VARIABLE
: '[' ~('[' | ']')+ ']'
;
FUNCNAME
: (LETTER)+
;
NUMBER
: (DIGIT)+ ('.' (DIGIT)+)?
;
fragment
LETTER
: ('a'..'z') | ('A'..'Z')
;
fragment
DIGIT
: ('0'..'9')
;
fragment
ESCAPE_SEQUENCE
: '\\' 't'
| '\\' 'n'
| '\\' '\"'
| '\\' '\''
| '\\' '\\'
;
Thanks again for your explanation ... it got me on the right track :-)
Chris
Related
I'm trying to build an expression evaluator with ANTLR v3 but I can't get the factorial function because it is right associative.
This is the code:
class ExpressionParser extends Parser;
options { buildAST=true; }
imaginaryTokenDefinitions :
SIGN_MINUS
SIGN_PLUS;
expr : LPAREN^ sumExpr RPAREN! ;
sumExpr : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr : runary (POW^ runary)? ;
runary : unary (FAT)?;
unary : (SIN^|COS^|TAN^|LOG^|LN^|RAD^)* signExpr;
signExpr : (
m:MINUS^ {#m.setType(SIGN_MINUS);}
| p:PLUS^ {#p.setType(SIGN_PLUS);}
)? atom ;
atom : NUMBER | expr ;
class ExpressionLexer extends Lexer;
PLUS : '+' ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MOD : '%' ;
POW : '^' ;
SIN : 's' ;
COS : 'c' ;
TAN : 't' ;
LOG : 'l' ;
LN : 'n' ;
RAD : 'r' ;
FAT : 'f' ;
LPAREN: '(' ;
RPAREN: ')' ;
SEMI : ';' ;
protected DIGIT : '0'..'9' ;
NUMBER : (DIGIT)+ ('.' (DIGIT)+)?;
{import java.lang.Math;}
class ExpressionTreeWalker extends TreeParser;
expr returns [double r]
{ double a,b; int i,f=1; r=0; }
: #(PLUS a=expr b=expr) { r=a+b; }
| #(MINUS a=expr b=expr) { r=a-b; }
| #(MUL a=expr b=expr) { r=a*b; }
| #(DIV a=expr b=expr) { r=a/b; }
| #(MOD a=expr b=expr) { r=a%b; }
| #(POW a=expr b=expr) { r=Math.pow(a,b); }
| #(SIN a=expr ) { r=Math.sin(a); }
| #(COS a=expr ) { r=Math.cos(a); }
| #(TAN a=expr ) { r=Math.tan(a); }
| #(LOG a=expr ) { r=Math.log10(a); }
| #(LN a=expr ) { r=Math.log(a); }
| #(RAD a=expr ) { r=Math.sqrt(a); }
| #(FAT a=expr ) { for(i=1; i<=a; i++){f=f*i;}; r=(double)f;}
| #(LPAREN a=expr) { r=a; }
| #(SIGN_MINUS a=expr) { r=-1*a; }
| #(SIGN_PLUS a=expr) { if(a<0)r=0-a; else r=a; }
| d:NUMBER { r=Double.parseDouble(d.getText()); } ;
if I change FAT matching case in class TreeWalker with something like this:
| #(a=expr FAT ) { for(i=1; i<=a; i++){f=f*i;}; r=(double)f;}
I get this errors:
Expression.g:56:7: rule classDef trapped:
Expression.g:56:7: unexpected token: a
error: aborting grammar 'ExpressionTreeWalker' due to errors
Exiting due to errors.
Your tree walker (the original one) is fine, as far as I can see.
However, you probably need to mark FAT in the grammar:
runary : unary (FAT^)?;
(Note the hat ^, as in all the other productions.)
Edit:
As explained in the Antlr3 wiki, the hat operator is needed to make the node the "root of subtree created for entire enclosing rule even if nested in a subrule". In this case, the ! operator is nested in a conditional subrule ((FAT)?). That's independent of whether the operator is prefix or postfix.
Note that in your grammar the ! operator is not right-associative since a!! is not valid at all. But I would say that associativity is only meaningful for infix operators.
I'm trying to generate a Lexer/Parser through a simple grammar using ANTLR. What I've done right now :
grammar ExprV2;
#header {
package mypack.parte2;
}
#lexer::header {
package mypack.parte2;
}
start
: expr EOF { System.out.println($expr.val);}
;
expr returns [int val]
: term e=exprP[$term.val] { $val = $e.val; }
;
exprP[int i] returns [int val]
: { $val = $i; }
| '+' term e=exprP[$i + $term.val] { $val = $e.val; }
| '-' term e=exprP[$i - $term.val] { $val = $e.val; }
;
term returns [int val]
: fact e=termP[$fact.val] { $val = $e.val; }
;
termP[int i] returns [int val]
: {$val = $i;}
| '*' fact e=termP[$i * $fact.val] {$val = $e.val; }
| '/' fact e=termP[$i / $fact.val] {$val = $e.val; }
;
fact returns [int val]
: '(' expr ')' { $val = $expr.val; }
| NUM { $val=Integer.parseInt($NUM.text); }
;
NUM : '0'..'9'+ ;
WS : (' ' | '\t' |'\r' | '\n')+ { skip(); };
What I would like to obtain is to generate a Lexer/Parser with a grammar written using EBNF but I'm stucked and I don't know how to go ahead. I looked on the internet but I did not succeed in finding anything clear. Thanks to all!
In ANTLR: Is there a simple example?, a question about antlr3, the accepted answer has this grammar:
grammar Exp;
eval returns [double value]
: exp=additionExp {$value = $exp.value;}
;
additionExp returns [double value]
: m1=multiplyExp {$value = $m1.value;}
( '+' m2=multiplyExp {$value += $m2.value;}
| '-' m2=multiplyExp {$value -= $m2.value;}
)*
;
multiplyExp returns [double value]
: a1=atomExp {$value = $a1.value;}
( '*' a2=atomExp {$value *= $a2.value;}
| '/' a2=atomExp {$value /= $a2.value;}
)*
;
atomExp returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
WS
: (' ' | '\t' | '\r'| '\n') {$channel=HIDDEN;}
It uses the $value attribute to pass information up the parse tree.
I want to do the same thing antlr4. It looks like the $value attribute isn't there any more. How can I add custom attributes to rules to pass information up the parse tree? If that's not the right mechanism to accomplish what I want, what mechanisms are there to accomplish something similar?
I tried using locals, like this:
/* Store each row in an ArrayList */
row
locals [
ArrayList<String> cells = null
]
: partial_row RowSeparator
{
$cells = $partial_row.cells;
}
;
partial_row
locals [
ArrayList<String> cells = null
]
: Cell
{
$cells = new java.util.ArrayList<String>();
$cells.add($Cell.text);
}
| partial_row Cell
{
$cells = $partial_row.cells;
$cells.add($Cell.text);
}
;
But this doesn't work, giving me this error:
error(65): csce322a1p2.g:70:24: unknown attribute 'cells' for rule 'partial_row' in '$partial_row.cells'
I think you are looking for "returns" not locals. Also that should work. My test works:
row
locals [
ArrayList cells = null
]
: A B
{
$cells = $A;
}
;
Instead of locals, I want to use #init and returns:
row returns [java.util.ArrayList<String> cells]
#init {
java.util.ArrayList<String> cells = null;
}
: partial_row RowSeparator
{
$cells = $partial_row.cells;
}
;
partial_row returns [java.util.ArrayListArrayList<String> cells]
#init {
java.util.ArrayListArrayList<String> cells = null;
}
: Cell
{
$cells = new java.util.ArrayList<String>();
$cells.add($Cell.text);
}
| partial_row Cell
{
$cells = $partial_row.cells;
$cells.add($Cell.text);
}
;
Anyone know of any ANTLR grammar for Liquid Markup or a JAVA library that can work with it? I have taken a look at Jangod but it doesn't seem to work much.
Thanks!
Here's a grammar:
grammar Liquid;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ASSIGNMENT;
ATTRIBUTES;
BLOCK;
CAPTURE;
CASE;
COMMENT;
CYCLE;
ELSE;
FILTERS;
FILTER;
FOR_ARRAY;
FOR_RANGE;
GROUP;
IF;
INCLUDE;
LOOKUP;
OUTPUT;
PARAMS;
PLAIN;
RAW;
TABLE;
UNLESS;
WHEN;
WITH;
}
#parser::members {
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
#lexer::members {
private boolean inTag = false;
private boolean openTagAhead() {
return input.LA(1) == '{' && (input.LA(2) == '{' || input.LA(2) == '\u0025');
}
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
/* parser rules */
parse
: block EOF -> block
;
block
: (options{greedy=true;}: atom)* -> ^(BLOCK atom*)
;
atom
: tag
| output
| assignment
| Other -> ^(PLAIN Other)
;
tag
: raw_tag
| comment_tag
| if_tag
| unless_tag
| case_tag
| cycle_tag
| for_tag
| table_tag
| capture_tag
| include_tag
;
raw_tag
: TagStart RawStart TagEnd raw_body TagStart RawEnd TagEnd
-> ^(RAW raw_body)
;
raw_body
: ~TagStart*
;
comment_tag
: TagStart CommentStart TagEnd comment_body TagStart CommentEnd TagEnd
-> ^(COMMENT comment_body)
;
comment_body
: ~TagStart*
;
if_tag
: TagStart IfStart expr TagEnd block else_tag? TagStart IfEnd TagEnd
-> ^(IF expr block ^(ELSE else_tag?))
;
else_tag
: TagStart Else TagEnd block
-> block
;
unless_tag
: TagStart UnlessStart expr TagEnd block else_tag? TagStart UnlessEnd TagEnd
-> ^(UNLESS expr block ^(ELSE else_tag?))
;
case_tag
: TagStart CaseStart expr TagEnd when_tag+ else_tag? TagStart CaseEnd TagEnd
-> ^(CASE expr when_tag+ ^(ELSE else_tag?))
;
when_tag
: TagStart When expr TagEnd block
-> ^(WHEN expr block)
;
cycle_tag
: TagStart Cycle cycle_group? expr (Comma expr)* TagEnd
-> ^(CYCLE ^(GROUP cycle_group?) expr+)
;
cycle_group
: expr Col -> expr
;
for_tag
: for_array
| for_range
;
for_array // attributes must be 'limit' or 'offset'!
: TagStart ForStart Id In lookup attribute* TagEnd block TagStart ForEnd TagEnd
-> ^(FOR_ARRAY Id lookup ^(ATTRIBUTES attribute*) block)
;
attribute
: Id Col expr -> ^(Id expr)
;
for_range
: TagStart ForStart Id In OPar expr DotDot expr CPar TagEnd block TagStart ForEnd TagEnd
-> ^(FOR_RANGE Id expr expr block)
;
table_tag // attributes must be 'limit' or 'cols'!
: TagStart TableStart Id In Id attribute* TagEnd block TagStart TableEnd TagEnd
-> ^(TABLE Id Id ^(ATTRIBUTES attribute*) block)
;
capture_tag
: TagStart CaptureStart Id TagEnd block TagStart CaptureEnd TagEnd
-> ^(CAPTURE Id block)
;
include_tag
: TagStart Include a=Str (With b=Str)? TagEnd
-> ^(INCLUDE $a ^(WITH $b?))
;
output
: OutStart expr filter* OutEnd
-> ^(OUTPUT expr ^(FILTERS filter*))
;
filter
: Pipe Id params?
-> ^(FILTER Id ^(PARAMS params?))
;
params
: Col expr (Comma expr)* -> expr+
;
assignment
: TagStart Assign Id EqSign expr TagEnd
-> ^(ASSIGNMENT Id expr)
;
expr
: or_expr
;
or_expr
: and_expr (Or^ and_expr)*
;
and_expr
: eq_expr (And^ eq_expr)*
;
eq_expr
: rel_expr ((Eq | NEq)^ rel_expr)*
;
rel_expr
: term ((LtEq | Lt | GtEq | Gt)^ term)?
;
term
: Num
| Str
| True
| False
| Nil
| lookup
;
lookup
: Id (Dot Id)* -> ^(LOOKUP Id+)
;
/* lexer rules */
OutStart : '{{' {inTag=true;};
OutEnd : '}}' {inTag=false;};
TagStart : '{%' {inTag=true;};
TagEnd : '%}' {inTag=false;};
Str : {inTag}?=> (SStr | DStr);
DotDot : {inTag}?=> '..';
Dot : {inTag}?=> '.';
NEq : {inTag}?=> '!=';
Eq : {inTag}?=> '==';
EqSign : {inTag}?=> '=';
GtEq : {inTag}?=> '>=';
Gt : {inTag}?=> '>';
LtEq : {inTag}?=> '<=';
Lt : {inTag}?=> '<';
Pipe : {inTag}?=> '|';
Col : {inTag}?=> ':';
Comma : {inTag}?=> ',';
OPar : {inTag}?=> '(';
CPar : {inTag}?=> ')';
Num : {inTag}?=> Digit+;
WS : {inTag}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};
Id
: {inTag}?=> (Letter | '_') (Letter | '_' | '-' | Digit)*
{
if($text.equals("capture")) $type = CaptureStart;
else if($text.equals("endcapture")) $type = CaptureEnd;
else if($text.equals("comment")) $type = CommentStart;
else if($text.equals("endcomment")) $type = CommentEnd;
else if($text.equals("raw")) $type = RawStart;
else if($text.equals("endraw")) $type = RawEnd;
else if($text.equals("if")) $type = IfStart;
else if($text.equals("endif")) $type = IfEnd;
else if($text.equals("unless")) $type = UnlessStart;
else if($text.equals("endunless")) $type = UnlessEnd;
else if($text.equals("else")) $type = Else;
else if($text.equals("case")) $type = CaseStart;
else if($text.equals("endcase")) $type = CaseEnd;
else if($text.equals("when")) $type = When;
else if($text.equals("cycle")) $type = Cycle;
else if($text.equals("for")) $type = ForStart;
else if($text.equals("endfor")) $type = ForEnd;
else if($text.equals("in")) $type = In;
else if($text.equals("and")) $type = And;
else if($text.equals("or")) $type = Or;
else if($text.equals("tablerow")) $type = TableStart;
else if($text.equals("endtablerow")) $type = TableEnd;
else if($text.equals("assign")) $type = Assign;
else if($text.equals("true")) $type = True;
else if($text.equals("false")) $type = False;
else if($text.equals("nil")) $type = Nil;
else if($text.equals("include")) $type = Include;
else if($text.equals("with")) $type = With;
}
;
Other
: ({!inTag && !openTagAhead()}?=> . )+
{
String s = getText().replaceAll("\\s+", " ").trim();
if(s.isEmpty()) {
skip();
}
else {
setText(s);
}
}
;
/* fragment rules */
fragment Letter : 'a'..'z' | 'A'..'Z';
fragment Digit : '0'..'9';
fragment SStr : '\'' ~'\''* '\'';
fragment DStr : '"' ~'"'* '"';
fragment CommentStart : ;
fragment CommentEnd : ;
fragment RawStart : ;
fragment RawEnd : ;
fragment IfStart : ;
fragment IfEnd : ;
fragment UnlessStart : ;
fragment UnlessEnd : ;
fragment Else : ;
fragment CaseStart : ;
fragment CaseEnd : ;
fragment When : ;
fragment Cycle : ;
fragment ForStart : ;
fragment ForEnd : ;
fragment In : ;
fragment And : ;
fragment Or : ;
fragment TableStart : ;
fragment TableEnd : ;
fragment Assign : ;
fragment True : ;
fragment False : ;
fragment Nil : ;
fragment Include : ;
fragment With : ;
fragment CaptureStart : ;
fragment CaptureEnd : ;
I have dusted the thing off a bit and put it in a Github repository: https://github.com/bkiers/Liqp
Be aware: although I have successfully used this grammar in the past, the input might have been rather "easy". If you're going to use it, and run into problems, I'd appreciate it if you let me know. If you're looking for a more robust, thoroughly tested library/parser/grammar, this might not be what you're looking for.
I've been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:
${abc}
<html>
<head>
<title>Welcome!</title>
</head>
<body>
<h1>
Welcome ${user}<#if user == "Big Joe">, our beloved
leader</#if>!
</h1>
<p>Our latest product:
${latestProduct}!
</body>
</html>
The template language is between some specific tags, e.g. '${' '}', '<#' '>'. Other raw texts in between can be treated like as the same tokens (RAW).
The key point here is that the same text, e.g. an integer, will mean differently thing for the parser depends on whether it's between those tags or not, and thus needs to be treated as different tokens.
I've tried with the following ugly implementation, with a self-defined state to indicate whether it's in those tags. As you see, I have to check the state almost in every rule, which drives me crazy...
I also thinked about the following two solutions:
Use multiple lexers. I can switch between two lexers when inside/outside those tags. However, the document for this is poor for ANTLR3. I don't know how to let one parser share two different lexers and switch between them.
Move the RAW rule up, after the NUMERICAL_ESCAPE rule. Check the state there, if it's in the tag, put back the token and continue trying the left rules. This would save lots of state checking. However, I don't find any 'put back' function and ANTLR complains about some rules can never be matched...
Is there an elegant solution for this?
grammar freemarker_simple;
#lexer::members {
int freemarker_type = 0;
}
expression
: primary_expression ;
primary_expression
: number_literal | identifier | parenthesis | builtin_variable
;
parenthesis
: OPEN_PAREN expression CLOSE_PAREN ;
number_literal
: INTEGER | DECIMAL
;
identifier
: ID
;
builtin_variable
: DOT ID
;
string_output
: OUTPUT_ESCAPE expression CLOSE_BRACE
;
numerical_output
: NUMERICAL_ESCAPE expression CLOSE_BRACE
;
if_expression
: START_TAG IF expression DIRECTIVE_END optional_block
( START_TAG ELSE_IF expression loose_directive_end optional_block )*
( END_TAG ELSE optional_block )?
END_TAG END_IF
;
list : START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;
for_each
: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;
loose_directive_end
: ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;
freemarker_directive
: ( if_expression | list | for_each ) ;
content : ( RAW | string_output | numerical_output | freemarker_directive ) + ;
optional_block
: ( content )? ;
root : optional_block EOF ;
START_TAG
: '<#'
{ freemarker_type = 1; }
;
END_TAG : '</#'
{ freemarker_type = 1; }
;
DIRECTIVE_END
: '>'
{
if(freemarker_type == 0) $type=RAW;
freemarker_type = 0;
}
;
EMPTY_DIRECTIVE_END
: '/>'
{
if(freemarker_type == 0) $type=RAW;
freemarker_type = 0;
}
;
OUTPUT_ESCAPE
: '${'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
NUMERICAL_ESCAPE
: '#{'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
IF : 'if'
{ if(freemarker_type == 0) $type=RAW; }
;
ELSE : 'else' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
ELSE_IF : 'elseif'
{ if(freemarker_type == 0) $type=RAW; }
;
LIST : 'list'
{ if(freemarker_type == 0) $type=RAW; }
;
FOREACH : 'foreach'
{ if(freemarker_type == 0) $type=RAW; }
;
END_IF : 'if' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
END_LIST
: 'list' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
END_FOREACH
: 'foreach' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
FALSE: 'false' { if(freemarker_type == 0) $type=RAW; };
TRUE: 'true' { if(freemarker_type == 0) $type=RAW; };
INTEGER: ('0'..'9')+ { if(freemarker_type == 0) $type=RAW; };
DECIMAL: INTEGER '.' INTEGER { if(freemarker_type == 0) $type=RAW; };
DOT: '.' { if(freemarker_type == 0) $type=RAW; };
DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW; };
PLUS: '+' { if(freemarker_type == 0) $type=RAW; };
MINUS: '-' { if(freemarker_type == 0) $type=RAW; };
TIMES: '*' { if(freemarker_type == 0) $type=RAW; };
DIVIDE: '/' { if(freemarker_type == 0) $type=RAW; };
PERCENT: '%' { if(freemarker_type == 0) $type=RAW; };
AND: '&' | '&&' { if(freemarker_type == 0) $type=RAW; };
OR: '|' | '||' { if(freemarker_type == 0) $type=RAW; };
EXCLAM: '!' { if(freemarker_type == 0) $type=RAW; };
OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };
CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW; };
OPEN_BRACE
: '{'
{ if(freemarker_type == 0) $type=RAW; }
;
CLOSE_BRACE
: '}'
{
if(freemarker_type == 0) $type=RAW;
if(freemarker_type == 2) freemarker_type = 0;
}
;
IN: 'in' { if(freemarker_type == 0) $type=RAW; };
AS: 'as' { if(freemarker_type == 0) $type=RAW; };
ID : ('A'..'Z'|'a'..'z')+
//{ if(freemarker_type == 0) $type=RAW; }
;
BLANK : ( '\r' | ' ' | '\n' | '\t' )+
{
if(freemarker_type == 0) $type=RAW;
else $channel = HIDDEN;
}
;
RAW
: .
;
EDIT
I found the problem similar to How do I lex this input? , where a "start condition" is needed. But unfortunately, the answer uses a lot of predicates as well, just like my states.
Now, I tried to move the RAW higher with a predicate. Hoping to eliminate all the state checks after RAW rule. However, my example input failed, the first line end is recogonized as BLANK instead of RAW it should be.
I guess something wrong is about the rule priority:
After CLOSE_BRACE is matched, the next token is matched from rules after the CLOSE_BRACE rule, rather than start from the begenning again.
Any way to resolve this?
New grammar below with some debug outputs:
grammar freemarker_simple;
#lexer::members {
int freemarker_type = 0;
}
expression
: primary_expression ;
primary_expression
: number_literal | identifier | parenthesis | builtin_variable
;
parenthesis
: OPEN_PAREN expression CLOSE_PAREN ;
number_literal
: INTEGER | DECIMAL
;
identifier
: ID
;
builtin_variable
: DOT ID
;
string_output
: OUTPUT_ESCAPE expression CLOSE_BRACE
;
numerical_output
: NUMERICAL_ESCAPE expression CLOSE_BRACE
;
if_expression
: START_TAG IF expression DIRECTIVE_END optional_block
( START_TAG ELSE_IF expression loose_directive_end optional_block )*
( END_TAG ELSE optional_block )?
END_TAG END_IF
;
list : START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;
for_each
: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;
loose_directive_end
: ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;
freemarker_directive
: ( if_expression | list | for_each ) ;
content : ( RAW | string_output | numerical_output | freemarker_directive ) + ;
optional_block
: ( content )? ;
root : optional_block EOF ;
START_TAG
: '<#'
{ freemarker_type = 1; }
;
END_TAG : '</#'
{ freemarker_type = 1; }
;
OUTPUT_ESCAPE
: '${'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
NUMERICAL_ESCAPE
: '#{'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
RAW
:
{ freemarker_type == 0 }?=> .
{System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);}
;
DIRECTIVE_END
: '>'
{ if(freemarker_type == 1) freemarker_type = 0; }
;
EMPTY_DIRECTIVE_END
: '/>'
{ if(freemarker_type == 1) freemarker_type = 0; }
;
IF : 'if'
;
ELSE : 'else' DIRECTIVE_END
;
ELSE_IF : 'elseif'
;
LIST : 'list'
;
FOREACH : 'foreach'
;
END_IF : 'if' DIRECTIVE_END
;
END_LIST
: 'list' DIRECTIVE_END
;
END_FOREACH
: 'foreach' DIRECTIVE_END
;
FALSE: 'false' ;
TRUE: 'true' ;
INTEGER: ('0'..'9')+ ;
DECIMAL: INTEGER '.' INTEGER ;
DOT: '.' ;
DOT_DOT: '..' ;
PLUS: '+' ;
MINUS: '-' ;
TIMES: '*' ;
DIVIDE: '/' ;
PERCENT: '%' ;
AND: '&' | '&&' ;
OR: '|' | '||' ;
EXCLAM: '!' ;
OPEN_PAREN: '(' ;
CLOSE_PAREN: ')' ;
OPEN_BRACE
: '{'
;
CLOSE_BRACE
: '}'
{ if(freemarker_type == 2) {freemarker_type = 0;} }
;
IN: 'in' ;
AS: 'as' ;
ID : ('A'..'Z'|'a'..'z')+
{ System.out.printf("ID \%s \%d\n",getText(),freemarker_type);}
;
BLANK : ( '\r' | ' ' | '\n' | '\t' )+
{
System.out.printf("BLANK \%d\n",freemarker_type);
$channel = HIDDEN;
}
;
My input results with the output:
ID abc 2
BLANK 0 <<< incorrect, should be RAW when state==0
RAW < 0 <<< correct
ID html 0 <<< incorrect, should be RAW RAW RAW RAW
RAW > 0
EDIT2
Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?
grammar freemarker_bart;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
#parser::members {
// merge a given list of tokens into a single AST
private CommonTree merge(List tokenList) {
StringBuilder b = new StringBuilder();
for(int i = 0; i < tokenList.size(); i++) {
Token token = (Token)tokenList.get(i);
b.append(token.getText());
}
return new CommonTree(new CommonToken(RAW, b.toString()));
}
}
#lexer::members {
private boolean mmode = false;
}
parse
: content* EOF -> ^(FILE content*)
;
content
: (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
raw_block
: (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
RAW : {!mmode}?=> . ;
// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
// valid tokens only when in "markup mode"
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.
A little demo:
freemarker_simple.g
grammar freemarker_simple;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
#parser::members {
// merge a given list of tokens into a single AST
private CommonTree merge(List tokenList) {
StringBuilder b = new StringBuilder();
for(int i = 0; i < tokenList.size(); i++) {
Token token = (Token)tokenList.get(i);
b.append(token.getText());
}
return new CommonTree(new CommonToken(RAW, b.toString()));
}
}
#lexer::members {
private boolean mmode = false;
}
parse
: content* EOF -> ^(FILE content*)
;
content
: (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
raw_block
: (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END : {mmode}?=> '}' {mmode=false;};
TAG_END : {mmode}?=> '>' {mmode=false;};
// valid tokens only when in "markup mode"
EQUALS : {mmode}?=> '==';
IF : {mmode}?=> 'if';
STRING : {mmode}?=> '"' ~'"'* '"';
ID : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : . ;
which parses your input:
test.html
${abc}
<html>
<head>
<title>Welcome!</title>
</head>
<body>
<h1>
Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>!
</h1>
<p>Our latest product: ${latestProduct}!</p>
</body>
</html>
into the following AST:
as you can test yourself with the class:
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
EDIT 1
When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):
ID abc 2
RAW
0
RAW < 0
ID html 0
...
EDIT 2
Bood wrote:
Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?
Yes, that is correct: ANTLR chooses the longer match in that case.
But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the RAW rule match characters as long as the rule can't see one of the following character sequences ahead: "<#", "</#" or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the merge(...) method in the parser:
grammar freemarker_simple;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
#lexer::members {
private boolean mmode = false;
private boolean rawAhead() {
if(mmode) return false;
int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
return !(
(ch1 == '<' && ch2 == '#') ||
(ch1 == '<' && ch2 == '/' && ch3 == '#') ||
(ch1 == '$' && ch2 == '{')
);
}
}
parse
: content* EOF -> ^(FILE content*)
;
content
: RAW
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : ({rawAhead()}?=> . )+;
The grammar above will produce the following AST from the input posted at the start of this answer: