Different lexer rules in different state - antlr

I've been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:
${abc}
<html>
<head>
<title>Welcome!</title>
</head>
<body>
<h1>
Welcome ${user}<#if user == "Big Joe">, our beloved
leader</#if>!
</h1>
<p>Our latest product:
${latestProduct}!
</body>
</html>
The template language is between some specific tags, e.g. '${' '}', '<#' '>'. Other raw texts in between can be treated like as the same tokens (RAW).
The key point here is that the same text, e.g. an integer, will mean differently thing for the parser depends on whether it's between those tags or not, and thus needs to be treated as different tokens.
I've tried with the following ugly implementation, with a self-defined state to indicate whether it's in those tags. As you see, I have to check the state almost in every rule, which drives me crazy...
I also thinked about the following two solutions:
Use multiple lexers. I can switch between two lexers when inside/outside those tags. However, the document for this is poor for ANTLR3. I don't know how to let one parser share two different lexers and switch between them.
Move the RAW rule up, after the NUMERICAL_ESCAPE rule. Check the state there, if it's in the tag, put back the token and continue trying the left rules. This would save lots of state checking. However, I don't find any 'put back' function and ANTLR complains about some rules can never be matched...
Is there an elegant solution for this?
grammar freemarker_simple;
#lexer::members {
int freemarker_type = 0;
}
expression
: primary_expression ;
primary_expression
: number_literal | identifier | parenthesis | builtin_variable
;
parenthesis
: OPEN_PAREN expression CLOSE_PAREN ;
number_literal
: INTEGER | DECIMAL
;
identifier
: ID
;
builtin_variable
: DOT ID
;
string_output
: OUTPUT_ESCAPE expression CLOSE_BRACE
;
numerical_output
: NUMERICAL_ESCAPE expression CLOSE_BRACE
;
if_expression
: START_TAG IF expression DIRECTIVE_END optional_block
( START_TAG ELSE_IF expression loose_directive_end optional_block )*
( END_TAG ELSE optional_block )?
END_TAG END_IF
;
list : START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;
for_each
: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;
loose_directive_end
: ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;
freemarker_directive
: ( if_expression | list | for_each ) ;
content : ( RAW | string_output | numerical_output | freemarker_directive ) + ;
optional_block
: ( content )? ;
root : optional_block EOF ;
START_TAG
: '<#'
{ freemarker_type = 1; }
;
END_TAG : '</#'
{ freemarker_type = 1; }
;
DIRECTIVE_END
: '>'
{
if(freemarker_type == 0) $type=RAW;
freemarker_type = 0;
}
;
EMPTY_DIRECTIVE_END
: '/>'
{
if(freemarker_type == 0) $type=RAW;
freemarker_type = 0;
}
;
OUTPUT_ESCAPE
: '${'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
NUMERICAL_ESCAPE
: '#{'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
IF : 'if'
{ if(freemarker_type == 0) $type=RAW; }
;
ELSE : 'else' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
ELSE_IF : 'elseif'
{ if(freemarker_type == 0) $type=RAW; }
;
LIST : 'list'
{ if(freemarker_type == 0) $type=RAW; }
;
FOREACH : 'foreach'
{ if(freemarker_type == 0) $type=RAW; }
;
END_IF : 'if' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
END_LIST
: 'list' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
END_FOREACH
: 'foreach' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
FALSE: 'false' { if(freemarker_type == 0) $type=RAW; };
TRUE: 'true' { if(freemarker_type == 0) $type=RAW; };
INTEGER: ('0'..'9')+ { if(freemarker_type == 0) $type=RAW; };
DECIMAL: INTEGER '.' INTEGER { if(freemarker_type == 0) $type=RAW; };
DOT: '.' { if(freemarker_type == 0) $type=RAW; };
DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW; };
PLUS: '+' { if(freemarker_type == 0) $type=RAW; };
MINUS: '-' { if(freemarker_type == 0) $type=RAW; };
TIMES: '*' { if(freemarker_type == 0) $type=RAW; };
DIVIDE: '/' { if(freemarker_type == 0) $type=RAW; };
PERCENT: '%' { if(freemarker_type == 0) $type=RAW; };
AND: '&' | '&&' { if(freemarker_type == 0) $type=RAW; };
OR: '|' | '||' { if(freemarker_type == 0) $type=RAW; };
EXCLAM: '!' { if(freemarker_type == 0) $type=RAW; };
OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };
CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW; };
OPEN_BRACE
: '{'
{ if(freemarker_type == 0) $type=RAW; }
;
CLOSE_BRACE
: '}'
{
if(freemarker_type == 0) $type=RAW;
if(freemarker_type == 2) freemarker_type = 0;
}
;
IN: 'in' { if(freemarker_type == 0) $type=RAW; };
AS: 'as' { if(freemarker_type == 0) $type=RAW; };
ID : ('A'..'Z'|'a'..'z')+
//{ if(freemarker_type == 0) $type=RAW; }
;
BLANK : ( '\r' | ' ' | '\n' | '\t' )+
{
if(freemarker_type == 0) $type=RAW;
else $channel = HIDDEN;
}
;
RAW
: .
;
EDIT
I found the problem similar to How do I lex this input? , where a "start condition" is needed. But unfortunately, the answer uses a lot of predicates as well, just like my states.
Now, I tried to move the RAW higher with a predicate. Hoping to eliminate all the state checks after RAW rule. However, my example input failed, the first line end is recogonized as BLANK instead of RAW it should be.
I guess something wrong is about the rule priority:
After CLOSE_BRACE is matched, the next token is matched from rules after the CLOSE_BRACE rule, rather than start from the begenning again.
Any way to resolve this?
New grammar below with some debug outputs:
grammar freemarker_simple;
#lexer::members {
int freemarker_type = 0;
}
expression
: primary_expression ;
primary_expression
: number_literal | identifier | parenthesis | builtin_variable
;
parenthesis
: OPEN_PAREN expression CLOSE_PAREN ;
number_literal
: INTEGER | DECIMAL
;
identifier
: ID
;
builtin_variable
: DOT ID
;
string_output
: OUTPUT_ESCAPE expression CLOSE_BRACE
;
numerical_output
: NUMERICAL_ESCAPE expression CLOSE_BRACE
;
if_expression
: START_TAG IF expression DIRECTIVE_END optional_block
( START_TAG ELSE_IF expression loose_directive_end optional_block )*
( END_TAG ELSE optional_block )?
END_TAG END_IF
;
list : START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;
for_each
: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;
loose_directive_end
: ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;
freemarker_directive
: ( if_expression | list | for_each ) ;
content : ( RAW | string_output | numerical_output | freemarker_directive ) + ;
optional_block
: ( content )? ;
root : optional_block EOF ;
START_TAG
: '<#'
{ freemarker_type = 1; }
;
END_TAG : '</#'
{ freemarker_type = 1; }
;
OUTPUT_ESCAPE
: '${'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
NUMERICAL_ESCAPE
: '#{'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
RAW
:
{ freemarker_type == 0 }?=> .
{System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);}
;
DIRECTIVE_END
: '>'
{ if(freemarker_type == 1) freemarker_type = 0; }
;
EMPTY_DIRECTIVE_END
: '/>'
{ if(freemarker_type == 1) freemarker_type = 0; }
;
IF : 'if'
;
ELSE : 'else' DIRECTIVE_END
;
ELSE_IF : 'elseif'
;
LIST : 'list'
;
FOREACH : 'foreach'
;
END_IF : 'if' DIRECTIVE_END
;
END_LIST
: 'list' DIRECTIVE_END
;
END_FOREACH
: 'foreach' DIRECTIVE_END
;
FALSE: 'false' ;
TRUE: 'true' ;
INTEGER: ('0'..'9')+ ;
DECIMAL: INTEGER '.' INTEGER ;
DOT: '.' ;
DOT_DOT: '..' ;
PLUS: '+' ;
MINUS: '-' ;
TIMES: '*' ;
DIVIDE: '/' ;
PERCENT: '%' ;
AND: '&' | '&&' ;
OR: '|' | '||' ;
EXCLAM: '!' ;
OPEN_PAREN: '(' ;
CLOSE_PAREN: ')' ;
OPEN_BRACE
: '{'
;
CLOSE_BRACE
: '}'
{ if(freemarker_type == 2) {freemarker_type = 0;} }
;
IN: 'in' ;
AS: 'as' ;
ID : ('A'..'Z'|'a'..'z')+
{ System.out.printf("ID \%s \%d\n",getText(),freemarker_type);}
;
BLANK : ( '\r' | ' ' | '\n' | '\t' )+
{
System.out.printf("BLANK \%d\n",freemarker_type);
$channel = HIDDEN;
}
;
My input results with the output:
ID abc 2
BLANK 0 <<< incorrect, should be RAW when state==0
RAW < 0 <<< correct
ID html 0 <<< incorrect, should be RAW RAW RAW RAW
RAW > 0
EDIT2
Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?
grammar freemarker_bart;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
#parser::members {
// merge a given list of tokens into a single AST
private CommonTree merge(List tokenList) {
StringBuilder b = new StringBuilder();
for(int i = 0; i < tokenList.size(); i++) {
Token token = (Token)tokenList.get(i);
b.append(token.getText());
}
return new CommonTree(new CommonToken(RAW, b.toString()));
}
}
#lexer::members {
private boolean mmode = false;
}
parse
: content* EOF -> ^(FILE content*)
;
content
: (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
raw_block
: (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
RAW : {!mmode}?=> . ;
// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
// valid tokens only when in "markup mode"
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};

You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.
A little demo:
freemarker_simple.g
grammar freemarker_simple;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
#parser::members {
// merge a given list of tokens into a single AST
private CommonTree merge(List tokenList) {
StringBuilder b = new StringBuilder();
for(int i = 0; i < tokenList.size(); i++) {
Token token = (Token)tokenList.get(i);
b.append(token.getText());
}
return new CommonTree(new CommonToken(RAW, b.toString()));
}
}
#lexer::members {
private boolean mmode = false;
}
parse
: content* EOF -> ^(FILE content*)
;
content
: (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
raw_block
: (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END : {mmode}?=> '}' {mmode=false;};
TAG_END : {mmode}?=> '>' {mmode=false;};
// valid tokens only when in "markup mode"
EQUALS : {mmode}?=> '==';
IF : {mmode}?=> 'if';
STRING : {mmode}?=> '"' ~'"'* '"';
ID : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : . ;
which parses your input:
test.html
${abc}
<html>
<head>
<title>Welcome!</title>
</head>
<body>
<h1>
Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>!
</h1>
<p>Our latest product: ${latestProduct}!</p>
</body>
</html>
into the following AST:
as you can test yourself with the class:
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
EDIT 1
When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):
ID abc 2
RAW
0
RAW < 0
ID html 0
...
EDIT 2
Bood wrote:
Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?
Yes, that is correct: ANTLR chooses the longer match in that case.
But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the RAW rule match characters as long as the rule can't see one of the following character sequences ahead: "<#", "</#" or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the merge(...) method in the parser:
grammar freemarker_simple;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
#lexer::members {
private boolean mmode = false;
private boolean rawAhead() {
if(mmode) return false;
int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
return !(
(ch1 == '<' && ch2 == '#') ||
(ch1 == '<' && ch2 == '/' && ch3 == '#') ||
(ch1 == '$' && ch2 == '{')
);
}
}
parse
: content* EOF -> ^(FILE content*)
;
content
: RAW
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : ({rawAhead()}?=> . )+;
The grammar above will produce the following AST from the input posted at the start of this answer:

Related

ANTLR parses greedily even though it can match high priority rule

I am using the following ANTLR grammar to define a function.
definition_function
: DEFINE FUNCTION function_name '[' language_name ']'
RETURN attribute_type '{' function_body '}'
;
function_name
: id
;
language_name
: id
;
function_body
: SCRIPT
;
SCRIPT
: '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}'
{ setText(getText().substring(1, getText().length()-1)); }
;
But when I try to parse two functions like below,
define function concat[Scala] return string {
var concatenatedString = ""
for(i <- 0 until data.length) {
concatenatedString += data(i).toString
}
concatenatedString
};
define function concat[JavaScript] return string {
var str1 = data[0];
var str2 = data[1];
var str3 = data[2];
var res = str1.concat(str2,str3);
return res;
};
Then ANTLR doesn't parse this like two function definitions, but like a single function with the following body,
var concatenatedString = ""
for(i <- 0 until data.length) {
concatenatedString += data(i).toString
}
concatenatedString
};
define function concat[JavaScript] return string {
var str1 = data[0];
var str2 = data[1];
var str3 = data[2];
var res = str1.concat(str2,str3);
return res;
Can you explain this behavior? The body of the function can have anything in it. How can I correctly define this grammar?
Your rule matches that much because '\u0020'..'\u007e' from the rule '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' matches both { and }.
Your rule should work if you define it like this:
SCRIPT
: '{' ( SCRIPT | ~( '{' | '}' ) )* '}'
;
However, this will fail when the script block contains, says, strings or comments that contain { or }. Here is a way to match a SCRIPT token, including comments and string literals that could contain { and '}':
SCRIPT
: '{' SCRIPT_ATOM* '}'
;
fragment SCRIPT_ATOM
: ~[{}]
| '"' ~["]* '"'
| '//' ~[\r\n]*
| SCRIPT
;
A complete grammar that properly parses your input would then look like this:
grammar T;
parse
: definition_function* EOF
;
definition_function
: DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type SCRIPT ';'
;
function_name
: ID
;
language_name
: ID
;
attribute_type
: ID
;
DEFINE
: 'define'
;
FUNCTION
: 'function'
;
RETURN
: 'return'
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
SCRIPT
: '{' SCRIPT_ATOM* '}'
;
SPACES
: [ \t\r\n]+ -> skip
;
fragment SCRIPT_ATOM
: ~[{}]
| '"' ~["]* '"'
| '//' ~[\r\n]*
| SCRIPT
;
which also parses the following input properly:
define function concat[JavaScript] return string {
for (;;) {
while (true) { }
}
var s = "}"
// }
return s
};
Unless you absolutely need SCRIPT to be a token (recognized by a lexer rule), you can use a parser rule which recognizes nested blocks (the block rule below). The grammar included here should parse your example as two distinct function definitions.
DEFINE : 'define';
FUNCTION : 'function';
RETURN : 'return';
ID : [A-Za-z]+;
ANY : . ;
WS : [ \r\t\n]+ -> skip ;
test : definition_function* ;
definition_function
: DEFINE FUNCTION function_name '[' language_name ']'
RETURN attribute_type block ';'
;
function_name : id ;
language_name : id ;
attribute_type : 'string' ;
id : ID;
block
: '{' ( ( ~('{'|'}') )+ | block)* '}'
;

Antlr grammar with EBNF

I'm trying to generate a Lexer/Parser through a simple grammar using ANTLR. What I've done right now :
grammar ExprV2;
#header {
package mypack.parte2;
}
#lexer::header {
package mypack.parte2;
}
start
: expr EOF { System.out.println($expr.val);}
;
expr returns [int val]
: term e=exprP[$term.val] { $val = $e.val; }
;
exprP[int i] returns [int val]
: { $val = $i; }
| '+' term e=exprP[$i + $term.val] { $val = $e.val; }
| '-' term e=exprP[$i - $term.val] { $val = $e.val; }
;
term returns [int val]
: fact e=termP[$fact.val] { $val = $e.val; }
;
termP[int i] returns [int val]
: {$val = $i;}
| '*' fact e=termP[$i * $fact.val] {$val = $e.val; }
| '/' fact e=termP[$i / $fact.val] {$val = $e.val; }
;
fact returns [int val]
: '(' expr ')' { $val = $expr.val; }
| NUM { $val=Integer.parseInt($NUM.text); }
;
NUM : '0'..'9'+ ;
WS : (' ' | '\t' |'\r' | '\n')+ { skip(); };
What I would like to obtain is to generate a Lexer/Parser with a grammar written using EBNF but I'm stucked and I don't know how to go ahead. I looked on the internet but I did not succeed in finding anything clear. Thanks to all!

Using Tree Walker with Boolean checks + capturing the whole expression

I have actually two questions that I hope can be answered as they are semi-dependent on my work. Below is the grammar + tree grammar + Java test file.
What I am actually trying to achieve is the following:
Question 1:
I have a grammar that parses my language correctly. I would like to do some semantic checks on variable declarations. So I created a tree walker and so far it semi works. My problem is it's not capturing the whole string of expression. For example,
float x = 10 + 10;
It is only capturing the first part, i.e. 10. I am not sure what I am doing wrong. If I did it in one pass, it works. Somehow, if I split the work into a grammar and tree grammar, it is not capturing the whole string.
Question 2:
I would like to do a check on a rule such that if my conditions returns true, I would like to remove that subtree. For example,
float x = 10;
float x; // <================ I would like this to be removed.
I have tried using rewrite rules but I think it is more complex than that.
Test.g:
grammar Test;
options {
language = Java;
output = AST;
}
parse : varDeclare+
;
varDeclare : type id equalExp? ';'
;
equalExp : ('=' (expression | '...'))
;
expression : binaryExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
literalExpression : INT
;
id : IDENTIFIER
;
type : 'int'
| 'float'
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
TestTree.g:
tree grammar TestTree;
options {
language = Java;
tokenVocab = Test;
ASTLabelType = CommonTree;
}
#members {
SemanticCheck s;
public TestTree(TreeNodeStream input, SemanticCheck s) {
this(input);
this.s = s;
}
}
parse[SemanticCheck s]
: varDeclare+
;
varDeclare : type id equalExp? ';'
{s.check($type.name, $id.text, $equalExp.expr);}
;
equalExp returns [String expr]
: ('=' (expression {$expr = $expression.e;} | '...' {$expr = "...";}))
;
expression returns [String e]
#after {$e = $expression.text;}
: binaryExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
literalExpression : INT
;
id : IDENTIFIER
;
type returns [String name]
#after { $name = $type.text; }
: 'int'
| 'float'
;
Java test file, Test.java:
import java.util.ArrayList;
import java.util.List;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RuleReturnScope;
import org.antlr.runtime.tree.CommonTree;
import org.antlr.runtime.tree.CommonTreeNodeStream;
public class Test {
public static void main(String[] args) throws Exception {
SemanticCheck s = new SemanticCheck();
String src =
"float x = 10+y; \n" +
"float x; \n";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
//TestLexer lexer = new TestLexer(new ANTLRFileStream("input.txt"));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokenStream);
RuleReturnScope r = parser.parse();
System.out.println("Parse Tree:\n" + tokenStream.toString());
CommonTree t = (CommonTree)r.getTree();
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
nodes.setTokenStream(tokenStream);
TestTree walker = new TestTree(nodes, s);
walker.parse(s);
}
}
class SemanticCheck {
List<String> names;
public SemanticCheck() {
this.names = new ArrayList<String>();
}
public boolean check(String type, String variableName, String exp) {
System.out.println("Type: " + type + " variableName: " + variableName + " exp: " + exp);
if(names.contains(variableName)) {
System.out.println("Remove statement! Already defined!");
return true;
}
names.add(variableName);
return false;
}
}
Thanks in advance!
I figured out my problem and it turns out I needed to build an AST first before I can do anything. This would help in understanding what is a flat tree look like vs building an AST.
How to output the AST built using ANTLR?
Thanks to Bart's endless examples here in StackOverFlow, I was able to do semantic predicates to do what I needed in the example above.
Below is the updated code:
Test.g
grammar Test;
options {
language = Java;
output = AST;
}
tokens {
VARDECL;
Assign = '=';
EqT = '==';
NEq = '!=';
LT = '<';
LTEq = '<=';
GT = '>';
GTEq = '>=';
NOT = '!';
PLUS = '+';
MINUS = '-';
MULT = '*';
DIV = '/';
}
parse : varDeclare+
;
varDeclare : type id equalExp ';' -> ^(VARDECL type id equalExp)
;
equalExp : (Assign^ (expression | '...' ))
;
expression : binaryExpression
;
binaryExpression : addingExpression ((EqT|NEq|LTEq|GTEq|LT|GT)^ addingExpression)*
;
addingExpression : multiplyingExpression ((PLUS|MINUS)^ multiplyingExpression)*
;
multiplyingExpression : unaryExpression
((MULT|DIV)^ unaryExpression)*
;
unaryExpression: ((NOT|MINUS))^ primitiveElement
| primitiveElement
;
primitiveElement : literalExpression
| id
| '(' expression ')' -> expression
;
literalExpression : INT
;
id : IDENTIFIER
;
type : 'int'
| 'float'
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
This should automatically build an AST whenever you have varDeclare. Now on to the tree grammar/walker.
TestTree.g
tree grammar TestTree;
options {
language = Java;
tokenVocab = Test;
ASTLabelType = CommonTree;
output = AST;
}
tokens {
REMOVED;
}
#members {
SemanticCheck s;
public TestTree(TreeNodeStream input, SemanticCheck s) {
this(input);
this.s = s;
}
}
start[SemanticCheck s] : varDeclare+
;
varDeclare : ^(VARDECL type id equalExp)
-> {s.check($type.text, $id.text, $equalExp.text)}? REMOVED
-> ^(VARDECL type id equalExp)
;
equalExp : ^(Assign expression)
| ^(Assign '...')
;
expression : ^(('!') expression)
| ^(('+'|'-'|'*'|'/') expression expression*)
| ^(('=='|'<='|'<'|'>='|'>'|'!=') expression expression*)
| literalExpression
;
literalExpression : INT
| id
;
id : IDENTIFIER
;
type : 'int'
| 'float'
;
Now on to test it:
Test.java
import java.util.ArrayList;
import java.util.List;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.tree.*;
public class Test {
public static void main(String[] args) throws Exception {
SemanticCheck s = new SemanticCheck();
String src =
"float x = 10; \n" +
"int x = 1; \n";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokenStream);
TestParser.parse_return r = parser.parse();
System.out.println("Tree:" + ((Tree)r.tree).toStringTree() + "\n");
CommonTreeNodeStream nodes = new CommonTreeNodeStream((Tree)r.tree);
nodes.setTokenStream(tokenStream);
TestTree walker = new TestTree(nodes, s);
TestTree.start_return r2 = walker.start(s);
System.out.println("\nTree Walker: "+((Tree)r2.tree).toStringTree());
}
}
class SemanticCheck {
List<String> names;
public SemanticCheck() {
this.names = new ArrayList<String>();
}
public boolean check(String type, String variableName, String exp) {
System.out.println("Type: " + type + " variableName: " + variableName + " exp: " + exp);
if(names.contains(variableName)) {
return true;
}
names.add(variableName);
return false;
}
}
Output:
Tree:(VARDECL float x (= 10)) (VARDECL int x (= 1))
Type: float variableName: x exp: = 10
Type: int variableName: x exp: = 1
Tree Walker: (VARDECL float x (= 10)) REMOVED
Hope this helps! Please feel free to point any errors if I did something wrong.

Simplified smalltalk grammar using antlr - unary minus and message chaining

I am writing simple smalltalk-like grammar using antlr. It is simplified version of smalltalk, but basic ideas are the same (message passing for example).
Here is my grammar so far:
grammar GAL;
options {
//k=2;
backtrack=true;
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '"' ( options {greedy=false;} : . )* '"' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
) {$channel=HIDDEN;}
;
NEW_LINE
: ('\r'?'\n')
;
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
BINARY_MESSAGE_CHAR
: ('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')
('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')?
;
// parser
program
: NEW_LINE* (statement (NEW_LINE+ | EOF))*
;
statement
: message_sending
| return_statement
| assignment
| temp_variables
;
return_statement
: '^' statement
;
assignment
: identifier ':=' statement
;
temp_variables
: '|' identifier+ '|'
;
object
: raw_object
;
raw_object
: number
| string
| identifier
| literal
| block
| '(' message_sending ')'
;
message_sending
: keyword_message_sending
;
keyword_message_sending
: binary_message_sending keyword_message?
;
binary_message_sending
: unary_message_sending binary_message*
;
unary_message_sending
: object (unary_message)*
;
unary_message
: unary_message_selector
;
binary_message
: binary_message_selector unary_message_sending
;
keyword_message
: (NEW_LINE? single_keyword_message_selector NEW_LINE? binary_message_sending)+
;
block
:
'[' (block_signiture
)? NEW_LINE*
block_body
NEW_LINE* ']'
;
block_body
: (statement
)?
(NEW_LINE+ statement
)*
;
block_signiture
:
(':' identifier
)+ '|'
;
unary_message_selector
: identifier
;
binary_message_selector
: BINARY_MESSAGE_CHAR
;
single_keyword_message_selector
: identifier ':'
;
keyword_message_selector
: single_keyword_message_selector+
;
symbol
: '#' (string | identifier | binary_message_selector | keyword_message_selector)
;
literal
: symbol block? // if there is block then this is method
;
number
: /*'-'?*/
( INT | FLOAT )
;
string
: STRING
;
identifier
: ID
;
1. Unary Minus
I have a problem with unary minus for numbers (commented part for rule number). The problem is that minus is valid binary message. To make things worse two minus signs are also valid binary message. What I need is unary minus in case where there is no object to send binary message to (for example, -3+4 should be unary minus because there is nothing in frot of -3). Also, (-3) should be binary minus too. It would be great if 1 -- -2 would be binary message '--' with parameter -2, but I can live without that. How can I do this?
If I uncomment unary minus I get error MismatchedSetException(0!=null) when parsing something like 1-2.
2. Message chaining
What would be best way to implement message chainging like in smalltalk? What I mean by this is something like this:
obj message1 + 3;
message2;
+ 3;
keyword: 2+3
where every message would be sent to the same object, in this case obj. Message precedence should be kept (unary > binary > keyword).
3. Backtrack
Most of this grammar can be parsed with k=2, but when input is something like this:
1 + 2
Obj message:
1 + 2
message2: 'string'
parser tries to match Obj as single_keyword_message_selector and raises UnwantedTokenExcaption on token message. If remove k=2 and set backtrack=true (as I did) everything works as it should. How can I remove backtrack and get desired behaviour?
Also, most of the grammar can be parsed using k=1, so I tried to set k=2 only for rules that require it, but that is ignored. I did something like this:
rule
options { k = 2; }
: // rule definition
;
but it doesn't work until I set k in global options. What am I missing here?
Update:
It is not ideal solution to write grammar from scratch, because I have a lot of code that depends on it. Also, some features of smalltalk that are missing - are missing by design. This is not intended to be another smalltalk implementation, smalltalk was just an inspiration.
I would be more then happy to have unary minus working in cases like this: -1+2 or 2+(-1). Cases like 2 -- -1 are just not so important.
Also, message chaining is something that should be done as simple as posible. That means that I don't like idea of changeing AST I am generating.
About backtrack - I can live with it, just asked here out of personal curiosity.
This is little modified grammar that generates AST - maybe it will help to better understand what I don't want to change. (temp_variables are probably going to be deleted, I havent made that decision).
grammar GAL;
options {
//k=2;
backtrack=true;
language=CSharp3;
output=AST;
}
tokens {
HASH = '#';
COLON = ':';
DOT = '.';
CARET = '^';
PIPE = '|';
LBRACKET = '[';
RBRACKET = ']';
LPAREN = '(';
RPAREN = ')';
ASSIGN = ':=';
}
// generated files options
#namespace { GAL.Compiler }
#lexer::namespace { GAL.Compiler}
// this will disable CLSComplaint warning in ANTLR generated code
#parser::header {
// Do not bug me about [System.CLSCompliant(false)]
#pragma warning disable 3021
}
#lexer::header {
// Do not bug me about [System.CLSCompliant(false)]
#pragma warning disable 3021
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '"' ( options {greedy=false;} : . )* '"' {$channel=Hidden;}
;
WS : ( ' '
| '\t'
) {$channel=Hidden;}
;
NEW_LINE
: ('\r'?'\n')
;
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
BINARY_MESSAGE_CHAR
: ('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')
('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')?
;
// parser
public program returns [ AstProgram program ]
: { $program = new AstProgram(); }
NEW_LINE*
( statement (NEW_LINE+ | EOF)
{ $program.AddStatement($statement.stmt); }
)*
;
statement returns [ AstNode stmt ]
: message_sending
{ $stmt = $message_sending.messageSending; }
| return_statement
{ $stmt = $return_statement.ret; }
| assignment
{ $stmt = $assignment.assignment; }
| temp_variables
{ $stmt = $temp_variables.tempVars; }
;
return_statement returns [ AstReturn ret ]
: CARET statement
{ $ret = new AstReturn($CARET, $statement.stmt); }
;
assignment returns [ AstAssignment assignment ]
: dotted_expression ASSIGN statement
{ $assignment = new AstAssignment($dotted_expression.dottedExpression, $ASSIGN, $statement.stmt); }
;
temp_variables returns [ AstTempVariables tempVars ]
: p1=PIPE
{ $tempVars = new AstTempVariables($p1); }
( identifier
{ $tempVars.AddVar($identifier.identifier); }
)+
p2=PIPE
{ $tempVars.EndToken = $p2; }
;
object returns [ AstNode obj ]
: number
{ $obj = $number.number; }
| string
{ $obj = $string.str; }
| dotted_expression
{ $obj = $dotted_expression.dottedExpression; }
| literal
{ $obj = $literal.literal; }
| block
{ $obj = $block.block; }
| LPAREN message_sending RPAREN
{ $obj = $message_sending.messageSending; }
;
message_sending returns [ AstKeywordMessageSending messageSending ]
: keyword_message_sending
{ $messageSending = $keyword_message_sending.keywordMessageSending; }
;
keyword_message_sending returns [ AstKeywordMessageSending keywordMessageSending ]
: binary_message_sending
{ $keywordMessageSending = new AstKeywordMessageSending($binary_message_sending.binaryMessageSending); }
( keyword_message
{ $keywordMessageSending = $keywordMessageSending.NewMessage($keyword_message.keywordMessage); }
)?
;
binary_message_sending returns [ AstBinaryMessageSending binaryMessageSending ]
: unary_message_sending
{ $binaryMessageSending = new AstBinaryMessageSending($unary_message_sending.unaryMessageSending); }
( binary_message
{ $binaryMessageSending = $binaryMessageSending.NewMessage($binary_message.binaryMessage); }
)*
;
unary_message_sending returns [ AstUnaryMessageSending unaryMessageSending ]
: object
{ $unaryMessageSending = new AstUnaryMessageSending($object.obj); }
(
unary_message
{ $unaryMessageSending = $unaryMessageSending.NewMessage($unary_message.unaryMessage); }
)*
;
unary_message returns [ AstUnaryMessage unaryMessage ]
: unary_message_selector
{ $unaryMessage = new AstUnaryMessage($unary_message_selector.unarySelector); }
;
binary_message returns [ AstBinaryMessage binaryMessage ]
: binary_message_selector unary_message_sending
{ $binaryMessage = new AstBinaryMessage($binary_message_selector.binarySelector, $unary_message_sending.unaryMessageSending); }
;
keyword_message returns [ AstKeywordMessage keywordMessage ]
:
{ $keywordMessage = new AstKeywordMessage(); }
(
NEW_LINE?
single_keyword_message_selector
NEW_LINE?
binary_message_sending
{ $keywordMessage.AddMessagePart($single_keyword_message_selector.singleKwSelector, $binary_message_sending.binaryMessageSending); }
)+
;
block returns [ AstBlock block ]
: LBRACKET
{ $block = new AstBlock($LBRACKET); }
(
block_signiture
{ $block.Signiture = $block_signiture.blkSigniture; }
)? NEW_LINE*
block_body
{ $block.Body = $block_body.blkBody; }
NEW_LINE*
RBRACKET
{ $block.SetEndToken($RBRACKET); }
;
block_body returns [ IList<AstNode> blkBody ]
#init { $blkBody = new List<AstNode>(); }
:
( s1=statement
{ $blkBody.Add($s1.stmt); }
)?
( NEW_LINE+ s2=statement
{ $blkBody.Add($s2.stmt); }
)*
;
block_signiture returns [ AstBlockSigniture blkSigniture ]
#init { $blkSigniture = new AstBlockSigniture(); }
:
( COLON identifier
{ $blkSigniture.AddIdentifier($COLON, $identifier.identifier); }
)+ PIPE
{ $blkSigniture.SetEndToken($PIPE); }
;
unary_message_selector returns [ AstUnaryMessageSelector unarySelector ]
: identifier
{ $unarySelector = new AstUnaryMessageSelector($identifier.identifier); }
;
binary_message_selector returns [ AstBinaryMessageSelector binarySelector ]
: BINARY_MESSAGE_CHAR
{ $binarySelector = new AstBinaryMessageSelector($BINARY_MESSAGE_CHAR); }
;
single_keyword_message_selector returns [ AstIdentifier singleKwSelector ]
: identifier COLON
{ $singleKwSelector = $identifier.identifier; }
;
keyword_message_selector returns [ AstKeywordMessageSelector keywordSelector ]
#init { $keywordSelector = new AstKeywordMessageSelector(); }
:
( single_keyword_message_selector
{ $keywordSelector.AddIdentifier($single_keyword_message_selector.singleKwSelector); }
)+
;
symbol returns [ AstSymbol symbol ]
: HASH
( string
{ $symbol = new AstSymbol($HASH, $string.str); }
| identifier
{ $symbol = new AstSymbol($HASH, $identifier.identifier); }
| binary_message_selector
{ $symbol = new AstSymbol($HASH, $binary_message_selector.binarySelector); }
| keyword_message_selector
{ $symbol = new AstSymbol($HASH, $keyword_message_selector.keywordSelector); }
)
;
literal returns [ AstNode literal ]
: symbol
{ $literal = $symbol.symbol; }
( block
{ $literal = new AstMethod($symbol.symbol, $block.block); }
)? // if there is block then this is method
;
number returns [ AstNode number ]
: /*'-'?*/
( INT
{ $number = new AstInt($INT); }
| FLOAT
{ $number = new AstInt($FLOAT); }
)
;
string returns [ AstString str ]
: STRING
{ $str = new AstString($STRING); }
;
dotted_expression returns [ AstDottedExpression dottedExpression ]
: i1=identifier
{ $dottedExpression = new AstDottedExpression($i1.identifier); }
(DOT i2=identifier
{ $dottedExpression.AddIdentifier($i2.identifier); }
)*
;
identifier returns [ AstIdentifier identifier ]
: ID
{ $identifier = new AstIdentifier($ID); }
;
Hi Smalltalk Grammar writer,
Firstly, to get a smalltalk grammar to parse properly (1 -- -2) and to support the optional '.' on the last statement, etc., you should treat whitespace as significant. Don't put it on the hidden channel.
The grammar so far is not breaking down the rules into small enough fragments. This will be a problem like you have seen with K=2 and backtracking.
I suggest you check out a working Smalltalk grammar in ANTLR as defined by the Redline Smalltalk project http://redline.st & https://github.com/redline-smalltalk/redline-smalltalk
Rgs, James.

ANTLR Grammar for Liquid Markup?

Anyone know of any ANTLR grammar for Liquid Markup or a JAVA library that can work with it? I have taken a look at Jangod but it doesn't seem to work much.
Thanks!
Here's a grammar:
grammar Liquid;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ASSIGNMENT;
ATTRIBUTES;
BLOCK;
CAPTURE;
CASE;
COMMENT;
CYCLE;
ELSE;
FILTERS;
FILTER;
FOR_ARRAY;
FOR_RANGE;
GROUP;
IF;
INCLUDE;
LOOKUP;
OUTPUT;
PARAMS;
PLAIN;
RAW;
TABLE;
UNLESS;
WHEN;
WITH;
}
#parser::members {
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
#lexer::members {
private boolean inTag = false;
private boolean openTagAhead() {
return input.LA(1) == '{' && (input.LA(2) == '{' || input.LA(2) == '\u0025');
}
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
/* parser rules */
parse
: block EOF -> block
;
block
: (options{greedy=true;}: atom)* -> ^(BLOCK atom*)
;
atom
: tag
| output
| assignment
| Other -> ^(PLAIN Other)
;
tag
: raw_tag
| comment_tag
| if_tag
| unless_tag
| case_tag
| cycle_tag
| for_tag
| table_tag
| capture_tag
| include_tag
;
raw_tag
: TagStart RawStart TagEnd raw_body TagStart RawEnd TagEnd
-> ^(RAW raw_body)
;
raw_body
: ~TagStart*
;
comment_tag
: TagStart CommentStart TagEnd comment_body TagStart CommentEnd TagEnd
-> ^(COMMENT comment_body)
;
comment_body
: ~TagStart*
;
if_tag
: TagStart IfStart expr TagEnd block else_tag? TagStart IfEnd TagEnd
-> ^(IF expr block ^(ELSE else_tag?))
;
else_tag
: TagStart Else TagEnd block
-> block
;
unless_tag
: TagStart UnlessStart expr TagEnd block else_tag? TagStart UnlessEnd TagEnd
-> ^(UNLESS expr block ^(ELSE else_tag?))
;
case_tag
: TagStart CaseStart expr TagEnd when_tag+ else_tag? TagStart CaseEnd TagEnd
-> ^(CASE expr when_tag+ ^(ELSE else_tag?))
;
when_tag
: TagStart When expr TagEnd block
-> ^(WHEN expr block)
;
cycle_tag
: TagStart Cycle cycle_group? expr (Comma expr)* TagEnd
-> ^(CYCLE ^(GROUP cycle_group?) expr+)
;
cycle_group
: expr Col -> expr
;
for_tag
: for_array
| for_range
;
for_array // attributes must be 'limit' or 'offset'!
: TagStart ForStart Id In lookup attribute* TagEnd block TagStart ForEnd TagEnd
-> ^(FOR_ARRAY Id lookup ^(ATTRIBUTES attribute*) block)
;
attribute
: Id Col expr -> ^(Id expr)
;
for_range
: TagStart ForStart Id In OPar expr DotDot expr CPar TagEnd block TagStart ForEnd TagEnd
-> ^(FOR_RANGE Id expr expr block)
;
table_tag // attributes must be 'limit' or 'cols'!
: TagStart TableStart Id In Id attribute* TagEnd block TagStart TableEnd TagEnd
-> ^(TABLE Id Id ^(ATTRIBUTES attribute*) block)
;
capture_tag
: TagStart CaptureStart Id TagEnd block TagStart CaptureEnd TagEnd
-> ^(CAPTURE Id block)
;
include_tag
: TagStart Include a=Str (With b=Str)? TagEnd
-> ^(INCLUDE $a ^(WITH $b?))
;
output
: OutStart expr filter* OutEnd
-> ^(OUTPUT expr ^(FILTERS filter*))
;
filter
: Pipe Id params?
-> ^(FILTER Id ^(PARAMS params?))
;
params
: Col expr (Comma expr)* -> expr+
;
assignment
: TagStart Assign Id EqSign expr TagEnd
-> ^(ASSIGNMENT Id expr)
;
expr
: or_expr
;
or_expr
: and_expr (Or^ and_expr)*
;
and_expr
: eq_expr (And^ eq_expr)*
;
eq_expr
: rel_expr ((Eq | NEq)^ rel_expr)*
;
rel_expr
: term ((LtEq | Lt | GtEq | Gt)^ term)?
;
term
: Num
| Str
| True
| False
| Nil
| lookup
;
lookup
: Id (Dot Id)* -> ^(LOOKUP Id+)
;
/* lexer rules */
OutStart : '{{' {inTag=true;};
OutEnd : '}}' {inTag=false;};
TagStart : '{%' {inTag=true;};
TagEnd : '%}' {inTag=false;};
Str : {inTag}?=> (SStr | DStr);
DotDot : {inTag}?=> '..';
Dot : {inTag}?=> '.';
NEq : {inTag}?=> '!=';
Eq : {inTag}?=> '==';
EqSign : {inTag}?=> '=';
GtEq : {inTag}?=> '>=';
Gt : {inTag}?=> '>';
LtEq : {inTag}?=> '<=';
Lt : {inTag}?=> '<';
Pipe : {inTag}?=> '|';
Col : {inTag}?=> ':';
Comma : {inTag}?=> ',';
OPar : {inTag}?=> '(';
CPar : {inTag}?=> ')';
Num : {inTag}?=> Digit+;
WS : {inTag}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};
Id
: {inTag}?=> (Letter | '_') (Letter | '_' | '-' | Digit)*
{
if($text.equals("capture")) $type = CaptureStart;
else if($text.equals("endcapture")) $type = CaptureEnd;
else if($text.equals("comment")) $type = CommentStart;
else if($text.equals("endcomment")) $type = CommentEnd;
else if($text.equals("raw")) $type = RawStart;
else if($text.equals("endraw")) $type = RawEnd;
else if($text.equals("if")) $type = IfStart;
else if($text.equals("endif")) $type = IfEnd;
else if($text.equals("unless")) $type = UnlessStart;
else if($text.equals("endunless")) $type = UnlessEnd;
else if($text.equals("else")) $type = Else;
else if($text.equals("case")) $type = CaseStart;
else if($text.equals("endcase")) $type = CaseEnd;
else if($text.equals("when")) $type = When;
else if($text.equals("cycle")) $type = Cycle;
else if($text.equals("for")) $type = ForStart;
else if($text.equals("endfor")) $type = ForEnd;
else if($text.equals("in")) $type = In;
else if($text.equals("and")) $type = And;
else if($text.equals("or")) $type = Or;
else if($text.equals("tablerow")) $type = TableStart;
else if($text.equals("endtablerow")) $type = TableEnd;
else if($text.equals("assign")) $type = Assign;
else if($text.equals("true")) $type = True;
else if($text.equals("false")) $type = False;
else if($text.equals("nil")) $type = Nil;
else if($text.equals("include")) $type = Include;
else if($text.equals("with")) $type = With;
}
;
Other
: ({!inTag && !openTagAhead()}?=> . )+
{
String s = getText().replaceAll("\\s+", " ").trim();
if(s.isEmpty()) {
skip();
}
else {
setText(s);
}
}
;
/* fragment rules */
fragment Letter : 'a'..'z' | 'A'..'Z';
fragment Digit : '0'..'9';
fragment SStr : '\'' ~'\''* '\'';
fragment DStr : '"' ~'"'* '"';
fragment CommentStart : ;
fragment CommentEnd : ;
fragment RawStart : ;
fragment RawEnd : ;
fragment IfStart : ;
fragment IfEnd : ;
fragment UnlessStart : ;
fragment UnlessEnd : ;
fragment Else : ;
fragment CaseStart : ;
fragment CaseEnd : ;
fragment When : ;
fragment Cycle : ;
fragment ForStart : ;
fragment ForEnd : ;
fragment In : ;
fragment And : ;
fragment Or : ;
fragment TableStart : ;
fragment TableEnd : ;
fragment Assign : ;
fragment True : ;
fragment False : ;
fragment Nil : ;
fragment Include : ;
fragment With : ;
fragment CaptureStart : ;
fragment CaptureEnd : ;
I have dusted the thing off a bit and put it in a Github repository: https://github.com/bkiers/Liqp
Be aware: although I have successfully used this grammar in the past, the input might have been rather "easy". If you're going to use it, and run into problems, I'd appreciate it if you let me know. If you're looking for a more robust, thoroughly tested library/parser/grammar, this might not be what you're looking for.