Parsing a templating language - antlr

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;

Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)

Related

Why am I getting "mismatched input 'addr' expecting {<EOF>, 'addr'}"

Given this g4 grammar:
grammar smaller;
root
: ( componentDefinition )* EOF;
componentDefinition
: Addr
Id?
Lbrace
Rbrace
Semi
;
ExprElem
: Num
| Id
;
Addr : 'addr' {System.out.println("addr");};
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
//------------------------------------------------
// Whitespace and Comments
//------------------------------------------------
Wspace : [ \t]+ -> skip;
Newline : ('\r' '\n'?
| '\n'
) -> skip;
and this file to parse
addr basic {
};
this cmdline:
rm *.class *.java ; java -Xmx500M org.antlr.v4.Tool smaller.g4 ; javac *.java ; cat basic | java org.antlr.v4.runtime.misc.TestRig smaller root -tree
I get this error:
line 2:0 mismatched input 'addr' expecting {<EOF>, 'addr'}
(root addr basic { } ;)
If I remove the ExprElem (which is not used anywhere else in the grammar), the parser works:
addr
id
(root (componentDefinition addr basic { } ;) <EOF>)
Why? Note that this is a greatly reduced version of the grammar. Normally, the ExprElem does have a purpose.
Addr is a literal, so it shouldn't conflict with Id in the way that other questions like this usually do.
Your rule ExprElem is a lexer rule, not a parser rule (it begins with an upercase) and is masking the Addr rule, so, no Addr :(
Also, as ExprElem is a lexer rule and it relies on Id or Num rule. Consequently, when an Id is found, ANTLR lexer gives it the ExprElem token type and not the Id token type.
So, two things, you can either rewrite your ExprElem rule to exprElem (assuming you want a parser rule):
exprElem : Num | Id;
or you can use Id token in your ExprElem as part of the rule but you need something that can differentiate ExprElem from Id (example below, but I really think you want a parser rule):
Addr : 'addr' {System.out.println("addr");};
ExprElem
: Sharp Num // This token use others but defines its own 'pattern'
| Sharp Id
;
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
Sharp : '#';
From what I suppose, this is definitely not what you want, but I just put it here to illustrate how lexer rule can reuse others.
When you have doubt about what your token do, do not hesitate to display the recognize tokens. Here is the Java code fragment I often use (I named your grammar test in this case):
public class Main {
public static void main(String[] args) throws InterruptedException {
String txt =
"addr Basic {\n"
+ "\n"
+ "};";
TestLexer lexer = new TestLexer(new ANTLRInputStream(txt));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.root();
for (Token t : tokens.getTokens()) {
System.out.println(t);
}
}
}
NOTE: by the way, Num will never be recognized as Id rule can match the same thing. Try this instead:
Id : Letter (Letter | [0-9])*;
Num : [0-9]+;
fragment Letter : [a-zA-z_];

Solving antlr left recursion

I'm trying to parse a language using ANTLR which can contain the following syntax:
someVariable, somVariable.someMember, functionCall(param).someMember, foo.bar.baz(bjork).buffalo().xyzzy
This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error
The following sets of rules are mutually left-recursive [access_operation, expression]:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
LHS;
RHS;
CALL;
PARAMS;
}
start
: body? EOF
;
body
: expression (',' expression)*
;
expression
: function -> ^(CALL)
| access_operation
| atom
;
access_operation
: (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
;
function
: (IDENT '(' body? ')') -> ^(IDENT PARAMS?)
;
atom
: IDENT
| NUMBER
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
IDENT : (LETTER)+ ;
NUMBER : (DIGIT)+ ;
SPACE : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };
What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.
What i'm looking for instead is something like this:
How can the left-recursion problem solved in this case?
By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:
foo
/ | \
/ | \
bar baz ...
That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.
A quick demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS;
CALL;
PARAMS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (IDENT -> IDENT) ( tail -> ^(IDENT tail)
| call tail? -> ^(CALL IDENT call tail?)
)?
;
tail
: (access)+
;
access
: ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
;
call
: '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
;
IDENT : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " +
"foo.bar.baz(bjork).buffalo().xyzzy";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
The output of Main corresponds to the following AST:
EDIT
And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS_OP;
CALL;
PARAMS;
LHS;
RHS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (ID -> ID) ( ('(' params ')' -> ^(CALL ID params))
('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
| '.' expression -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
)?
;
params
: (expression (',' expression)*)? -> ^(PARAMS expression*)
;
ID : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which creates the following AST if you run the Main class:
The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:
which means atom can be any of the 5 following alternatives (with their corresponding AST's):
+----------------------+--------------------------------------------------------+
| alternative | generated AST |
+----------------------+--------------------------------------------------------+
| NUMBER | NUMBER |
| ID | ID |
| ID params | ^(CALL ID params) |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression | ^(ACCESS_OP ^(LHS ID) ^(RHS expression) |
+----------------------+--------------------------------------------------------+

Switching lexer state in antlr3 grammar

I'm trying to construct an antlr grammar to parse a templating language. that language can be embedded in any text and the boundaries are marked with opening/closing tags: {{ / }}. So a valid template looks like this:
foo {{ someVariable }} bar
Where foo and bar should be ignored, and the part inside the {{ and }} tags should be parsed. I've found this question which basically has an answer for the problem, except that the tags are only one { and }. I've tried to modify the grammar to match 2 opening/closing characters, but as soon as i do this, the BUFFER rule consumes ALL characters, also the opening and closing brackets. The LD rule is never being invoked.
Has anyone an idea why the antlr lexer is consuming all tokens in the Buffer rule when the delimiters have 2 characters, but does not consume the delimiters when they have only one character?
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: (tag | BUFFER )*
;
tag
: LD IDENT^ RD
;
LD #after {
// flip lexer the state
insideTag=true;
System.err.println("FLIPPING TAG");
} : '{{';
RD #after {
// flip the state back
insideTag=false;
} : '}}';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER)*;
BUFFER : { !insideTag }?=> ~(LD | RD)+;
fragment LETTER : ('a'..'z' | 'A'..'Z');
You can match any character once or more until you see {{ ahead by including a predicate inside the parenthesis ( ... )+ (see the BUFFER rule in the demo).
A demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: tag EOF
;
tag
: LD IDENT^ RD
;
LD
#after {insideTag=true;}
: '{{'
;
RD
#after {insideTag=false;}
: '}}'
;
BUFFER
: ({!insideTag && !(input.LA(1)=='{' && input.LA(2)=='{')}?=> .)+ {$channel=HIDDEN;}
;
SPACE
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
IDENT
: ('a'..'z' | 'A'..'Z')+
;
Note that it's best to keep the BUFFER rule as the first lexer rule in your grammar: that way, it will be the first token that is tried.
If you now parse "foo {{ someVariable }} bar", the following AST is created:
Wouldn't a grammar like this fit your needs? I don't see why the BUFFER needs to be that complicated.
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean inTag=false;
}
start
: tag* EOF
;
tag
: LD IDENT RD -> IDENT
;
LD
#after { inTag=true; }
: '{{'
;
RD
#after { inTag=false; }
: '}}'
;
IDENT : {inTag}?=> ('a'..'z'|'A'..'Z'|'_') 'a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
BUFFER
: . {$channel=HIDDEN;}
;

ANTLR - test class not found error when debugging

I am getting error of Test class not found error, even when I have made it via command
java org.antlr.Tool something.g
when debugging via ANTLRworks. I have been searching this on the web, but no success with it. Do you know, how to solve my problem?
Thanks
Edit - grammar:
grammar Expr;
prog : stat+ ;
stat : expr NEWLINE
| ID '=' expr NEWLINE
| NEWLINE
;
expr : multExpr (('+' |'-' ) multExpr)*;
multExpr: atom ('*' atom)*
;
atom : INT
| ID
| '(' expr ')'
;
ID : ('a'..'z' |'A'..'Z' )+ ;
INT : '0'..'9' + ;
NEWLINE:'\r' ? '\n' ;
WS : (' ' |'\t' |'\n' |'\r' )+ {skip();} ;
Error message:
When debugging with imput text 3*4 makes output - "Error: Could not find or load main class Test" While, Test is generated in subfolder /output...
Since you commented in the comments that running a test class from the console also failed, here's one that works:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ExprLexer lex = new ExprLexer(new ANTLRStringStream("1 + 2 - 3 * 4 - E\n"));
CommonTokenStream tokens = new CommonTokenStream(lex);
ExprParser parser = new ExprParser(tokens);
parser.prog();
}
}
Now put your ANTLR JAR (v3.3 in my example) in the same directory as Main.java and Expr.g and run the main class from your console like this:
java -cp antlr-3.3.jar org.antlr.Tool Expr.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
The fact that nothing is printed to your console means that the parsing went without any problems.

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace