Simplified smalltalk grammar using antlr - unary minus and message chaining - antlr

I am writing simple smalltalk-like grammar using antlr. It is simplified version of smalltalk, but basic ideas are the same (message passing for example).
Here is my grammar so far:
grammar GAL;
options {
//k=2;
backtrack=true;
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '"' ( options {greedy=false;} : . )* '"' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
) {$channel=HIDDEN;}
;
NEW_LINE
: ('\r'?'\n')
;
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
BINARY_MESSAGE_CHAR
: ('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')
('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')?
;
// parser
program
: NEW_LINE* (statement (NEW_LINE+ | EOF))*
;
statement
: message_sending
| return_statement
| assignment
| temp_variables
;
return_statement
: '^' statement
;
assignment
: identifier ':=' statement
;
temp_variables
: '|' identifier+ '|'
;
object
: raw_object
;
raw_object
: number
| string
| identifier
| literal
| block
| '(' message_sending ')'
;
message_sending
: keyword_message_sending
;
keyword_message_sending
: binary_message_sending keyword_message?
;
binary_message_sending
: unary_message_sending binary_message*
;
unary_message_sending
: object (unary_message)*
;
unary_message
: unary_message_selector
;
binary_message
: binary_message_selector unary_message_sending
;
keyword_message
: (NEW_LINE? single_keyword_message_selector NEW_LINE? binary_message_sending)+
;
block
:
'[' (block_signiture
)? NEW_LINE*
block_body
NEW_LINE* ']'
;
block_body
: (statement
)?
(NEW_LINE+ statement
)*
;
block_signiture
:
(':' identifier
)+ '|'
;
unary_message_selector
: identifier
;
binary_message_selector
: BINARY_MESSAGE_CHAR
;
single_keyword_message_selector
: identifier ':'
;
keyword_message_selector
: single_keyword_message_selector+
;
symbol
: '#' (string | identifier | binary_message_selector | keyword_message_selector)
;
literal
: symbol block? // if there is block then this is method
;
number
: /*'-'?*/
( INT | FLOAT )
;
string
: STRING
;
identifier
: ID
;
1. Unary Minus
I have a problem with unary minus for numbers (commented part for rule number). The problem is that minus is valid binary message. To make things worse two minus signs are also valid binary message. What I need is unary minus in case where there is no object to send binary message to (for example, -3+4 should be unary minus because there is nothing in frot of -3). Also, (-3) should be binary minus too. It would be great if 1 -- -2 would be binary message '--' with parameter -2, but I can live without that. How can I do this?
If I uncomment unary minus I get error MismatchedSetException(0!=null) when parsing something like 1-2.
2. Message chaining
What would be best way to implement message chainging like in smalltalk? What I mean by this is something like this:
obj message1 + 3;
message2;
+ 3;
keyword: 2+3
where every message would be sent to the same object, in this case obj. Message precedence should be kept (unary > binary > keyword).
3. Backtrack
Most of this grammar can be parsed with k=2, but when input is something like this:
1 + 2
Obj message:
1 + 2
message2: 'string'
parser tries to match Obj as single_keyword_message_selector and raises UnwantedTokenExcaption on token message. If remove k=2 and set backtrack=true (as I did) everything works as it should. How can I remove backtrack and get desired behaviour?
Also, most of the grammar can be parsed using k=1, so I tried to set k=2 only for rules that require it, but that is ignored. I did something like this:
rule
options { k = 2; }
: // rule definition
;
but it doesn't work until I set k in global options. What am I missing here?
Update:
It is not ideal solution to write grammar from scratch, because I have a lot of code that depends on it. Also, some features of smalltalk that are missing - are missing by design. This is not intended to be another smalltalk implementation, smalltalk was just an inspiration.
I would be more then happy to have unary minus working in cases like this: -1+2 or 2+(-1). Cases like 2 -- -1 are just not so important.
Also, message chaining is something that should be done as simple as posible. That means that I don't like idea of changeing AST I am generating.
About backtrack - I can live with it, just asked here out of personal curiosity.
This is little modified grammar that generates AST - maybe it will help to better understand what I don't want to change. (temp_variables are probably going to be deleted, I havent made that decision).
grammar GAL;
options {
//k=2;
backtrack=true;
language=CSharp3;
output=AST;
}
tokens {
HASH = '#';
COLON = ':';
DOT = '.';
CARET = '^';
PIPE = '|';
LBRACKET = '[';
RBRACKET = ']';
LPAREN = '(';
RPAREN = ')';
ASSIGN = ':=';
}
// generated files options
#namespace { GAL.Compiler }
#lexer::namespace { GAL.Compiler}
// this will disable CLSComplaint warning in ANTLR generated code
#parser::header {
// Do not bug me about [System.CLSCompliant(false)]
#pragma warning disable 3021
}
#lexer::header {
// Do not bug me about [System.CLSCompliant(false)]
#pragma warning disable 3021
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '"' ( options {greedy=false;} : . )* '"' {$channel=Hidden;}
;
WS : ( ' '
| '\t'
) {$channel=Hidden;}
;
NEW_LINE
: ('\r'?'\n')
;
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
BINARY_MESSAGE_CHAR
: ('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')
('~' | '!' | '#' | '%' | '&' | '*' | '-' | '+' | '=' | '|' | '\\' | '<' | '>' | ',' | '?' | '/')?
;
// parser
public program returns [ AstProgram program ]
: { $program = new AstProgram(); }
NEW_LINE*
( statement (NEW_LINE+ | EOF)
{ $program.AddStatement($statement.stmt); }
)*
;
statement returns [ AstNode stmt ]
: message_sending
{ $stmt = $message_sending.messageSending; }
| return_statement
{ $stmt = $return_statement.ret; }
| assignment
{ $stmt = $assignment.assignment; }
| temp_variables
{ $stmt = $temp_variables.tempVars; }
;
return_statement returns [ AstReturn ret ]
: CARET statement
{ $ret = new AstReturn($CARET, $statement.stmt); }
;
assignment returns [ AstAssignment assignment ]
: dotted_expression ASSIGN statement
{ $assignment = new AstAssignment($dotted_expression.dottedExpression, $ASSIGN, $statement.stmt); }
;
temp_variables returns [ AstTempVariables tempVars ]
: p1=PIPE
{ $tempVars = new AstTempVariables($p1); }
( identifier
{ $tempVars.AddVar($identifier.identifier); }
)+
p2=PIPE
{ $tempVars.EndToken = $p2; }
;
object returns [ AstNode obj ]
: number
{ $obj = $number.number; }
| string
{ $obj = $string.str; }
| dotted_expression
{ $obj = $dotted_expression.dottedExpression; }
| literal
{ $obj = $literal.literal; }
| block
{ $obj = $block.block; }
| LPAREN message_sending RPAREN
{ $obj = $message_sending.messageSending; }
;
message_sending returns [ AstKeywordMessageSending messageSending ]
: keyword_message_sending
{ $messageSending = $keyword_message_sending.keywordMessageSending; }
;
keyword_message_sending returns [ AstKeywordMessageSending keywordMessageSending ]
: binary_message_sending
{ $keywordMessageSending = new AstKeywordMessageSending($binary_message_sending.binaryMessageSending); }
( keyword_message
{ $keywordMessageSending = $keywordMessageSending.NewMessage($keyword_message.keywordMessage); }
)?
;
binary_message_sending returns [ AstBinaryMessageSending binaryMessageSending ]
: unary_message_sending
{ $binaryMessageSending = new AstBinaryMessageSending($unary_message_sending.unaryMessageSending); }
( binary_message
{ $binaryMessageSending = $binaryMessageSending.NewMessage($binary_message.binaryMessage); }
)*
;
unary_message_sending returns [ AstUnaryMessageSending unaryMessageSending ]
: object
{ $unaryMessageSending = new AstUnaryMessageSending($object.obj); }
(
unary_message
{ $unaryMessageSending = $unaryMessageSending.NewMessage($unary_message.unaryMessage); }
)*
;
unary_message returns [ AstUnaryMessage unaryMessage ]
: unary_message_selector
{ $unaryMessage = new AstUnaryMessage($unary_message_selector.unarySelector); }
;
binary_message returns [ AstBinaryMessage binaryMessage ]
: binary_message_selector unary_message_sending
{ $binaryMessage = new AstBinaryMessage($binary_message_selector.binarySelector, $unary_message_sending.unaryMessageSending); }
;
keyword_message returns [ AstKeywordMessage keywordMessage ]
:
{ $keywordMessage = new AstKeywordMessage(); }
(
NEW_LINE?
single_keyword_message_selector
NEW_LINE?
binary_message_sending
{ $keywordMessage.AddMessagePart($single_keyword_message_selector.singleKwSelector, $binary_message_sending.binaryMessageSending); }
)+
;
block returns [ AstBlock block ]
: LBRACKET
{ $block = new AstBlock($LBRACKET); }
(
block_signiture
{ $block.Signiture = $block_signiture.blkSigniture; }
)? NEW_LINE*
block_body
{ $block.Body = $block_body.blkBody; }
NEW_LINE*
RBRACKET
{ $block.SetEndToken($RBRACKET); }
;
block_body returns [ IList<AstNode> blkBody ]
#init { $blkBody = new List<AstNode>(); }
:
( s1=statement
{ $blkBody.Add($s1.stmt); }
)?
( NEW_LINE+ s2=statement
{ $blkBody.Add($s2.stmt); }
)*
;
block_signiture returns [ AstBlockSigniture blkSigniture ]
#init { $blkSigniture = new AstBlockSigniture(); }
:
( COLON identifier
{ $blkSigniture.AddIdentifier($COLON, $identifier.identifier); }
)+ PIPE
{ $blkSigniture.SetEndToken($PIPE); }
;
unary_message_selector returns [ AstUnaryMessageSelector unarySelector ]
: identifier
{ $unarySelector = new AstUnaryMessageSelector($identifier.identifier); }
;
binary_message_selector returns [ AstBinaryMessageSelector binarySelector ]
: BINARY_MESSAGE_CHAR
{ $binarySelector = new AstBinaryMessageSelector($BINARY_MESSAGE_CHAR); }
;
single_keyword_message_selector returns [ AstIdentifier singleKwSelector ]
: identifier COLON
{ $singleKwSelector = $identifier.identifier; }
;
keyword_message_selector returns [ AstKeywordMessageSelector keywordSelector ]
#init { $keywordSelector = new AstKeywordMessageSelector(); }
:
( single_keyword_message_selector
{ $keywordSelector.AddIdentifier($single_keyword_message_selector.singleKwSelector); }
)+
;
symbol returns [ AstSymbol symbol ]
: HASH
( string
{ $symbol = new AstSymbol($HASH, $string.str); }
| identifier
{ $symbol = new AstSymbol($HASH, $identifier.identifier); }
| binary_message_selector
{ $symbol = new AstSymbol($HASH, $binary_message_selector.binarySelector); }
| keyword_message_selector
{ $symbol = new AstSymbol($HASH, $keyword_message_selector.keywordSelector); }
)
;
literal returns [ AstNode literal ]
: symbol
{ $literal = $symbol.symbol; }
( block
{ $literal = new AstMethod($symbol.symbol, $block.block); }
)? // if there is block then this is method
;
number returns [ AstNode number ]
: /*'-'?*/
( INT
{ $number = new AstInt($INT); }
| FLOAT
{ $number = new AstInt($FLOAT); }
)
;
string returns [ AstString str ]
: STRING
{ $str = new AstString($STRING); }
;
dotted_expression returns [ AstDottedExpression dottedExpression ]
: i1=identifier
{ $dottedExpression = new AstDottedExpression($i1.identifier); }
(DOT i2=identifier
{ $dottedExpression.AddIdentifier($i2.identifier); }
)*
;
identifier returns [ AstIdentifier identifier ]
: ID
{ $identifier = new AstIdentifier($ID); }
;

Hi Smalltalk Grammar writer,
Firstly, to get a smalltalk grammar to parse properly (1 -- -2) and to support the optional '.' on the last statement, etc., you should treat whitespace as significant. Don't put it on the hidden channel.
The grammar so far is not breaking down the rules into small enough fragments. This will be a problem like you have seen with K=2 and backtracking.
I suggest you check out a working Smalltalk grammar in ANTLR as defined by the Redline Smalltalk project http://redline.st & https://github.com/redline-smalltalk/redline-smalltalk
Rgs, James.

Related

ANTLR doesnt complain about unclosed comment

I'm trying to define a comment rule that allows == bla bla bla == but not == on its own.
FYI, originally, I had added F_COMMENT!? between everything and still was getting the same problem. I've left one F_COMMENT!? in the expression rule in some vain attempt at getting it working.
When I debug == in ANTLRworks (1.5.2) it just ignores the == and returns the EOF token.
Heres some of the grammar...
expression
: F_COMMENT!? condition? EOF;
WS
: ( '\t' | ' ' | '\r' | '\n' | F_COMMENT)+ { $channel = HIDDEN; } ;
F_COMMENT
: '==' ( options {greedy=false;} : . )* '==';
UPDATE:
I've created a concise grammar for this problem which seems to be working, for the == case at least...
#members {
#Override
public void displayRecognitionError(String[] tokenNames, RecognitionException e) {
String hdr = getErrorHeader(e);
String msg = getErrorMessage(e, tokenNames);
throw new RuntimeException(hdr + ":" + msg);
}
}
expression
: condition? EOF;
condition
: (F_COMMENT!)* cnd_word;
cnd_word
: ( CND_WORD );
WS
: ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
CND_WORD
: ('=' | '*'? (F_QUOTEDWORD+ | F_WORDCHARS+) '*'?) | '*';
fragment
F_COMMENT
: '==' ~('=')* '==';
fragment
F_QUOTEDWORD
: '"' ( ~('\\'|'"') | ('\\' '"') )* '"';
fragment
F_WORDCHARS
: ('a'..'z'|'A'..'Z'|'0'..'9')+;

ANTLR v3 Treewalker class. How to evaluate right associative function such as Factorial

I'm trying to build an expression evaluator with ANTLR v3 but I can't get the factorial function because it is right associative.
This is the code:
class ExpressionParser extends Parser;
options { buildAST=true; }
imaginaryTokenDefinitions :
SIGN_MINUS
SIGN_PLUS;
expr : LPAREN^ sumExpr RPAREN! ;
sumExpr : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr : runary (POW^ runary)? ;
runary : unary (FAT)?;
unary : (SIN^|COS^|TAN^|LOG^|LN^|RAD^)* signExpr;
signExpr : (
m:MINUS^ {#m.setType(SIGN_MINUS);}
| p:PLUS^ {#p.setType(SIGN_PLUS);}
)? atom ;
atom : NUMBER | expr ;
class ExpressionLexer extends Lexer;
PLUS : '+' ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MOD : '%' ;
POW : '^' ;
SIN : 's' ;
COS : 'c' ;
TAN : 't' ;
LOG : 'l' ;
LN : 'n' ;
RAD : 'r' ;
FAT : 'f' ;
LPAREN: '(' ;
RPAREN: ')' ;
SEMI : ';' ;
protected DIGIT : '0'..'9' ;
NUMBER : (DIGIT)+ ('.' (DIGIT)+)?;
{import java.lang.Math;}
class ExpressionTreeWalker extends TreeParser;
expr returns [double r]
{ double a,b; int i,f=1; r=0; }
: #(PLUS a=expr b=expr) { r=a+b; }
| #(MINUS a=expr b=expr) { r=a-b; }
| #(MUL a=expr b=expr) { r=a*b; }
| #(DIV a=expr b=expr) { r=a/b; }
| #(MOD a=expr b=expr) { r=a%b; }
| #(POW a=expr b=expr) { r=Math.pow(a,b); }
| #(SIN a=expr ) { r=Math.sin(a); }
| #(COS a=expr ) { r=Math.cos(a); }
| #(TAN a=expr ) { r=Math.tan(a); }
| #(LOG a=expr ) { r=Math.log10(a); }
| #(LN a=expr ) { r=Math.log(a); }
| #(RAD a=expr ) { r=Math.sqrt(a); }
| #(FAT a=expr ) { for(i=1; i<=a; i++){f=f*i;}; r=(double)f;}
| #(LPAREN a=expr) { r=a; }
| #(SIGN_MINUS a=expr) { r=-1*a; }
| #(SIGN_PLUS a=expr) { if(a<0)r=0-a; else r=a; }
| d:NUMBER { r=Double.parseDouble(d.getText()); } ;
if I change FAT matching case in class TreeWalker with something like this:
| #(a=expr FAT ) { for(i=1; i<=a; i++){f=f*i;}; r=(double)f;}
I get this errors:
Expression.g:56:7: rule classDef trapped:
Expression.g:56:7: unexpected token: a
error: aborting grammar 'ExpressionTreeWalker' due to errors
Exiting due to errors.
Your tree walker (the original one) is fine, as far as I can see.
However, you probably need to mark FAT in the grammar:
runary : unary (FAT^)?;
(Note the hat ^, as in all the other productions.)
Edit:
As explained in the Antlr3 wiki, the hat operator is needed to make the node the "root of subtree created for entire enclosing rule even if nested in a subrule". In this case, the ! operator is nested in a conditional subrule ((FAT)?). That's independent of whether the operator is prefix or postfix.
Note that in your grammar the ! operator is not right-associative since a!! is not valid at all. But I would say that associativity is only meaningful for infix operators.

ANTLR parses greedily even though it can match high priority rule

I am using the following ANTLR grammar to define a function.
definition_function
: DEFINE FUNCTION function_name '[' language_name ']'
RETURN attribute_type '{' function_body '}'
;
function_name
: id
;
language_name
: id
;
function_body
: SCRIPT
;
SCRIPT
: '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}'
{ setText(getText().substring(1, getText().length()-1)); }
;
But when I try to parse two functions like below,
define function concat[Scala] return string {
var concatenatedString = ""
for(i <- 0 until data.length) {
concatenatedString += data(i).toString
}
concatenatedString
};
define function concat[JavaScript] return string {
var str1 = data[0];
var str2 = data[1];
var str3 = data[2];
var res = str1.concat(str2,str3);
return res;
};
Then ANTLR doesn't parse this like two function definitions, but like a single function with the following body,
var concatenatedString = ""
for(i <- 0 until data.length) {
concatenatedString += data(i).toString
}
concatenatedString
};
define function concat[JavaScript] return string {
var str1 = data[0];
var str2 = data[1];
var str3 = data[2];
var res = str1.concat(str2,str3);
return res;
Can you explain this behavior? The body of the function can have anything in it. How can I correctly define this grammar?
Your rule matches that much because '\u0020'..'\u007e' from the rule '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' matches both { and }.
Your rule should work if you define it like this:
SCRIPT
: '{' ( SCRIPT | ~( '{' | '}' ) )* '}'
;
However, this will fail when the script block contains, says, strings or comments that contain { or }. Here is a way to match a SCRIPT token, including comments and string literals that could contain { and '}':
SCRIPT
: '{' SCRIPT_ATOM* '}'
;
fragment SCRIPT_ATOM
: ~[{}]
| '"' ~["]* '"'
| '//' ~[\r\n]*
| SCRIPT
;
A complete grammar that properly parses your input would then look like this:
grammar T;
parse
: definition_function* EOF
;
definition_function
: DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type SCRIPT ';'
;
function_name
: ID
;
language_name
: ID
;
attribute_type
: ID
;
DEFINE
: 'define'
;
FUNCTION
: 'function'
;
RETURN
: 'return'
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
SCRIPT
: '{' SCRIPT_ATOM* '}'
;
SPACES
: [ \t\r\n]+ -> skip
;
fragment SCRIPT_ATOM
: ~[{}]
| '"' ~["]* '"'
| '//' ~[\r\n]*
| SCRIPT
;
which also parses the following input properly:
define function concat[JavaScript] return string {
for (;;) {
while (true) { }
}
var s = "}"
// }
return s
};
Unless you absolutely need SCRIPT to be a token (recognized by a lexer rule), you can use a parser rule which recognizes nested blocks (the block rule below). The grammar included here should parse your example as two distinct function definitions.
DEFINE : 'define';
FUNCTION : 'function';
RETURN : 'return';
ID : [A-Za-z]+;
ANY : . ;
WS : [ \r\t\n]+ -> skip ;
test : definition_function* ;
definition_function
: DEFINE FUNCTION function_name '[' language_name ']'
RETURN attribute_type block ';'
;
function_name : id ;
language_name : id ;
attribute_type : 'string' ;
id : ID;
block
: '{' ( ( ~('{'|'}') )+ | block)* '}'
;

Antlr grammar with EBNF

I'm trying to generate a Lexer/Parser through a simple grammar using ANTLR. What I've done right now :
grammar ExprV2;
#header {
package mypack.parte2;
}
#lexer::header {
package mypack.parte2;
}
start
: expr EOF { System.out.println($expr.val);}
;
expr returns [int val]
: term e=exprP[$term.val] { $val = $e.val; }
;
exprP[int i] returns [int val]
: { $val = $i; }
| '+' term e=exprP[$i + $term.val] { $val = $e.val; }
| '-' term e=exprP[$i - $term.val] { $val = $e.val; }
;
term returns [int val]
: fact e=termP[$fact.val] { $val = $e.val; }
;
termP[int i] returns [int val]
: {$val = $i;}
| '*' fact e=termP[$i * $fact.val] {$val = $e.val; }
| '/' fact e=termP[$i / $fact.val] {$val = $e.val; }
;
fact returns [int val]
: '(' expr ')' { $val = $expr.val; }
| NUM { $val=Integer.parseInt($NUM.text); }
;
NUM : '0'..'9'+ ;
WS : (' ' | '\t' |'\r' | '\n')+ { skip(); };
What I would like to obtain is to generate a Lexer/Parser with a grammar written using EBNF but I'm stucked and I don't know how to go ahead. I looked on the internet but I did not succeed in finding anything clear. Thanks to all!

ANTLR Grammar for Liquid Markup?

Anyone know of any ANTLR grammar for Liquid Markup or a JAVA library that can work with it? I have taken a look at Jangod but it doesn't seem to work much.
Thanks!
Here's a grammar:
grammar Liquid;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ASSIGNMENT;
ATTRIBUTES;
BLOCK;
CAPTURE;
CASE;
COMMENT;
CYCLE;
ELSE;
FILTERS;
FILTER;
FOR_ARRAY;
FOR_RANGE;
GROUP;
IF;
INCLUDE;
LOOKUP;
OUTPUT;
PARAMS;
PLAIN;
RAW;
TABLE;
UNLESS;
WHEN;
WITH;
}
#parser::members {
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
#lexer::members {
private boolean inTag = false;
private boolean openTagAhead() {
return input.LA(1) == '{' && (input.LA(2) == '{' || input.LA(2) == '\u0025');
}
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
/* parser rules */
parse
: block EOF -> block
;
block
: (options{greedy=true;}: atom)* -> ^(BLOCK atom*)
;
atom
: tag
| output
| assignment
| Other -> ^(PLAIN Other)
;
tag
: raw_tag
| comment_tag
| if_tag
| unless_tag
| case_tag
| cycle_tag
| for_tag
| table_tag
| capture_tag
| include_tag
;
raw_tag
: TagStart RawStart TagEnd raw_body TagStart RawEnd TagEnd
-> ^(RAW raw_body)
;
raw_body
: ~TagStart*
;
comment_tag
: TagStart CommentStart TagEnd comment_body TagStart CommentEnd TagEnd
-> ^(COMMENT comment_body)
;
comment_body
: ~TagStart*
;
if_tag
: TagStart IfStart expr TagEnd block else_tag? TagStart IfEnd TagEnd
-> ^(IF expr block ^(ELSE else_tag?))
;
else_tag
: TagStart Else TagEnd block
-> block
;
unless_tag
: TagStart UnlessStart expr TagEnd block else_tag? TagStart UnlessEnd TagEnd
-> ^(UNLESS expr block ^(ELSE else_tag?))
;
case_tag
: TagStart CaseStart expr TagEnd when_tag+ else_tag? TagStart CaseEnd TagEnd
-> ^(CASE expr when_tag+ ^(ELSE else_tag?))
;
when_tag
: TagStart When expr TagEnd block
-> ^(WHEN expr block)
;
cycle_tag
: TagStart Cycle cycle_group? expr (Comma expr)* TagEnd
-> ^(CYCLE ^(GROUP cycle_group?) expr+)
;
cycle_group
: expr Col -> expr
;
for_tag
: for_array
| for_range
;
for_array // attributes must be 'limit' or 'offset'!
: TagStart ForStart Id In lookup attribute* TagEnd block TagStart ForEnd TagEnd
-> ^(FOR_ARRAY Id lookup ^(ATTRIBUTES attribute*) block)
;
attribute
: Id Col expr -> ^(Id expr)
;
for_range
: TagStart ForStart Id In OPar expr DotDot expr CPar TagEnd block TagStart ForEnd TagEnd
-> ^(FOR_RANGE Id expr expr block)
;
table_tag // attributes must be 'limit' or 'cols'!
: TagStart TableStart Id In Id attribute* TagEnd block TagStart TableEnd TagEnd
-> ^(TABLE Id Id ^(ATTRIBUTES attribute*) block)
;
capture_tag
: TagStart CaptureStart Id TagEnd block TagStart CaptureEnd TagEnd
-> ^(CAPTURE Id block)
;
include_tag
: TagStart Include a=Str (With b=Str)? TagEnd
-> ^(INCLUDE $a ^(WITH $b?))
;
output
: OutStart expr filter* OutEnd
-> ^(OUTPUT expr ^(FILTERS filter*))
;
filter
: Pipe Id params?
-> ^(FILTER Id ^(PARAMS params?))
;
params
: Col expr (Comma expr)* -> expr+
;
assignment
: TagStart Assign Id EqSign expr TagEnd
-> ^(ASSIGNMENT Id expr)
;
expr
: or_expr
;
or_expr
: and_expr (Or^ and_expr)*
;
and_expr
: eq_expr (And^ eq_expr)*
;
eq_expr
: rel_expr ((Eq | NEq)^ rel_expr)*
;
rel_expr
: term ((LtEq | Lt | GtEq | Gt)^ term)?
;
term
: Num
| Str
| True
| False
| Nil
| lookup
;
lookup
: Id (Dot Id)* -> ^(LOOKUP Id+)
;
/* lexer rules */
OutStart : '{{' {inTag=true;};
OutEnd : '}}' {inTag=false;};
TagStart : '{%' {inTag=true;};
TagEnd : '%}' {inTag=false;};
Str : {inTag}?=> (SStr | DStr);
DotDot : {inTag}?=> '..';
Dot : {inTag}?=> '.';
NEq : {inTag}?=> '!=';
Eq : {inTag}?=> '==';
EqSign : {inTag}?=> '=';
GtEq : {inTag}?=> '>=';
Gt : {inTag}?=> '>';
LtEq : {inTag}?=> '<=';
Lt : {inTag}?=> '<';
Pipe : {inTag}?=> '|';
Col : {inTag}?=> ':';
Comma : {inTag}?=> ',';
OPar : {inTag}?=> '(';
CPar : {inTag}?=> ')';
Num : {inTag}?=> Digit+;
WS : {inTag}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};
Id
: {inTag}?=> (Letter | '_') (Letter | '_' | '-' | Digit)*
{
if($text.equals("capture")) $type = CaptureStart;
else if($text.equals("endcapture")) $type = CaptureEnd;
else if($text.equals("comment")) $type = CommentStart;
else if($text.equals("endcomment")) $type = CommentEnd;
else if($text.equals("raw")) $type = RawStart;
else if($text.equals("endraw")) $type = RawEnd;
else if($text.equals("if")) $type = IfStart;
else if($text.equals("endif")) $type = IfEnd;
else if($text.equals("unless")) $type = UnlessStart;
else if($text.equals("endunless")) $type = UnlessEnd;
else if($text.equals("else")) $type = Else;
else if($text.equals("case")) $type = CaseStart;
else if($text.equals("endcase")) $type = CaseEnd;
else if($text.equals("when")) $type = When;
else if($text.equals("cycle")) $type = Cycle;
else if($text.equals("for")) $type = ForStart;
else if($text.equals("endfor")) $type = ForEnd;
else if($text.equals("in")) $type = In;
else if($text.equals("and")) $type = And;
else if($text.equals("or")) $type = Or;
else if($text.equals("tablerow")) $type = TableStart;
else if($text.equals("endtablerow")) $type = TableEnd;
else if($text.equals("assign")) $type = Assign;
else if($text.equals("true")) $type = True;
else if($text.equals("false")) $type = False;
else if($text.equals("nil")) $type = Nil;
else if($text.equals("include")) $type = Include;
else if($text.equals("with")) $type = With;
}
;
Other
: ({!inTag && !openTagAhead()}?=> . )+
{
String s = getText().replaceAll("\\s+", " ").trim();
if(s.isEmpty()) {
skip();
}
else {
setText(s);
}
}
;
/* fragment rules */
fragment Letter : 'a'..'z' | 'A'..'Z';
fragment Digit : '0'..'9';
fragment SStr : '\'' ~'\''* '\'';
fragment DStr : '"' ~'"'* '"';
fragment CommentStart : ;
fragment CommentEnd : ;
fragment RawStart : ;
fragment RawEnd : ;
fragment IfStart : ;
fragment IfEnd : ;
fragment UnlessStart : ;
fragment UnlessEnd : ;
fragment Else : ;
fragment CaseStart : ;
fragment CaseEnd : ;
fragment When : ;
fragment Cycle : ;
fragment ForStart : ;
fragment ForEnd : ;
fragment In : ;
fragment And : ;
fragment Or : ;
fragment TableStart : ;
fragment TableEnd : ;
fragment Assign : ;
fragment True : ;
fragment False : ;
fragment Nil : ;
fragment Include : ;
fragment With : ;
fragment CaptureStart : ;
fragment CaptureEnd : ;
I have dusted the thing off a bit and put it in a Github repository: https://github.com/bkiers/Liqp
Be aware: although I have successfully used this grammar in the past, the input might have been rather "easy". If you're going to use it, and run into problems, I'd appreciate it if you let me know. If you're looking for a more robust, thoroughly tested library/parser/grammar, this might not be what you're looking for.