How to get rid of useless nodes from this AST tree? - antlr

I have already looked at this question and even though the question titles seem to be the same; it doesn't answer my question, at least not in any way that I can understand.
Parsing Math
Here is what I am parsing:
PI -> 3.14.
Number area(Number radius) -> PI * radius^2.
This is how I want my AST tree to look, minus all the useless root nodes.
how it should look http://vertigrated.com/images/How%20I%20want%20the%20tree%20to%20look.png
Here are what I hope are the relevant fragments of my grammar:
term : '(' expression ')'
| number -> ^(NUMBER number)
| (function_invocation)=> function_invocation
| ATOM
| ID
;
power : term ('^' term)* -> ^(POWER term (term)* ) ;
unary : ('+'! | '-'^)* power ;
multiply : unary ('*' unary)* -> ^(MULTIPLY unary (unary)* ) ;
divide : multiply ('/' multiply)* -> ^(DIVIDE multiply (multiply)* );
modulo : divide ('%' divide)* -> ^(MODULO divide (divide)*) ;
subtract : modulo ('-' modulo)* -> ^(SUBTRACT modulo (modulo)* ) ;
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (and_or relation)*
| string
| container_access
;
and_or : '&' | '|' ;
Precedence
I still want to keep the precedence as illustrated in the following diagrams, but want to eliminate the useless nodes if at all possible.
Source: Number a(x) -> 0 - 1 + 2 * 3 / 4 % 5 ^ 6.
Here are the nodes I want to eliminate:
how I want the precedence tree to look http://vertigrated.com/images/example%202%20desired%20result.png
Basically I want to eliminate any of those nodes that don't directly have a branch under them to binary options.

You must realize that the two rules:
add : sub ( ('+' sub)+ -> ^(ADD sub (sub)*) | -> sub ) ;
and
add : sub ('+'^ sub)* ;
do not produce the same AST. Given the input 1+2+3, the first rule will produce:
ADD
|
.--+--.
| | |
1 2 3
where the second rule produces:
(+)
|
.--+--.
| |
(+) 3
|
.--+--.
| |
1 2
The latter makes more sense: infix expressions are expected to have 2 child nodes, not more.
Why not simply remove the literals in your parser rules and just do:
add : sub (ADD^ sub)*;
ADD : '+';
Creating the same AST using a rewrite rule would look like this:
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
Also see chapter 7: Tree Construction from The Definitive ANTLR Reference. Especially the paragraphs Rewrite Rules in Subrules (page 173) and Referencing Previous Rule ASTs in Rewrite Rules (page 174/175).

Your rule (and other like it)
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
produces the useless production when you don't have a sequence of add operations.
I'm not an ANTLR expert, but I'd guess you need two cases, one for an add term
that is unary, and one for a set of children, the first of which generates your
standard tree, and the second of which simply passes the child tree up to the parent,
without creating a new node?
add : subtract ( ('+' subtract)+ -> ^(ADDITION subtract (subtract)*)
| -> subtract ) ;
Similar changes for other rules with sequences of operands to an operator.

To get rid of the irrelevant nodes, just be explicit:
subtract
:
modulo
(
( '-' modulo)+ -> ^(SUBTRACT modulo+) // no need for parenthesis or asterisk
|
() -> modulo
)
;

Even though I accepted Barts's answers as correct, I wanted to post my own complete answer with example code that I got working just for completeness.
Here is what I did based on Bart's answer:
unary : ('+'! | '-'^)? term ;
pow : (unary -> unary) ('^' s=unary -> ^(POWER $pow $s))*;
mod : (pow -> pow) ('%' s=pow -> ^(MODULO $mod $s))*;
mult : (mod -> mod) ('*' s=mod -> ^(MULTIPLY $mult $s))*;
div : (mult -> mult) ('/' s=mult -> ^(DIVIDE $div $s))*;
sub : (div -> div) ('-' s=div -> ^(SUBTRACT $sub $s))*;
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
And here is what the resulting tree looks like:
working answer http://vertigrated.com/images/working_answer.png
There is an alternative solution to just not use the rewrites and promote the symbols themselves to roots, but I want all descriptive labels in my tree if at all possible. I am just being anal about how the tree is represented so that my tree walking code will be as clean as possible!
power : unary ('^'^ unary)* ;
mod : power ('%'^ power)* ;
mult : mod ('*'^ mod)* ;
div : mult ('/'^ mult)* ;
sub : div ('-'^ div)* ;
add : sub ('+'^ sub)* ;
And this looks like this:
without rewrites http://vertigrated.com/images/without_the_rewrites.png

Related

Array support for Hplsql.g4 or Hive.g4

Good day everyone,
I am using antlr4 to create a parser and lexer for Hive SQL (Hplsql.g4).
I believe this is the latest grammar file.
https://github.com/AngersZhuuuu/Spark-Hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4
However, I found at least two additions that are needed: IF and array indices.
For example, in a select statement, I may have:
a) SELECT if(a>8,12,20) FROM x
b) SELECT column_name[2] FROM x
Both are valid in Hive but both do not parse when I create a parser and lexer for java from the Hplsql.g4 above. I added an expression for the IF and it appears to work.
I added
expr :
...
| expr_if //I added
and a new rule:
expr_if :
T_IF T_OPEN_P bool_expr T_COMMA expr T_COMMA expr T_CLOSE_P //I added
;
However, figuring out how to allow an array index is not so easy because the grammar allows aliases:
select a from x
select a alias_of_a from x
select a[1] from x
select a[1] alias_of_a from x
should all be valid.
I tried adding a new expression for this like so:
expr :
...
| expr_array //I added
expr_array :
T_OPEN_SB L_INT T_OPEN_CB //I added
;
This didn't work for me. (T_OPEN_SB L_INT T_OPEN_CB are [ integer ] respectively). I tried so many variations on this as well. My questions are:
Am I using the right grammar file - if not is there a newer one with IF and array handling?
Has anyone been successful in extending this grammar to handle my cases above?
As per Bart's recommendations:
I updated ident.
I updated expr_atom.
I added array_index.
I had // | '[' .*? ']' commented out before.
Test Sql: select a[0] from t
Result:
line 1:8 no viable alternative at input 'selecta[0]'
line 1:8 mismatched input '[0]'
Tree
(program (block stmt (stmt select) (stmt (expr_stmt (expr (expr_atom (ident a)))))) [0] from t)
I feel like the problem is somehow related to select_list_alias below.
With select_list_alias containing ident and T_AS optional, ident is matching the array index.
I can't reconcile why this happens, especially since ident has been updated.
Excerpt from Hplsql.sql:
select_list :
select_list_set? select_list_limit? select_list_item (T_COMMA select_list_item)*
;
select_list_item :
(ident T_EQUAL)? expr select_list_alias?
| select_list_asterisk
;
select_list_alias :
{!_input.LT(1).getText().equalsIgnoreCase("INTO") && !_input.LT(1).getText().equalsIgnoreCase("FROM")}? T_AS? ident
| T_OPEN_P T_TITLE L_S_STRING T_CLOSE_P
;
If I pass in a simple SQL stmt to grun such as
select a[1] from t
The parse tree should look similar to this:
Instead of expr_atom, I want to see expr_array where it would split into expr_atom for the a and array_index for the [1].
Note that there is one SQL statement here. With my existing g4, the array index [1] (and the remainder of the stmt) gets parsed as a separate SQL statement.
Bart, I see from your parse tree that parsing resulted in two SQL statements from "select a[0] from t" - I was getting the same situation.
I will continue to explore different approaches - I am still suspicious of the select_list_alias which has T_AS? ident at the end. Just to confirm, I have commented out one line from ident_part like this: // | '[' .*? ']'
As mentioned in the comments: [ ... ] will be tokenised as a L_ID token. If you don;t want that, remove the | '[' .*? ']' part:
fragment
L_ID_PART :
[a-zA-Z] ([a-zA-Z] | L_DIGIT | '_')* // Identifier part
| ('_' | '#' | ':' | '#' | '$') ([a-zA-Z] | L_DIGIT | '_' | '#' | ':' | '#' | '$')+ // (at least one char must follow special char)
| '"' .*? '"' // Quoted identifiers
// | '[' .*? ']' <-- removed
| '`' .*? '`'
;
and create/edit the grammar like this:
expr_atom :
date_literal
| timestamp_literal
| bool_literal
| expr_array // <-- added
| ident
| string
| dec_number
| int_number
| null_const
;
// new rule
expr_array
: ident array_index+
;
// new rule
array_index
: T_OPEN_SB expr T_CLOSE_SB
;
The rules above will cause select a[1] alias_of_a from x to be parsed successfully, but wil fail on input like select a[1] alias_of_a from [identifier]: the [identifier] will not be matched as an identifier.
You could try adding something like this:
ident :
L_ID
| T_OPEN_SB ~T_CLOSE_SB+ T_CLOSE_SB // <-- added
| non_reserved_words
;
which will parse select a[1] alias_of_a from [identifier] properly, but have no good picture of the whole grammar (or deep knowledge of HPL/SQL) to determine if that will mess up other things :)
EDIT
With my proposed changes, the grammar looks like this: https://gist.github.com/bkiers/4aedd6074726cbcd5d87ede00000cd0d (I cannot post it here on SO because of the char limit)
Parsing select a[0] from t with this will result in the parse tree:
And parsing select a[0] from [t] with this will result in this parse tree:
You're also able to test it by running the following Java code:
String source = "select a[0] from [t]";
HplsqlLexer lexer = new HplsqlLexer(CharStreams.fromString(source));
HplsqlParser parser = new HplsqlParser(new CommonTokenStream(lexer));
ParseTree root = parser.program();
JFrame frame = new JFrame("Antlr AST");
JPanel panel = new JPanel();
TreeViewer viewer = new TreeViewer(Arrays.asList(parser.getRuleNames()), root);
viewer.setScale(1.5);
panel.add(viewer);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.pack();
frame.setVisible(true);

How to detect an expression result is unused in an interpreted programming language?

I'm working on a simple procedural interpreted scripting language, written in Java using ANTLR4. Just a hobby project. I have written a few DSLs using ANTLR4 and the lexer and parser presented no real problems. I got quite a bit of the language working by interpreting directly from the parse tree but that strategy, apart from being slow, started to break down when I started to add functions.
So I've created a stack-based virtual machine, based on Chapter 10 of "Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages". I have an assembler for the VM that works well and I'm now trying to make the scripting language generate assembly via an AST.
Something I can't quite see is how to detect when an expression or function result is unused, so that I can generate a POP instruction to discard the value from the top of the operand stack.
I want things like assignment statements to be expressions, so that I can do things like:
x = y = 1;
In the AST, the assignment node is annotated with the symbol (the lvalue) and the rvalue comes from visiting the children of the assignment node. At the end of the visit of the assignment node, the rvalue is stored into the lvalue, and this is reloaded back into the operand stack so that it can be used as an expression result.
This generates ( for x = y = 1):
CLOAD 1 ; Push constant value
GSTOR y ; Store into global y and pop
GLOAD y ; Push value of y
GSTOR x ; Store into global x and pop
GLOAD x ; Push value of x
But it needs a POP instruction at the end to discard the result, otherwise the operand stack starts to grow with these unused results. I can't see the best way of doing this.
I guess my grammar could be flawed, which is preventing me seeing a solution here.
grammar g;
// ----------------------------------------------------------------------------
// Parser
// ----------------------------------------------------------------------------
parse
: (functionDefinition | compoundStatement)*
;
functionDefinition
: FUNCTION ID parameterSpecification compoundStatement
;
parameterSpecification
: '(' (ID (',' ID)*)? ')'
;
compoundStatement
: '{' compoundStatement* '}'
| conditionalStatement
| iterationStatement
| statement ';'
;
statement
: declaration
| expression
| exitStatement
| printStatement
| returnStatement
;
declaration
: LET ID ASSIGN expression # ConstantDeclaration
| VAR ID ASSIGN expression # VariableDeclaration
;
conditionalStatement
: ifStatement
;
ifStatement
: IF expression compoundStatement (ELSE compoundStatement)?
;
exitStatement
: EXIT
;
iterationStatement
: WHILE expression compoundStatement # WhileStatement
| DO compoundStatement WHILE expression # DoStatement
| FOR ID IN expression TO expression (STEP expression)? compoundStatement # ForStatement
;
printStatement
: PRINT '(' (expression (',' expression)*)? ')' # SimplePrintStatement
| PRINTF '(' STRING (',' expression)* ')' # PrintFormatStatement
;
returnStatement
: RETURN expression?
;
expression
: expression '[' expression ']' # Indexed
| ID DEFAULT expression # DefaultValue
| ID op=(INC | DEC) # Postfix
| op=(ADD | SUB | NOT) expression # Unary
| op=(INC | DEC) ID # Prefix
| expression op=(MUL | DIV | MOD) expression # Multiplicative
| expression op=(ADD | SUB) expression # Additive
| expression op=(GT | GE | LT | LE) expression # Relational
| expression op=(EQ | NE) expression # Equality
| expression AND expression # LogicalAnd
| expression OR expression # LogicalOr
| expression IF expression ELSE expression # Ternary
| ID '(' (expression (',' expression)*)? ')' # FunctionCall
| '(' expression ')' # Parenthesized
| '[' (expression (',' expression)* )? ']' # LiteralArray
| ID # Identifier
| NUMBER # LiteralNumber
| STRING # LiteralString
| BOOLEAN # LiteralBoolean
| ID ASSIGN expression # SimpleAssignment
| ID op=(CADD | CSUB | CMUL | CDIV) expression # CompoundAssignment
| ID '[' expression ']' ASSIGN expression # IndexedAssignment
;
// ----------------------------------------------------------------------------
// Lexer
// ----------------------------------------------------------------------------
fragment
IDCHR : [A-Za-z_$];
fragment
DIGIT : [0-9];
fragment
ESC : '\\' ["\\];
COMMENT : '#' .*? '\n' -> skip;
// ----------------------------------------------------------------------------
// Keywords
// ----------------------------------------------------------------------------
DO : 'do';
ELSE : 'else';
EXIT : 'exit';
FOR : 'for';
FUNCTION : 'function';
IF : 'if';
IN : 'in';
LET : 'let';
PRINT : 'print';
PRINTF : 'printf';
RETURN : 'return';
STEP : 'step';
TO : 'to';
VAR : 'var';
WHILE : 'while';
// ----------------------------------------------------------------------------
// Operators
// ----------------------------------------------------------------------------
ADD : '+';
DIV : '/';
MOD : '%';
MUL : '*';
SUB : '-';
DEC : '--';
INC : '++';
ASSIGN : '=';
CADD : '+=';
CDIV : '/=';
CMUL : '*=';
CSUB : '-=';
GE : '>=';
GT : '>';
LE : '<=';
LT : '<';
AND : '&&';
EQ : '==';
NE : '!=';
NOT : '!';
OR : '||';
DEFAULT : '??';
// ----------------------------------------------------------------------------
// Literals and identifiers
// ----------------------------------------------------------------------------
BOOLEAN : ('true'|'false');
NUMBER : DIGIT+ ('.' DIGIT+)?;
STRING : '"' (ESC | .)*? '"';
ID : IDCHR (IDCHR | DIGIT)*;
WHITESPACE : [ \t\r\n] -> skip;
ANYCHAR : . ;
So my question is where is the usual place to detect unused expression results, i.e. when expressions are used as plain statements? Is it something I should detect during the parse, then annotate the AST node? Or is this better done when visiting the AST for code generation (assembly generation in my case)? I just can't see where best to do it.
IMO it's not a question of the right grammar, but how you process the AST/parse tree. The fact if a result is used or not could be determined by checking the siblings (and parent siblings etc.). An assignment for instance is made of the lvalue, the operator and the rvalue, hence when you determined the rvalue, check the previous tree node sibling if that is an operator. Similarly you can check if the parent is a parentheses expression (for nested function calls, grouping etc.).
statement
: ...
| expression
If you label this case with # ExpressionStatement, you can generate a pop after every expression statement by overriding exitExpressionStatement() in the listener or visitExpressionStatement in the visitor.

What is wrong with this ANTLR Grammar? Conditional statement nested parenthesis

I've been tasked with writing a prototype of my team's DSL in Java, so I thought I would try it out using ANTLR. However I'm having problems with the 'expression' and 'condition' rules.
The DSL is already well defined so I would like to keep as close to the current spec as possible.
grammar MyDSL;
// Obviously this is just a snippet of the whole language, but it should give a
// decent view of the issue.
entry
: condition EOF
;
condition
: LPAREN condition RPAREN
| atomic_condition
| NOT condition
| condition AND condition
| condition OR condition
;
atomic_condition
: expression compare_operator expression
| expression (IS NULL | IS NOT NULL)
| identifier
| BOOLEAN
;
compare_operator
: EQUALS
| NEQUALS
| GT | LT
| GTEQUALS | LTEQUALS
;
expression
: LPAREN expression RPAREN
| atomic_expression
| PREFIX expression
| expression (MULTIPLY | DIVIDE) expression
| expression (ADD | SUBTRACT) expression
| expression CONCATENATE expression
;
atomic_expression
: SUBSTR LPAREN expression COMMA expression (COMMA expression)? RPAREN
| identifier
| INTEGER
;
identifier
: WORD
;
// Function Names
SUBSTR: 'SUBSTR';
// Control Chars
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
// Literals and Identifiers
fragment DIGIT : [0-9] ;
INTEGER: DIGIT+;
fragment LETTER : [A-Za-z#$#];
fragment CHARACTER : DIGIT | LETTER | '_';
WORD: LETTER CHARACTER*;
BOOLEAN: 'TRUE' | 'FALSE';
// Arithmetic Operators
MULTIPLY : '*';
DIVIDE : '/';
ADD : '+';
SUBTRACT : '-';
PREFIX: ADD| SUBTRACT ;
// String Operators
CONCATENATE : '||';
// Comparison Operators
EQUALS : '==';
NEQUALS : '<>';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
// Logical Operators
NOT : 'NOT';
AND : 'AND';
OR : 'OR';
// Keywords
IS : 'IS';
NULL: 'NULL';
// Whitespace
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
The phrase I'm testing with is
(FOO == 115 AND (SUBSTR(BAR,2,1) == 1 OR SUBSTR(BAR,4,1) == 1))
However it is breaking on the nested parenthesis, matching the first ( with the first ) instead of the outermost (see below). In ANTLR3 I solved this with semantic predicates but it seems that ANTLR4 is supposed to have fixed the need for those.
I'd really like to keep the condition and the expression rules separate if at all possible. I have been able to get it to work when merged together in a single expression rule (based on examples here and elsewhere) but the current DSL spec has them as different and I'm trying to reduce any possible differences in behaviour.
Can anyone point out how I can get this all working while maintaining a separate rule for conditions' andexpressions`? Many thanks!
The grammar seems fine to me.
There's one thing going wrong in the lexer: the WORD token is defined before various keywords/operators causing it to get precedence over them. Place your WORD rule at the very end of your lexer rules (or at least after the last keywords which WORD could also match).

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.

How do I make a TreeParser in ANTLR3?

I'm attemping to learn language parsing for fun...
I've created a ANTLR grammar which I believe will match a simple language I am hoping to implement. It will have the following syntax:
<FunctionName> ( <OptionalArguments>+) {
<OptionalChildFunctions>+
}
Actual Example:
ForEach(in:[1,2,3,4,5] as:"nextNumber") {
Print(message:{nextNumber})
}
I believe I have the grammar working correctly to match this construct, and now I am attemping to build an Abstract Syntax Tree for the language.
Firstly, I must admit I'm not exactly sure HOW this tree should look. Secondly, I'm at a complete loss how to do this in my Antlr grammar...I've been trying without much success for hours.
This is the current idea I'm going with for the tree:
FunctionName
/ \
Attributes \
/ \ / \
ID /\ ChildFunctions
/ \ ID etc
/ \
Attribute AttributeValue
Type
This is my current Antlr grammar file:
grammar Test;
options {output=AST;ASTLabelType=CommonTree;}
program : function ;
function : ID (OPEN_BRACKET (attribute (COMMA? attribute)*)? CLOSE_BRACKET)? (OPEN_BRACE function* CLOSE_BRACE)?;
attribute : ID COLON datatype;
datatype : NUMBER | STRING | BOOLEAN | array | lookup ;
array : OPEN_BOX (datatype (COMMA datatype)* )? CLOSE_BOX ;
lookup : OPEN_BRACE (ID (PERIOD ID)*) CLOSE_BRACE;
NUMBER
: ('+' | '-')? (INTEGER | FLOAT)
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
BOOLEAN
: 'true' | 'TRUE' | 'false' | 'FALSE'
;
ID : (LETTER|'_') (LETTER | INTEGER |'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WHITESPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
COLON : ':' ;
COMMA : ',' ;
PERIOD : '.' ;
OPEN_BRACKET : '(' ;
CLOSE_BRACKET : ')' ;
OPEN_BRACE : '{' ;
CLOSE_BRACE : '}' ;
OPEN_BOX : '[' ;
CLOSE_BOX : ']' ;
fragment
LETTER
: 'a'..'z' | 'A'..'Z'
;
fragment
INTEGER
: '0'..'9'+
;
fragment
FLOAT
: INTEGER+ '.' INTEGER*
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
;
ANY help / advice would be great. I've tried reading dozens of tutorials and nothing about the AST generation seems to stick :(
Step 1 is to make the tree look like the little graph that you posted. Right now, you don't have any tree construction operators, so you're going to end up with a flat list.
See tree construction on the antlr.org website.
You can use ANTLRWorks to see what your getting for a parse tree and AST. Start adding tree construction operators and watch how things change.
EDIT / Additional Info:
Here's a process you can follow to give you a rough idea of how to do it:
Download ANTLRWorks and use it's graphing facilities. You will definitely want to see the parse tree and the AST before and after you make changes. Once you understand how everything works, then you can use any IDE or editor you want.
There are two basic operators for tree construction - The exclamation mark ! which tells the compiler to not place the node within the AST, and the carot ^, which tells ANTLR to make something the root node. Start by going through each non-terminal rule and deciding which elements don't need to be in the AST. For example, you don't need commas or parenthesis. Once you have all the information you can populate the a structure (or create your own AST structure) that provides all the information. Commas don't help any more, so add a ! to them. For example:
function: ID (OPEN_BRACKET! (attribute (COMMA!? attribute)*)? CLOSE_BRACKET!)? (OPEN_BRACE! function* CLOSE_BRACE!)?;
Take a look at the AST in ANTLRWorks before and after. Compare.
Now decide which element should be the root node. It looks like you want ID to be the root node, so add a ^ after ID and compare in ANTLRWorks.
Here's a few changes that bring it closer to what I think you want:
program : function ;
function : ID^ (OPEN_BRACKET! attributeList? CLOSE_BRACKET!)? (OPEN_BRACE! function* CLOSE_BRACE!)?;
attributeList: (attribute (COMMA!? attribute)*);
attribute : ID COLON! datatype;
datatype : NUMBER | STRING | BOOLEAN | array | lookup ;
array : OPEN_BOX! (datatype^ (COMMA! datatype)* )? CLOSE_BOX!;
lookup : OPEN_BRACE! (ID (PERIOD! ID)*) CLOSE_BRACE!;
With that under your belt, now go look at some of the tutorials.