Tree Rewrite - whole subtree not just the top node should become root - antlr

I want that the tree rewrite of *addition_operator* contains the whole subtree not only the top node, so that *hint_keywords* are still in the tree.
addition is so complex because I want to add the T_LEFT and T_RIGHT in the tree.
antlr 3.3
grammar:
grammar Test;
options {
output = AST;
}
tokens {
T_LEFT;
T_RIGHT;
T_MARKER;
}
#lexer::header {
package com.spielwiese;
}
#header {
package com.spielwiese;
}
NUM : '0' .. '9' ( '0' .. '9' )*;
ASTERISK : '*';
PLUS : '+';
MINUS : '-';
WS : (' '|'\r'|'\t'|'\n') {skip();};
addition
:
(a=atom -> $a)
(
addition_operator b=atom
->
^(addition_operator
^(T_LEFT $addition)
^(T_RIGHT $b)
)
)+
;
atom
: NUM
| '(' addition ')' -> addition
;
addition_operator
: PLUS hints? -> ^(PLUS hints?)
| MINUS hints? -> ^(MINUS hints?)
;
hints
: '[' hint_keywords += hint_keyword (',' hint_keywords += hint_keyword)* ']'
->
$hint_keywords
;
hint_keyword
: 'FAST'
| 'SLOW'
| 'BIG'
| 'THIN'
;
As far as I can see the reason is the implementation of RewriteRuleSubtreeStream#nextNode() which uses adaptor.dupNode(tree) and I want a adaptor.dupTree(tree).
given input
2 + [BIG] 3 - [FAST, THIN] 4
is:
+---------+
| - |
+---------+
| \
| \
T_LEFT T_RIGHT
| |
+---------+
| + | 4
+---------+
| \
T_LEFT T_RIGHT
| |
2 3
and should be
+---------+
| - |
+---------+
/ / | \
/ / | \
FAST THIN T_LEFT T_RIGHT
| |
+---------+
| + | 4
+---------+
/ | \
/ T_LEFT T_RIGHT
BIG | |
2 3

Try this:
grammar Test;
options {
output=AST;
}
tokens {
T_MARKER;
T_LEFT;
T_RIGHT;
}
calc
: addition EOF -> addition
;
addition
: (a=atom -> $a) ( Add markers b=atom -> ^(Add markers ^(T_LEFT $addition) ^(T_RIGHT $b))
| Sub markers b=atom -> ^(Sub markers ^(T_LEFT $addition) ^(T_RIGHT $b))
)*
;
markers
: ('[' marker (',' marker)* ']')? -> ^(T_MARKER marker*)
;
marker
: Fast
| Thin
| Big
;
atom
: Num
| '(' addition ')' -> addition
;
Fast : 'FAST';
Thin : 'THIN';
Big : 'BIG';
Num : '0'..'9' ('0'..'9')*;
Add : '+';
Sub : '-';
Space : (' ' | '\t' | '\r' | '\n') {skip();};
which parses the input 2 + [BIG] 3 - [FAST, THIN] 4 + 5 into the following AST:
The trick was to use $addition in the rewrite rule to reference the entire rule itself.

Related

mincaml grammar in antlr4

I am trying to write mincaml parser in antlr4. github(https://github.com/esumii/min-caml/blob/master/parser.mly).
Japanese site : http://esumii.github.io/min-caml/ .
here is antlr 4 code.
grammar MinCaml;
simple_exp: #simpleExp
| LPAREN exp RPAREN #parenExp
| LPAREN RPAREN #emptyParen
| BOOL #boolExpr
| INT #intExpr
| FLOAT #floatExpr
| IDENT #identExpr
| simple_exp DOT LPAREN exp RPAREN #arrayGetExpr
;
exp : #programExp
| simple_exp #simpleExpInExp
| NOT exp #notExp
| MINUS exp #minusExp
| MINUS_DOT exp #minusFloatExp
| left = exp op = (AST_DOT | SLASH_DOT) right = exp #astSlashExp
| left = exp op = (PLUS | MINUS | MINUS_DOT | PLUS_DOT) right = exp #addSubExp
| left = exp op = (EQUAL | LESS_GREATER | LESS | GREATER | LESS_EQUAL | GREATER_EQUAL) right = exp #logicExp
| IF condition = exp THEN thenExp = exp ELSE elseExp = exp #ifExp
| LET IDENT EQUAL exp IN exp #letExp
| LET REC fundef IN exp #letRecExp
| exp actual_args #appExp
| exp COMMA exp elems #tupleExp
| LET LPAREN pat RPAREN EQUAL exp IN exp #tupleReadExp
| simple_exp DOT LPAREN exp RPAREN LESS_MINUS exp #putExp
| exp SEMICOLON exp #expSeqExp
| ARRAY_CREATE simple_exp simple_exp #arrayCreateExp
;
fundef:
| IDENT formal_args EQUAL exp
;
formal_args:
| IDENT formal_args
| IDENT
;
actual_args:
| actual_args simple_exp
| simple_exp
;
elems:
| COMMA exp elems
|
;
pat:
| pat COMMA IDENT
| IDENT COMMA IDENT
;
LET : 'let';
REC : 'rec';
IF : 'if';
THEN : 'then';
ELSE : 'else';
IN : 'in';
IDENT : '_' | [a-z][a-zA-Z0-9_]+;
ARRAY_CREATE : 'Array.create';
LPAREN : '(';
RPAREN : ')';
BOOL : 'true' 'false';
NOT : 'not';
INT : ['1'-'9'] (['0'-'9'])*;
FLOAT : (['0'-'9'])+ ('.' (['0'-'9'])*)? (['e', 'E'] (['+', '-'])? (['0'-'9'])+)?;
MINUS : '-';
PLUS : '+';
MINUS_DOT : '-.';
PLUS_DOT : '+.';
AST_DOT : '*.';
SLASH_DOT : '/.';
EQUAL : '=';
LESS_GREATER : '';
LESS_EQUAL : '=';
LESS : '';
DOT : '.';
LESS_MINUS : ' skip ; // toss out whitespace
COMMENT : '(*' .*? '*)' -> skip;
but I get following errors on rules exp and actual args.
error(148): MinCaml.g4:13:0: left recursive rule exp contains a left recursive alternative which can be followed by the empty string
error(148): MinCaml.g4:41:0: left recursive rule actual_args contains a left recursive alternative which can be followed by the empty string
But I don't see a any possibility of empty string on both rules. Or am I wrong?
What is wrong with this code?
The first line of the exp rule (actually of every rule) is the likely problem:
exp : #programExp
The standard rule form is
r: alt1 | alt2 | .... | altN ;
The alt1s in the grammar are all empty. An empty alt "matches an empty string".
Given the elems rule appears to have an intentional empty alt, consider that, in general terms, rules with empty alts can be problematic. Rather than using an empty alt, make the corresponding element in the parent rule optional (either ? or *).

Antlr linphone line 2:0 required (...)+ loop did not match anything at character 'V'

I make a simple copy of the .g file in linphone sip decode antlr file to make things clear.
My problem is when I use this file to decode sip's firstline it will failed sometimes.
The file is shown below:
grammar sipmessage;
options{
language = Java;
output = AST;
}
#lexer::header{
package org.meri.antlr_step_by_step.parsers;
import java.util.HashMap;
}
#parser::header{
package org.meri.antlr_step_by_step.parsers;
import java.util.HashMap;
}
status_line :
sip_version
sptab status_code {System.out.println($status_code.text);}
sptab reason_phrase {System.out.println($reason_phrase.text);}
CRLF
;
sip_version
: word;// 'SIP/' DIGIT '.' DIGIT;
reason_phrase
: ~(CRLF)*;
status_code
: extension_code;
word
:
(alphanum | mark | '%'
| PLUS | '`' |
LAQUOT | RAQUOT |
COLON | '\\' | DQUOTE | SLASH | '[' | ']' | '?' | '{' | '}'
)+
;
extension_code : DIGIT+;
alphanum : alpha | DIGIT ;
mark : '-' | '_' | '.' | '!' | '~' | STAR | '\'' | LPAREN | RPAREN ;
alpha : HEX_CHAR | COMMON_CHAR;
COMMON_CHAR : 'g'..'z' | 'G'..'Z' ;
HEX_CHAR: 'a'..'f' |'A'..'F';
RPAREN : ')';
LPAREN : '(';
STAR : '*';
DIGIT : '0'..'9' ;
PLUS: '+';
COLON : ':';
RAQUOT : '>';
LAQUOT : '<';
DQUOTE : '"';
SLASH : '/';
SP : ' ';
CRLF : '\r\n';
HTAB : ' ';
LWS : (SP* CRLF)? SP+ ; //linear whitespace
sptab : (SP|HTAB)+;
MyProblem is :when decode str eg:
"SIP/2.0 200 OK.\r\nVia: SIP/2.0/TCP 192.168.26.116:46448;alias;branch=z9hG4bK.TteIOuQeu;rport\r\n"
It can print firstline's status code and reason right.That's 200 and OK.
But if I add some spaces before the first \r\n before the Via that is,
"SIP/2.0 200 OK. \r\nVia: SIP/2.0/TCP 192.168.26.116:46448;alias;branch=z9hG4bK.TteIOuQeu;rport\r\n"
The result is so wrong.
the reason printed out will become "OK.ia: SIP/2.0/TCP 192.168.26.116:46448aliasbranchz9hG4bK.TteIOuQeurport"
And I got a warning:"line 2:0 required (...)+ loop did not match anything at character 'V'"
Can somebody tell me why ,I am not quite good at English and Antlr.
Thanks in advance!

Precedence in Antlr using parentheses

We are developing a DSL, and we're facing some problems:
Problem 1:
In our DSL, it's allowed to do this:
A + B + C
but not this:
A + B - C
If the user needs to use two or more different operators, he'll need to insert parentheses:
A + (B - C) or (A + B) - C.
Problem 2:
In our DSL, the most precedent operator must be surrounded by parentheses.
For example, instead of using this way:
A + B * C
The user needs to use this:
A + (B * C)
To solve the Problem 1 I've got a snippet of ANTLR that worked, but I'm not sure if it's the best way to solve it:
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr;
To solve the Problem 2, I have no idea where to start.
I appreciate your help to find out a better solution to the first problem and a direction to solve the seconde one. (Sorry for my bad english)
Below is the grammar that we have developed:
grammar TclGrammar;
options {
output=AST;
ASTLabelType=CommonTree;
}
#members {
public boolean isSum(String type) {
System.out.println("Tipo: " + type);
return "+".equals(type);
}
public boolean isSub(String type) {
System.out.println("Tipo: " + type);
return "-".equals(type);
}
}
prog
: exprMain ';' {System.out.println(
$exprMain.tree == null ? "null" : $exprMain.tree.toStringTree());}
;
exprMain
: exprQuando? (exprDeveSatis | exprDeveFalharCaso)
;
exprDeveSatis
: 'DEVE SATISFAZER' '{'! expr '}'!
;
exprDeveFalharCaso
: 'DEVE FALHAR CASO' '{'! expr '}'!
;
exprQuando
: 'QUANDO' '{'! expr '}'!
;
expr
: logicExpr
;
logicExpr
: boolExpr (('E'|'OU')^ boolExpr)*
;
boolExpr
: comparatorExpr
| emExpr
| 'VERDADE'
| 'FALSO'
;
emExpr
: FIELD 'EM' '[' (variable_lista | field_lista) comCruzamentoExpr? ']'
-> ^('EM' FIELD (variable_lista+)? (field_lista+)? comCruzamentoExpr?)
;
comCruzamentoExpr
: 'COM CRUZAMENTO' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COM CRUZAMENTO' FIELD+)
;
comparatorExpr
: sumExpr (('<'^|'<='^|'>'^|'>='^|'='^|'<>'^) sumExpr)?
| naoPreenchidoExpr
| preenchidoExpr
;
naoPreenchidoExpr
: FIELD 'NAO PREENCHIDO' -> ^('NAO PREENCHIDO' FIELD)
;
preenchidoExpr
: FIELD 'PREENCHIDO' -> ^('PREENCHIDO' FIELD)
;
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr
;
multExpr
: funcExpr(('*'^|'/'^) funcExpr)?
;
funcExpr
: 'QUANTIDADE'^ '('! FIELD ')'!
| 'EXTRAI_TEXTO'^ '('! FIELD ';' INTEGER ';' INTEGER ')'!
| cruzaExpr
| 'COMBINACAO_UNICA' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COMBINACAO_UNICA' FIELD+)
| 'EXISTE'^ '('! FIELD ')'!
| 'UNICO'^ '('! FIELD ')'!
| atom
;
cruzaExpr
: operadorCruzaExpr ('CRUZA COM'^|'CRUZA AMBOS'^) operadorCruzaExpr ondeExpr?
;
operadorCruzaExpr
: FIELD('('!field_lista')'!)?
;
ondeExpr
: ('ONDE'^ '('!expr')'!)
;
atom
: FIELD
| VARIABLE
| '('! expr ')'!
;
field_lista
: FIELD(';' field_lista)?
;
variable_lista
: VARIABLE(';' variable_lista)?
;
FIELD
: NONCONTROL_CHAR+
;
NUMBER
: INTEGER | FLOAT
;
VARIABLE
: '\'' NONCONTROL_CHAR+ '\''
;
fragment SIGN: '+' | '-';
fragment NONCONTROL_CHAR: LETTER | DIGIT | SYMBOL;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
fragment SYMBOL: '_' | '.' | ',';
fragment FLOAT: INTEGER '.' '0'..'9'*;
fragment INTEGER: '0' | SIGN? '1'..'9' '0'..'9'*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {skip();}
;
This is similar to not having operator precedence at all.
expr
: funcExpr
( ('+' funcExpr)*
| ('-' funcExpr)*
| ('*' funcExpr)*
| ('/' funcExpr)*
)
;
I think the following should work. I'm assuming some lexer tokens with obvious names.
expr: sumExpr;
sumExpr: onlySum | subExpr;
onlySum: atom ( PLUS onlySum )?;
subExpr: onlySub | multExpr;
onlySub: atom ( MINUS onlySub )? ;
multExpr: atom ( STAR atomic )? ;
parenExpr: OPEN_PAREN expr CLOSE_PAREN;
atom: FIELD | VARIABLE | parenExpr
The only* rules match an expression if it only has one type of operator outside of parentheses. The *Expr rules match either a line with the appropriate type of operators or go to the next operator.
If you have multiple types of operators, then they are forced to be inside parentheses because the match will go through atom.

ANTLR: Why the invalid input could match the grammar definition

I've written a very simple grammar definition for a calculation expression:
grammar SimpleCalc;
options {
output=AST;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
ID : ('a'..'z' | 'A' .. 'Z' | '0' .. '9')+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { Skip(); } ;
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
start: expr EOF;
expr : multExpr ((PLUS | MINUS)^ multExpr)*;
multExpr : atom ((MULT | DIV)^ atom )*;
atom : ID
| '(' expr ')' -> expr;
I've tried the invalid expression ABC &* DEF by start but it passed. It looks like the & charactor is ignored. What's the problem here?
Actually your invalid expression ABC &= DEF hasn't been passed; it causes NoViableAltException.

How do I distinguish a very keyword-like token from a keyword using ANTLR?

I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar "look" to the keyword.
Here's the grammar:
grammar Query;
options {
output = AST;
backtrack = true;
}
tokens {
DefaultBooleanNode;
}
// Parser
startExpression : expression EOF ;
expression : withinExpression ;
withinExpression
: defaultBooleanExpression
(WSLASH^ NUMBER defaultBooleanExpression)*
defaultBooleanExpression
: (queryFragment -> queryFragment)
(e=queryFragment -> ^(DefaultBooleanNode $defaultBooleanExpression $e))*
;
queryFragment : unquotedQuery ;
unquotedQuery : UNQUOTED | NUMBER ;
// Lexer
WSLASH : ('W'|'w') '/';
NUMBER : Digit+ ('.' Digit+)? ;
UNQUOTED : UnquotedStartChar UnquotedChar* ;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
fragment Digit : ('0'..'9') ;
fragment HexDigit : ('0'..'9' | 'a'..'f' | 'A'..'F') ;
WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) { skip(); };
I have simplified it enough to get rid of the distractions but I think removing any more would remove the problem.
A slash is permitted in the middle of an unquoted query fragment.
Boolean queries in particular have no required keyword.
A new syntax (e.g. W/3) is being introduced but I'm trying not to affect existing queries which happen to look similar (e.g. X/Y)
Due to '/' being valid as part of a word, ANTLR appears to be giving me "W/3" as a single token of type UNQUOTED instead of it being a WSLASH followed by a NUMBER.
Due to the above, I end up with a tree like: DefaultBooleanNode(DefaultBooleanNode(~first clause~, "W/3"), ~second clause~), whereas what I really wanted was WSLASH(~first clause~, "3", ~second clause~).
What I would like to do is somehow write the UNQUOTED rule as "what I have now, but not matching ~~~~", but I'm at a loss for how to do that.
I realise that I could spell it out in full, e.g.:
Any character from UnquotedStartChar except 'w', followed by the rest of the rule
'w' followed by any character from UnquotedChar except '/', followed by the rest of the rule
'w/' followed by any character from UnquotedChar except digits
...
However, that would look awful. :)
When a lexer generated by ANTLR "sees" that certain input can be matched by more than 1 rule, it chooses the longest match. If you want a shorter match to take precedence, you'll need to merge all the similar rules into one and then check with a gated sematic predicate if the shorter match is ahead or not. If the shorter match is ahead, you change the type of the token.
A demo:
Query.g
grammar Query;
tokens {
WSlash;
}
#lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
parse
: (t=. {System.out.printf("\%-10s \%s\n", tokenNames[$t.type], $t.text);} )* EOF
;
NUMBER
: Digit+ ('.' Digit+)?
;
UNQUOTED
: {ahead("W/")}?=> 'W/' { $type=WSlash; /* change the type of the token */ }
| {ahead("w/")}?=> 'w/' { $type=WSlash; /* change the type of the token */ }
| UnquotedStartChar UnquotedChar*
;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~'u'
)
;
fragment Digit : '0'..'9';
fragment HexDigit : '0'..'9' | 'a'..'f' | 'A'..'F';
WHITESPACE : (' ' | '\r' | '\t' | '\u000C' | '\n') { skip(); };
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
QueryLexer lexer = new QueryLexer(new ANTLRStringStream("P/3 W/3"));
QueryParser parser = new QueryParser(new CommonTokenStream(lexer));
parser.parse();
}
}
To run the demo on *nix/MacOS:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
or on Windows:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
which will print the following:
UNQUOTED P/3
WSlash W/
NUMBER 3
EDIT
To eliminate the warning when using the WSlash token in a parser rule, simply add an empty fragment rule to your grammar:
fragment WSlash : /* empty */ ;
It's a bit of a hack, but that's how it's done. No more warnings.