Antlr not producing a visit method for some reason - antlr

The following grammar, in Java, does not produce a visitor with "visitExpr" and I have no idea why... I added valueExpression and it DOES produce visitValueExpression, but this doesn't make it easy to get an expr out of all the math expressions.
grammar Txml;
program: statement (NEWLINE statement)* NEWLINE? EOF;
statement: require # Condition
| entry # CreateEntry
| assignment # Assign
;
require: REQUIRE valueExpression;
valueExpression: expr;
expr: lhs=expr ('*' | '/') rhs=expr # MulDiv
| lhs=expr ('+' | '-') rhs=expr # AddSub
| lhs=expr '%' rhs=expr # Mod
| lhs=expr '^' rhs=expr # Pow
| '(' expr ')' # Parens
| NUMBER # NumberLiteral
| IDENT '(' args ')' # FunctionCall
| IDENT # Identifier
| STRING_LITERAL # StringLiteral
;
functionArgument: expr;
args: (functionArgument (',' functionArgument)*)?;
// Reserved words
REQUIRE: 'require';
// Whitespace and line break
NEWLINE : [\r\n];
WS: [ \t] + -> skip;
// Entities
NUMBER: ('0' .. '9') + ('.' ('0' .. '9') +)?;
IDENT: [a-zA-Z]+[0-9a-zA-Z]*;
STRING_LITERAL : '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
Also, I don't quite see how to visit a "generic" node in the base visitor - how do I get a RuleNode out of a specific context?

[...] does not produce a visitor with "visitExpr" and I have no idea why
When labelling a parser rule r, no visitR(...) will be generated. Only the visit...() methods of the alternatives will be generated.
So without alt labels:
r
: a
| b
;
// Only 1 method:
// - visitR(...)
With alt labels:
r
: a #altA
| b #altB
;
// Two methods:
// - visitAltA(...)
// - visitAltB(...)
Also, I don't quite see how to visit a "generic" node in the base visitor - how do I get a RuleNode out of a specific context?
You could override the AbstractParseTreeVisitor<T>#visitChildren(...) method to listen to any rule. A quick demo:
public class Main {
public static void main(String[] args) throws Exception {
String source = "require a + b";
TxmlLexer lexer = new TxmlLexer(CharStreams.fromString(source));
TxmlParser parser = new TxmlParser(new CommonTokenStream(lexer));
ParseTree root = parser.program();
new TestTXmlVisitor().visit(root);
}
}
class TestTXmlVisitor extends TxmlBaseVisitor<Object> {
#Override
public Object visitChildren(RuleNode node) {
System.out.println("visited: " + node.getClass().getSimpleName() + " -> " + node.getText());
return super.visitChildren(node);
}
}
will print:
visited: ProgramContext -> requirea+b<EOF>
visited: ConditionContext -> requirea+b
visited: RequireContext -> requirea+b
visited: ValueExpressionContext -> a+b
visited: AddSubContext -> a+b
visited: IdentifierContext -> a
visited: IdentifierContext -> b

Related

Using Antlr to parse formulas with multiple locales

I'm very new to Antlr, so forgive what may be a very easy question.
I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.
Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.
I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.
What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.
I don't have ANTLR and C# running here, but the Java demo should be pretty similar:
grammar LocaleDemo;
#lexer::header {
import java.text.DecimalFormatSymbols;
import java.util.Locale;
}
#lexer::members {
private char decimalSeparator = '.';
private char groupingSeparator = ',';
public LocaleDemoLexer(CharStream input, Locale locale) {
this(input);
DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
this.decimalSeparator = dfs.getDecimalSeparator();
this.groupingSeparator = dfs.getGroupingSeparator();
}
}
parse
: .*? EOF
;
NUMBER
: D D? ( DG D D D )* ( DS D+ )?
;
OTHER
: .
;
fragment D : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}? . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;
To test the grammar above, run this class:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;
public class Main {
private static void tokenize(String input, Locale locale) {
LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
for (Token t : lexer.getAllTokens()) {
System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
public static void main(String[] args) throws Exception {
tokenize("1.23", Locale.ENGLISH);
tokenize("1.23", Locale.GERMAN);
tokenize("12.345.678,90", Locale.ENGLISH);
tokenize("12.345.678,90", Locale.GERMAN);
}
}
which would print:
input='1.23', locale=en, tokens:
NUMBER '1.23'
input='1.23', locale=de, tokens:
NUMBER '1'
OTHER '.'
NUMBER '23'
input='12.345.678,90', locale=en, tokens:
NUMBER '12.345'
OTHER '.'
NUMBER '67'
NUMBER '8'
OTHER ','
NUMBER '90'
input='12.345.678,90', locale=de, tokens:
NUMBER '12.345.678,90'
Related Q&A's:
What is a 'semantic predicate' in ANTLR?
What does "fragment" mean in ANTLR?
As a follow-up to Bart's answer, this is the grammar I created with his suggestions:
grammar ExcelScript;
#lexer::header
{
using System;
using System.Globalization;
}
#lexer::members
{
private Int32 listseparator = 44; // UTF16 value for comma
private Int32 decimalseparator = 46; // UTF16 value for period
/// <summary>
/// Creates a new lexer object
/// </summary>
/// <param name="input">The input stream</param>
/// <param name="locale">The locale to use in parsing numbers</param>
/// <returns>A new lexer object</returns>
public ExcelScriptLexer (ICharStream input, CultureInfo locale)
: this(input)
{
this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);
// special case for 8 locales where the list separator is a , and the number separator is a , too
// Excel uses semicolon for list separator, so we will too
if (this.listseparator == 44 && this.decimalseparator == 44)
this.listseparator = 59; // UTF16 value for semicolon
}
}
/*
* Parser Rules
*/
formula
: numberLiteral
| Identifier
| '=' expression
;
expression
: primary # PrimaryExpression
| Identifier arguments # FunctionCallExpression
| ('+' | '-') expression # UnarySignExpression
| expression ('*' | '/' | '%') expression # MulDivModExpression
| expression ('+' | '-') expression # AddSubExpression
| expression ('<=' | '>=' | '>' | '<') expression # CompareExpression
| expression ('=' | '<>') expression # EqualCompareExpression
;
primary
: '(' expression ')' # ParenExpression
| literal # LiteralExpression
| Identifier # IdentifierExpression
;
literal
: numberLiteral # NumberLiteralRule
| booleanLiteral # BooleanLiteralRule
;
numberLiteral
: IntegerLiteral
| FloatingPointLiteral
;
booleanLiteral
: TrueKeyword
| FalseKeyword
;
arguments
: '(' expressionList? ')'
;
expressionList
: expression (ListSeparator expression)*
;
/*
* Lexer Rules
*/
AddOperator : '+' ;
SubOperator : '-' ;
MulOperator : '*' ;
DivOperator : '/' ;
PowOperator : '^' ;
EqOperator : '=' ;
NeqOperator : '<>' ;
LeOperator : '<=' ;
GeOperator : '>=' ;
LtOperator : '<' ;
GtOperator : '>' ;
ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;
TrueKeyword : [Tt][Rr][Uu][Ee] ;
FalseKeyword : [Ff][Aa][Ll][Ss][Ee] ;
Identifier
: Letter (Letter | Digit)*
;
fragment Letter
: [A-Z_a-z]
;
fragment Digit
: [0-9]
;
IntegerLiteral
: '0'
| [1-9] [0-9]*
;
FloatingPointLiteral
: [0-9]+ DecimalSeparator [0-9]* Exponent?
| DecimalSeparator [0-9]+ Exponent?
| [0-9]+ Exponent
;
fragment Exponent
: ('e' | 'E') ('+' | '-')? ('0'..'9')+
;
WhiteSpace
: [ \t]+ -> channel(HIDDEN)
;

ANTLR: Why the invalid input could match the grammar definition

I've written a very simple grammar definition for a calculation expression:
grammar SimpleCalc;
options {
output=AST;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
ID : ('a'..'z' | 'A' .. 'Z' | '0' .. '9')+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { Skip(); } ;
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
start: expr EOF;
expr : multExpr ((PLUS | MINUS)^ multExpr)*;
multExpr : atom ((MULT | DIV)^ atom )*;
atom : ID
| '(' expr ')' -> expr;
I've tried the invalid expression ABC &* DEF by start but it passed. It looks like the & charactor is ignored. What's the problem here?
Actually your invalid expression ABC &= DEF hasn't been passed; it causes NoViableAltException.

Solving antlr left recursion

I'm trying to parse a language using ANTLR which can contain the following syntax:
someVariable, somVariable.someMember, functionCall(param).someMember, foo.bar.baz(bjork).buffalo().xyzzy
This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error
The following sets of rules are mutually left-recursive [access_operation, expression]:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
LHS;
RHS;
CALL;
PARAMS;
}
start
: body? EOF
;
body
: expression (',' expression)*
;
expression
: function -> ^(CALL)
| access_operation
| atom
;
access_operation
: (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
;
function
: (IDENT '(' body? ')') -> ^(IDENT PARAMS?)
;
atom
: IDENT
| NUMBER
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
IDENT : (LETTER)+ ;
NUMBER : (DIGIT)+ ;
SPACE : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };
What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.
What i'm looking for instead is something like this:
How can the left-recursion problem solved in this case?
By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:
foo
/ | \
/ | \
bar baz ...
That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.
A quick demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS;
CALL;
PARAMS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (IDENT -> IDENT) ( tail -> ^(IDENT tail)
| call tail? -> ^(CALL IDENT call tail?)
)?
;
tail
: (access)+
;
access
: ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
;
call
: '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
;
IDENT : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " +
"foo.bar.baz(bjork).buffalo().xyzzy";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
The output of Main corresponds to the following AST:
EDIT
And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS_OP;
CALL;
PARAMS;
LHS;
RHS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (ID -> ID) ( ('(' params ')' -> ^(CALL ID params))
('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
| '.' expression -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
)?
;
params
: (expression (',' expression)*)? -> ^(PARAMS expression*)
;
ID : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which creates the following AST if you run the Main class:
The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:
which means atom can be any of the 5 following alternatives (with their corresponding AST's):
+----------------------+--------------------------------------------------------+
| alternative | generated AST |
+----------------------+--------------------------------------------------------+
| NUMBER | NUMBER |
| ID | ID |
| ID params | ^(CALL ID params) |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression | ^(ACCESS_OP ^(LHS ID) ^(RHS expression) |
+----------------------+--------------------------------------------------------+

ANTLR Template translator match part of grammar

I wrote a grammar for a language and now I want to treat some syntactic sugar constructions, for that I was thinking of writing a template translator.
The problem is I want my template grammar to translate only some constructions of the language and leave the rest as it is.
For example:
I have this as input:
class Main {
int a[10];
}
and I want to translate that into something like:
class Main {
Array a = new Array(10);
}
Ideally I would like to do some think like this in ANTLR
grammer Translator
options { output=template;}
decl
: TYPE ID '[' INT ']' -> template(name = {$ID.text}, size ={$INT.text})
"Array <name> = new Array(<size>);
I would like it to leave the rest of the input that doesn't match rule decl as it is.
How can I achieve this in ANTLR without writing the full grammar for the language ?
I would simply handle such things in the parser grammar.
Assuming you're constructing an AST in your parser grammar, I guess you'll have a rule to parse input like Array a = new Array(10); similar to:
decl
: TYPE ID '=' expr ';' -> ^(DECL TYPE ID expr)
;
where expr eventually matches a term like this:
term
: NUMBER
| 'new' ID '(' (expr (',' expr)*)? ')' -> ^('new' ID expr*)
| ...
;
To account for your short-hand declaration int a[10];, all you have to do is expand decl like this:
decl
: TYPE ID '=' expr ';' -> ^(DECL TYPE ID expr)
| TYPE ID '[' expr ']' ';' -> ^(DECL 'Array' ID ^(NEW ARRAY expr))
;
which will rewrite the input int a[10]; into the following AST:
which is exactly the same as the AST created for input Array a = new Array(10);.
EDIT
Here's a small working demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
DECL;
NEW='new';
INT='int';
ARRAY='Array';
}
parse
: decl+ EOF -> ^(ROOT decl+)
;
decl
: type ID '=' expr ';' -> ^(DECL type ID expr)
| type ID '[' expr ']' ';' -> ^(DECL ARRAY ID ^(NEW ARRAY expr))
;
expr
: Number
| NEW type '(' (expr (',' expr)*)? ')' -> ^(NEW ID expr*)
;
type
: INT
| ARRAY
| ID
;
ID : ('a'..'z' | 'A'..'Z')+;
Number : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n') {skip();};
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "Array a = new Array(10); int a[10];";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;