ANTLR Is it possible to make grammar with embed grammar inside? - antlr

ANTLR: Is it possible to make grammar with embed grammar (with it's own lexer) inside?
For example in my language I have ability to use embed SQL language:
var Query = [select * from table];
with Query do something ....;
Is it possible with ANTLR?

Is it possible to make grammar with embed grammar (with it's own lexer) inside?
If you mean whether it is possible to define two languages in a single grammar (using separate lexers), then the answer is: no, that's not possible.
However, if the question is whether it is possible to parse two languages into a single AST, then the answer is: yes, it is possible.
You simply need to:
define both languages in their own grammar;
create a lexer rule in you main grammar that captures the entire input of the embedded language;
use a rewrite rule that calls a custom method that parses the external AST and inserts it in the main AST using { ... } (see the expr rule in the main grammar (MyLanguage.g)).
MyLanguage.g
grammar MyLanguage;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ROOT;
}
#members {
private CommonTree parseSQL(String sqlSrc) {
try {
MiniSQLLexer lexer = new MiniSQLLexer(new ANTLRStringStream(sqlSrc));
MiniSQLParser parser = new MiniSQLParser(new CommonTokenStream(lexer));
return (CommonTree)parser.parse().getTree();
} catch(Exception e) {
return new CommonTree(new CommonToken(-1, e.getMessage()));
}
}
}
parse
: assignment+ EOF -> ^(ROOT assignment+)
;
assignment
: Var Id '=' expr ';' -> ^('=' Id expr)
;
expr
: Num
| SQL -> {parseSQL($SQL.text)}
;
Var : 'var';
Id : ('a'..'z' | 'A'..'Z')+;
Num : '0'..'9'+;
SQL : '[' ~']'* ']';
Space : ' ' {skip();};
MiniSQL.g
grammar MiniSQL;
options {
output=AST;
ASTLabelType=CommonTree;
}
parse
: '[' statement ']' EOF -> statement
;
statement
: select
;
select
: Select '*' From ID -> ^(Select '*' From ID)
;
Select : 'select';
From : 'from';
ID : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "var Query = [select * from table]; var x = 42;";
MyLanguageLexer lexer = new MyLanguageLexer(new ANTLRStringStream(src));
MyLanguageParser parser = new MyLanguageParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
Run the demo
java -cp antlr-3.3.jar org.antlr.Tool MiniSQL.g
java -cp antlr-3.3.jar org.antlr.Tool MyLanguage.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Given the input:
var Query = [select * from table]; var x = 42;
the output of the Main class corresponds to the following AST:
And if you want to allow string literals inside your SQL (which could contain ]), and comments (which could contain ' and ]), the you could use the following SQL rule inside your main grammar:
SQL
: '[' ( ~(']' | '\'' | '-')
| '-' ~'-'
| COMMENT
| STR
)*
']'
;
fragment STR
: '\'' (~('\'' | '\r' | '\n') | '\'\'')+ '\''
| '\'\''
;
fragment COMMENT
: '--' ~('\r' | '\n')*
;
which would properly parse the following input in a single token:
[
select a,b,c
from table
where a='A''B]C'
and b='' -- some ] comment ] here'
]
Just beware that trying to create a grammar for an entire SQL dialect (or even a large subset) is no trivial task! You may want to search for existing SQL parsers, or look at the ANTLR wiki for example-grammars.

Yes, with AntLR it is called Island grammar.
You can get a working example in the v3 examples, inside the island-grammar folder : it shows the usage of a grammar to parse javadoc comments inside of java code.
You can also find some clues in the doc Island Grammars Under Parser Control and that Another one.

Related

Unable to parse APL Symbol using ANTLR

I am trying to parse APL expressions using ANTLR, It is sort of APL source code parser. It parse normal characters but fails to parse special symbols(like '←')
expression = N←0
Lexer
/* Lexer Tokens. */
NUMBER:
(DIGIT)+ ( '.' (DIGIT)+ )?;
ASSIGN:
'←'
;
DIGIT :
[0-9]
;
Output:
[#0,0:1='99',<NUMBER>,1:0]
**[#1,4:6='â??',<'â??'>,2:0**]
[#2,7:6='<EOF>',<EOF>,2:3]
Can some one help me to parse special characters from APL language.
I am following below steps.
Written Grammar
"antlr4.bat" used to generate parser from grammar.
"grun.bat" is used to generate token
"grun.bat" is used to generate token
That just means your terminal cannot display the character properly. There is nothing wrong with the generated parser or lexer not being able to recognise ←.
Just don't use the bat file, but rather test your lexer and parser by writing a small class yourself using your favourite IDE (which can display the characters properly).
Something like this:
grammar T;
expression
: ID ARROW NUMBER
;
ID : [a-zA-Z]+;
ARROW : '←';
NUMBER : [0-9]+;
SPACE : [ \t\r\n]+ -> skip;
and a main class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
TLexer lexer = new TLexer(CharStreams.fromString("N ← 0"));
TParser parser = new TParser(new CommonTokenStream(lexer));
System.out.println(parser.expression().toStringTree(parser));
}
}
which will display:
(expression N ← 0)
EDIT
You could also try using the unicode escape for the arrow like this:
grammar T;
expression
: ID ARROW NUMBER
;
ID : [a-zA-Z]+;
ARROW : '\u2190';
NUMBER : [0-9]+;
SPACE : [ \t\r\n]+ -> skip;
and the Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "N \u2190 0";
TLexer lexer = new TLexer(CharStreams.fromString(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
System.out.println(source + ": " + parser.expression().toStringTree(parser));
}
}
which will print:
N ← 0: (expression N ← 0)

how to report grammar ambiguity in antlr4

According to the antlr4 book (page 159), and using the grammar Ambig.g4, grammar ambiguity can be reported by:
grun Ambig stat -diagnostics
or equivalently, in code form:
parser.removeErrorListeners();
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
The grun command reports the ambiguity properly for me, using antlr-4.5.3. But when I use the code form, I dont get the ambiguity report. Here is the command trace:
$ antlr4 Ambig.g4 # see the book's page.159 for the grammar
$ javac Ambig*.java
$ grun Ambig stat -diagnostics < in1.txt # in1.txt is as shown on page.159
line 1:3 reportAttemptingFullContext d=0 (stat), input='f();'
line 1:3 reportAmbiguity d=0 (stat): ambigAlts={1, 2}, input='f();'
$ javac TestA_Listener.java
$ java TestA_Listener < in1.txt # exits silently
The TestA_Listener.java code is the following:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.atn.*; // for PredictionMode
import java.util.*;
public class TestA_Listener {
public static void main(String[] args) throws Exception {
ANTLRInputStream input = new ANTLRInputStream(System.in);
AmbigLexer lexer = new AmbigLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AmbigParser parser = new AmbigParser(tokens);
parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
parser.stat();
}
}
Can somebody please point out how the above java code should be modified, to print the ambiguity report?
For completeness, here is the code Ambig.g4 :
grammar Ambig;
stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement
;
expr: ID '(' ')'
| INT
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
And here is the input file in1.txt :
f();
Antlr4 is a top-down parser, so for the given input, the parse match is unambiguously:
stat -> expr -> ID -> ( -> ) -> stat(cnt'd) -> ;
The second stat alt is redundant and never reached, not ambiguous.
To resolve the apparent redundancy, a predicate might be used:
stat: e=expr {isValidExpr($e)}? ';' #exprStmt
| ID '(' ')' ';' #funcStmt
;
When isValidExpr is false, the function statement alternative will be evaluated.
I waited for several days for other people to post their answers. Finally after several rounds of experimenting, I found an answer:
The following line should be deleted from the above code. Then we get the same ambiguity report as given by grun.
parser.removeErrorListeners(); // remove ConsoleErrorListener
The following code will be work
public static void main(String[] args) throws IOException {
CharStream input = CharStreams.fromStream(System.in);
AmbigLexer lexer = new AmbigLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AmbigParser parser = new AmbigParser(tokens);
//parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new org.antlr.v4.runtime.DiagnosticErrorListener()); // add ours
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
parser.stat(); // parse as usual
}

variable not passed to predicate method in ANTLR

The java code generated from ANTLR is one rule, one method in most times. But for the following rule:
switchBlockLabels[ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts]
: ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel[_entity, _method, _preStmts]* switchDefaultLabel? switchCaseLabel*)
;
it generates a submethod named synpred125_TreeParserStage3_fragment(), in which mehod switchCaseLabel(_entity, _method, _preStmts) is called:
synpred125_TreeParserStage3_fragment(){
......
switchCaseLabel(_entity, _method, _preStmts);//variable not found error
......
}
switchBlockLabels(ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts){
......
synpred125_TreeParserStage3_fragment();
......
}
The problem is switchCaseLabel has parameters and the parameters come from the parameters of switchBlockLabels() method, so "variable not found error" occurs.
How can I solve this problem?
My guess is that you've enabled global backtracking in your grammar like this:
options {
backtrack=true;
}
in which case you can't pass parameters to ambiguous rules. In order to communicate between ambiguous rules when you have enabled global backtracking, you must use rule scopes. The "predicate-methods" do have access to rule scopes variables.
A demo
Let's say we have this ambiguous grammar:
grammar Scope;
options {
backtrack=true;
}
parse
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName
: Number
| Name
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
(for the record, the atom+ and numberOrName+ make it ambiguous)
If you now want to pass information between the parse and numberOrName rule, say an integer n, something like this will fail (which is the way you tried it):
grammar Scope;
options {
backtrack=true;
}
parse
#init{int n = 0;}
: (atom[++n])+ EOF
;
atom[int n]
: (numberOrName[n])+
;
numberOrName[int n]
: Number {System.out.println(n + " = " + $Number.text);}
| Name {System.out.println(n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
In order to do this using rule scopes, you could do it like this:
grammar Scope;
options {
backtrack=true;
}
parse
scope{int n; /* define the scoped variable */ }
#init{$parse::n = 0; /* important: initialize the variable! */ }
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName /* increment and print the scoped variable from the parse rule */
: Number {System.out.println(++$parse::n + " = " + $Number.text);}
| Name {System.out.println(++$parse::n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Test
If you now run the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "foo 42 Bar 666";
ScopeLexer lexer = new ScopeLexer(new ANTLRStringStream(src));
ScopeParser parser = new ScopeParser(new CommonTokenStream(lexer));
parser.parse();
}
}
you will see the following being printed to the console:
1 = foo
2 = 42
3 = Bar
4 = 666
P.S.
I don't know what language you're parsing, but enabling global backtracking is usually overkill and can have quite an impact on the performance of your parser. Computer languages often are ambiguous in just a few cases. Instead of enabling global backtracking, you really should look into adding syntactic predicates, or enabling backtracking on those rules that are ambiguous. See The Definitive ANTLR Reference for more info.

Whats the correct way to add new tokens (rewrite) to create AST nodes that are not on the input steam

I've a pretty basic math expression grammar for ANTLR here and what's of interest is handling the implied * operator between parentheses e.g. (2-3)(4+5)(6*7) should actually be (2-3)*(4+5)*(6*7).
Given the input (2-3)(4+5)(6*7) I'm trying to add the missing * operator to the AST tree while parsing, in the following grammar I think I've managed to achieve that but I'm wondering if this is the correct, most elegant way?
grammar G;
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ADD = '+' ;
SUB = '-' ;
MUL = '*' ;
DIV = '/' ;
OPARN = '(' ;
CPARN = ')' ;
}
start
: expression EOF!
;
expression
: mult (( ADD^ | SUB^ ) mult)*
;
mult
: atom (( MUL^ | DIV^) atom)*
;
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN expression CPARN -> ^(MUL expression)+
)*
;
INTEGER : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
This grammar appears to output the correct AST Tree in ANTLRworks:
I'm only just starting to get to grips with parsing and ANTLR, don't have much experience so feedback with really appreciated!
Thanks in advance! Carl
First of all, you did a great job given the fact that you've never used ANTLR before.
You can omit the language=Java and ASTLabelType=CommonTree, which are the default values. So you can just do:
options {
output=AST;
}
Also, you don't have to specify the root node for each operator separately. So you don't have to do:
(ADD^ | SUB^)
but the following:
(ADD | SUB)^
will suffice. With only two operators, there's not much difference, but when implementing relational operators (>=, <=, > and <), the latter is a bit easier.
Now, for you AST: you'll probably want to create a binary tree: that way, all internal nodes are operators, and the leafs will be operands which makes the actual evaluating of your expressions much easier. To get a binary tree, you'll have to change your atom rule slightly:
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN e=expression CPARN -> ^(MUL $atom $e)
)*
;
which produces the following AST given the input "(2-3)(4+5)(6*7)":
(image produced by: graphviz-dev.appspot.com)
The DOT file was generated with the following test-class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
GLexer lexer = new GLexer(new ANTLRStringStream("(2-3)(4+5)(6*7)"));
GParser parser = new GParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

ANTLR : How to replace all characters defined as space with actual space

My ANTLR code is as follow :
LPARENTHESIS : ('(');
RPARENTHESIS : (')');
fragment CHARACTER : ('a'..'z'|'0'..'9'|);
fragment QUOTE : ('"');
fragment WILDCARD : ('*');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C'|';'|':'|',');
WILD_STRING
: (CHARACTER)*
(
('?')
(CHARACTER)*
)+
;
PREFIX_STRING
: (CHARACTER)+
(
('*')
)+
;
WS : (SPACE) { $channel=HIDDEN; };
PHRASE : (QUOTE)(LPARENTHESIS)?(WORD)(WILDCARD)?(RPARENTHESIS)?((SPACE)+(LPARENTHESIS)?(WORD)(WILDCARD)?(RPARENTHESIS)?)*(SPACE)+(QUOTE);
WORD : (CHARACTER)+;
What I would like to do is to replace all characters marked as space to be replaced with actual space character in the PHRASE. Also if possible, I would then like all continuous spaces to be represented by a single space.
Any help would be most appreciated. For some reason, I am finding it hard to understand ANTLR. Any good tutorials out there ?
Java
Invoke your lexer's setText(...) method:
grammar T;
parse
: words EOF {System.out.println($words.text);}
;
words
: Word (Spaces Word)*
;
Word
: ('a'..'z'|'A'..'Z')+
;
Spaces
: (' ' | '\t' | '\r' | '\n')+ {setText(" ");}
;
Which can be tested with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "This is \n just \t\t\t\t\t\t a \n\t\t test";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("------------------------------\nSource:\n" + source +
"\n------------------------------\nAfter parsing:");
parser.parse();
}
}
which produces the following output:
------------------------------
Source:
This is
just a
test
------------------------------
After parsing:
This is just a test
Puneet Pawaia wrote:
Any help would be most appreciated. For some reason, I am finding it hard to understand ANTLR. Any good tutorials out there ?
The ANTLR Wiki has loads of informative info, albeit a bit unstructured (but that could just be me!).
The best ANTLR tutorial is the book: The Definitive ANTLR Reference: Building Domain-Specific Languages.
C#
For the C# target, try this:
grammar T;
options {
language=CSharp2;
}
#parser::namespace { Demo }
#lexer::namespace { Demo }
parse
: words EOF {Console.WriteLine($words.text);}
;
words
: Word (Spaces Word)*
;
Word
: ('a'..'z'|'A'..'Z')+
;
Spaces
: (' ' | '\t' | '\r' | '\n')+ {Text = " ";}
;
with the test class:
using System;
using Antlr.Runtime;
namespace Demo
{
class MainClass
{
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("This is \n just \t\t\t\t\t\t a \n\t\t test");
TLexer Lexer = new TLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
TParser Parser = new TParser(Tokens);
Parser.parse();
}
}
}
which also prints This is just a test to the console. I tried to use SetText(...) instead of setText(...) but that didn't work either, and the C# API docs are currently off-line, so I used the trial and error-hack {Text = " ";}. I tested it with the C# 3.1.1 runtime DLL's.
Good luck!