Suppose grammar G has two productions …
S → λ
S → aSb
In Raku, how would one create this grammar programmatically (i.e., dynamically, at runtime)?
The goal is to have a Raku program create — at runtime — a Raku grammar that might be written statically as …
grammar Parser
{
token TOP { <S> }
token S { '' | 'a' <S> 'b' }
}
Given the first answer to my question, I am trying to dynamically create what would be statically written as …
grammar Parser
{
token TOP { <S> }
} # end grammar Parser
I have tried …
constant Parser := Metamodel::GrammarHOW.new_type( name => 'Parser' ) ;
Parser.^add_method('TOP', my method TOP(Parser:) { <S> }) ;
Parser.^compose; # }
say Parser.HOW.^name ;
say Parser.^methods(:local) ;
However, the reply is …
Perl6::Metamodel::GrammarHOW
(TOP)
… rather than the hoped-for …
Perl6::Metamodel::GrammarHOW
(token TOP { <S> } BUILDALL)
How should add_method be invoked to add the TOP token (and later, other tokens such as the S token)?
After more work, I believe that I may have a solution …
constant Parser := Metamodel::GrammarHOW.new_type( name => 'Parser' ) ;
Parser.^add_method( 'TOP', my token TOP { <S> } ) ;
Parser.^add_method( 'S', my token S { '' | 'a' <S> 'b' } ) ;
Parser.^compose ;
say Parser.HOW.^name ;
say Parser.^methods( :local ) ;
say Parser.parse: 'aabb' ;
Output is …
Perl6::Metamodel::GrammarHOW
(token TOP { <S> } token S { '' | 'a' <S> 'b' })
「aabb」
S => 「aabb」
S => 「ab」
S => 「」
I had coded a static version of Parser and for that static Parser similar output to that shown above had been …
(token TOP { <S> } token S { '' | 'a' 'a' <S> 'b' 'b' } BUILDALL)
I am not sure about the fact that BUILDALL is missing from my dynamically created Parser. I do not understand BUILDALL and did not find much when searching on-line.
Using the metamodel, of course. Same as there are HOWs (Higher Order Working) for every basic type, there's a GrammarHOW for grammars. Unfortunately, there's not a whole lot of information on the metamodel; there's this article by Masak which mentions the GrammarHOW, and that's that. However, looking at the code, it's essentially a class; you're probably OK if you look at the classHOW examples, and make methods be tokens and classes be grammars.
Metaprogramming, in general, is a subject that's not been extensively covered so far. Which is a pity.
Related
I'm playing around with Antlr, designing a toy language, which I think is where most people start! - I had a question on how best to think about switching on token type.
consider a 'function call' in the language, where a function can consume a string, number or variable - for example like the below (project() is the function call)
project("ABC") vs project(123) vs project($SOME_VARIABLE)
I have the alteration operator in my grammar, so the grammar parses the right thing, but in the visitor code, it would be nice to tell the difference between the three versions of the above.
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
try {
s1 = ctx.STRING_LITERAL().getText();
}catch(Exception e){}
try{
s2 = ctx.NUM().getText();
}catch(Exception e){}
System.out.println("Created Project via => " + ctx.getChild(1).toString());
}
The code above worked, depending on whether s1 or s2 are null, I can infer how I was called (with a literal or a number, I haven't shown the variable case above), but I'm interested if there is a better or more elegant way - for example switching on token type inside the visitor code to actually process the language.
The grammar I had for the above was
createproj: 'project('WS?(STRING_LITERAL|NUM)')';
and when I use the intellij antlr plugin, it seems to know the token type of the argument to the project() function - but I don't seem to be able to get to it from my code.
You could do something like this:
createproj
: 'project' '(' WS? param ')'
;
param
: STRING_LITERAL
| NUM
;
and in your visitor code:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param().start.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
so by inlining the token in the grammar I had originally, I've lost the opportunity to inspect it in the visitor code?
No really, you could also do it like this:
createproj
: 'project' '(' WS? param_token=(STRING_LITERAL | NUM) ')'
;
and could then do this:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param_token.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
Just make sure you don't mix lexer rules (tokens) and parser rules in your set param_token=( ... ). When it's a parser rule, ctx.param_token.getType() will fail (it must then be ctx.param_token.start.getType()). That is why I recommended adding an extra parser rule, because this would then still work:
param
: STRING_LITERAL
| NUM
| some_parser_rule
;
The following is the simplified version of my actual grammar :-
grammar org.hello.World
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate world "http://www.hello.org/World"
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER)*
;
Greeting:
'<hello>' name=ID '</hello>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
So using above grammar if my input is like :-
<hi><hello>world</hello>
Then I am getting an syntax error saying that mismatched character 'i' expecting 'e' at Column 2 .
My requirement is that AnyContent should match "<hi>" , can anyone guide me about how to achieve that?
If you want to make it with Xtext. I advice you to split your problem. You first problem is syntaxic, you need to parser your file. The second problem is semantic, you want to give a "sense" to your objets and tell who is the container. Define the container and the containment for XML can't be done inside your grammar.
Make a custom Ecore and make an easy grammar, with start and end tag. You don't really care about the name of your tag.
Example :
Model returns XmlFile: (StartTag|EndTag|Text)+;
Text returns Text: text=STRING;
StartTag returns StartTag: '<' name=ID '>';
EndTag returns EndTag: '</' name=ID '>';
Change the TokenSource. The token source will give the token to your Parser. You can override the nature of your token, merge or split them.
The idea here is to merge all token outside the between of ">" and "</".
This token represent a Text, so you can create a single token for all elements containing between this elements. Example :
class CustomTokenSource extends XtextTokenStream{
new(TokenSource tokenSource, ITokenDefProvider tokenDefProvider) {
super(tokenSource,tokenDefProvider)
}
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
In this example you need to add your custom code on the method "tokenOverride".
Add your custom token source on your parser :
class XDSLParser extends DSLParser{
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new CustomTokenSource(tokenSource, getTokenDefProvider());
}
}
Compute the containement : the containment of your elements can be compute after the parsing. After it, you can get your model and change it as you will. To make it, you need to override the method "doParse" of your Parser "XDSLParser" as follow :
override protected IParseResult doParse(String ruleName, CharStream in, NodeModelBuilder nodeModelBuilder, int initialLookAhead) {
var IParseResult result = super.doParse( ruleName, in, nodeModelBuilder, initialLookAhead)
//Give you model
result.rootASTElement;
return result
}
Note : The model that you obtain after the parsing will be flat. The xmlFile Object will contain all the elements in the good order. You need to write an algorithm to build the containement on your AST model.
This will require a lot of tweaking in the grammar due to the nature of the antlr lexer that is used by Xtext. The lexer will not roll back for the keyword <hello>: As soon as it sees a < followed by an h it'll try consume the hello-token. Something along these lines could work though:
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER | '<' (ID | ANY_OTHER | '/' | '>') | '/' | '>' | 'hello')*
;
Greeting:
'<' 'hello '>' name=ID '<' '/' 'hello' '>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
The approach won't scale for real world grammars but maybe it helps to get on the some working track.
I have an ANTLR4 listener which handles a standard and well-formed grammar, however am struggling with how to deal the non-standard implementations. Although all of the variants go through the lexer without problems the parse stage is a lot trickier.
A traditional way of doing this would be something like
// Header of document
variant = STANDARD;
if (header.indexOf("microsoft") != -1) {
variant = MICROSOFT;
} else if (header.indexOf("google") != -1) {
variant = GOOGLE;
}
...
// Parsing a particular element
if (variant.equals(MICROSOFT)) {
// Microsoft-specific stuff
} else if (variant.equals(GOOGLE)) {
// Google-specific stuff
} else {
// Standard stuff
}
but this quickly becomes unmaintainable. The obvious solution is to have a ParseTreeListener for the standard implementation and then subclass it for each variant, but I don't know which variant it is until I've started the parse.
So how can I either switch from one listener to another part-way through the parse, or restart the parse with a new listener once I know which variant I'm dealing with?
If these variants occur frequently, you might want to consider embedding custom code to handle context sensitive parsing by using predicates (the {...}? construct in the following pseudo grammar):
rule
: { boolean-expression-a }? a-alternative
| { boolean-expression-b }? b-alternative
| /* fall through */ not-a-or-b-alternative
;
Let's say you want to parse a file containing chunks. A chunk consists of a header and a data row. In the header you can set your variant. The data of a normal variant contains 3 NUMBERs, Google's variant contains 2 NUMBERs and Microsoft's variant contains a single NUMBER. An example of such a file would look like this:
header: none
data: 1 2 3
header: google
data: 4 5
header: microsoft
data: 6
And here's a demo of a context sensitive ANTLR v4 grammar able to parse this:
grammar T;
#parser::members {
enum Variant {
GOOGLE,
MICROSOFT,
OTHER;
public static Variant tryValueOf(String name) {
try {
return Variant.valueOf(name.toUpperCase());
}
catch(Exception e) {
return OTHER;
}
}
}
private Variant variant = Variant.OTHER;
}
parse
: chunk+ EOF
;
chunk
: header data
;
header
: K_HEADER COLON NAME {variant = Variant.tryValueOf($NAME.text);}
;
data
: {variant == Variant.MICROSOFT}? K_DATA COLON NUMBER #MicrosoftData
| {variant == Variant.GOOGLE}? K_DATA COLON NUMBER NUMBER #GoogleData
| K_DATA COLON NUMBER NUMBER NUMBER #OtherData
;
K_DATA : 'data';
K_HEADER : 'header';
NAME : [a-zA-Z]+;
NUMBER : [0-9]+;
COLON : ':';
SPACE : [ \t\r\n] -> skip;
Resulting in the following parse:
I have an antlr grammer with subtrees like this:
^(type ID)
that I want to convert to:
^(type DUMMY ID)
where type is 'a'|'b'.
Note: what I really want to do is convert anonymous instantiations to explicit by generating dummy names.
I've narrowed it down to the grammars below, but I'm getting this:
(a bar) (b bar)
got td
got bu
Exception in thread "main" org.antlr.runtime.tree.RewriteEmptyStreamException: rule type
at org.antlr.runtime.tree.RewriteRuleElementStream._next(RewriteRuleElementStream.java:157)
at org.antlr.runtime.tree.RewriteRuleSubtreeStream.nextNode(RewriteRuleSubtreeStream.java:77)
at Pattern.bu(Pattern.java:382)
The error message continues. My debug so far:
The input made it through the initial grammar generating two trees. a bar and b bar.
The second grammar does match the trees. it's printing td and bu.
The rewrite crashes, but I have no idea why? What does RewriteEmptyStreamException mean.
What the proper way to do this kind of a rewrite?
My main grammer Rewrite.g:
grammar Rewrite;
options {
output=AST;
}
#members{
public static void main(String[] args) throws Exception {
RewriteLexer lexer = new RewriteLexer(new ANTLRStringStream("a foo\nb bar"));
RewriteParser parser = new RewriteParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.test().getTree();
System.out.println(tree.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
Pattern p = new Pattern(nodes);
CommonTree newtree = (CommonTree) p.downup(tree);
}
}
type
: 'a'
| 'b'
;
test : id+;
id : type ID -> ^(type ID["bar"]);
DUMMY : 'dummy';
ID : ('a'..'z')+;
WS : (' '|'\n'|'r')+ {$channel=HIDDEN;};
and Pattern.g
tree grammar Pattern;
options {
tokenVocab = Rewrite;
ASTLabelType=CommonTree;
output=AST;
filter=true; // tree pattern matching mode
}
topdown
: td
;
bottomup
: bu
;
type
: 'a'
| 'b'
;
td
: ^(type ID) { System.out.println("got td"); }
;
bu
: ^(type ID) { System.out.println("got bu"); }
-> ^(type DUMMY ID)
;
to do compile:
java -cp ../jar/antlr-3.4-complete-no-antlrv2.jar org.antlr.Tool Rewrite.g
java -cp ../jar/antlr-3.4-complete-no-antlrv2.jar org.antlr.Tool Pattern.g
javac -cp ../jar/antlr-3.4-complete-no-antlrv2.jar *.java
java -classpath .:../jar/antlr-3.4-complete-no-antlrv2.jar RewriteParser
EDIT 1: I have also tried using antlr4 and I get the same crash.
There are two small problems to address to get the rewrite to work, one problem in Rewrite and the other in Pattern.
The Rewrite grammar produces ^(type ID) as root elements in the output AST, as shown in the output (a bar) (b bar). A root element can't be transformed because transforming is actually a form of child-swapping: the element's parent drops the element and replaces it with the new, "transformed" version. Without a parent, you'll get the error Can't set single child to a list. Adding the root is a matter of creating an imaginary token ROOT or whatever name you like and referencing it in your entry-level rule's AST generation like so: test : id+ -> ^(ROOT id+);.
The Pattern grammar, the one producing the error you're getting, is confused by the type rule: type : 'a' | 'b' ; as part of the rewrite. I don't know the low-level details here, but apparently a tree parser doesn't maintain the state of a visited root rule like type in ^(type ID) when writing a transform (or maybe it can't or shouldn't, or maybe it's some other limitation). The easiest way to address this is with the following two changes:
Let text "a" and "b" match rule ID in the lexer by changing rule type in Rewrite from type: 'a' | 'b'; to just type: ID;.
Let rule bu in Pattern match against ^(ID ID) and transform to ^(ID DUMMY ID).
Now with a couple of minor debugging changes to Rewrite's main, input "a foo\nb bar" produces the following output:
(ROOT (a foo) (b bar))
got td
got bu
(a foo) -> (a DUMMY foo)
got td
got bu
(b bar) -> (b DUMMY bar)
(ROOT (a DUMMY foo) (b DUMMY bar))
Here are the files as I've changed them:
Rewrite.g
grammar Rewrite;
options {
output=AST;
}
tokens {
ROOT;
}
#members{
public static void main(String[] args) throws Exception {
RewriteLexer lexer = new RewriteLexer(new ANTLRStringStream("a foo\nb bar"));
RewriteParser parser = new RewriteParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.test().getTree();
System.out.println(tree.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
Pattern p = new Pattern(nodes);
CommonTree newtree = (CommonTree) p.downup(tree, true); //print the transitions to help debugging
System.out.println(newtree.toStringTree()); //print the final result
}
}
type : ID;
test : id+ -> ^(ROOT id+);
id : type ID -> ^(type ID);
DUMMY : 'dummy';
ID : ('a'..'z')+;
WS : (' '|'\n'|'r')+ {$channel=HIDDEN;};
Pattern.g
tree grammar Pattern;
options {
tokenVocab = Rewrite;
ASTLabelType=CommonTree;
output=AST;
filter=true; // tree pattern matching mode
}
topdown
: td
;
bottomup
: bu
;
td
: ^(ID ID) { System.out.println("got td"); }
;
bu
: ^(ID ID) { System.out.println("got bu"); }
-> ^(ID DUMMY ID)
;
I don't have much experience with tree-patterns, with or without rewrites. But when using rewrite rules in them, I believe your options should also include rewrite=true;. The Definitive ANTLR Reference doesn't handle them, so I'm not entirely sure (have a look at the ANTLR wiki for more info).
However, for such (relatively) simple rewrites, you don't really need a separate grammar. You could make DUMMY an imaginary token and inject it in some other parser rule, like this:
grammar T;
options {
output=AST;
}
tokens {
DUMMY;
}
test : id+;
id : type ID -> ^(type DUMMY["dummy"] ID);
type
: 'a'
| 'b'
;
ID : ('a'..'z')+;
WS : (' '|'\n'|'r')+ {$channel=HIDDEN;};
which would parse the input:
a bar
b foo
into the following AST:
Note that if your lexer is also meant to tokenize the input "dummy" as a DUMMY token, change the tokens { ... } block into this:
tokens {
DUMMY='dummy';
}
and you'd still be able to inject a DUMMY in other rules.
I have a grammar which parses dot notion expressions like this:
a.b.c
memberExpression returns [Expression value]
: i=ID { $value = ParameterExpression($i.value); }
('.' m=memberExpression { $value = MemberExpression($m.value, $i.value); }
)*
;
This parses expressions fine and gives me a tree structure like this:
MemberExpression(
MemberExpression(
ParameterExpression("c"),
"b"
)
, "a"
)
But my problem is that I want a tree that looks like this:
MemberExpression(
MemberExpression(
ParameterExpression("a"),
"b"
)
, "c"
)
for the same expression "a.b.c"
How can I achieve this?
You could do this by collecting all tokens in a java.util.List using ANTLR's convenience += operator and create the desired tree using a custom method in your #parser::members section:
// grammar def ...
// options ...
#parser::members {
private Expression customTree(List tks) {
// `tks` is a java.util.List containing `CommonToken` objects
}
}
// parser ...
memberExpression returns [Expression value]
: ids+=ID ('.' ids+=ID)* { $value = customTree($ids); }
;
I think what you are asking for is mutually left recursive, and therefore ANTLR is not a good choice to parse it.
To elaborate, you need C at the root of the tree and therefore your rule would be:
rule: rule ID;
This rule will be uncertain whether it should match
a.b
or
a.b.c