ANTLR: how to parse a region within matching brackets with a lexer - antlr

i want to parse something like this in my lexer:
( begin expression )
where expressions are also surrounded by brackets. it isn't important what is in the expression, i just want to have all what's between the (begin and the matching ) as a token. an example would be:
(begin
(define x (+ 1 2)))
so the text of the token should be (define x (+ 1 2)))
something like
PROGRAM : LPAREN BEGIN .* RPAREN;
does (obviously) not work because as soon as he sees a ")", he thinks the rule is over, but i need the matching bracket for this.
how can i do that?

Inside lexer rules, you can invoke rules recursively. So, that's one way to solve this. Another approach would be to keep track of the number of open- and close parenthesis and let a gated semantic predicate loop as long as your counter is more than zero.
A demo:
T.g
grammar T;
parse
: BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF
;
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // keep reapeating `( ... )*` as long as open > 0
( ~('(' | ')') // match anything other than parenthesis
| '(' {open++;} // match a '(' in increase the var `open`
| ')' {open--;} // match a ')' in decrease the var `open`
)
)*
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "(begin (define x (+ (- 1 3) 2)))";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
java -cp antlr-3.3-complete.jar org.antlr.Tool T.g
javac -cp antlr-3.3-complete.jar *.java
java -cp .:antlr-3.3-complete.jar Main
parsed :: (begin (define x (+ (- 1 3) 2)))
Note that you'll need to beware of string literals inside your source that might include parenthesis:
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // ...
( ~('(' | ')' | '"') // ...
| '(' {open++;} // ...
| ')' {open--;} // ...
| '"' ... // TODO: define a string literal here
)
)*
;
or comments that may contain parenthesis.
The suggestion with the predicate uses some language specific code (Java, in this case). An advantage of calling a lexer rule recursively is that you don't have custom code in your lexer:
BeginToken
: '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')'
;
fragment NestedParens
: '(' ( ~('(' | ')') | NestedParens )* ')'
;
fragment Spaces
: (' ' | '\t')+
;

Related

R Language: Grammar for Raw Strings

I'm trying to create a new rule in the R grammar for Raw Strings.
Quote of the R news:
There is a new syntax for specifying raw character constants similar
to the one used in C++: r"(...)" with ... any character sequence not
containing the sequence )". This makes it easier to write strings that
contain backslashes or both single and double quotes. For more details
see ?Quotes.
Examples:
## A Windows path written as a raw string constant:
r"(c:\Program files\R)"
## More raw strings:
r"{(\1\2)}"
r"(use both "double" and 'single' quotes)"
r"---(\1--)-)---"
But I'm unsure if a grammar file alone is enough to implement the rule.
Until now I tried something like this as a basis from older suggestions of similar grammars:
Parser:
| RAW_STRING_LITERAL #e42
Lexer:
RAW_STRING_LITERAL
: ('R' | 'r') '"' ( '\\' [btnfr"'\\] | ~[\r\n"]|LETTER )* '"' ;
Any hints or suggestions are appreciated.
R ANTLR Grammar:
https://github.com/antlr/grammars-v4/blob/master/r/R.g4
Original R Grammar in Bison:
https://svn.r-project.org/R/trunk/src/main/gram.y
To match start- and end-delimiters, you will have to use target specific code. In Java that could look like this:
#lexer::members {
boolean closeDelimiterAhead() {
// Get the part between `r"` and `(`
String delimiter = getText().substring(2, getText().indexOf('('));
// Construct the end of the raw string
String stopFor = ")" + delimiter + "\"";
for (int n = 1; n <= stopFor.length(); n++) {
if (this._input.LA(n) != stopFor.charAt(n - 1)) {
// No end ahead yet
return false;
}
}
return true;
}
}
RAW_STRING
: [rR] '"' ~[(]* '(' ( {!closeDelimiterAhead()}? . )* ')' ~["]* '"'
;
which tokenizes r"---( )--" )----" )---" as a single RAW_STRING.
EDIT
And since the delimiters can only consist of hyphens (and parenthesis/braces) and not just any arbitrary character, this should do it as well:
RAW_STRING
: [rR] '"' INNER_RAW_STRING '"'
;
fragment INNER_RAW_STRING
: '-' INNER_RAW_STRING '-'
| '(' .*? ')'
| '{' .*? '}'
| '[' .*? ']'
;

ANTLR Template translator match part of grammar

I wrote a grammar for a language and now I want to treat some syntactic sugar constructions, for that I was thinking of writing a template translator.
The problem is I want my template grammar to translate only some constructions of the language and leave the rest as it is.
For example:
I have this as input:
class Main {
int a[10];
}
and I want to translate that into something like:
class Main {
Array a = new Array(10);
}
Ideally I would like to do some think like this in ANTLR
grammer Translator
options { output=template;}
decl
: TYPE ID '[' INT ']' -> template(name = {$ID.text}, size ={$INT.text})
"Array <name> = new Array(<size>);
I would like it to leave the rest of the input that doesn't match rule decl as it is.
How can I achieve this in ANTLR without writing the full grammar for the language ?
I would simply handle such things in the parser grammar.
Assuming you're constructing an AST in your parser grammar, I guess you'll have a rule to parse input like Array a = new Array(10); similar to:
decl
: TYPE ID '=' expr ';' -> ^(DECL TYPE ID expr)
;
where expr eventually matches a term like this:
term
: NUMBER
| 'new' ID '(' (expr (',' expr)*)? ')' -> ^('new' ID expr*)
| ...
;
To account for your short-hand declaration int a[10];, all you have to do is expand decl like this:
decl
: TYPE ID '=' expr ';' -> ^(DECL TYPE ID expr)
| TYPE ID '[' expr ']' ';' -> ^(DECL 'Array' ID ^(NEW ARRAY expr))
;
which will rewrite the input int a[10]; into the following AST:
which is exactly the same as the AST created for input Array a = new Array(10);.
EDIT
Here's a small working demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
DECL;
NEW='new';
INT='int';
ARRAY='Array';
}
parse
: decl+ EOF -> ^(ROOT decl+)
;
decl
: type ID '=' expr ';' -> ^(DECL type ID expr)
| type ID '[' expr ']' ';' -> ^(DECL ARRAY ID ^(NEW ARRAY expr))
;
expr
: Number
| NEW type '(' (expr (',' expr)*)? ')' -> ^(NEW ID expr*)
;
type
: INT
| ARRAY
| ID
;
ID : ('a'..'z' | 'A'..'Z')+;
Number : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n') {skip();};
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "Array a = new Array(10); int a[10];";
TLexer lexer = new TLexer(new ANTLRStringStream(src));
TParser parser = new TParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.
Is there a function to check the length?
EDIT
Let me make my question more clear with an example.
I wrote the following code:
grammar first;
tokens {
SET = 'set';
VAL = 'val';
UND = 'und';
CON = 'con';
ON = 'on';
OFF = 'off';
}
#parser::members {
private boolean inbounds(Token t, int min, int max) {
int n = Integer.parseInt(t.getText());
return n >= min && n <= max;
}
}
parse : SET expr;
expr : VAL('u'('e')?)? String |
UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
CON('n'('e'('c'('t')?)?)?)? oneChar
;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
dot : .;
oneChar : dot { $dot.text.length() == 1;} ;
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
I want my grammar to do the following things:
Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.
In my grammar all the three requirements I have listed above are not working fine. Don't know why.
There are some mistakes and/or bad practices in your grammar:
#1
The following is not a validating predicate:
{$dot.text.length() == 1;}
A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:
{$dot.text.length() == 1}?
instead.
#2
You should not be handling these alternative commands:
expr
: VAL('u'('e')?)? String
| UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF)
| CON('n'('e'('c'('t')?)?)?)? oneChar
;
in a parser rule. You should let the lexer handle this instead. Something like this will do it:
expr
: VAL String
| UND (ON | OFF)
| CON oneChar
;
// ...
VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
Your lexer rules:
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).
#4
The rule:
dot : .;
does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.
The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)
- an example -
Take the following grammar:
p : . ;
A : 'a' | 'A';
B : 'b' | 'B';
The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.
And in the following grammar:
prs : . ;
FOO : 'a';
BAR : . ;
the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.
#5
Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.
So, don't do:
pRule : 'u' ':' String
String : ...
but do:
pRule : U ':' String
U : 'u';
String : ...
You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.
Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:
grammar first;
parse
: (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
;
expr
: VAL String {System.out.print("A :: ");}
| UL (ON | OFF) {System.out.print("B :: ");}
| CON oneChar {System.out.print("C :: ");}
;
oneChar
: String {$String.text.length() == 1}?
;
SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON : 'on';
OFF : 'off';
String : (CHAR | DIGIT)+;
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
that can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"set value abc \n" +
"set underli on \n" +
"set conn x \n" +
"set conn xy ";
ANTLRStringStream in = new ANTLRStringStream(source);
firstLexer lexer = new firstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
firstParser parser = new firstParser(tokens);
System.out.println("parsing:\n======\n" + source + "\n======");
parser.parse();
}
}
which, after generating the lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool first.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
prints the following output:
parsing:
======
set value abc
set underli on
set conn x
set conn xy
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy
As you can see, the last command, C :: expr = conn xy, produces an error, as expected.

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;

eliminate extra spaces in a given ANTLR grammar

In any grammar I create in ANTLR, is it possible to parse the grammar and the result of the parsing can eleminate any extra spaces in the grammar. f.e
simple example ;
int x=5;
if I write
int x = 5 ;
I would like that the text changes to the int x=5 without the extra spaces. Can the parser return the original text without extra spaces?
Can the parser return the original text without extra spaces?
Yes, you need to define a lexer rule that captures these spaces and then skip() them:
Space
: (' ' | '\t') {skip();}
;
which will cause spaces and tabs to be ignored.
PS. I'm assuming you're using Java as the target language. The skip() can be different in other targets (Skip() for C#, for example). You may also want to include \r and \n chars in this rule.
EDIT
Let's say your language only consists of a couple of variable declarations. Assuming you know the basics of ANTLR, the following grammar should be easy to understand:
grammar T;
parse
: stat* EOF
;
stat
: Type Identifier '=' Int ';'
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
And you're parsing the source:
int x = 5 ; double y =5;boolean z = 0 ;
which you'd like to change into:
int x=5;
double y=5;
boolean z=0;
Here's a way to embed code in your grammar and let the parser rules return custom objects (Strings, in this case):
grammar T;
parse returns [String str]
#init{StringBuilder buffer = new StringBuilder();}
#after{$str = buffer.toString();}
: (stat {buffer.append($stat.str).append('\n');})* EOF
;
stat returns [String str]
: Type Identifier '=' Int ';'
{$str = $Type.text + " " + $Identifier.text + "=" + $Int.text + ";";}
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
Test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "int x = 5 ; double y =5;boolean z = 0 ;";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("Result:\n"+parser.parse());
}
}
which produces:
Result:
int x=5;
double y=5;
boolean z=0;