Replace token in ANTLR - antlr

I want to replace a token using ANTLR.
I tried with TokenRewriteStream and replace, but it didn't work.
Any suggestions?
ANTLRStringStream in = new ANTLRStringStream(source);
MyLexer lexer = new MyLexer(in);
TokenRewriteStream tokens = new TokenRewriteStream(lexer);
for(Object obj : tokens.getTokens()) {
CommonToken token = (CommonToken)obj;
tokens.replace(token, "replacement");
}
The lexer finds all occurences of single-line comments, and i want to replace them in the original source too.
EDIT:
This is the grammar:
grammar ANTLRTest;
options {
language = Java;
}
#header {
package main;
}
#lexer::header {
package main;
}
rule: SingleLineComment+;
SingleLineComment
: '//' ~( '\r' | '\n' )* {setText("replacement");}
;
What i want to do is replace all single-line comments in a file, let's say.

Rewrite the text inside the lexer:
SingleLineComment
: '//' ~('\r' | '\n')* {setText("replacement");}
;
EDIT
Okay, here's a quick demo how you can filter certain tokens from a language:
SingleCommentStrip.g
grammar SingleCommentStrip;
parse returns [String str]
#init{StringBuilder builder = new StringBuilder();}
: (t=. {builder.append($t.text);})* EOF {$str = builder.toString();}
;
SingleLineComment
: '//' ~('\r' | '\n')* {skip();}
;
MultiLineComment
: '/*' .* '*/'
;
StringLiteral
: '"' ('\\' . | ~('"' | '\\' | '\r' | '\n'))* '"'
;
AnyOtherChar
: .
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
SingleCommentStripLexer lexer = new SingleCommentStripLexer(new ANTLRFileStream("Test.java"));
SingleCommentStripParser parser = new SingleCommentStripParser(new CommonTokenStream(lexer));
String adjusted = parser.parse();
System.out.println(adjusted);
}
}
Test.java
// COMMENT
class Test {
/*
// don't remove
*/
// COMMENT AS WELL
String s = "/* don't // remove */ \" \\ me */ as well";
}
Now run the demo:
java -cp antlr-3.3.jar org.antlr.Tool SingleCommentStrip.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
which will print:
class Test {
/*
// don't remove
*/
String s = "/* don't // remove */ \" \\ me */ as well";
}
(i.e. the single line comments are removed)

Related

How to write a lexer rule that references a character?

I want to create a lexer rule that can read a string literal that defines its own delimiter (specifically, the Oracle quote-delimited string):
q'!My string which can contain 'single quotes'!'
where the ! serves as the delimiter, but can in theory be any character.
Is it possible to do this via a lexer rule, without introducing a dependency on a given language target?
Is it possible to do this via a lexer rule, without introducing a dependency on a given language target?
No, target dependent code is needed for such a thing.
Just in case you, or someone else reading this Q&A is wondering how this can be done using target code, here's a quick demo:
lexer grammar TLexer;
#members {
boolean ahead(String text) {
for (int i = 0; i < text.length(); i++) {
if (_input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
TEXT
: [nN]? ( ['] ( [']['] | ~['] )* [']
| [qQ] ['] QUOTED_TEXT [']
)
;
// Skip everything other than TEXT tokens
OTHER
: . -> skip
;
fragment QUOTED_TEXT
: '[' ( {!ahead("]'")}? . )* ']'
| '{' ( {!ahead("}'")}? . )* '}'
| '<' ( {!ahead(">'")}? . )* '>'
| '(' ( {!ahead(")'")}? . )* ')'
| . ( {!ahead(getText().charAt(0) + "'")}? . )* .
;
which can be tested with the class:
public class Main {
static void test(String input) {
TLexer lexer = new TLexer(new ANTLRInputStream(input));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
System.out.printf("input: `%s`\n", input);
for (Token token : tokenStream.getTokens()) {
if (token.getType() != TLexer.EOF) {
System.out.printf(" token: -> %s\n", token.getText());
}
}
System.out.println();
}
public static void main(String[] args) throws Exception {
test("foo q'!My string which can contain 'single quotes'!' bar");
test("foo q'(My string which can contain 'single quotes')' bar");
test("foo 'My string which can contain ''single quotes' bar");
}
}
which will print:
input: `foo q'!My string which can contain 'single quotes'!' bar`
token: -> q'!My string which can contain 'single quotes'!'
input: `foo q'(My string which can contain 'single quotes')' bar`
token: -> q'(My string which can contain 'single quotes')'
input: `foo 'My string which can contain ''single quotes' bar`
token: -> 'My string which can contain ''single quotes'
The . in the alternative
| . ( {!ahead(getText().charAt(0) + "'")}? . )* .
might be a bit too permissive, but that can be tweaked by replacing it with a negated, or regular character set.

Dynamically create lexer rule

Here is a simple rule:
NAME : 'name1' | 'name2' | 'name3';
Is it possible to provide alternatives for such rule dynamically using an array that contains strings?
Yes, dynamic tokens match IDENTIFIER rule
In that case, simply do a check after the Id has matched completely to see if the text the Id matched is in a predefined collection. If it is in the collection (a Set in my example) change the type of the token.
A small demo:
grammar T;
#lexer::members {
private java.util.Set<String> special;
public TLexer(ANTLRStringStream input, java.util.Set<String> special) {
super(input);
this.special = special;
}
}
parse
: (t=. {System.out.printf("\%-10s'\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
{if(special.contains($text)) $type=Special;}
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
fragment Special : ;
And if you now run the following demo:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "foo bar baz Mu";
java.util.Set<String> set = new java.util.HashSet<String>();
set.add("Mu");
set.add("bar");
TLexer lexer = new TLexer(new ANTLRStringStream(source), set);
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
You will see the following being printed:
Id 'foo'
Special 'bar'
Id 'baz'
Special 'Mu'
ANTLR4
For ANTLR4, you can do something like this:
grammar T;
#lexer::members {
private java.util.Set<String> special = new java.util.HashSet<>();
public TLexer(CharStream input, java.util.Set<String> special) {
this(input);
this.special = special;
}
}
tokens {
Special
}
parse
: .*? EOF
;
Id
: [a-zA-Z_] [a-zA-Z_0-9]* {if(special.contains(getText())) setType(TParser.Special);}
;
Int
: [0-9]+
;
Space
: [ \t\r\n] -> skip
;
test it with the class:
import org.antlr.v4.runtime.*;
import java.util.HashSet;
import java.util.Set;
public class Main {
public static void main(String[] args) {
String source = "foo bar baz Mu";
Set<String> set = new HashSet<String>(){{
add("Mu");
add("bar");
}};
TLexer lexer = new TLexer(CharStreams.fromString(source), set);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-10s '%s'\n", TParser.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
Id 'foo'
Special 'bar'
Id 'baz'
Special 'Mu'
EOF '<EOF>'

ANTLRWorks :Can't get operators to work

I've been trying to learn ANTLR for some time and finally got my hands on The Definitive ANTLR reference.
Well I tried the following in ANTLRWorks 1.4
grammar Test;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
expression
: INT ('+'^ INT)*;
When I pass 2+4 and process expression, I don't get a tree with + as the root and 2 and 4 as the child nodes. Rather, I get expression as the root and 2, + and 4 as child nodes at the same level.
Can't figure out what I am doing wrong. Need help desparately.
BTW how can I get those graphic descriptions ?
Yes, you get the expression because it's an expression that your only rule expression is returning.
I have just added a virtual token PLUS to your example along with a rewrite expression that show the result your are expecting.
But it seems that you have already found the solution :o)
grammar Test;
options {
output=AST;
ASTLabelType = CommonTree;
}
tokens {PLUS;}
#members {
public static void main(String [] args) {
try {
TestLexer lexer =
new TestLexer(new ANTLRStringStream("2+2"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
TestParser.expression_return p_result = parser.expression();
CommonTree ast = p_result.tree;
if( ast == null ) {
System.out.println("resultant tree: is NULL");
} else {
System.out.println("resultant tree: " + ast.toStringTree());
}
} catch(Exception e) {
e.printStackTrace();
}
}
}
expression
: INT ('+' INT)* -> ^(PLUS INT+);
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;

Generating simple AST in ANTLR

I'm playing a bit around with ANTLR, and wish to create a function like this:
MOVE x y z pitch roll
That produces the following AST:
MOVE
|---x
|---y
|---z
|---pitch
|---roll
So far I've tried without luck, and I keep getting the AST to have the parameters as siblings, rather than children.
Code so far:
C#:
class Program
{
const string CRLF = "\r\n";
static void Main(string[] args)
{
string filename = "Script.txt";
var reader = new StreamReader(filename);
var input = new ANTLRReaderStream(reader);
var lexer = new ScorBotScriptLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new ScorBotScriptParser(tokens);
var result = parser.program();
var tree = result.Tree as CommonTree;
Print(tree, "");
Console.Read();
}
static void Print(CommonTree tree, string indent)
{
Console.WriteLine(indent + tree.ToString());
if (tree.Children != null)
{
indent += "\t";
foreach (var child in tree.Children)
{
var childTree = child as CommonTree;
if (childTree.Text != CRLF)
{
Print(childTree, indent);
}
}
}
}
ANTLR:
grammar ScorBotScript;
options
{
language = 'CSharp2';
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
memoize = true;
}
#parser::namespace { RSD.Scripting }
#lexer::namespace { RSD.Scripting }
program
: (robotInstruction CRLF)*
;
robotInstruction
: moveCoordinatesInstruction
;
/**
* MOVE X Y Z PITCH ROLL
*/
moveCoordinatesInstruction
: 'MOVE' x=INT y=INT z=INT pitch=INT roll=INT
;
INT : '-'? ( '0'..'9' )*
;
COMMENT
: '//' ~( CR | LF )* CR? LF { $channel = HIDDEN; }
;
WS
: ( ' ' | TAB | CR | LF ) { $channel = HIDDEN; }
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
;
fragment TAB
: '\t'
;
fragment CR
: '\r'
;
fragment LF
: '\n'
;
CRLF
: (CR ? LF) => CR ? LF
| CR
;
parse
: ID
| INT
| COMMENT
| STRING
| WS
;
I'm a beginner with ANTLR myself, this confused me too.
I think if you want to create a tree from your grammar that has structure, you augment your grammar with hints using the ^ and ! characters. This examples page shows how.
From the linked page:
By default ANTLR creates trees as
"sibling lists".
The grammar must be annotated to with
tree commands to produce a parser that
creates trees in the correct shape
(that is, operators at the root, which
operands as children). A somewhat more
complicated expression parser can be
seen here and downloaded in tar form
here. Note that grammar terminals
which should be at the root of a
sub-tree are annotated with ^.

ANTLR: multiplication omiting '*' symbol

I'm trying to create a grammar for multiplying and dividing numbers in which the '*' symbol does not need to be included. I need it to output an AST. So for input like this:
1 2 / 3 4
I want the AST to be
(* (/ (* 1 2) 3) 4)
I've hit upon the following, which uses java code to create the appropriate nodes:
grammar TestProd;
options {
output = AST;
}
tokens {
PROD;
}
DIV : '/';
multExpr: (INTEGER -> INTEGER)
( {div = null;}
div=DIV? b=INTEGER
->
^({$div == null ? (Object)adaptor.create(PROD, "*") : (Object)adaptor.create(DIV, "/")}
$multExpr $b))*
;
INTEGER: ('0' | '1'..'9' '0'..'9'*);
WHITESPACE: (' ' | '\t')+ { $channel = HIDDEN; };
This works. But is there a better/simpler way?
Here's a way:
grammar Test;
options {
backtrack=true;
output=AST;
}
tokens {
MUL;
DIV;
}
parse
: expr* EOF
;
expr
: (atom -> atom)
( '/' a=atom -> ^(DIV $expr $a)
| a=atom -> ^(MUL $expr $a)
)*
;
atom
: Number
| '(' expr ')' -> expr
;
Number
: '0'..'9'+
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
Tested with:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.Tree;
public class Main {
public static void main(String[] args) throws Exception {
String source = "1 2 / 3 4";
ANTLRStringStream in = new ANTLRStringStream(source);
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
TestParser.parse_return result = parser.parse();
Tree tree = (Tree)result.getTree();
System.out.println(tree.toStringTree());
}
}
produced:
(MUL (DIV (MUL 1 2) 3) 4)