How to write a lexer rule that references a character? - antlr

I want to create a lexer rule that can read a string literal that defines its own delimiter (specifically, the Oracle quote-delimited string):
q'!My string which can contain 'single quotes'!'
where the ! serves as the delimiter, but can in theory be any character.
Is it possible to do this via a lexer rule, without introducing a dependency on a given language target?

Is it possible to do this via a lexer rule, without introducing a dependency on a given language target?
No, target dependent code is needed for such a thing.
Just in case you, or someone else reading this Q&A is wondering how this can be done using target code, here's a quick demo:
lexer grammar TLexer;
#members {
boolean ahead(String text) {
for (int i = 0; i < text.length(); i++) {
if (_input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
TEXT
: [nN]? ( ['] ( [']['] | ~['] )* [']
| [qQ] ['] QUOTED_TEXT [']
)
;
// Skip everything other than TEXT tokens
OTHER
: . -> skip
;
fragment QUOTED_TEXT
: '[' ( {!ahead("]'")}? . )* ']'
| '{' ( {!ahead("}'")}? . )* '}'
| '<' ( {!ahead(">'")}? . )* '>'
| '(' ( {!ahead(")'")}? . )* ')'
| . ( {!ahead(getText().charAt(0) + "'")}? . )* .
;
which can be tested with the class:
public class Main {
static void test(String input) {
TLexer lexer = new TLexer(new ANTLRInputStream(input));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
System.out.printf("input: `%s`\n", input);
for (Token token : tokenStream.getTokens()) {
if (token.getType() != TLexer.EOF) {
System.out.printf(" token: -> %s\n", token.getText());
}
}
System.out.println();
}
public static void main(String[] args) throws Exception {
test("foo q'!My string which can contain 'single quotes'!' bar");
test("foo q'(My string which can contain 'single quotes')' bar");
test("foo 'My string which can contain ''single quotes' bar");
}
}
which will print:
input: `foo q'!My string which can contain 'single quotes'!' bar`
token: -> q'!My string which can contain 'single quotes'!'
input: `foo q'(My string which can contain 'single quotes')' bar`
token: -> q'(My string which can contain 'single quotes')'
input: `foo 'My string which can contain ''single quotes' bar`
token: -> 'My string which can contain ''single quotes'
The . in the alternative
| . ( {!ahead(getText().charAt(0) + "'")}? . )* .
might be a bit too permissive, but that can be tweaked by replacing it with a negated, or regular character set.

Related

Using a grammar with a visitor to calculate arithmetic expressions

We've been given a grammar in class that looks like this:
grammar Calculator;
#header {
import java.util.*;
}
#parser::members {
/** "memory" for our calculator; variable/value pairs go here */
Map<String, Double> memory = new HashMap<String, Double>();
}
statlist : stat+ ;
stat : vgl NL #printCompare
| ass NL #printAssign
| NL #blank
;
ass : <assoc=right> VAR ('=') vgl #assign
;
vgl : sum(op=('<'|'>') sum)* #compare
;
sum : prod(op=('+'|'-') prod)* #addSub
;
prod : pot(op=('*'|'/') pot)* #mulDiv
;
pot :<assoc=right> term(op='^' pot)? #poten
;
term : '+' term #add
| '-' term #subtract
| '(' sum ')' #parens
| VAR #var
| INT #int
;
/*Rules for the lexer */
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
BIG : '>' ;
SML : '<' ;
POT : '^' ;
VAR : [a-zA-Z]+ ;
NL : [\n] ;
INT : [0-9]+ ;
WS : [ \r\t]+ -> skip ; // skip spaces, tabs
I am having problems translating constructs like these
sum : prod(op=('+'|'-') prod)* #addSub
into working code. Currently the corresponding method looks like this:
/** prod(op=('+'|'-') prod)* */
#Override
public Double visitAddSub(CalculatorParser.AddSubContext ctx) {
double left = visit(ctx.prod(0));
if(ctx.op == null){
return left;
}
double right = visit(ctx.prod(1));
return (ctx.op.getType() == CalculatorParser.ADD) ? left+right : left-right;
}
Current output would look like this
3+3+3
6.0
which is obviously false. How do I get my visitor to visit the nodes correctly without touching the grammar?
Take a look at the rule:
prod(op=('+'|'-') prod)*
See that *? It means that what's inside the parentheses can come up 0 or more times.
Your visitor code assumes there will either be only one or two child prod, but no more. That's why you see 6.0: the parser put 3+3+3 into the context, but your visitor only processed 3+3 and leaved the final +3 out.
So just use a while loop over all the op and prod children, and accumulate them into the result.
Okay, with the help of Lucas and the usage of op+= I manage to fix my problem. It looks pretty complicated but it works.
/** prod(op+=('+'|'-') prod)* */
#Override
public Double visitAddSub(CalculatorParser.AddSubContext ctx) {
Stack<Double> temp = new Stack<Double>();
switch(ctx.children.size()){
case 1: return visit(ctx.prod(0));
default:
Double ret = 0.0;
for(int i = 0; i < ctx.op.size(); i++){
if(ctx.op.get(i).getType()==CalculatorParser.ADD){
if(temp.isEmpty()) {
ret = visit(ctx.prod(i)) + visit(ctx.prod(i+1));
temp.push(ret);
} else {
ret = temp.pop() + visit(ctx.prod(i+1));
temp.push(ret);
}
} else {
if(temp.isEmpty()) {
ret = visit(ctx.prod(i)) - visit(ctx.prod(i+1));
temp.push(ret);
} else {
ret = temp.pop() - visit(ctx.prod(i+1));
temp.push(ret);
}
}
}
}
return temp.pop();
}
We are using a switch-case to determine how many children this context has. If its more than 3 we have atleast 2 operators. We're then using the individual operator and a stack to determine the result.

In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?

I have an antlr4 lexer grammar. It has many rules for words, but I also want it to create an Unknown token for any word that it can not match by other rules. I have something like this:
Whitespace : [ \t\n\r]+ -> skip;
Punctuation : [.,:;?!];
// Other rules here
Unknown : .+? ;
Now generated matcher catches '~' as unknown but creates 3 '~' Unknown tokens for input '~~~' instead of a single '~~~' token. What should I do to tell lexer to generate word tokens for unknown consecutive characters. I also tried "Unknown: . ;" and "Unknown : .+ ;" with no results.
EDIT: In current antlr versions .+? now catches remaining words, so this problem seems to be resolved.
.+? at the end of a lexer rule will always match a single character. But .+ will consume as much as possible, which was illegal at the end of a rule in ANTLR v3 (v4 probably as well).
What you can do is just match a single char, and "glue" these together in the parser:
unknowns : Unknown+ ;
...
Unknown : . ;
EDIT
... but I only have a lexer, no parsers ...
Ah, I see. Then you could override the nextToken() method:
lexer grammar Lex;
#members {
public static void main(String[] args) {
Lex lex = new Lex(new ANTLRInputStream("foo, bar...\n"));
for(Token t : lex.getAllTokens()) {
System.out.printf("%-15s '%s'\n", tokenNames[t.getType()], t.getText());
}
}
private java.util.Queue<Token> queue = new java.util.LinkedList<Token>();
#Override
public Token nextToken() {
if(!queue.isEmpty()) {
return queue.poll();
}
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
StringBuilder builder = new StringBuilder();
while(next.getType() == Unknown) {
builder.append(next.getText());
next = super.nextToken();
}
// The `next` will _not_ be an Unknown-token, store it in
// the queue to return the next time!
queue.offer(next);
return new CommonToken(Unknown, builder.toString());
}
}
Whitespace : [ \t\n\r]+ -> skip ;
Punctuation : [.,:;?!] ;
Unknown : . ;
Running it:
java -cp antlr-4.0-complete.jar org.antlr.v4.Tool Lex.g4
javac -cp antlr-4.0-complete.jar *.java
java -cp .:antlr-4.0-complete.jar Lex
will print:
Unknown 'foo'
Punctuation ','
Unknown 'bar'
Punctuation '.'
Punctuation '.'
Punctuation '.'
The accepted answer works, but it only works for Java.
I converted the provided Java code for use with the C# ANTLR runtime. If anyone else needs it... here ya go!
#members {
private IToken _NextToken = null;
public override IToken NextToken()
{
if(_NextToken != null)
{
var token = _NextToken;
_NextToken = null;
return token;
}
var next = base.NextToken();
if(next.Type != UNKNOWN)
{
return next;
}
var originalToken = next;
var lastToken = next;
var builder = new StringBuilder();
while(next.Type == UNKNOWN)
{
lastToken = next;
builder.Append(next.Text);
next = base.NextToken();
}
_NextToken = next;
return new CommonToken(
originalToken
)
{
Text = builder.ToString(),
StopIndex = lastToken.Column
};
}
}

Dynamically create lexer rule

Here is a simple rule:
NAME : 'name1' | 'name2' | 'name3';
Is it possible to provide alternatives for such rule dynamically using an array that contains strings?
Yes, dynamic tokens match IDENTIFIER rule
In that case, simply do a check after the Id has matched completely to see if the text the Id matched is in a predefined collection. If it is in the collection (a Set in my example) change the type of the token.
A small demo:
grammar T;
#lexer::members {
private java.util.Set<String> special;
public TLexer(ANTLRStringStream input, java.util.Set<String> special) {
super(input);
this.special = special;
}
}
parse
: (t=. {System.out.printf("\%-10s'\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
{if(special.contains($text)) $type=Special;}
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
fragment Special : ;
And if you now run the following demo:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "foo bar baz Mu";
java.util.Set<String> set = new java.util.HashSet<String>();
set.add("Mu");
set.add("bar");
TLexer lexer = new TLexer(new ANTLRStringStream(source), set);
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
You will see the following being printed:
Id 'foo'
Special 'bar'
Id 'baz'
Special 'Mu'
ANTLR4
For ANTLR4, you can do something like this:
grammar T;
#lexer::members {
private java.util.Set<String> special = new java.util.HashSet<>();
public TLexer(CharStream input, java.util.Set<String> special) {
this(input);
this.special = special;
}
}
tokens {
Special
}
parse
: .*? EOF
;
Id
: [a-zA-Z_] [a-zA-Z_0-9]* {if(special.contains(getText())) setType(TParser.Special);}
;
Int
: [0-9]+
;
Space
: [ \t\r\n] -> skip
;
test it with the class:
import org.antlr.v4.runtime.*;
import java.util.HashSet;
import java.util.Set;
public class Main {
public static void main(String[] args) {
String source = "foo bar baz Mu";
Set<String> set = new HashSet<String>(){{
add("Mu");
add("bar");
}};
TLexer lexer = new TLexer(CharStreams.fromString(source), set);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-10s '%s'\n", TParser.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
Id 'foo'
Special 'bar'
Id 'baz'
Special 'Mu'
EOF '<EOF>'

ANTLRWorks :Can't get operators to work

I've been trying to learn ANTLR for some time and finally got my hands on The Definitive ANTLR reference.
Well I tried the following in ANTLRWorks 1.4
grammar Test;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
expression
: INT ('+'^ INT)*;
When I pass 2+4 and process expression, I don't get a tree with + as the root and 2 and 4 as the child nodes. Rather, I get expression as the root and 2, + and 4 as child nodes at the same level.
Can't figure out what I am doing wrong. Need help desparately.
BTW how can I get those graphic descriptions ?
Yes, you get the expression because it's an expression that your only rule expression is returning.
I have just added a virtual token PLUS to your example along with a rewrite expression that show the result your are expecting.
But it seems that you have already found the solution :o)
grammar Test;
options {
output=AST;
ASTLabelType = CommonTree;
}
tokens {PLUS;}
#members {
public static void main(String [] args) {
try {
TestLexer lexer =
new TestLexer(new ANTLRStringStream("2+2"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
TestParser.expression_return p_result = parser.expression();
CommonTree ast = p_result.tree;
if( ast == null ) {
System.out.println("resultant tree: is NULL");
} else {
System.out.println("resultant tree: " + ast.toStringTree());
}
} catch(Exception e) {
e.printStackTrace();
}
}
}
expression
: INT ('+' INT)* -> ^(PLUS INT+);
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;

Generating simple AST in ANTLR

I'm playing a bit around with ANTLR, and wish to create a function like this:
MOVE x y z pitch roll
That produces the following AST:
MOVE
|---x
|---y
|---z
|---pitch
|---roll
So far I've tried without luck, and I keep getting the AST to have the parameters as siblings, rather than children.
Code so far:
C#:
class Program
{
const string CRLF = "\r\n";
static void Main(string[] args)
{
string filename = "Script.txt";
var reader = new StreamReader(filename);
var input = new ANTLRReaderStream(reader);
var lexer = new ScorBotScriptLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new ScorBotScriptParser(tokens);
var result = parser.program();
var tree = result.Tree as CommonTree;
Print(tree, "");
Console.Read();
}
static void Print(CommonTree tree, string indent)
{
Console.WriteLine(indent + tree.ToString());
if (tree.Children != null)
{
indent += "\t";
foreach (var child in tree.Children)
{
var childTree = child as CommonTree;
if (childTree.Text != CRLF)
{
Print(childTree, indent);
}
}
}
}
ANTLR:
grammar ScorBotScript;
options
{
language = 'CSharp2';
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
memoize = true;
}
#parser::namespace { RSD.Scripting }
#lexer::namespace { RSD.Scripting }
program
: (robotInstruction CRLF)*
;
robotInstruction
: moveCoordinatesInstruction
;
/**
* MOVE X Y Z PITCH ROLL
*/
moveCoordinatesInstruction
: 'MOVE' x=INT y=INT z=INT pitch=INT roll=INT
;
INT : '-'? ( '0'..'9' )*
;
COMMENT
: '//' ~( CR | LF )* CR? LF { $channel = HIDDEN; }
;
WS
: ( ' ' | TAB | CR | LF ) { $channel = HIDDEN; }
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
;
fragment TAB
: '\t'
;
fragment CR
: '\r'
;
fragment LF
: '\n'
;
CRLF
: (CR ? LF) => CR ? LF
| CR
;
parse
: ID
| INT
| COMMENT
| STRING
| WS
;
I'm a beginner with ANTLR myself, this confused me too.
I think if you want to create a tree from your grammar that has structure, you augment your grammar with hints using the ^ and ! characters. This examples page shows how.
From the linked page:
By default ANTLR creates trees as
"sibling lists".
The grammar must be annotated to with
tree commands to produce a parser that
creates trees in the correct shape
(that is, operators at the root, which
operands as children). A somewhat more
complicated expression parser can be
seen here and downloaded in tar form
here. Note that grammar terminals
which should be at the root of a
sub-tree are annotated with ^.