Antlr backtrack option not working - antlr

I am not sure but I think the Antlr backtrack option is not working properly or something...
Here is my grammar:
grammar Test;
options {
backtrack=true;
memoize=true;
}
prog: (code)+;
code
: ABC {System.out.println("ABC");}
| OTHER {System.out.println("OTHER");}
;
ABC : 'ABC';
OTHER : .;
If the input stream is "ABC" then I'll see ABC printed.
If the input stream is "ACD" then I'll see 3 times OTHER printed.
But if the input stream is "ABD" then I'll see
line 1:2 mismatched character 'D' expecting 'C'
line 1:3 required (...)+ loop did not match anything at input ''
but I expect to see three times OTHER, since the input should match the second rule if the first rule fails.
That doesn't make any sense. Why the parser didn't backtrack when it sees that the last character was not 'C'? However, it was ok with "ACD."
Could someone please help me solve this issue???
Thanks for your time!!!

The option backtrack=true applies to parser rules only, not lexer rules.
EDIT
The only work-around I am aware of, is by letting "AB" followed by some other char other than "C" be matched in the same ABC rule and then manually emitting other tokens.
A demo:
grammar Test;
#lexer::members {
List<Token> tokens = new ArrayList<Token>();
public void emit(int type, String text) {
state.token = new CommonToken(type, text);
tokens.add(state.token);
}
public Token nextToken() {
super.nextToken();
if(tokens.size() == 0) {
return Token.EOF_TOKEN;
}
return tokens.remove(0);
}
}
prog
: code+
;
code
: ABC {System.out.println("ABC");}
| OTHER {System.out.println("OTHER");}
;
ABC
: 'ABC'
| 'AB' t=~'C'
{
emit(OTHER, "A");
emit(OTHER, "B");
emit(OTHER, String.valueOf((char)$t));
}
;
OTHER
: .
;

Another solution. this might be a simpler solution though. i made use of "syntactic predicates".
grammar ABC;
#lexer::header {package org.inanme.antlr;}
#parser::header {package org.inanme.antlr;}
prog: (code)+ EOF;
code: ABC {System.out.println($ABC.text);}
| OTHER {System.out.println($OTHER.text);};
ABC : ('ABC') => 'ABC' | 'A';
OTHER : .;

Related

No way to implement a q quoted string with custom delimiters in Antlr4

I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>

Yield a modified token in ANTLR4

I have a syntax like the following
Identifier
: [a-zA-Z0-9_.]+
| '`' Identifier '`'
;
When I matched an identifier, e.g `someone`, I'd like to strip the backtick and yield a different token, aka someone
Of course, I could walk through the final token array, but is it possible to do it during token parsing?
If I well understand, given the input (file t.text) :
one `someone`
two `fred`
tree `henry`
you would like that tokens are automatically produced as if the grammar had the lexer rules :
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
ID : [a-zA-Z0-9_.]+ ;
But tokens are identified by a type, i.e. an integer, not by the name of the lexer rule. You can change this type with setType() :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::members { int next_number = 1001; }
question
#init {System.out.println("Question last update 1117");}
: expr+ EOF
;
expr
: ID BACKTICK_ID
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`' { setType(next_number); next_number+=1; } ;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:2='one',<ID>,1:0]
[#1,4:12='`someone`',<1001>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='`fred`',<1002>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='`henry`',<1003>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1117
line 1:4 mismatched input '`someone`' expecting BACKTICK_ID
line 2:4 mismatched input '`fred`' expecting BACKTICK_ID
line 3:5 mismatched input '`henry`' expecting BACKTICK_ID
The basic types come from the lexer rules :
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
the other from setType. Instead of incrementing a number for each token, you could write the tokens found in a table, and before creating a new one, access the table to check if it already exists and avoid duplicate tokens receive a different number.
Anyway you can do nothing useful in the parser because parser rules need to know the type number.
If you have a set of names known in advance, you can list them in a tokens statement :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::header {
import java.util.*;
}
tokens { SOMEONE, FRED, HENRY }
#lexer::members {
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("someone", QuestionParser.SOMEONE);
put("fred", QuestionParser.FRED);
put("henry", QuestionParser.HENRY);
}};
}
question
#init {System.out.println("Question last update 1746");}
: expr+ EOF
;
expr
: ID SOMEONE
| ID FRED
| ID HENRY
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`'
{ String textb = getText();
String texta = textb.substring(1, textb.length() - 1);
System.out.println("text before=" + textb + ", text after="+ texta);
if ( keywords.containsKey(texta)) {
setType(keywords.get(texta)); // reset token type
setText(texta); // remove backticks
}
}
;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
text before=`someone`, text after=someone
text before=`fred`, text after=fred
text before=`henry`, text after=henry
[#0,0:2='one',<ID>,1:0]
[#1,4:12='someone',<4>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='fred',<5>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='henry',<6>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1746
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
SOMEONE=4
FRED=5
HENRY=6
As you can see, there are no more errors because the expr rule is happy with well identified tokens. Even if there are no
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
only ID and BACKTICK_ID, the types have been defined behind the scene by the tokens statement :
public static final int
ID=1, BACKTICK_ID=2, WS=3, SOMEONE=4, FRED=5, HENRY=6;
I'm afraid that if you want a free list of names, it's not possible because the parser works with types, not the name of lexer rules :
public static class ExprContext extends ParserRuleContext {
public TerminalNode ID() { return getToken(QuestionParser.ID, 0); }
public TerminalNode SOMEONE() { return getToken(QuestionParser.SOMEONE, 0); }
public TerminalNode FRED() { return getToken(QuestionParser.FRED, 0); }
public TerminalNode HENRY() { return getToken(QuestionParser.HENRY, 0); }
...
public final ExprContext expr() throws RecognitionException {
try { ...
setState(17);
case 1:
enterOuterAlt(_localctx, 1);
{
setState(11);
match(ID);
setState(12);
match(SOMEONE);
}
break;
In
match(SOMEONE);
SOMEONE is a constant representing the number 4.
If you don't have a list of known names, emit will not solve your problem because it creates a Token whose most important field is _type :
public Token emit() {
Token t = _factory.create(_tokenFactorySourcePair, _type, _text, _channel, _tokenStartCharIndex, getCharIndex()-1,
_tokenStartLine, _tokenStartCharPositionInLine);
emit(t);
return t;
}

ANTLR: removing clutter

i'm learning ANTLR right now. Let's say, I have a VHDL code and would like to do some processing on the PROCESS blocks. The rest should be completely ignored. I don't want to describe the whole VHDL language, since I'm interested only in the process blocks. So I could write a rule that matches process blocks. But how do I tell ANTLR to match only the process block rule and ignore anything else?
I know next to no VHDL, so let's say you want to replace all single line comments in a (Java) source file with multi-line comments:
//foo
should become:
/* foo */
You need to let the lexer match single line comments, of course. But you should also make sure it recognizes multi-line comments because you don't want //bar to be recognized as a single line comment in:
/*
//bar
*/
The same goes for string literals:
String s = "no // comment";
Finally, you should create some sort of catch-all rule in the lexer that will match any character.
A quick demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Str
: '"' ('\\' . | ~('\\' | '"'))* '"'
;
MLComment
: '/*' .* '*/'
;
SLComment
: '//' ~('\r' | '\n')*
{
setText("/* " + getText().substring(2) + " */");
}
;
Any
: . // fall through rule, matches any character
;
If you now parse input like this:
//comment 1
class Foo {
//comment 2
/*
* not // a comment
*/
String s = "not // a // comment"; //comment 3
}
the following will be printed to your console:
/* comment 1 */
class Foo {
/* comment 2 */
/*
* not // a comment
*/
String s = "not // a // comment"; /* comment 3 */
}
Note that this is just a quick demo: a string literal in Java could contain Unicode escapes, which my demo doesn't support, and my demo also does not handle char-literals (the char literal char c = '"'; would break it). All of these things are quite easy to fix, of course.
In the upcoming ANTLR v4, you can do fuzzy parsing. take a look at
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
You can get the beta software here:
http://antlr.org/download/antlr-4.0b3-complete.jar
Terence

ANTLR Variable Troubles

In short: how do I implement dynamic variables in ANTLR?
I come to you again with a basic ANTLR question.
I have this grammar:
grammar Amethyst;
options {
language = Java;
}
#header {
package org.omer.amethyst.generated;
import java.util.HashMap;
}
#lexer::header {
package org.omer.amethyst.generated;
}
#members {
HashMap memory = new HashMap();
}
begin: expr;
expr: (defun | println)*
;
println:
'println' atom {System.out.println($atom.value);}
;
defun:
'defun' VAR INT {memory.put($VAR.text, Integer.parseInt($INT.text));}
| 'defun' VAR STRING_LITERAL {memory.put($VAR.text, $STRING_LITERAL.text);}
;
atom returns [Object value]:
INT {$value = Integer.parseInt($INT.text);}
| ID
{
Object v = memory.get($ID.text);
if (v != null) $value = v;
else System.err.println("undefined variable " + $ID.text);
}
| STRING_LITERAL
{
String v = (String) memory.get($STRING_LITERAL.text);
if (v != null) $value = String.valueOf(v);
else System.err.println("undefined variable " + $STRING_LITERAL.text);
}
;
INT: '0'..'9'+ ;
STRING_LITERAL: '"' .* '"';
VAR: ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9')* ;
ID: ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
LETTER: ('a..z'|'A'..'Z')+ ;
WS: (' '|'\t'|'\n'|'\r')+ {skip();} ;
What it does (or should do), so far, is have a built-in "println" function to do exactly what you think it does, and a "defun" rule to define variables.
When "defun" is called on either a string or integer, the value is put into the "memory" HashMap with the first parameter being the variable's name and the second being its value.
When println is called on an atom, it should display the atom's value. The atom can be either a string or integer. It gets its value from memory and returns it. So for example:
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
null
NOTE: This output comes when I do:
println "greeting"
Output:
undefined variable "greeting"null
Does anyone know why this is so? Sorry if I'm not being clear, I don't understand most of this.
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
Because the input "greeting" is being tokenized as a VAR and a VAR is no atom. So the input defun greeting "Hello world!" is properly matched by the 2nd alternative of the defun rule:
defun
: 'defun' VAR INT // 1st alternative
| 'defun' VAR STRING_LITERAL // 2nd alternative
;
but the input println "greeting" cannot be matched by the println rule:
println
: 'println' atom
;
You must realize that the lexer does not produce tokens based on what the parser tries to match at a particular time. The input "greeting" will always be tokenized as a VAR, never as an ID rule.
What you need to do is remove the ID rule from the lexer, and replace ID with VAR inside your parser rules.

How can I build an ANTLR Works style parse tree?

I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar:
grammar test;
program : idList;
idList : id* ;
id : ID ;
ID : LETTER (LETTER | NUMBER)* ;
LETTER : 'a' .. 'z' | 'A' .. 'Z' ;
NUMBER : '0' .. '9' ;
ANTLR Works will output:
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens.
A quick demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
IdList;
Id;
}
#parser::members {
private static void walk(CommonTree tree, int indent) {
if(tree == null) return;
for(int i = 0; i < indent; i++, System.out.print(" "));
System.out.println(tree.getText());
for(int i = 0; i < tree.getChildCount(); i++) {
walk((CommonTree)tree.getChild(i), indent + 1);
}
}
public static void main(String[] args) throws Exception {
testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123"));
testParser parser = new testParser(new CommonTokenStream(lexer));
walk((CommonTree)parser.program().getTree(), 0);
}
}
program : idList EOF -> idList;
idList : id* -> ^(IdList id*);
id : ID -> ^(Id ID);
ID : LETTER (LETTER | DIGIT)*;
SPACE : ' ' {skip();};
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
fragment DIGIT : '0' .. '9';
If you run the demo above, you will see the following being printed to the console:
IdList
Id
abc
Id
abc123
As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead:
idList : id* -> ^(IdList["idList"] id*);
id : ID -> ^(Id["id"] ID);
which will print:
idList
id
abc
id
abc123