No way to implement a q quoted string with custom delimiters in Antlr4 - antlr

I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).

Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>

Related

Unable to parse APL Symbol using ANTLR

I am trying to parse APL expressions using ANTLR, It is sort of APL source code parser. It parse normal characters but fails to parse special symbols(like '←')
expression = N←0
Lexer
/* Lexer Tokens. */
NUMBER:
(DIGIT)+ ( '.' (DIGIT)+ )?;
ASSIGN:
'←'
;
DIGIT :
[0-9]
;
Output:
[#0,0:1='99',<NUMBER>,1:0]
**[#1,4:6='â??',<'â??'>,2:0**]
[#2,7:6='<EOF>',<EOF>,2:3]
Can some one help me to parse special characters from APL language.
I am following below steps.
Written Grammar
"antlr4.bat" used to generate parser from grammar.
"grun.bat" is used to generate token
"grun.bat" is used to generate token
That just means your terminal cannot display the character properly. There is nothing wrong with the generated parser or lexer not being able to recognise ←.
Just don't use the bat file, but rather test your lexer and parser by writing a small class yourself using your favourite IDE (which can display the characters properly).
Something like this:
grammar T;
expression
: ID ARROW NUMBER
;
ID : [a-zA-Z]+;
ARROW : '←';
NUMBER : [0-9]+;
SPACE : [ \t\r\n]+ -> skip;
and a main class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
TLexer lexer = new TLexer(CharStreams.fromString("N ← 0"));
TParser parser = new TParser(new CommonTokenStream(lexer));
System.out.println(parser.expression().toStringTree(parser));
}
}
which will display:
(expression N ← 0)
EDIT
You could also try using the unicode escape for the arrow like this:
grammar T;
expression
: ID ARROW NUMBER
;
ID : [a-zA-Z]+;
ARROW : '\u2190';
NUMBER : [0-9]+;
SPACE : [ \t\r\n]+ -> skip;
and the Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "N \u2190 0";
TLexer lexer = new TLexer(CharStreams.fromString(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
System.out.println(source + ": " + parser.expression().toStringTree(parser));
}
}
which will print:
N ← 0: (expression N ← 0)

how to report grammar ambiguity in antlr4

According to the antlr4 book (page 159), and using the grammar Ambig.g4, grammar ambiguity can be reported by:
grun Ambig stat -diagnostics
or equivalently, in code form:
parser.removeErrorListeners();
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
The grun command reports the ambiguity properly for me, using antlr-4.5.3. But when I use the code form, I dont get the ambiguity report. Here is the command trace:
$ antlr4 Ambig.g4 # see the book's page.159 for the grammar
$ javac Ambig*.java
$ grun Ambig stat -diagnostics < in1.txt # in1.txt is as shown on page.159
line 1:3 reportAttemptingFullContext d=0 (stat), input='f();'
line 1:3 reportAmbiguity d=0 (stat): ambigAlts={1, 2}, input='f();'
$ javac TestA_Listener.java
$ java TestA_Listener < in1.txt # exits silently
The TestA_Listener.java code is the following:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.atn.*; // for PredictionMode
import java.util.*;
public class TestA_Listener {
public static void main(String[] args) throws Exception {
ANTLRInputStream input = new ANTLRInputStream(System.in);
AmbigLexer lexer = new AmbigLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AmbigParser parser = new AmbigParser(tokens);
parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
parser.stat();
}
}
Can somebody please point out how the above java code should be modified, to print the ambiguity report?
For completeness, here is the code Ambig.g4 :
grammar Ambig;
stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement
;
expr: ID '(' ')'
| INT
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
And here is the input file in1.txt :
f();
Antlr4 is a top-down parser, so for the given input, the parse match is unambiguously:
stat -> expr -> ID -> ( -> ) -> stat(cnt'd) -> ;
The second stat alt is redundant and never reached, not ambiguous.
To resolve the apparent redundancy, a predicate might be used:
stat: e=expr {isValidExpr($e)}? ';' #exprStmt
| ID '(' ')' ';' #funcStmt
;
When isValidExpr is false, the function statement alternative will be evaluated.
I waited for several days for other people to post their answers. Finally after several rounds of experimenting, I found an answer:
The following line should be deleted from the above code. Then we get the same ambiguity report as given by grun.
parser.removeErrorListeners(); // remove ConsoleErrorListener
The following code will be work
public static void main(String[] args) throws IOException {
CharStream input = CharStreams.fromStream(System.in);
AmbigLexer lexer = new AmbigLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AmbigParser parser = new AmbigParser(tokens);
//parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new org.antlr.v4.runtime.DiagnosticErrorListener()); // add ours
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
parser.stat(); // parse as usual
}

ANTLR: removing clutter

i'm learning ANTLR right now. Let's say, I have a VHDL code and would like to do some processing on the PROCESS blocks. The rest should be completely ignored. I don't want to describe the whole VHDL language, since I'm interested only in the process blocks. So I could write a rule that matches process blocks. But how do I tell ANTLR to match only the process block rule and ignore anything else?
I know next to no VHDL, so let's say you want to replace all single line comments in a (Java) source file with multi-line comments:
//foo
should become:
/* foo */
You need to let the lexer match single line comments, of course. But you should also make sure it recognizes multi-line comments because you don't want //bar to be recognized as a single line comment in:
/*
//bar
*/
The same goes for string literals:
String s = "no // comment";
Finally, you should create some sort of catch-all rule in the lexer that will match any character.
A quick demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Str
: '"' ('\\' . | ~('\\' | '"'))* '"'
;
MLComment
: '/*' .* '*/'
;
SLComment
: '//' ~('\r' | '\n')*
{
setText("/* " + getText().substring(2) + " */");
}
;
Any
: . // fall through rule, matches any character
;
If you now parse input like this:
//comment 1
class Foo {
//comment 2
/*
* not // a comment
*/
String s = "not // a // comment"; //comment 3
}
the following will be printed to your console:
/* comment 1 */
class Foo {
/* comment 2 */
/*
* not // a comment
*/
String s = "not // a // comment"; /* comment 3 */
}
Note that this is just a quick demo: a string literal in Java could contain Unicode escapes, which my demo doesn't support, and my demo also does not handle char-literals (the char literal char c = '"'; would break it). All of these things are quite easy to fix, of course.
In the upcoming ANTLR v4, you can do fuzzy parsing. take a look at
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
You can get the beta software here:
http://antlr.org/download/antlr-4.0b3-complete.jar
Terence

ANTLR Variable Troubles

In short: how do I implement dynamic variables in ANTLR?
I come to you again with a basic ANTLR question.
I have this grammar:
grammar Amethyst;
options {
language = Java;
}
#header {
package org.omer.amethyst.generated;
import java.util.HashMap;
}
#lexer::header {
package org.omer.amethyst.generated;
}
#members {
HashMap memory = new HashMap();
}
begin: expr;
expr: (defun | println)*
;
println:
'println' atom {System.out.println($atom.value);}
;
defun:
'defun' VAR INT {memory.put($VAR.text, Integer.parseInt($INT.text));}
| 'defun' VAR STRING_LITERAL {memory.put($VAR.text, $STRING_LITERAL.text);}
;
atom returns [Object value]:
INT {$value = Integer.parseInt($INT.text);}
| ID
{
Object v = memory.get($ID.text);
if (v != null) $value = v;
else System.err.println("undefined variable " + $ID.text);
}
| STRING_LITERAL
{
String v = (String) memory.get($STRING_LITERAL.text);
if (v != null) $value = String.valueOf(v);
else System.err.println("undefined variable " + $STRING_LITERAL.text);
}
;
INT: '0'..'9'+ ;
STRING_LITERAL: '"' .* '"';
VAR: ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9')* ;
ID: ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
LETTER: ('a..z'|'A'..'Z')+ ;
WS: (' '|'\t'|'\n'|'\r')+ {skip();} ;
What it does (or should do), so far, is have a built-in "println" function to do exactly what you think it does, and a "defun" rule to define variables.
When "defun" is called on either a string or integer, the value is put into the "memory" HashMap with the first parameter being the variable's name and the second being its value.
When println is called on an atom, it should display the atom's value. The atom can be either a string or integer. It gets its value from memory and returns it. So for example:
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
null
NOTE: This output comes when I do:
println "greeting"
Output:
undefined variable "greeting"null
Does anyone know why this is so? Sorry if I'm not being clear, I don't understand most of this.
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
Because the input "greeting" is being tokenized as a VAR and a VAR is no atom. So the input defun greeting "Hello world!" is properly matched by the 2nd alternative of the defun rule:
defun
: 'defun' VAR INT // 1st alternative
| 'defun' VAR STRING_LITERAL // 2nd alternative
;
but the input println "greeting" cannot be matched by the println rule:
println
: 'println' atom
;
You must realize that the lexer does not produce tokens based on what the parser tries to match at a particular time. The input "greeting" will always be tokenized as a VAR, never as an ID rule.
What you need to do is remove the ID rule from the lexer, and replace ID with VAR inside your parser rules.

ANTLR : How to replace all characters defined as space with actual space

My ANTLR code is as follow :
LPARENTHESIS : ('(');
RPARENTHESIS : (')');
fragment CHARACTER : ('a'..'z'|'0'..'9'|);
fragment QUOTE : ('"');
fragment WILDCARD : ('*');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C'|';'|':'|',');
WILD_STRING
: (CHARACTER)*
(
('?')
(CHARACTER)*
)+
;
PREFIX_STRING
: (CHARACTER)+
(
('*')
)+
;
WS : (SPACE) { $channel=HIDDEN; };
PHRASE : (QUOTE)(LPARENTHESIS)?(WORD)(WILDCARD)?(RPARENTHESIS)?((SPACE)+(LPARENTHESIS)?(WORD)(WILDCARD)?(RPARENTHESIS)?)*(SPACE)+(QUOTE);
WORD : (CHARACTER)+;
What I would like to do is to replace all characters marked as space to be replaced with actual space character in the PHRASE. Also if possible, I would then like all continuous spaces to be represented by a single space.
Any help would be most appreciated. For some reason, I am finding it hard to understand ANTLR. Any good tutorials out there ?
Java
Invoke your lexer's setText(...) method:
grammar T;
parse
: words EOF {System.out.println($words.text);}
;
words
: Word (Spaces Word)*
;
Word
: ('a'..'z'|'A'..'Z')+
;
Spaces
: (' ' | '\t' | '\r' | '\n')+ {setText(" ");}
;
Which can be tested with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "This is \n just \t\t\t\t\t\t a \n\t\t test";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("------------------------------\nSource:\n" + source +
"\n------------------------------\nAfter parsing:");
parser.parse();
}
}
which produces the following output:
------------------------------
Source:
This is
just a
test
------------------------------
After parsing:
This is just a test
Puneet Pawaia wrote:
Any help would be most appreciated. For some reason, I am finding it hard to understand ANTLR. Any good tutorials out there ?
The ANTLR Wiki has loads of informative info, albeit a bit unstructured (but that could just be me!).
The best ANTLR tutorial is the book: The Definitive ANTLR Reference: Building Domain-Specific Languages.
C#
For the C# target, try this:
grammar T;
options {
language=CSharp2;
}
#parser::namespace { Demo }
#lexer::namespace { Demo }
parse
: words EOF {Console.WriteLine($words.text);}
;
words
: Word (Spaces Word)*
;
Word
: ('a'..'z'|'A'..'Z')+
;
Spaces
: (' ' | '\t' | '\r' | '\n')+ {Text = " ";}
;
with the test class:
using System;
using Antlr.Runtime;
namespace Demo
{
class MainClass
{
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("This is \n just \t\t\t\t\t\t a \n\t\t test");
TLexer Lexer = new TLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
TParser Parser = new TParser(Tokens);
Parser.parse();
}
}
}
which also prints This is just a test to the console. I tried to use SetText(...) instead of setText(...) but that didn't work either, and the C# API docs are currently off-line, so I used the trial and error-hack {Text = " ";}. I tested it with the C# 3.1.1 runtime DLL's.
Good luck!