Parsing input files with size bigger than computer memory requires three steps:
The use of unbuffered character and token streams
Copy text out of sliding buffer and store in tokens
Ask the parser not to create parse trees
The objective is a simplified grammar to parse a file composed of many lines, each of one composed of many words.
With this requirement in mind, the following code uses #members action and sub classing the parser generated by ANTLR
The method printPagesAndWords receives a List of LineContext objects.
It prints the total number of lines (with the start method provided by LineContext), but it cannot access to the number of WORDs per line, whith the method WORD() provided by the LineContext object.
This is the output obtained:
Number of lines in page: 3
Line start token: word1
Number of words in line: 0
Line start token: 2323
Number of words in line: 0
Line start token: 554545
Number of words in line: 0
Furthermore, If I try to get the WORDs in a line, for example changing the line
System.out.println("Number of words in line:\t"+row.WORD().size()+"\n");
by the line
System.out.println("Number of words in line:\t"+row.WORD(1).getText()+"\n");
The following exception is thrown:
Exception in thread "main" java.lang.NullPointerException
at TestContext.printPagesAndWords(TestContext.java:11)
at ContextParser.read(ContextParser.java:130)
at Main.main(Main.java:11)
The following is the complete set of files:
ContextLexer.g4
lexer grammar ContextLexer;
NL
: '\r'? '\n'
;
WORD
: ~[ \t\n\r]+
;
SPACE
: [ \t] -> skip
;
ContextParser.g4
parser grammar ContextParser;
options {
tokenVocab=ContextLexer;
}
#members{
void printPagesAndWords(List<LineContext> rows){};
}
read
: dataLine+=line* {printPagesAndWords($dataLine);}
;
line
: WORD* NL
;
TestContext which extends ContextParser
import org.antlr.v4.runtime.TokenStream;
import java.util.List;
public class TestContext extends ContextParser {
public TestContext(TokenStream input) {
super(input);
}
void printPagesAndWords(List<LineContext> rows){
System.out.println("Number of lines in page:\t" + rows.size()+"\n");
for(LineContext row: rows){
System.out.println("Line start token:\t\t\t"+row.start.getText());
System.out.println("Number of words in line:\t"+row.WORD().size()+"\n");
}
};
}
The Main Class:
import org.antlr.v4.runtime.*;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
String source = "word1 word2 number anothernumber \n 2323 55r ere\n554545 lll 545\n";
ContextLexer lexer = new ContextLexer(CharStreams.fromString(source));
lexer.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lexer);
TestContext parser = new TestContext(tokens);
parser.setBuildParseTree(false);
parser.read();
}
}
Related
I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>
I am trying to parse APL expressions using ANTLR, It is sort of APL source code parser. It parse normal characters but fails to parse special symbols(like '←')
expression = N←0
Lexer
/* Lexer Tokens. */
NUMBER:
(DIGIT)+ ( '.' (DIGIT)+ )?;
ASSIGN:
'←'
;
DIGIT :
[0-9]
;
Output:
[#0,0:1='99',<NUMBER>,1:0]
**[#1,4:6='â??',<'â??'>,2:0**]
[#2,7:6='<EOF>',<EOF>,2:3]
Can some one help me to parse special characters from APL language.
I am following below steps.
Written Grammar
"antlr4.bat" used to generate parser from grammar.
"grun.bat" is used to generate token
"grun.bat" is used to generate token
That just means your terminal cannot display the character properly. There is nothing wrong with the generated parser or lexer not being able to recognise ←.
Just don't use the bat file, but rather test your lexer and parser by writing a small class yourself using your favourite IDE (which can display the characters properly).
Something like this:
grammar T;
expression
: ID ARROW NUMBER
;
ID : [a-zA-Z]+;
ARROW : '←';
NUMBER : [0-9]+;
SPACE : [ \t\r\n]+ -> skip;
and a main class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
TLexer lexer = new TLexer(CharStreams.fromString("N ← 0"));
TParser parser = new TParser(new CommonTokenStream(lexer));
System.out.println(parser.expression().toStringTree(parser));
}
}
which will display:
(expression N ← 0)
EDIT
You could also try using the unicode escape for the arrow like this:
grammar T;
expression
: ID ARROW NUMBER
;
ID : [a-zA-Z]+;
ARROW : '\u2190';
NUMBER : [0-9]+;
SPACE : [ \t\r\n]+ -> skip;
and the Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "N \u2190 0";
TLexer lexer = new TLexer(CharStreams.fromString(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
System.out.println(source + ": " + parser.expression().toStringTree(parser));
}
}
which will print:
N ← 0: (expression N ← 0)
Here is my grammar i am trying to give input as
alter table ;
everything works fine but when i give
altasder table; alter table ;
it gives me an error on first string as expected but i want is to parse the second command ignoring the first 'altasder table;'
grammar Hello;
start : compilation;
compilation : sql*;
sql : altercommand;
altercommand : ALTER TABLE SEMICOLON;
ALTER: 'alter';
TABLE: 'table';
SEMICOLON : ';';
how can i achieve it???
I have used the DefualtError stategy but still its not wotking
import org.antlr.v4.runtime.DefaultErrorStrategy;
import org.antlr.v4.runtime.Parser;
import org.antlr.v4.runtime.RecognitionException;
import org.antlr.v4.runtime.TokenStream;
import org.antlr.v4.runtime.misc.IntervalSet;
public class CustomeErrorHandler extends DefaultErrorStrategy {
#Override
public void recover(Parser recognizer, RecognitionException e) {
// TODO Auto-generated method stub
super.recover(recognizer, e);
TokenStream tokenStream = (TokenStream)recognizer.getInputStream();
if (tokenStream.LA(1) == HelloParser.SEMICOLON )
{
IntervalSet intervalSet = getErrorRecoverySet(recognizer);
tokenStream.consume();
consumeUntil(recognizer, intervalSet);
}
}
}
main class :
public class Main {
public static void main(String[] args) throws IOException {
ANTLRInputStream ip = new ANTLRInputStream("altasdere table ; alter table ;");
HelloLexer lex = new HelloLexer(ip);
CommonTokenStream token = new CommonTokenStream(lex);
HelloParser parser = new HelloParser(token);
parser.setErrorHandler(new CustomeErrorHandler());
System.out.println(parser.start().toStringTree(parser));
}
}
myoutput :
line 1:0 token recognition error at: 'alta'
line 1:4 token recognition error at: 's'
line 1:5 token recognition error at: 'd'
line 1:6 token recognition error at: 'e'
line 1:7 token recognition error at: 'r'
line 1:8 token recognition error at: 'e'
line 1:9 token recognition error at: ' '
(start compilation)
why its not moving to second command ?
Need to use the DefaultErrorStrategy to control how the parser behaves in response to recognition errors. Extend as necessary, modifying the #recover method, to consume tokens up to the desired parsing restart point in the token stream.
A naive implementation of #recover would be:
#Override
public void recover(Parser recognizer, RecognitionException e) {
if (e instanceof InputMismatchException) {
int ttype = recognizer.getInputStream().LA(1);
while (ttype != Token.EOF && ttype != HelloParser.SEMICOLON) {
recognizer.consume();
ttype = recognizer.getInputStream().LA(1);
}
} else {
super.recover(recognizer, e);
}
}
Adjust the while condition as necessary to identify the next valid point to resume recognition.
Note, the error messages are due to the lexer being unable to match extraneous input characters. To remove the error messages, add as the last lexer rule:
ERR_TOKEN : . ;
According to the antlr4 book (page 159), and using the grammar Ambig.g4, grammar ambiguity can be reported by:
grun Ambig stat -diagnostics
or equivalently, in code form:
parser.removeErrorListeners();
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
The grun command reports the ambiguity properly for me, using antlr-4.5.3. But when I use the code form, I dont get the ambiguity report. Here is the command trace:
$ antlr4 Ambig.g4 # see the book's page.159 for the grammar
$ javac Ambig*.java
$ grun Ambig stat -diagnostics < in1.txt # in1.txt is as shown on page.159
line 1:3 reportAttemptingFullContext d=0 (stat), input='f();'
line 1:3 reportAmbiguity d=0 (stat): ambigAlts={1, 2}, input='f();'
$ javac TestA_Listener.java
$ java TestA_Listener < in1.txt # exits silently
The TestA_Listener.java code is the following:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.atn.*; // for PredictionMode
import java.util.*;
public class TestA_Listener {
public static void main(String[] args) throws Exception {
ANTLRInputStream input = new ANTLRInputStream(System.in);
AmbigLexer lexer = new AmbigLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AmbigParser parser = new AmbigParser(tokens);
parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
parser.stat();
}
}
Can somebody please point out how the above java code should be modified, to print the ambiguity report?
For completeness, here is the code Ambig.g4 :
grammar Ambig;
stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement
;
expr: ID '(' ')'
| INT
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
And here is the input file in1.txt :
f();
Antlr4 is a top-down parser, so for the given input, the parse match is unambiguously:
stat -> expr -> ID -> ( -> ) -> stat(cnt'd) -> ;
The second stat alt is redundant and never reached, not ambiguous.
To resolve the apparent redundancy, a predicate might be used:
stat: e=expr {isValidExpr($e)}? ';' #exprStmt
| ID '(' ')' ';' #funcStmt
;
When isValidExpr is false, the function statement alternative will be evaluated.
I waited for several days for other people to post their answers. Finally after several rounds of experimenting, I found an answer:
The following line should be deleted from the above code. Then we get the same ambiguity report as given by grun.
parser.removeErrorListeners(); // remove ConsoleErrorListener
The following code will be work
public static void main(String[] args) throws IOException {
CharStream input = CharStreams.fromStream(System.in);
AmbigLexer lexer = new AmbigLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AmbigParser parser = new AmbigParser(tokens);
//parser.removeErrorListeners(); // remove ConsoleErrorListener
parser.addErrorListener(new org.antlr.v4.runtime.DiagnosticErrorListener()); // add ours
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
parser.stat(); // parse as usual
}
My ANTLR code is as follow :
LPARENTHESIS : ('(');
RPARENTHESIS : (')');
fragment CHARACTER : ('a'..'z'|'0'..'9'|);
fragment QUOTE : ('"');
fragment WILDCARD : ('*');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C'|';'|':'|',');
WILD_STRING
: (CHARACTER)*
(
('?')
(CHARACTER)*
)+
;
PREFIX_STRING
: (CHARACTER)+
(
('*')
)+
;
WS : (SPACE) { $channel=HIDDEN; };
PHRASE : (QUOTE)(LPARENTHESIS)?(WORD)(WILDCARD)?(RPARENTHESIS)?((SPACE)+(LPARENTHESIS)?(WORD)(WILDCARD)?(RPARENTHESIS)?)*(SPACE)+(QUOTE);
WORD : (CHARACTER)+;
What I would like to do is to replace all characters marked as space to be replaced with actual space character in the PHRASE. Also if possible, I would then like all continuous spaces to be represented by a single space.
Any help would be most appreciated. For some reason, I am finding it hard to understand ANTLR. Any good tutorials out there ?
Java
Invoke your lexer's setText(...) method:
grammar T;
parse
: words EOF {System.out.println($words.text);}
;
words
: Word (Spaces Word)*
;
Word
: ('a'..'z'|'A'..'Z')+
;
Spaces
: (' ' | '\t' | '\r' | '\n')+ {setText(" ");}
;
Which can be tested with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "This is \n just \t\t\t\t\t\t a \n\t\t test";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("------------------------------\nSource:\n" + source +
"\n------------------------------\nAfter parsing:");
parser.parse();
}
}
which produces the following output:
------------------------------
Source:
This is
just a
test
------------------------------
After parsing:
This is just a test
Puneet Pawaia wrote:
Any help would be most appreciated. For some reason, I am finding it hard to understand ANTLR. Any good tutorials out there ?
The ANTLR Wiki has loads of informative info, albeit a bit unstructured (but that could just be me!).
The best ANTLR tutorial is the book: The Definitive ANTLR Reference: Building Domain-Specific Languages.
C#
For the C# target, try this:
grammar T;
options {
language=CSharp2;
}
#parser::namespace { Demo }
#lexer::namespace { Demo }
parse
: words EOF {Console.WriteLine($words.text);}
;
words
: Word (Spaces Word)*
;
Word
: ('a'..'z'|'A'..'Z')+
;
Spaces
: (' ' | '\t' | '\r' | '\n')+ {Text = " ";}
;
with the test class:
using System;
using Antlr.Runtime;
namespace Demo
{
class MainClass
{
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("This is \n just \t\t\t\t\t\t a \n\t\t test");
TLexer Lexer = new TLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
TParser Parser = new TParser(Tokens);
Parser.parse();
}
}
}
which also prints This is just a test to the console. I tried to use SetText(...) instead of setText(...) but that didn't work either, and the C# API docs are currently off-line, so I used the trial and error-hack {Text = " ";}. I tested it with the C# 3.1.1 runtime DLL's.
Good luck!