Using antlr 3 with java
I'm having trouble printing the text from the token here:
#lexer::members {
public void displayRecognitionError(String[] tokenNames, RecognitionException e) {
System.err.println("Encountered an illegal char " + getText() + " on line "+getLine()+":"+getCharPositionInLine ());
}
}
I'm making a more detailed error report for the lexer grammar.
The thing is that when the error occurs (the user enters a token like : which isn't defined) it only shows "Encountered an illegal char on line x:y", instead it should show the invalid character : between char and on line x:y.
What can I do to show the invalid character, line and column?
Related
Parsing input files with size bigger than computer memory requires three steps:
The use of unbuffered character and token streams
Copy text out of sliding buffer and store in tokens
Ask the parser not to create parse trees
The objective is a simplified grammar to parse a file composed of many lines, each of one composed of many words.
With this requirement in mind, the following code uses #members action and sub classing the parser generated by ANTLR
The method printPagesAndWords receives a List of LineContext objects.
It prints the total number of lines (with the start method provided by LineContext), but it cannot access to the number of WORDs per line, whith the method WORD() provided by the LineContext object.
This is the output obtained:
Number of lines in page: 3
Line start token: word1
Number of words in line: 0
Line start token: 2323
Number of words in line: 0
Line start token: 554545
Number of words in line: 0
Furthermore, If I try to get the WORDs in a line, for example changing the line
System.out.println("Number of words in line:\t"+row.WORD().size()+"\n");
by the line
System.out.println("Number of words in line:\t"+row.WORD(1).getText()+"\n");
The following exception is thrown:
Exception in thread "main" java.lang.NullPointerException
at TestContext.printPagesAndWords(TestContext.java:11)
at ContextParser.read(ContextParser.java:130)
at Main.main(Main.java:11)
The following is the complete set of files:
ContextLexer.g4
lexer grammar ContextLexer;
NL
: '\r'? '\n'
;
WORD
: ~[ \t\n\r]+
;
SPACE
: [ \t] -> skip
;
ContextParser.g4
parser grammar ContextParser;
options {
tokenVocab=ContextLexer;
}
#members{
void printPagesAndWords(List<LineContext> rows){};
}
read
: dataLine+=line* {printPagesAndWords($dataLine);}
;
line
: WORD* NL
;
TestContext which extends ContextParser
import org.antlr.v4.runtime.TokenStream;
import java.util.List;
public class TestContext extends ContextParser {
public TestContext(TokenStream input) {
super(input);
}
void printPagesAndWords(List<LineContext> rows){
System.out.println("Number of lines in page:\t" + rows.size()+"\n");
for(LineContext row: rows){
System.out.println("Line start token:\t\t\t"+row.start.getText());
System.out.println("Number of words in line:\t"+row.WORD().size()+"\n");
}
};
}
The Main Class:
import org.antlr.v4.runtime.*;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
String source = "word1 word2 number anothernumber \n 2323 55r ere\n554545 lll 545\n";
ContextLexer lexer = new ContextLexer(CharStreams.fromString(source));
lexer.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lexer);
TestContext parser = new TestContext(tokens);
parser.setBuildParseTree(false);
parser.read();
}
}
I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>
Here is my grammar i am trying to give input as
alter table ;
everything works fine but when i give
altasder table; alter table ;
it gives me an error on first string as expected but i want is to parse the second command ignoring the first 'altasder table;'
grammar Hello;
start : compilation;
compilation : sql*;
sql : altercommand;
altercommand : ALTER TABLE SEMICOLON;
ALTER: 'alter';
TABLE: 'table';
SEMICOLON : ';';
how can i achieve it???
I have used the DefualtError stategy but still its not wotking
import org.antlr.v4.runtime.DefaultErrorStrategy;
import org.antlr.v4.runtime.Parser;
import org.antlr.v4.runtime.RecognitionException;
import org.antlr.v4.runtime.TokenStream;
import org.antlr.v4.runtime.misc.IntervalSet;
public class CustomeErrorHandler extends DefaultErrorStrategy {
#Override
public void recover(Parser recognizer, RecognitionException e) {
// TODO Auto-generated method stub
super.recover(recognizer, e);
TokenStream tokenStream = (TokenStream)recognizer.getInputStream();
if (tokenStream.LA(1) == HelloParser.SEMICOLON )
{
IntervalSet intervalSet = getErrorRecoverySet(recognizer);
tokenStream.consume();
consumeUntil(recognizer, intervalSet);
}
}
}
main class :
public class Main {
public static void main(String[] args) throws IOException {
ANTLRInputStream ip = new ANTLRInputStream("altasdere table ; alter table ;");
HelloLexer lex = new HelloLexer(ip);
CommonTokenStream token = new CommonTokenStream(lex);
HelloParser parser = new HelloParser(token);
parser.setErrorHandler(new CustomeErrorHandler());
System.out.println(parser.start().toStringTree(parser));
}
}
myoutput :
line 1:0 token recognition error at: 'alta'
line 1:4 token recognition error at: 's'
line 1:5 token recognition error at: 'd'
line 1:6 token recognition error at: 'e'
line 1:7 token recognition error at: 'r'
line 1:8 token recognition error at: 'e'
line 1:9 token recognition error at: ' '
(start compilation)
why its not moving to second command ?
Need to use the DefaultErrorStrategy to control how the parser behaves in response to recognition errors. Extend as necessary, modifying the #recover method, to consume tokens up to the desired parsing restart point in the token stream.
A naive implementation of #recover would be:
#Override
public void recover(Parser recognizer, RecognitionException e) {
if (e instanceof InputMismatchException) {
int ttype = recognizer.getInputStream().LA(1);
while (ttype != Token.EOF && ttype != HelloParser.SEMICOLON) {
recognizer.consume();
ttype = recognizer.getInputStream().LA(1);
}
} else {
super.recover(recognizer, e);
}
}
Adjust the while condition as necessary to identify the next valid point to resume recognition.
Note, the error messages are due to the lexer being unable to match extraneous input characters. To remove the error messages, add as the last lexer rule:
ERR_TOKEN : . ;
I'm very new to ANTLR4 and am trying to build my own language. So my grammar starts at
program: <EOF> | statement | functionDef | statement program | functionDef program;
and my statement is
statement: selectionStatement | compoundStatement | ...;
and
selectionStatement
: If LeftParen expression RightParen compoundStatement (Else compoundStatement)?
| Switch LeftParen expression RightParen compoundStatement
;
compoundStatement
: LeftBrace statement* RightBrace;
Now the problem is, that when I test a piece of code against selectionStatement or statement it passes the test, but when I test it against program it fails to recognize. Can anyone help me on this? Thank you very much
edit: the code I use to test is the following:
if (x == 2) {}
It passes the test against selectionStatement and statement but fails at program. It appears that program only accepts if...else
if (x == 2) {} else {}
Edit 2:
The error message I received was
<unknown>: Incorrect error: no viable alternative at input 'if(x==2){}'
Cannot answer your question given the incomplete information provided: the statement rule is partial and the compoundStatement rule is missing.
Nonetheless, there are two techniques you should be using to answer this kind of question yourself (in addition to unit tests).
First, ensure that the lexer is working as expected. This answer shows how to dump the token stream directly.
Second, use a custom ErrorListener to provide a meaningful/detailed description of its parse path to every encountered error. An example:
public class JavaErrorListener extends BaseErrorListener {
public int lastError = -1;
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine,
String msg, RecognitionException e) {
Parser parser = (Parser) recognizer;
String name = parser.getSourceName();
TokenStream tokens = parser.getInputStream();
Token offSymbol = (Token) offendingSymbol;
int thisError = offSymbol.getTokenIndex();
if (offSymbol.getType() == -1 && thisError == tokens.size() - 1) {
Log.debug(this, name + ": Incorrect error: " + msg);
return;
}
String offSymName = JavaLexer.VOCABULARY.getSymbolicName(offSymbol.getType());
List<String> stack = parser.getRuleInvocationStack();
// Collections.reverse(stack);
Log.error(this, name);
Log.error(this, "Rule stack: " + stack);
Log.error(this, "At line " + line + ":" + charPositionInLine + " at " + offSymName + ": " + msg);
if (thisError > lastError + 10) {
lastError = thisError - 10;
}
for (int idx = lastError + 1; idx <= thisError; idx++) {
Token token = tokens.get(idx);
if (token.getChannel() != Token.HIDDEN_CHANNEL) Log.error(this, token.toString());
}
lastError = thisError;
}
}
Note: adjust the Log statements to whatever logging package you are using.
Finally, Antlr doesn't do 'weird' things - just things that you don't understand.
Lex and Yacc are not reporting an error when an unexpected character is parsed. In the code below, there is no error when #set label sample is parsed, but the # is not valid.
Lex portion of code
identifier [\._a-zA-Z0-9\/]+
<INITIAL>{s}{e}{t} {
return SET;
}
<INITIAL>{l}{a}{b}{e}{l} {
return LABEL;
}
<INITIAL>{i}{d}{e}{n}{t}{i}{f}{i}{e}{r} {
strncpy(yylval.str, yytext,1023);
yylval.str[1023] = '\0';
return IDENTIFIER;
}
Yacc portion of code.
definition : SET LABEL IDENTIFIER
{
cout<<"set label "<<$3<<endl;
};
When #set sample label is parsed, there should be an error reported because # is an unexpected character. But there is no error reported. How should I modify the code so an error is reported?
(Comments converted to a SO style Q&A format)
#JonathanLeffler wrote:
That's why you need a default rule in the lexical analyzer (typically the LHS is .) that arranges for an error to be reported. Without it, the default action is just to echo the unmatched character and proceed onwards with the next one.
At the least you would want to include the specific character that is causing trouble in the error message. You might well want to return it as a single-character token, which will generally trigger an error in the grammar. So:
<*>. { cout << "Error: unexpected character " << yytext << endl; return *yytext; }
might be appropriate.